Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data

-

Increasing growth and data complexities have made data deduplication much more relevant

Data duplication remains to be an issue for a lot of organisations. Although data processing and storage systems have developed rapidly together with technological advances, the complexity of the information produced can also be increasing. Furthermore, with the proliferation of Big Data and the utilisation of cloud-based applications, today’s organisations must increasingly take care of fragmented data sources.

Photo by Damir: https://www.pexels.com/photo/serene-lakeside-reflection-with-birch-trees-29167854/

Ignoring the phenomenon of the big amount of duplicated data may have a negative impact on the organisation. Similar to:

  • Disruption of the decision-making process. Unclean data can bias metrics and never reflect the actual conditions. For instance: if there’s one customer that is definitely the identical, but is represented as 2 or 3 customers data in CRM, this could be a distortion when projecting revenue.
  • Swelling storage costs because every bit of information mainly takes up cupboard space.
  • Disruption of customer experience. For instance: if the system has to offer notifications or send emails to customers, it is vitally likely that customers whose data is duplicate will receive multiple notification.
  • Making the AI ​​training process lower than optimal. When an organisation starts developing an AI solution, one among the necessities is to conduct training with clean data. If there remains to be loads of duplicates, the information can’t be said to be clean and when forced to be utilized in AI training, it’ll potentially produce biased AI.

Given the crucial impact caused when an organisation doesn’t attempt to cut back or eliminate data duplication, the strategy of data deduplication becomes increasingly relevant. Additionally it is critical to make sure data quality. The growing sophistication and complexity of the system have to be accompanied by the evolution of adequate deduplication techniques.

On this occasion, we are going to examine the three latest deduplication methods, which could be a reference for practitioners when planning the deduplication process.

It’s the strategy of eliminating duplicate data across multiple storage locations. It’s now common for organisations to store their data across multiple servers, data centers, or the cloud. Global deduplication ensures that just one copy of the information is stored.

This method works by creating a worldwide index, which is an inventory of all existing data, in the shape of a singular code (hash) using an algorithm similar to SHA256 that represents each bit of information. When a brand new file is uploaded to a server (for instance Server 1), the system will store a singular code for that file.

On one other day when a user uploads a file to Server 2, the system will compare the unique code of the brand new file with the worldwide index. If the brand new file is found to have the identical unique code/hash as the worldwide index, then as an alternative of continuous to store the identical file in two places, the system will replace the duplicate file stored on Server 2 with a reference/pointer that points to a duplicate of the file that already exists on Server 1.

With this method, cupboard space can clearly be saved. And if combined with Data Virtualisation technique then when the file is required the system will take it from the unique location, but all users will still feel the information is on their respective servers.

The illustration below shows how Global Deduplication works where each server only stores one copy of the unique data and duplicates on other servers are replaced by references to the unique file.

source: Writer

It needs to be noted that the Global Deduplication method doesn’t work in real-time, but post-process. Which suggests the tactic can only be applied when the file has entered storage.

Unlike Global Deduplication, this method works in real-time right when data is being written to the storage system. With the Inline Deduplication technique, duplicate data is instantly replaced with references without going through the physical storage process.

The method begins when data is about to enter the system or a file is being uploaded, the system will immediately divide the file into several small pieces or chunks. Using an algorithm similar to SHA-256, each chunk will then be given a hash value as a singular code. Example:

Chunk1 -> hashA

Chunk2-> hashB

Chunk3 -> hashC

The system will then check whether any of the chunks have hashes already within the storage index. If one among the chunks is found whose unique code is already within the storage hash, the system is not going to re-save the physical data from the chunk, but will only store a reference to the unique chunk location that was previously stored.

While each unique chunk will probably be stored physically.

Later, when a user desires to access the file, the system will rearrange the information from the prevailing chunks based on the reference, in order that the whole file might be utilized by the user.

Inline Deduplication is widely utilized by cloud service providers similar to Amazon S3 or Google Drive. This method could be very useful for optimising storage capability.

The easy illustration below illustrates the Inline Deduplication process, from data chunking to how data is accessed.

Source: Writer

Machine learning-powered deduplication uses AI to detect and take away duplicate data, even when it will not be completely equivalent.

The method begins when incoming data, similar to files/documents/records, are sent to the deduplication system for evaluation. For instance, the system receives two scanned documents that at the beginning glance look similar but even have subtle differences in layout or text format.

The system will then intelligently extract necessary features, often in the shape of metadata or visual patterns. These necessary features will then be analysed and compared for similarity. The similarity of a feature will probably be represented as a worth/rating. And every system/organisation can define whether data is a reproduction or not based on its similarity rating. For instance: only data with a similarity rating above 90% might be said to be potentially duplicate.

Based on the similarity rating, the system can judge whether the information is a reproduction. If stated that it’s a reproduction, then steps might be taken like other duplication methods, where for duplicate data only the reference is stored.

What’s interesting about ML-enhanced Deduplication is that it allows human involvement to validate the classification that has been done by the system. In order that the system can proceed to get smarter based on the inputs which have been learned (feedback loop).

Nonetheless, it needs to be noted that unlike Inline Deduplication, ML-enhanced deduplication will not be suitable to be used in real-time. That is attributable to the latency factor, where ML takes time to extract features and process data. As well as, if forced to be real-time, this method requires more intensive computing resources.

Although not real-time, the advantages it brings are still optimal, especially with its ability to handle unstructured or semi-structured data.

The next is an illustration of the steps of ML-enhanced Deduplication together with examples.

Source: Writer
ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x