Parallel Data Deduplication

The Project

The PC˛ researches in the area of data deduplication to reduce the amount of required physical storage. The goal is to use the storage resources available in a much more efficient way. Storage systems that apply a data deduplication scheme in general aim to detect redundancies in the stored data blocks and then replace redundant data with pointers to the already stored blocks containing the same data.

One example of these redundancies are PowerPoint attachments in mails, which are send to a whole department. Each receiver of the mail stores the attachment in the same or only slightly modified form. Deduplication systems can reduce this amount of storage capacity and therefore also reduce the costs and the carbon footprint. This process is similar to data compression, but at a much larger scale. The goal is to detect redundancies on an intra-file level in storage systems with multiple terabytes of data at high throughput.

Our Mission

The data deduplication research done by the PC˛ focuses on the throughput and scalability aspects. Naive deduplication systems store fingerprint information necessary to perform the redundancy detection either on disk or in memory. If the information is stored on disk, the throughput is limited to around 20 MB even using a large disk array because the number of random IO operations per disk is limited to less than 200 per second. If the information is stored in main memory, the scale of a reduplication is limited to a few terabytes because main memory is prohibitive expensive for sized more than 64 GB. The PC˛ researches aims to break this dilemma situation by combining two directions: Solid State Disk and Cluster Storage.

Solid State Disks (SSDs) are new storage systems based on flash memory, but providing a disk like interface. These SSDs have no mechanical moveable parts and provide a much larger random IO performance than disk, but they are limited in capacity and lifetime. The PC˛ aims to storage the fingerprint information and a series of Solid State Disks to increase the throughput of deduplication systems. To enable this research, the PC2 develops the dedupv1 block-based reduplication system.

An additional goal is to scale deduplication using a cluster system. The cluster system should provide so called global deduplication that is that the system should detect the redundancies regardless which of the cluster nodes are used or on which cluster node the redundant data has been stored before. While normal cluster storage is a research topic for a few years, cluster deduplication provides different tradeoffs and different research issues.

Contact

Jun.-Prof. Dr.-Ing. André Brinkmann

Dirk Meister, MSc.

Research