Skip to main content

MadSystems Seminar

Accelerating Content-Defined Chunking for Data Deduplication

Event Details

Date
Thursday, April 3, 2025
Time
4:30-5:30 p.m.
Location
Description

Abstract: Data deduplication is used to conserve storage space and network bandwidth. Content-defined chunking (CDC) algorithms divide data into chunks, dictating the space-saving efficiency of deduplication systems. However, modern CDC algorithms are slow due to their compute-intensive nature and need to scan large amounts of data, becoming one of the main bottlenecks in the deduplication pipeline. 

In this talk, I will present two solutions to accelerate content-defined chunking. The first solution, VectorCDC, uses AVX-friendly techniques to redesign and accelerate existing chunking algorithms. The second solution, SeqCDC, presents a new vector-friendly algorithm that uses content-defined heuristics to selectively skip scanning data regions, improving throughput without significantly affecting space savings.

Bio: Sreeharsha is a 4th year PhD student at the University of Waterloo advised by Prof. Samer Al-Kiswany. His research focuses on incorporating hardware acceleration into large-scale distributed systems.

Cost
Free
Accessibility

We value inclusion and access for all participants and are pleased to provide reasonable accommodations for this event. Please email chenhaoy@cs.wisc.edu to make a disability-related accommodation request. Reasonable effort will be made to support your request.

Tags