MadSystems Seminar
Accelerating Content-Defined Chunking for Data Deduplication
Event Details
Also offered online
Abstract: Data deduplication is used to conserve storage space and network bandwidth. Content-defined chunking (CDC) algorithms divide data into chunks, dictating the space-saving efficiency of deduplication systems. However, modern CDC algorithms are slow due to their compute-intensive nature and need to scan large amounts of data, becoming one of the main bottlenecks in the deduplication pipeline.
In this talk, I will present two solutions to accelerate content-defined chunking. The first solution, VectorCDC, uses AVX-friendly techniques to redesign and accelerate existing chunking algorithms. The second solution, SeqCDC, presents a new vector-friendly algorithm that uses content-defined heuristics to selectively skip scanning data regions, improving throughput without significantly affecting space savings.
Bio: Sreeharsha is a 4th year PhD student at the University of Waterloo advised by Prof. Samer Al-Kiswany. His research focuses on incorporating hardware acceleration into large-scale distributed systems.
We value inclusion and access for all participants and are pleased to provide reasonable accommodations for this event. Please email chenhaoy@cs.wisc.edu to make a disability-related accommodation request. Reasonable effort will be made to support your request.