Skip to main content

COVID-19 Notice

Anyone attending an indoor, in-person event must comply with the chancellor’s order on wearing masks while indoors in campus buildings or facilities, regardless of vaccination status.

Automatic Clustering at Snowflake

Jiaqi Yan

Event Details

Wednesday, November 11, 2020
1-2 p.m.

Snowflake is a database built on top of major cloud computing platforms. Snowflake provides reliable data storage at a low cost, and makes it easy to load large volumes of data, and it is easy for customers to create very large tables for data analytics workloads. To speed up query processing on large tables, Snowflake automatically partitions incoming data and uses zonemap metadata for pruning. For partitioned tables, maintaining good clustering is critical for query performance. However, both the size and volume of data ingestion presents challenges for efficient clustering maintenance on large tables where naive approaches could be prohibitively expensive. In this talk I will introduce Snowflake's approach for automatically maintaining clustering on clustered tables with DMLs. I will focus on our incremental approximate clustering mechanisms that lower the cost of clustering maintenance while ensuring good query performance. I will also dive into our service-oriented approach which significantly simplifies the task of performance tuning and reduces the management overhead.