Snowflake Iceberg Tables

In this article, we will compare the performance of Iceberg tables and external tables in Snowflake using the sample dataset provided by Snowflake: SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.LINEITEM, which contains over 600 million rows.

Background: What is Apache Iceberg?

Apache Iceberg is an open table format that has become an industry standard for handling big data. Known for being fast, reliable, and scalable, Iceberg manages large datasets and tracks changes over time. Snowflake recently introduced enhanced support for Iceberg, offering advanced features and better integration, now generally available as of June 2024.

Creating the Tables

When using Iceberg tables in Snowflake, both the data and metadata are stored externally on cloud storage. Let’s walk through how to set up both Iceberg and external tables in Snowflake for performance comparison.

Creating an External Volume

An external volume in Snowflake is essentially a configuration that allows data to be stored outside of Snowflake (for instance, in an Amazon S3 bucket).

CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume
STORAGE_LOCATIONS = (
(
NAME = 'my-s3-volume',
STORAGE_PROVIDER = 'S3',
STORAGE_BASE_URL = 's3://my-iceberg-bucket/',
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::xxxx:role/snowflakerole',
STORAGE_AWS_EXTERNAL_ID = 'iceberg_table_external_id'
)
);

DESC EXTERNAL VOLUME iceberg_external_volume;

This external volume uses AWS IAM authentication to access the S3 bucket.

Creating and Loading Data into an Iceberg Table

When creating an Iceberg table, you specify the external volume and the base location for data and metadata storage. Data is written directly to the specified S3 bucket, which also houses metadata for snapshots, partitions, and transactions.

CREATE OR REPLACE ICEBERG TABLE lineitem_iceberg (
L_COMMENT VARCHAR,
L_COMMITDATE DATE,
L_DISCOUNT FLOAT,
L_EXTENDEDPRICE FLOAT,
L_LINENUMBER INT,
L_LINESTATUS VARCHAR,
L_ORDERKEY INT,
L_PARTKEY INT,
L_QUANTITY FLOAT,
L_RECEIPTDATE DATE,
L_RETURNFLAG VARCHAR,
L_SHIPDATE DATE,
L_SHIPINSTRUCT VARCHAR,
L_SHIPMODE VARCHAR,
L_SUPPKEY INT,
L_TAX FLOAT
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_external_volume'
BASE_LOCATION = 'LINEITEM';

INSERT INTO lineitem_iceberg
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.LINEITEM;

Creating and Loading Data into External Tables

An external table in Snowflake allows you to query data stored in an external stage, such as Amazon S3, without loading it into Snowflake itself. These tables can be slower to query than native Snowflake tables, but performance can be enhanced using materialized views.

Here’s how to set up an external table:

CREATE OR REPLACE STORAGE INTEGRATION my_s3_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::xxxx:role/snowflakerole'
STORAGE_ALLOWED_LOCATIONS = ('s3://my-external-stage/');

CREATE OR REPLACE STAGE my_s3_stage
URL = 's3://my-aws-external-stage/'
STORAGE_INTEGRATION = my_s3_integration;

COPY INTO @my_s3_stage/data/lineitem.parquet
FROM snowflake_sample_data.tpch_sf100.lineitem
FILE_FORMAT = (TYPE = 'parquet')
HEADER = TRUE;

CREATE OR REPLACE EXTERNAL TABLE lineitem_external (
L_COMMENT VARCHAR AS (value:L_COMMENT::VARCHAR),
L_COMMITDATE DATE AS (value:L_COMMITDATE::DATE),
L_DISCOUNT FLOAT AS (value:L_DISCOUNT::FLOAT),
L_EXTENDEDPRICE FLOAT AS (value:L_EXTENDEDPRICE::FLOAT),
L_LINENUMBER INT AS (value:L_LINENUMBER::INT),
L_LINESTATUS VARCHAR AS (value:L_LINESTATUS::VARCHAR),
L_ORDERKEY INT AS (value:L_ORDERKEY::INT),
L_PARTKEY INT AS (value:L_PARTKEY::INT),
L_QUANTITY FLOAT AS (value:L_QUANTITY::FLOAT),
L_RECEIPTDATE DATE AS (value:L_RECEIPTDATE::DATE),
L_RETURNFLAG VARCHAR AS (value:L_RETURNFLAG::VARCHAR),
L_SHIPDATE DATE AS (value:L_SHIPDATE::DATE),
L_SHIPINSTRUCT VARCHAR AS (value:L_SHIPINSTRUCT::VARCHAR),
L_SHIPMODE VARCHAR AS (value:L_SHIPMODE::VARCHAR),
L_SUPPKEY INT AS (value:L_SUPPKEY::INT),
L_TAX FLOAT AS (value:L_TAX::FLOAT)
)
LOCATION = @my_s3_stage/data
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = TRUE;

Performance Test Results

To compare performance, we ran several tests on both the Iceberg and external tables using the same dataset:

Full Table Scan

select * from lineitem_external; -- 12m 57s

select * from lineitem_iceberg; -- 10m 31s

Counting Orders per Month

select count(distinct L_ORDERKEY) as nb_commandes, year(L_COMMITDATE) as annee, month(L_COMMITDATE) as mois from lineitem_external group by annee, mois; -- 1m 28s

select count(distinct L_ORDERKEY) as nb_commandes, year(L_COMMITDATE) as annee, month(L_COMMITDATE) as mois from lineitem_iceberg group by annee, mois; -- 47s

Sum of Sales by Shipping Mode for 1998

select sum(L_EXTENDEDPRICE - L_DISCOUNT) as prix_de_vente, L_SHIPMODE from lineitem_external where year(L_COMMITDATE) = 1998 group by L_SHIPMODE; -- 1m 26s

select sum(L_EXTENDEDPRICE - L_DISCOUNT) as prix_de_vente, L_SHIPMODE from lineitem_iceberg where year(L_COMMITDATE) = 1998 group by L_SHIPMODE; -- 4s

Revenue by Year

select sum((L_EXTENDEDPRICE - L_DISCOUNT) * L_QUANTITY) as CA, year(L_COMMITDATE) as annee from lineitem_external group by annee order by annee ASC; -- 1m 28s

select sum((L_EXTENDEDPRICE - L_DISCOUNT) * L_QUANTITY) as CA, year(L_COMMITDATE) as annee from lineitem_iceberg group by annee order by annee ASC; -- 10s

Conclusion

The results show that Iceberg tables are generally faster compared to external tables in Snowflake. Iceberg’s structure and metadata management allow for more efficient querying, especially when working with large datasets. For both Iceberg and external tables, partitioning can further improve performance when used in query WHERE clauses.

In summary, Iceberg tables in Snowflake offer a powerful solution for scalable, high-performance data storage and querying.