September 16, 2024

What Is a Snowflake External Table?

Learn about creating and querying external tables in Snowflake for accessing external data sources.
Dexter Chu
Head of Marketing

Snowflake external tables offer a powerful way to query data stored in external stages without needing to load it into Snowflake. This feature is particularly beneficial for organizations dealing with large volumes of data stored in cloud storage services such as AWS S3, Azure Blob Storage, or Google Cloud Storage. In this guide, we will explore the foundational concepts, schema design, performance considerations, and practical use cases of Snowflake external tables.

What are Snowflake External Tables?

Snowflake external tables allow users to query data stored in an external stage as if it were inside a Snowflake table. This capability eliminates the need to load data into Snowflake, making it an efficient solution for handling large datasets stored in cloud storage. External tables are inherently read-only, meaning they cannot perform Data Manipulation Language (DML) operations like insert, update, or delete. However, they can be queried and joined with other tables.

Example of creating an external table in Snowflake

CREATE OR REPLACE EXTERNAL TABLE my_external_table
WITH LOCATION = '@my_stage/path/to/data'
FILE_FORMAT = (TYPE = PARQUET);

This code snippet demonstrates how to create an external table in Snowflake using data stored in a specified location and file format. The external table can then be queried as if it were a native Snowflake table.

How do Snowflake External Tables Work?

Snowflake external tables work by referencing data stored in external stages, such as cloud storage services. When a query is executed on an external table, Snowflake accesses the external storage to retrieve the data. While querying external tables might be slower compared to native Snowflake tables due to the overhead of accessing external storage, using materialized views can significantly enhance performance.

  • Read-Only Nature: External tables are inherently read-only, meaning they cannot perform DML operations like insert, update, or delete. However, they can be queried and joined with other tables.
  • Query Performance: While querying external tables might be slower compared to native Snowflake tables due to the overhead of accessing external storage, using materialized views can significantly enhance performance.
  • Metadata Storage: Snowflake stores file-level metadata within the external tables, including filenames and version identifiers.

Creating and Managing Snowflake External Tables

1. Defining the Schema and Columns

External tables in Snowflake have a specific schema design to handle data efficiently. The VALUE column represents a single row in the external file and is of type VARIANT, which is essential for handling semi-structured data formats such as JSON, Avro, ORC, and Parquet. Additionally, columns like METADATA$FILENAME and METADATA$FILE_ROW_NUMBER provide context about the data source and help in tracking and debugging.

<!-- Example of defining schema and columns for an external table -->
CREATE OR REPLACE EXTERNAL TABLE my_external_table (
VALUE VARIANT,
METADATA$FILENAME STRING,
METADATA$FILE_ROW_NUMBER NUMBER
)
WITH LOCATION = '@my_stage/path/to/data'
FILE_FORMAT = (TYPE = PARQUET);

This code snippet demonstrates how to define the schema and columns for an external table in Snowflake, including the VALUE column and metadata columns.

2. Optimizing File Sizing

Proper file sizing is crucial for optimizing the performance of external tables. The following recommendations are based on the file formats supported by Snowflake:

File Format Recommended File Size Recommended Row Group Size Parquet 256 - 512 MB 16 - 256 MB Other Formats 16 - 256 MB N/A

3. Partitioning External Tables

Partitioning external tables can significantly improve query performance by dividing data into smaller, more manageable parts. There are two primary methods for partitioning:

  • Automatic Partitions: These are defined using expressions based on the filename and path and benefit from automatic metadata refresh.
  • Manual Partitions: These partitions are user-defined and added via the ALTER EXTERNAL TABLE … ADD PARTITION command.

4. Supporting Delta Lake

Snowflake external tables support Delta Lake, a storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to data lakes using Apache Parquet format. Key aspects of Delta Lake support include:

  • Table Format: Delta Lake tables are created using the TABLE_FORMAT = DELTA option in the CREATE EXTERNAL TABLE command.
  • Metadata Refresh: Snowflake can parse Delta Lake transaction logs to sync metadata, although automated refreshes are not supported.

Modifying and Protecting Snowflake External Tables

1. Adding and Removing Columns

Users can add or remove columns using the ALTER TABLE … ADD COLUMN or ALTER TABLE … DROP COLUMN commands. However, default columns such as VALUE, METADATA$FILENAME, and METADATA$FILE_ROW_NUMBER cannot be dropped.

<!-- Example of adding a column to an external table -->
ALTER TABLE my_external_table ADD COLUMN new_column STRING;

This code snippet demonstrates how to add a new column to an existing external table in Snowflake.

2. Protecting External Tables

To secure external tables, users can apply masking and row access policies. These policies help ensure that sensitive data is protected and that only authorized users can access specific rows or columns.

<!-- Example of applying a masking policy to an external table -->
CREATE MASKING POLICY mask_sensitive_data AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('authorized_role') THEN val
ELSE '****'
END;
ALTER TABLE my_external_table MODIFY COLUMN sensitive_column SET MASKING POLICY mask_sensitive_data;

This code snippet demonstrates how to create and apply a masking policy to an external table in Snowflake.

Practical Use Cases and Performance Considerations

Snowflake external tables are particularly useful in scenarios where data resides in external storage systems, and users need to perform ad-hoc queries or combine external data with internal Snowflake data. Some practical use cases include:

  • Data Lake Querying: Organizations can query large datasets stored in data lakes without moving data into Snowflake.
  • ETL Processes: External tables can be used in Extract, Transform, Load (ETL) processes to transform and load data into Snowflake.
  • Data Integration: External tables facilitate integrating data from different sources, providing a unified view for analysis.

However, there are performance considerations to keep in mind:

  • Query Latency: Querying data from external tables might introduce latency compared to querying native Snowflake tables.
  • Cost Implications: Accessing external storage might incur additional costs, depending on the cloud storage service used.

Comparison with Native Snowflake Tables

Feature External Tables Native Snowflake Tables Data Storage External storage (e.g., AWS S3, Azure Blob) Internal Snowflake storage Read/Write Capability Read-only Read and write Performance Potentially slower due to external data access Optimized for high performance Metadata Storage File-level metadata stored in Snowflake Complete metadata stored internally Use Cases Querying external data, ETL processes High-performance querying, comprehensive analytics Cost Storage costs depend on external storage provider Included in Snowflake storage pricing

Common Challenges and Solutions

While working with Snowflake external tables, users might encounter several challenges. Here are some common issues and their solutions:

  • Query Performance: To improve query performance, consider using materialized views and optimizing file sizes.
  • Metadata Management: Ensure that metadata is regularly refreshed, especially when dealing with Delta Lake tables.
  • Cost Management: Be mindful of the costs associated with accessing external storage and optimize data retrieval to minimize expenses.

Recap of Snowflake External Tables

In this guide, we explored the various aspects of Snowflake external tables, including their creation, management, and practical use cases. Here are the key takeaways:

  • Efficient Data Querying: Snowflake external tables allow users to query data stored in external stages without loading it into Snowflake, making it an efficient solution for handling large datasets.
  • Schema Design and Optimization: Proper schema design and file sizing are crucial for optimizing the performance of external tables.
  • Practical Use Cases: External tables are particularly useful for data lake querying, ETL processes, and data integration, providing flexibility and efficiency in data management.

By understanding and implementing the concepts discussed in this guide, users can effectively leverage Snowflake external tables to enhance their data querying and integration capabilities.

Keep reading

View all