Get started with Secoda
See why hundreds of industry leaders trust Secoda to unlock their data's full potential.
See why hundreds of industry leaders trust Secoda to unlock their data's full potential.
Data duplication in Snowflake refers to the occurrence of identical records within a dataset. This can happen due to various reasons such as human error, system glitches, or inadequate data integration processes. Understanding and managing data duplication is crucial for maintaining data integrity and ensuring accurate data analysis.
CREATE OR REPLACE TABLE STUDENT_RECORD (
STUDENT_ID NUMBER(6,0),
FIRST_NAME VARCHAR2(20),
LAST_NAME VARCHAR2(20),
AGE NUMBER(3,0),
ADDRESS VARCHAR2(100),
PHONE_NUMBER VARCHAR2(20),
GRADE VARCHAR2(10)
);
INSERT INTO STUDENT_RECORD(STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE) VALUES
(1, 'John', 'Cena', 18, '123 Main St, City', '123-456-7890', 'A'),
(1, 'John', 'Cena', 18, '123 Main St, City', '123-456-7890', 'A'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(3, 'James', 'Johnson', 16, '789 Oak St, Village', '456-123-7890', 'C'),
(3, 'James', 'Johnson', 16, '789 Oak St, Village', '456-123-7890', 'C'),
(4, 'Sarah', 'Williams', 18, '321 Pine St, County', '789-123-4560', 'A');
This code creates a sample table named STUDENT_RECORD
and inserts dummy data, including duplicate rows. This setup is essential for demonstrating how to identify and delete duplicate records in Snowflake.
Identifying and deleting duplicate rows in Snowflake can be achieved using various methods. One common approach is to use the DISTINCT
keyword, which filters out duplicate records. Another method involves using the ROW_NUMBER()
window function to assign unique identifiers to rows and then delete duplicates based on these identifiers.
DISTINCT
keyword is used with the SELECT
command to return only unique records from a table, effectively removing duplicates.Set up a demo table and insert elements to identify duplicates.
CREATE OR REPLACE TABLE STUDENT_RECORD (
STUDENT_ID NUMBER(6,0),
FIRST_NAME VARCHAR2(20),
LAST_NAME VARCHAR2(20),
AGE NUMBER(3,0),
ADDRESS VARCHAR2(100),
PHONE_NUMBER VARCHAR2(20),
GRADE VARCHAR2(10)
);
INSERT INTO STUDENT_RECORD(STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE) VALUES
(1, 'John', 'Cena', 18, '123 Main St, City', '123-456-7890', 'A'),
(1, 'John', 'Cena', 18, '123 Main St, City', '123-456-7890', 'A'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(3, 'James', 'Johnson', 16, '789 Oak St, Village', '456-123-7890', 'C'),
(3, 'James', 'Johnson', 16, '789 Oak St, Village', '456-123-7890', 'C'),
(4, 'Sarah', 'Williams', 18, '321 Pine St, County', '789-123-4560', 'A');
This code creates a sample table named STUDENT_RECORD
and inserts dummy data, including duplicate rows.
Use a query to create a temporary table that stores all duplicate records.
WITH DUPLICATES AS (
SELECT STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE,
ROW_NUMBER() OVER(PARTITION BY STUDENT_ID ORDER BY STUDENT_ID) AS RANK
FROM STUDENT_RECORD
)
DELETE FROM STUDENT_RECORD WHERE STUDENT_ID IN (
SELECT STUDENT_ID FROM DUPLICATES WHERE RANK > 1
);
This query uses a Common Table Expression (CTE) to identify and store duplicate records based on the ROW_NUMBER()
function.
Reinsert the unique rows from the temporary table back into the main table.
INSERT INTO STUDENT_RECORD (STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE)
SELECT STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE
FROM STUDENT_RECORD
WHERE RANK = 1;
This code reinserts the unique records back into the STUDENT_RECORD
table, ensuring that only unique rows are retained.
ROW_NUMBER()
function to minimize performance impact.SELECT
query to ensure that only unique records remain.DISTINCT
, ROW_NUMBER()
, and CTEs, can effectively identify and delete duplicate records.