Hot news

Long Live Data Metric Functions!

Arnaud Col Mon April 29, 2024

Data Metric Fonctions

A new object arrives in Snowflake... the Data Metric Function!

You can use standard functions or create your own to measure the quality of your data.

What better way to discover this new feature than with an example?

Let’s start with the basic but recurring need for automatic duplicate detection.

With Snowflake, we often start by assigning roles:

use role accountadmin; grant database role SNOWFLAKE.DATA_METRIC_USER to role sysadmin; grant execute data metric function on account to role sysadmin; grant application role snowflake.data_quality_monitoring_viewer to role sysadmin;

Let's go to our working context:

USE ROLE SYSADMIN ;
USE DATABASE demo_db;
USE SCHEMA public;

Then, let's create a table with 1 duplicate:

CREATE OR REPLACE TABLE region
AS
SELECT * FROM snowflake_sample_data.tpch_sf1.region
UNION ALL
(SELECT * FROM snowflake_sample_data.tpch_sf1.region limit 1)
;

Manually invoke the system function for detecting duplicates:

select snowflake.core.duplicate_count(select r_name from region);

Here is the result:

SNOWFLAKE.CORE.DUPLICATE_COUNT(SELECT R_NAME FROM PUBLIC.REGION)
1

We indeed find 1 duplicate.

Let's add a trigger on our table so that quality checks are executed when the data changes:

ALTER TABLE region SET
DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';

Finally, let’s add the duplicate detection function to our table:

ALTER TABLE region
ADD DATA METRIC FUNCTION snowflake.core.duplicate_count
ON (r_name);

All that remains is to inject new duplicates:

INSERT OVERWRITE INTO region
SELECT * FROM snowflake_sample_data.tpch_sf1.region
UNION ALL
(SELECT * FROM snowflake_sample_data.tpch_sf1.region limit 3);

Tips 🎁: The keyword OVERWRITE allows doing a truncate before doing an insert.

Wait a few minutes before you can invoke the result.

Another tip 🎁: it's the keyword TABLE which avoids writing SELECT * FROM ...

So handy ♥️

Let’s now retrieve the result of our check:

TABLE snowflake.local.data_quality_monitoring_results;

This table contains many columns.

Let's keep the ones that interest us:

select measurement_time,
metric_database,
table_schema,
table_name,
metric_schema,
metric_name,
value
from snowflake.local.data_quality_monitoring_results;

And we indeed find our 3 duplicates:

MEASUREMENT_TIME METRIC_DATABASE TABLE_SCHEMA TABLE_NAME METRIC_SCHEMA METRIC_NAME VALUE
2024–04–09 10:35:22.290 -0400 SNOWFLAKE PUBLIC REGION CORE DUPLICATE_COUNT 3

Of course, you can create alerts or dashboards from this data.

We could already do all this programmatically. It’s just become easier! 🎉

Don't forget to clean up afterwards:

DROP TABLE region;

I hope you will use and take full advantage of these handy functions to monitor data quality.

To learn more about DMFs... RTFM 😁

https://docs.snowflake.com/en/user-guide/data-quality-working

Partagez cet article

Nos Actus

29 avril 2024

The state of the art in Snowflake Governance

The Context Imagine you're the new CIO of an international group.

4 octobre 2023

Snowflake's Stock Exchange Debut: How Cloud BI Will Grow

Snowflake’s listing on the Stock Exchange in September 2020 allowed the company to accelerate its expansion and demonstrate its ambition:...

29 avril 2024

What is a Data Superhero?

A Data Superhero is a term coined by Snowflake to denote a distinction awarded to experts who demonstrate exceptional expertise and active engagement...