Ronald Steelman

Snowflake User-Defined Types: Giving Your Schema a Face-Lift

Ronny Steelman — Sat, 14 Mar 2026 02:16:00 +0000

Data platforms are only as good as the contracts they enforce. Developers can build elegant pipelines, meticulously layered architecture, and robust data quality processes, and still end up with garbage data. One of the driving causes of garbage data is a schema that does not reflect the business reality. That’s where most teams hit a wall: they’ve got NUMBER columns that could mean anything, VARCHAR fields holding data that should follow stricter rules, and no clean way to communicate data intent across tables, teams, and time.

Insert the latest game-changer from Snowflake: User-Defined Types, or UDTs. With a UDT, a Snowflake developer can create a custom data type rooted in existing Snowflake types but carrying the meaning and constraints of the business domain. Think of a UDT as a way to make your schema self-documenting by havijng column types for age or address, which tells users more than NUMBER(3,0) or OBJECT ever will.

In this post, I review the fundamentals of UDTs, explore the technical requirements needed to make your first UDT, and walk through real-world patterns that make UDTs worth adopting. Whether you’re a data engineer trying to tighten up your schemas or an architect looking for better ways to enforce consistency, there’s something here for you.

What Are User-Defined Types (And Why Should You Care)?

At their core, User-Defined Types are schema-level objects that let developers define new data types based on existing Snowflake data types. Instead of inventing data types from scratch, developers create named aliases that carry specific type information and can be reused wherever types are used: column definitions, function signatures, procedure parameters, and cast expressions.

Why are UDTs important for the evolution of data architecture? Consider a scenario most developers have lived through. A developer has a customers table with a postal_code column defined as VARCHAR. In a separate table, orders, you also have a column called zip_code of type VARCHAR. Furthermore, your shipping_address table has a postal code embedded as a semi-structured VARIANT. With all of these, developers have no shared contract, no guarantee that fields follow the same rules, and no way for a new team member to look at the schema and understand that the fields are all supposed to represent the same type of data.

With UDTS, you define postal_code once as a data type within your Snowflake org, and use it everywhere. If someone looks at any table in your schema, they immediately understand what that column represents and what rules it follows. This DRY (don’t repeat yourself) approach is another way Snowflake is bridging the gap between data developers and software developers, adopting more software engineering methodologies and applying them to the data world.

While such a narrow use case can help create data contracts in a space that has traditionally been underserved by data platforms, UDTs also enable grouping related fields into a single, logical column using structured OBJECT types. Instead of scattering street, city, state, and postal_code across four columns (or worse, multiple tables), you can define an address data type that holds all of these related fields together. Let’s explore what that looks like.

Getting Started: Creating Your First UDT

To kick things off, I’ll start with a simple example. To create a User-Defined Type, a developer would use the CREATE TYPE statement like below:

CREATE TYPE age AS NUMBER(3,0);

As simple as that, I have created a custom data type called age that maps to NUMBER(3,0), a number with at most three digits and no decimal places. With the creation of the new data type, I can now use it in my table definitions:

CREATE TABLE employees (
    emp_id VARCHAR NOT NULL,
    emp_name VARCHAR(100),
    emp_age age
);

Any developer who looks at the employees table will know that emp_age isn’t just a number; it is a value that carries meaning.

As developers create new UDTs, there is one prerequisite to keep in mind: privileges. To create a UDT in a schema, the user must have the CREATE TYPE privilege on that schema. Snowflake administrators can grant the privilege by using:

GRANT CREATE TYPE ON SCHEMA my_db.my_schema TO ROLE data_engineer_role;

Once a developer has been granted the necessary privilege and the UDT has been defined, inserting data works exactly the way one might expect:

INSERT INTO employees VALUES ('E001', 'Jane Doe', 32);

Snowflake handles the coercion from the literal 32 to the age type seamlessly. The value fits within NUMBER(3,0), so there are no errors or invalid values for data type issues. If a user or system tries to insert a value such as 1000 or 32.5, Snowflake would return an error indicating that the value is not of the correct data type.

Beyond Scalars: Structured Object UDTs

Simple scalar types are useful, but the real power of UDTs shows up when you combine a UDT with structured OBJECT types. This coupling of types allows developers to model complex, real-world entities as first-class data types in any schema. Take the example below:

CREATE TYPE address AS OBJECT (
    street VARCHAR(100)
    , city VARCHAR(50)
    , state_abbr CHAR(2)
    , postal_code CHAR(10)
);

Now that there is a reusable data type that encapsulates everything an address needs to be valid, it can be used in a table:

CREATE TABLE customers (
    cust_id VARCHAR NOT NULL,
    cust_name VARCHAR(100),
    cust_address address
);

Inserting data into a structured UDT column requires the developer to cast the object to the UDT. There are two approaches that developers can take, and both are worth knowing:

Approach 1: OBJECT constant with cast

INSERT INTO customers (cust_id, cust_name, cust_address)
    SELECT
        '1000'
        , 'Acme Corp'
        , {
            'street': '101 Bikini Bottom'
            , 'city': 'Ocean Floor'
            , 'state_abbr': 'CA'
            , 'postal_code': '90210'
        }::address;

Approach 2: OBJECT_CONSTRUCT with cast

INSERT INTO customers (cust_id, cust_name, cust_address)
    SELECT
        '1001'
        , 'Widgets Inc'
        , OBJECT_CONSTRUCT(
            'street': '101 Bikini Bottom'
            , 'city': 'Ocean Floor'
            , 'state_abbr': 'CA'
            , 'postal_code': '90210'
        }::address;

While both approaches work just fine, the OBJECT constant syntax is cleaner for hardcoded values; OBJECT_CONSTRUCT is more flexible when a developer is building objects dynamically from other columns or expressions.

Once the data is inserted into the table, querying individual fields is straightforward using the colon operator:

SELECT
    cust_id
    , cust_name
    , cust_address:city
    , cust_address:postal_code
FROM customers;

Developers now have a query that returns clean, extracted values from a structured data type without needing PARSE_JSON, lateral flattening, or other “gotchas”. Direct field access on a well-defined type, creating an experience that makes schemas easier to work with at scale.

Casting, Coercion, and the Gotchas That’ll Get Ya

In this section, I will explore time-saving tips around UDT-specific behaviours for type casting and coercion. These concepts are straightforward once there is an understanding of them, but can be confusing without.

Explicit Casting

A value can be cast to a UDT if it can be cast to the UDT’s base type. Going the other direction, a UDT value can be cast to any type that its base type can be cast to:

-- Cast a literal to the age type
SELECT 25::age;

-- Cast an age value to VARCHAR
SELECT 25::age::VARCHAR;

This chaining works because Snowflake resolves the UDT to its base type and then applies the normal casting rules. Nothing surprising here.

Implicit Coercion

Implicit coercion is where things get interesting. UDT values coerce to their base types implicitly in operations. Such implicit coercion means arithmetic, comparisons, and other expressions work exactly as they would with the base type:

CREATE TABLE test_ages (a age, b age);
INSERT INTO test_ages VALUES (10, 20);

SELECT a + b AS result,
       SYSTEM$TYPEOF(a + b) AS type
  FROM test_ages;

The result is 30, and the type is NUMBER(4,0), not age. The UDT coerced to its base type for the operation. This behavior is important to internalize: operations on UDT values produce base-type results, not UDT results.

The Set Operator and Conditional Expression Trap

Set operators and conditional expressions can trip developers up if they don’t understand them. When using set operators like UNION, INTERCEPT, or EXCEPT, or conditional expressions like CASE, IFF, COALESCE, or NVL with UDT values, Snowflake resolves to the common base type. The result is not a UDT.

I’ll attempt to make this concept concrete. Create two UDTs that share the same base type:

CREATE TYPE us_zipcode AS VARCHAR;
CREATE TYPE uk_postcode AS VARCHAR;

Now use the new UDTs in a conditional expression:

SELECT IFF(TRUE, '90210'::us_zipcode, '10006') AS result,
       SYSTEM$TYPEOF(IFF(TRUE, '90210'::us_zipcode, '10006')) AS type;

The result type? VARCHAR, not us_zipcode. The UDT information is gone. If a developer needs to preserve the UDT, they must explicitly cast the entire expression:

SELECT IFF(TRUE, '90210'::us_zipcode, '10006')::us_zipcode AS result,
       SYSTEM$TYPEOF(IFF(TRUE, '90210'::us_zipcode, '10006')::us_zipcode) AS type;

Now it returns MYDB.MYSCHEMA.US_ZIPECODE as the type. The pattern holds for CASE expressions, COALESCE, and set operators. If developers want UDT output, they need to cast the final result.

This same casting behavior applies when mixing compatible UDTs. A CASE expression that returns either a uk_postcode or a us_zipcode will resolve to VARCHAR:

SELECT CASE
         WHEN TRUE THEN 'SW1A 0AA'::uk_postcode
         ELSE '90210'::us_zipcode
       END AS result,
       SYSTEM$TYPEOF(CASE
         WHEN TRUE THEN 'SW1A 0AA'::uk_postcode
         ELSE '90210'::us_zipcode
       END) AS type;

Result type: VARCHAR. To get uk_postcode, wrap the whole thing in a CAST:

SELECT CAST(
         CASE
           WHEN TRUE THEN 'SW1A 0AA'::uk_postcode
           ELSE '90210'::us_zipcode
         END AS uk_postcode
       ) AS result;

SYSTEM$TYPEOF now becomes a developer’s best friend when debugging the above situations. When something downstream breaks because a function expects a UDT but receives a base type instead, this is almost always the reason.

UDTs in Functions, Procedures, and Overloading

UDTs integrate with the broader Snowflake programming model, but there are a few nuances worth calling out.

UDTs as Function Arguments and Return Types

Developers can use UDTs as argument types and return types in UDFs and stored procedures. This functionality is great for enforcing data contracts in your function signatures. A function that accepts an age parameter communicates something different than one that accepts NUMBER(3,0).

There’s one critical rule for return types: if a UDT is specified as the return type of a SQL UDF or Snowflake Scripting stored procedure, the return value must be explicitly cast to the UDT in the function body. Snowflake won’t cast the return value automatically:

CREATE OR REPLACE FUNCTION format_age(input_age age)
  RETURNS age
  LANGUAGE SQL
  AS
  $$
    SELECT input_age::age
  $$;

Skip that cast, and Snowflake will throw an error. It’s a small detail, but it catches developers off guard the first time.

Non-SQL Languages

When writing UDFs or procedures in Python, Java, or other non-SQL languages, UDTs are treated as their base types. There’s no special UDT handling in the Python or Java runtime — a parameter of type age comes in as a regular number. This behavior is pragmatic; it means you don’t need UDT-aware libraries in your handler code. But it also means the UDT boundary is enforced at the Snowflake SQL layer rather than within procedural code.

Function Overloading

UDTs and their compatible base types can be used for function overloading. Developers can define two functions with the same name, where one accepts an age argument, and another accepts a NUMBER(3,0) argument. Snowflake will resolve the correct function based on the argument type at call time. Function overloading is a powerful pattern for building type-safe APIs within a data platform.

Real-World Patterns and When (Not) to Use UDTs

Now that I’ve covered the mechanics, let’s talk about where UDTs deliver real value and where you should think twice.

Where UDTs Shine

Domain-specific type standardization. This standardization is the primary use case. If an organization has concepts that appear across multiple tables, such as customer IDs, product codes, currency amounts, and postal codes, defining them as UDTs establishes a single source of truth for how those fields are defined. Change the definition once, and every table that uses the type is aligned (with the caveats I’ll cover below).

Structured entity modeling. The address example isn’t just a demo, it’s a pattern. Anywhere there is a cluster of related fields that always travel together (addresses, contact info, geographic coordinates, monetary amounts with currency), a structured OBJECT UDT keeps them cohesive. It reduces column sprawl and makes for a more intuitive schema design.

Self-documenting schemas. When a new engineer joins the team and looks at table definitions, UDTs tell them what the data means, not just its shape. age communicates something that NUMBER(3,0) doesn’t. address communicates something that five separate VARCHAR columns don’t. This is an underrated use case that can be a huge benefit for organizations that need to ramp up new developers quickly.

Function signature contracts. Using UDTs in UDF and procedure signatures makes a data platform’s API layer more expressive. A function that takes a us_zipcode is making a statement about what it expects, and Snowflake can enforce that at the type level.

Where to Be Careful

Schema evolution isn’t supported. Schema evolution is the big gotcha. If data sources change frequently and developers rely on schema evolution to automatically add columns, UDTs won’t play well with that workflow. Unsupported schema evolution is a meaningful limitation for ingestion-heavy pipelines where source schemas are volatile.

Drop-and-recreate for changes. Developers can’t use ALTER TYPE to change the definition of a UDT. To modify one, developers have to drop it and recreate it. When dropping and recreating a UDT, SQL statements that operate on columns using the type may start returning errors. Functions and procedures that reference the type will also break and need to be dropped and recreated. This behavior means UDT changes require coordination to perform an essentially schema migration, which Snowflake will hopefully address over time.

The coercion behavior requires discipline. As I showed in the casting section, operations on UDT values silently resolve to base types. If downstream logic depends on the result being a UDT, developers need to use explicit casts everywhere. In complex queries with multiple CTEs and transformations, it’s easy to lose the UDT type along the way without realizing it. Build the habit of using SYSTEM$TYPEOF during development to verify types at each stage.

Column alterations. The ALTER TABLE . . . ALTER COLUMN command can change a column from a UDT to a compatible Snowflake type and vice versa. Using this approach gives developers an escape hatch if they need to move away from a UDT, but it also means anyone with the right privileges could inadvertently strip the UDT from a column. Governance around type changes matters.

Conclusion

User-Defined Types are one of those features that don’t make much noise but can fundamentally improve how a business’s data platform communicates intent. UDTs sit at the intersection of data quality, schema design, and developer experience, three things that every mature data organization cares about deeply.

The basics of UDTs are approachable. Create a custom type. Use the UDT in a table. Insert data into the table. The learning curve is gentle, and the immediate payoff is a more readable, self-documenting schema. As developers dive deeper, the integration with structured OBJECT types, UDFs, stored procedures, and function overloading opens up patterns that make a data platform more expressive and data contracts more enforceable.

But UDTs aren’t a magic wand. The lack of schema evolution support, the drop-and-recreate lifecycle, and the implicit coercion behavior all require intentional design decisions. Developers and architects need to consider where UDTs add value within a specific architecture and where the overhead isn’t justified.

My recommendation? Start small. Pick a handful of domain concepts that appear across multiple tables. Things like customer IDs, postal codes, or monetary amounts, and define them as UDTs. Get comfortable with the casting behavior. Build some UDFs that use UDTs in their signatures. Once developers see the benefits of schema clarity and type safety, they’ll naturally find more places to apply them.

At the end of the day, the best data platforms aren’t just fast and scalable. They’re understandable. UDTs are a step toward schemas that speak the language of the business, and that’s a step worth taking.

Need help or hands-on guidance? Ronny Steelman has over 20 years of experience in data development and architecture, with more than 8 years working with Snowflake, is a two-time published Snowflake author, and a 2026 Snowflake Data Superhero.

Snowflake, DataOps, and the Book That Needed to Exist

Ronny Steelman — Mon, 29 Dec 2025 16:26:10 +0000

I never set out to write a second book about Snowflake. After the first one, I thought I had scratched that itch. But somewhere between managing branch-based development in massive data environments, building production pipelines with DataOps.Live and listening to teams struggle with the same problems I once tripped over, it became clear: someone needed to connect the modern data world with the operational rigor that software engineering has long had.

When I started my career as a software engineer, CI/CD wasn’t some abstract methodology; it was survival. It was the reason releases didn’t turn into all-night code freezes, and why developers could experiment without blowing up production. When I shifted into data, that discipline didn’t immediately follow me. Engineers and technical teams were still viewing data like a wild ecosystem that resisted automation, where change management meant “pray and deploy” and testing was an afterthought. Meanwhile, Snowflake arrived as a flexible, powerful, and genuinely different platform, opening the door to possibilities that demanded a better way to control the chaos.

That’s where DataOps.Live entered my world. By 2020, I was using it in real-world environments; not proofs-of-concept, not toy projects, but messy, high-stakes architectures. It was the first time I saw DataOps principles applied in a way that felt as natural and essential as CI/CD did back when I was writing application code. The workflow made sense. The discipline felt familiar. The platform enabled what the philosophy demanded. And somewhere between refactoring pipelines, reviewing branch-based merges, and watching teams breathe easier after deployments, I realized there was a book in all of this.

Writing it wasn’t about ego or money, though I’d be lying if I said the finished manuscript didn’t hit me with a wave of pride that I didn’t expect. It was about putting tribal knowledge into a form that wouldn’t get lost in Slack threads or conference calls. It was about bridging gaps between beginners and leaders, between those just stepping into Snowflake and those who need to justify an architectural investment. It was about showing data practitioners why CI/CD isn’t a luxury; it’s oxygen. And yes, it was about helping others avoid the painful lessons I had already learned.

I didn’t want to write the definitive encyclopedia of DataOps or DataOps.Live…nobody wants a second job disguised as a book. I wanted something you could read without feeling overwhelmed, but still feel like you walked away with the blueprint. Tight, readable, and grounded in reality, even if that meant cutting chapters I spent weeks on. It turns out restraint is as important in writing as it is in architecture.

Why CI/CD Isn’t Optional Anymore, Even Though Data Teams Still Act Like It Is

If you ask a room full of software developers whether CI/CD is necessary, they’ll look at you like you just wondered whether gravity is optional. Ask the same question in a room full of data engineers, analysts, or platform owners, and you’re still going to get hesitation. Tradition is part of it, as data work has long been tied to manual processes and rituals passed down like folklore. But the environment has changed.

Today, data products evolve as fast as applications. Dashboards are front doors into decision-making. ELT pipelines change weekly, and sometimes daily. And when the surface area of your data estate grows, the consequences of informal change grow with it. Break a table, and you might break an executive dashboard. Break a dashboard, and you might break a quarterly outcome.

Snowflake, almost paradoxically, made this worse and better at the same time. Its architecture makes iteration easier, sometimes infinitely easier, but iteration without discipline leads to drift, inconsistency, and the sinking feeling that nobody knows whether the model being queried today matches the one that existed last week.

That’s why CI/CD matters. It’s the difference between development being an adventure and development being a controlled experiment. But explaining that to someone who has never used branch-based development in a data environment is like describing color to someone who’s only lived in grayscale. They won’t feel the need until they experience the consequences.

And then comes the first revelation: zero-copy cloning gives you every reason to apply CI/CD in Snowflake. Every branch gets its own sandbox, backed with real data, isolated enough to experiment yet safe enough not to wreck production. When that clicked for me, I had discovered the thing that made software-style development finally work in data environments.

I’ve seen teams go from hesitant to believers in a single sprint, not because of theory, but because they watched a branch come to life, tested changes, ran automated transformations across bronze, silver, and gold layers, deployed confidently, and walked away with the incredible feeling of meaningful progress without collateral damage.

DataOps isn’t a trend; it’s overdue inevitability. Snowflake just made it obvious.

What I Hope People Take Away, And What I Want to See Happen Next

If readers walk away with one insight from this book, I hope it’s this: DataOps isn’t about Snowflake, it’s about the mindset you bring to data. Snowflake just happens to be the environment where the philosophy shines brightest. I’ve used these principles across SQL Server, Oracle, and Postgres. Anywhere data exists, reliable pipelines and disciplined deployments are needed as well.

But Snowflake still feels special to me. The combination of elasticity, separation of storage and compute, and zero-copy cloning feels like an ecosystem designed to be fully unlocked by DataOps. It’s why, if I’m honest, I want to see Snowflake buy DataOps.Live outright and make DataOps a first-class feature. It shouldn’t be something users might consider someday; it should be part of the contract, a default expectation like security or backups. A platform this capable deserves operational guardrails baked in.

The book is my attempt to push that future forward, even if just a step. It’s for the beginner who wants to understand the why, for the expert who wants confidence they’re not alone in their frustrations, for the technical leader who needs language to justify investment, and for the practitioner who has always suspected that “just refresh the model and hope” isn’t a strategy.

I’ve seen first-hand what a DataOps workflow enables: collaborative development without collisions, pipelines that survive change, data layers that support each other rather than interfere, and the rare feeling of going to bed before a release with real sleep, not anxious sleep.

So far, the responses are overwhelmingly telling me others feel the same. DataOps.Live has endorsed the book, they’re sharing it with customers, and early readers have told me it puts words to problems they couldn’t articulate. That’s the reward, knowing something I once carried in my head is now in the hands of others who can build with it.

And yes, when the box of author copies arrived, I took a moment to feel proud, not because it was finished, but because it finally said what I had been trying to say for years: data deserves discipline, and discipline enables freedom.

If you’ve ever felt the gap between what Snowflake can do and what your processes let it do, this book is my attempt to close that distance. It’s practical, it’s honest, and it reflects the way teams actually work, not the way slide decks pretend they do. If you want a more straightforward path to bringing discipline, collaboration, and CI/CD into your data world, give it a read and see if it moves you forward.

Mastering Snowflake DataOps: A Practical Guide to DataOps.Live — available here: https://link.springer.com/book/10.1007/979-8-8688-1754-0

Optimizing Data Workflows in Snowflake with Medallion Architecture

Ronny Steelman — Thu, 27 Feb 2025 04:12:00 +0000

Data platforms today must balance scalability, governance, and performance while enabling seamless access to high-quality, analytics-ready datasets. The Medallion Data Architecture, originally introduced in lakehouse architectures, provides a structured approach for organizing data at different levels of refinement. While this concept is often associated with platforms like Databricks and Delta Lake, it is just as effective when implemented in Snowflake’s cloud data platform.

Snowflake’s separation of compute and storage, combined with its native support for semi-structured data, incremental processing, and security features, makes it an excellent fit for Medallion architecture. Organizations can improve data reliability, simplify transformations, and enhance performance for analytics and machine learning workloads by structuring data into Bronze, Silver, and Gold layers.

In this article, we’ll explore how Medallion Architecture can be applied within Snowflake and how it benefits organizations across industries such as finance, healthcare, and retail.

Understanding the Medallion Architecture

The Medallion Data Architecture is a framework for progressive data refinement, ensuring that raw data is structured, cleansed, and enriched before being consumed by business intelligence, reporting, and AI/ML models. It consists of three primary layers:

Bronze Layer — Stores raw, unprocessed data from diverse source systems.
Silver Layer — Cleans, validates, and standardizes data for transformation.
Gold Layer — Provides enriched, aggregated, and analytics-ready datasets.

This structured approach offers several advantages:

Data Lineage & Auditability: Each layer provides a checkpoint, making it easier to trace the origin and transformations of data.
Incremental Processing: Changes are applied in stages, reducing redundancy and computational overhead.
Flexibility Across Use Cases: Different layers support multiple needs — from real-time analytics to machine learning feature engineering.
Data Quality & Governance: Issues such as duplicates, missing values, and schema inconsistencies are addressed progressively.

Unlike a traditional ETL (Extract, Transform, Load) approach, where data is processed upfront before landing in a warehouse, Medallion Architecture aligns more closely with ELT (Extract, Load, Transform) principles. This allows raw data to be retained, ensuring flexibility in reprocessing historical records, debugging errors, and enabling ad hoc analytics.

ETL vs. ELT

Traditional data pipelines followed an ETL (Extract, Transform, Load) approach, where data was extracted from source systems, transformed into a structured format in an external processing layer, and then loaded into a data warehouse. While this method worked well when computational resources were limited, it created challenges in scalability, flexibility, and performance. ETL required significant preprocessing, making it difficult to modify transformations without extensive rework. Additionally, transformations performed outside the data warehouse often led to data silos and governance issues, as teams had less visibility into the raw, unprocessed data.

With the rise of cloud-native architectures like Snowflake, organizations have shifted to ELT (Extract, Load, Transform), which inverts the process by loading raw data first and performing transformations directly within the warehouse. This approach leverages Snowflake’s scalable compute and storage, allowing transformations to be applied incrementally and in parallel. Since raw data is always retained, teams can reprocess data with different logic, backfill historical datasets, and support multiple transformation workflows without needing to modify source ingestion pipelines. ELT also ensures that data engineers, analysts, and data scientists have access to both raw and processed datasets, enabling richer analytics and faster experimentation.

The Medallion Architecture naturally aligns with ELT by structuring transformations across Bronze, Silver, and Gold layers. Data is first extracted and loaded into the Bronze Layer, ensuring a single source of truth that captures all incoming records. Transformations such as deduplication, validation, and enrichment occur progressively in the Silver and Gold layers, optimizing query performance while maintaining historical traceability. By adopting ELT, organizations gain greater flexibility, improved data governance, and faster processing speeds, all while reducing complexity in data engineering workflows.

Bronze Layer: Storing Raw Data in Snowflake

The Bronze Layer is the foundation of the Medallion Architecture, serving as the initial landing zone for raw data ingestion. This layer is designed to accommodate data from multiple sources in its original, unprocessed format, preserving the full fidelity of the source system.

Raw Data Storage: Data is loaded from various sources, including databases, APIs, logs, IoT devices, and third-party feeds.
Schema Flexibility: Supports semi-structured formats like JSON, Avro, Parquet, and CSV.
Data Retention: Historical records are maintained, ensuring full traceability for auditing and debugging.
Minimal Transformation: No major processing occurs except for basic ingestion validation (file format, structure, integrity checks, etc.).

Example: Loading Raw Data into Snowflake

CREATE OR REPLACE TABLE bronze_sales (
    raw_data VARIANT, 
    ingestion_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

COPY INTO bronze_sales 
FROM @my_s3_stage
FILE_FORMAT = (TYPE = 'JSON')
PATTERN = '.*sales_data.*';

This example demonstrates storing raw JSON data in a VARIANT column inside Snowflake, allowing flexible ingestion without predefined schemas.

Silver Layer: Data Cleaning and Transformation

Once data is ingested into the Bronze Layer, it progresses to the Silver Layer, where the primary focus is on data quality, standardization, and transformation.

Deduplication & Cleansing: Identifies and removes duplicate records and invalid entries.
Schema Standardization: Converts semi-structured data into structured tables with defined data types.
Enrichment: Joins data with reference tables, lookup values, or external datasets for additional context.
Handling Missing Data: This technique applies imputation techniques such as filling NULL values, applying default values, or removing incomplete records.

Example: Transforming Bronze Data into Silver Table

CREATE OR REPLACE TABLE silver_sales AS
SELECT 
    raw_data:id::STRING AS sales_id,
    raw_data:customer::STRING AS customer_name,
    raw_data:amount::FLOAT AS sales_amount,
    ingestion_timestamp
FROM bronze_sales
WHERE raw_data:id IS NOT NULL;

This transformation extracts structured fields from JSON data, converting them into a clean, structured table for downstream analysis.

Gold Layer: Business-Ready Data for Analytics

The Gold Layer is the final stage of the Medallion Architecture, where data is refined into a fully optimized, analytics-ready state.

Aggregations & Summaries: Pre-computed metrics such as monthly revenue, customer churn, or fraud risk scores.
Optimized Performance: Data is indexed, partitioned, and clustered for faster query execution.
Business-Focused Views: Designed for reporting dashboards, AI/ML models, and decision-making processes.
Security & Governance: Implements role-based access control (RBAC) to restrict sensitive data access.

Example: Creating a Gold Table for Business Insights

CREATE OR REPLACE TABLE gold_sales_summary AS
SELECT 
    customer_name,
    DATE_TRUNC('month', ingestion_timestamp) AS month,
    SUM(sales_amount) AS total_revenue,
    COUNT(DISTINCT sales_id) AS transaction_count
FROM silver_sales
GROUP BY customer_name, month;

This query aggregates sales data at a monthly level, optimizing it for executive reporting and BI dashboards.

Platinum Layer: Advanced Analytics and AI Optimization

While the traditional Medallion Architecture consists of Bronze, Silver, and Gold layers, some organizations require an additional level of data refinement. This is where the Platinum Layer comes in — designed to support advanced analytics, AI/ML feature engineering, and high-performance optimization.

Unlike the Gold Layer, which focuses on business reporting and aggregated metrics, the Platinum Layer is tailored for predictive modeling, AI-driven decision-making, and real-time data applications. This layer is particularly beneficial for industries such as:

Finance: Fraud detection models, credit risk scoring, algorithmic trading.
Healthcare: Patient risk stratification, personalized treatment recommendations.
Retail & E-commerce: Demand forecasting, recommendation engines, dynamic pricing.

Key Features of the Platinum Layer

Machine Learning Feature Stores: Curates structured datasets optimized for AI/ML training.
Real-Time Processing: Supports streaming data ingestion for up-to-the-minute analytics.
Hyper-Optimized Query Performance: Uses materialized views, caching, and query acceleration techniques to enhance speed.
Automated Data Science Pipelines: Integrates with platforms like Snowflake Snowpark, Databricks, or AWS SageMaker to run ML workflows.

Example: Creating a Platinum Table for Predictive Modeling

CREATE OR REPLACE TABLE platinum_customer_churn AS
SELECT 
    customer_name,
    total_revenue,
    transaction_count,
    DATEDIFF(DAY, MAX(ingestion_timestamp), CURRENT_DATE) AS days_since_last_purchase,
    CASE 
        WHEN DATEDIFF(DAY, MAX(ingestion_timestamp), CURRENT_DATE) > 90 THEN 1 
        ELSE 0 
    END AS churn_label
FROM gold_sales_summary
GROUP BY customer_name, total_revenue, transaction_count;

In this example, the Platinum Layer creates a churn prediction dataset, helping businesses identify at-risk customers based on their purchasing patterns. This dataset can be fed into ML models to predict churn and trigger customer retention strategies.

When to Use the Platinum Layer?

Not all organizations need a Platinum Layer, but it becomes essential when:

AI/ML and predictive analytics are a core part of business strategy.
Streaming and real-time analytics are required for decision-making.
Complex transformations and feature engineering are needed beyond traditional BI reporting.
Compute-intensive workloads must be separated from operational analytics for performance reasons.

The Platinum Layer builds upon the Gold Layer to unlock advanced insights, automation, and AI-driven decision-making — a game-changer for data-driven enterprises.

Semantic Layer: A Unified Data Interface

As organizations move toward more complex and diverse data landscapes, the need for a semantic layer becomes apparent. The semantic layer acts as a unified interface between business users and raw data, providing a business-friendly view of the underlying data in the Medallion Architecture. This layer plays a crucial role in improving both the accessibility and understandability of data, ensuring that stakeholders across various business functions can work with data effectively without needing deep technical knowledge.

What is a Semantic Layer?

A semantic layer abstracts and simplifies the complexity of raw data by creating a business-oriented model that aligns directly with business goals and objectives. It provides a consistent set of business definitions, metrics, and KPIs that business users can easily query. Essentially, it allows users to interact with data using terms they understand, like “revenue,” “profit margin,” or “customer satisfaction,” rather than raw database schema or technical jargon.

For instance, instead of exposing business users directly to technical table names like sales_transactions or customer_data, the semantic layer can define an easy-to-understand metric like total_sales or active_customers based on a combination of data from various sources across the Medallion Architecture. This ensures that even non-technical users, such as product managers, marketing analysts, or executives, can gain insights without needing to understand the underlying technical complexity.

Benefits of a Semantic Layer in Medallion Architecture

Improved Data Accessibility: The semantic layer creates an intuitive interface for business users, making it easier for them to understand and explore data without the need for SQL or technical expertise.
Consistent Business Logic: By centralizing business rules and logic in one layer, organizations ensure that all company users work from the same definitions and calculations.
Self-Service Analytics: With a semantic layer in place, users can independently query data and generate reports, reducing the dependency on data teams for routine tasks.
Data Security and Governance: The semantic layer helps enforce data governance rules and ensure that users only access the data they are authorized to see. This is crucial for industries with strict compliance requirements, such as healthcare and finance.

Example: Creating a Semantic Layer View in Snowflake

To implement a semantic layer in Snowflake, organizations typically create views or materialized views that map the raw data from the Bronze, Silver, and Gold layers to user-friendly business metrics.

CREATE OR REPLACE VIEW semantic_sales_summary AS
SELECT
    customer_id,
    SUM(transaction_amount) AS total_sales,
    COUNT(DISTINCT product_id) AS product_count,
    CASE
        WHEN SUM(transaction_amount) > 500 THEN 'High'
        ELSE 'Low'
    END AS customer_segment
FROM gold_sales_data
GROUP BY customer_id;

In this example, a view is created on the Gold Layer sales data to provide an aggregated view of sales at the customer level. Instead of raw transaction data, business users can now work with total_sales, product_count, and customer_segment — simple, business-friendly metrics that help define the customer profile.

When Should You Consider Implementing a Semantic Layer?

A semantic layer is particularly useful when:

Non-technical users need to access and understand complex data without direct interaction with the raw data or underlying databases.
Multiple teams need to align on consistent definitions of key business metrics to ensure coherent decision-making.
The data model is complex, and it’s important to present users with a simplified, business-friendly interface.
Data security and governance policies need to be enforced while ensuring user-friendly data access.

Conclusion

By adopting Medallion Architecture in Snowflake, businesses can create a highly optimized and scalable data pipeline that efficiently handles everything from raw data ingestion to advanced analytics. The Bronze, Silver, and Gold layers provide a structured framework to refine and transform data into valuable insights, while the Platinum layer enables advanced data science capabilities.

Moreover, the inclusion of a semantic layer adds a crucial layer of accessibility and consistency, empowering business users to interact with data in a way that is intuitive and aligned with business goals. This enables organizations to improve decision-making, drive efficiency, and ultimately gain a competitive advantage in the market.

Implementing Medallion Architecture in Snowflake is about more than optimizing your data pipeline; it’s about building a solid foundation for data-driven success.

Exploring DeepSeek-R1 on Snowflake Cortex AI: The Future of AI-Powered Insights

Ronny Steelman — Fri, 07 Feb 2025 22:42:00 +0000

In the ever-evolving landscape of artificial intelligence, businesses are increasingly turning to advanced models to drive innovation and unlock valuable insights. One such groundbreaking advancement is the introduction of DeepSeek-R1, a highly capable large language model (LLM) that has now been integrated into Snowflake Cortex AI. This powerful model, trained using reinforcement learning (RL) techniques, brings cutting-edge capabilities to the forefront, offering unprecedented performance in complex tasks such as reasoning, math, and code generation. With its inclusion in Snowflake’s Cortex AI suite, companies now have access to a highly efficient, scalable solution that can seamlessly integrate AI into their workflows.

Why DeepSeek-R1 Is a Game Changer for AI

DeepSeek-R1 represents a significant leap in the development of AI models. Unlike traditional models that rely heavily on supervised fine-tuning (SFT), DeepSeek-R1 is trained entirely through reinforcement learning — a method that allows the model to learn and improve through interaction and self-guided exploration. This unique approach enables DeepSeek-R1 to exhibit exceptional capabilities in areas that have traditionally been challenging for AI systems, such as complex reasoning, problem-solving, and generating a chain of thoughts (CoT).

The model’s ability to generate long, complex reasoning chains allows it to solve higher-order problems that require multiple steps, self-verification, and reflection. DeepSeek-R1’s flexibility and ability to adapt are key factors that set it apart from other models. These attributes are especially critical for businesses looking to implement AI for tasks like customer feedback analysis, financial decision-making, and large-scale data analysis.

How DeepSeek-R1 Enhances AI with Reinforcement Learning

Reinforcement learning (RL) provides the core innovation behind DeepSeek-R1’s design. Traditional AI models often require significant fine-tuning using labeled datasets, but this process can be time-consuming and may not always lead to the best results. By applying RL directly to the base model, DeepSeek-R1 is able to enhance its reasoning and problem-solving capabilities organically without the need for supervised training.

DeepSeek-R1’s developers leveraged this innovative approach to explore new ways of solving complex problems. For example, the model can now generate complex reasoning steps for issues that demand intricate, multi-step thinking. The use of RL has also allowed DeepSeek-R1 to avoid the pitfalls of traditional methods, such as language mixing, poor readability, and endless repetitions. By incorporating cold-start data before the reinforcement learning phase, the DeepSeek team effectively mitigated these challenges, resulting in a more intuitive, accurate model capable of handling a wider range of tasks.

Seamless Integration with Snowflake Cortex AI

One of the most exciting aspects of the DeepSeek-R1 model is its integration into Snowflake Cortex AI. Snowflake has built a suite of AI tools that allow businesses to easily integrate advanced models like DeepSeek-R1 without manual setup or maintenance complexities. With Snowflake’s serverless architecture, businesses can access the model’s capabilities through a simple interface that eliminates the need for managing APIs, complex integrations, or resource-heavy deployments.

For companies already using Snowflake’s cloud platform, the addition of DeepSeek-R1 to Cortex AI means they can easily integrate powerful AI-driven insights into their existing data pipelines and applications. Whether you’re processing large volumes of structured data or working with unstructured data like text, Snowflake Cortex AI enables businesses to seamlessly blend these insights into their workflows.

SQL and Python Integration

Snowflake Cortex AI offers two primary ways to integrate DeepSeek-R1 into your systems: via SQL and Python. This versatility ensures that businesses can access the model using familiar tools, making it easy to leverage the full power of AI in a way that suits their needs.

SQL Integration: With the COMPLETE function in Snowflake, users can easily incorporate DeepSeek-R1 into their data queries to generate insights from structured data. Whether you need to summarize customer feedback, analyze transaction data, or generate predictions, Snowflake’s SQL interface lets you run these tasks efficiently and with minimal coding.

For example, here’s how businesses can use SQL to summarize customer feedback:

SELECT SNOWFLAKE.CORTEX.COMPLETE('deepseek-r1', 
   [{'content': CONCAT('Summarize this customer feedback in bullet points:', content ,'')}], 
    {'guardrails': true}
);

By activating the Cortex Guard feature, businesses can ensure that the model’s responses adhere to governance policies and filter out inappropriate content, ensuring that only safe and valuable insights are returned.

Python Integration: For developers who prefer working in Python, Snowflake Cortex AI also supports this popular programming language for custom integrations. Whether you’re building a standalone application or integrating the model into an existing system, Python offers a flexible environment for leveraging DeepSeek-R1’s capabilities.

Here’s an example of how to make inference calls using the REST API with Python:

curl -X POST \
    "model": "deepseek-r1",
    "messages": [{ "content": "Summarize this customer feedback in bullet points: ”}],
    "top_p": 0,
    "temperature": 0.6,
}'

With both options available, businesses can choose the integration method that best aligns with their technical infrastructure and workflow.

Ensuring Security and Governance with Cortex Guard

As AI technology becomes more powerful, ensuring these models’ safe and responsible use is critical. Snowflake’s Cortex Guard provides a comprehensive safeguard against potentially harmful or unsafe content generated by the model. Whether it’s hate speech, violent content, or inappropriate recommendations, Cortex Guard ensures that businesses can use DeepSeek-R1 in a way that aligns with their ethical guidelines and governance policies.

Cortex Guard can be activated with just a simple setting, ensuring that any harmful content is automatically filtered out. This feature helps businesses maintain control over their AI interactions, providing peace of mind that their use of AI technology will not result in the generation of inappropriate or harmful outputs.

DeepSeek-R1: Unlocking Cost-Effective AI Insights

With DeepSeek-R1, businesses can achieve remarkable results with fewer resources and reduced training costs. DeepSeek’s low-precision FP8 training and auxiliary-loss-free load-balancing strategy ensure that the model delivers state-of-the-art performance while minimizing computational costs. For organizations looking to drive cost efficiencies in their AI deployments, DeepSeek-R1 offers an attractive solution.

What’s Next for Snowflake Cortex AI and DeepSeek-R1?

As DeepSeek-R1 continues to evolve, Snowflake’s AI research team is working to further enhance the model’s performance, focusing on reducing inference costs and expanding its capabilities. In the future, businesses can expect even more advanced features that will help streamline the creation and deployment of generative AI applications.

Furthermore, with the upcoming general availability of DeepSeek-R1, businesses will be able to control access to the model using role-based access control (RBAC), ensuring that only authorized personnel can interact with the model based on governance policies. This added layer of control will help businesses maintain the integrity of their AI-driven operations.

Get Started Today

DeepSeek-R1 is currently available through a private preview on Snowflake Cortex AI, and we encourage businesses to reach out to their Snowflake sales team to request access. By embracing this powerful AI model, businesses can begin their journey into the future of data-driven insights, problem-solving, and innovation.

Incorporating DeepSeek-R1 into your organization’s workflows is more than just implementing a new tool — it’s about unleashing the power of AI to drive real, transformative change. Let’s explore the future of AI together and discover how DeepSeek-R1 can revolutionize your business.