DataForGeeks

What It Really Takes to Run Snowflake’s Snowpipe in Production at Scale – A Comprehensive Guide

Nikhil Aggarwal — Wed, 28 May 2025 22:49:11 +0000

We adopted a practical Medallion-style approach to structure our data flows – segmenting data flows into Bronze, Silver, and Gold layers. As part of this redesign, we needed to optimize how curated data was exported to Snowflake. That’s when we hit performance issues with external tables. I know the common suggestion is to use the ... Read more

The post What It Really Takes to Run Snowflake’s Snowpipe in Production at Scale – A Comprehensive Guide appeared first on DataForGeeks.

Apache Iceberg: The Data Lake Breakthrough That’s Reshaping the Big Data Landscape

Nikhil Aggarwal — Wed, 21 May 2025 09:47:02 +0000

By the end of this read, you’ll understand why Apache Iceberg is not just another open table format — it’s the seismic shift that’s redefining the role of Delta Lake in the modern data ecosystem and transform how the world thinks about data lakes. From Data Lakes to Data Icebergs: A New Era Begins Over ... Read more

The post Apache Iceberg: The Data Lake Breakthrough That’s Reshaping the Big Data Landscape appeared first on DataForGeeks.

The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time

Nikhil Aggarwal — Mon, 20 Jan 2025 20:32:41 +0000

The Era of Chaos – and Snowflake’s Rise Back in 2017, most of us were drowning in messy data. Files were everywhere in S3 buckets, Hadoop jobs kept failing at the worst times, and analysts? They were always chasing clean data that never seemed to arrive when needed. It was frustrating, and honestly, it felt ... Read more

The post The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time appeared first on DataForGeeks.

Mastering Python Setup on macOS: Bye Conda, Hello pyenv + Fancy iTerm2 Terminal

Nikhil Aggarwal — Wed, 09 Oct 2024 12:54:00 +0000

Tired of messy Python setups? Ever screamed at your terminal? Been there, done that, deleted Anaconda. Let me show you how I set up a clean, beautiful, and powerful Python development environment on my Mac. It’s light, customizable, and perfect for devs who love a good-looking terminal and tight control over Python versions. 🤓 Why ... Read more

The post Mastering Python Setup on macOS: Bye Conda, Hello pyenv + Fancy iTerm2 Terminal appeared first on DataForGeeks.

Python Data Structures Simplified: List, Tuple, Dict, Set, Frozenset & More

Nikhil Aggarwal — Thu, 02 May 2024 11:23:31 +0000

Python offers a rich set of built-in and extended data structures to efficiently manage and process data. In this blog, we’ll deep dive into essential ones: List, Tuple, Dictionary (Dict), Set, Frozenset, and also explore some powerful structures from the collections and dataclasses modules. We’ll cover their properties, use-cases, constructors, and how to convert between them using intuitive examples. Note: Since Python 3.7, dictionaries ... Read more

The post Python Data Structures Simplified: List, Tuple, Dict, Set, Frozenset & More appeared first on DataForGeeks.

Understanding SQL Execution Order and Corresponding PySpark Syntax

Nikhil Aggarwal — Sat, 02 Sep 2023 16:33:01 +0000

When writing SQL queries, it is essential to understand the order in which SQL clauses are executed. This helps in writing optimized queries, especially when transitioning from SQL to PySpark. In this blog, we’ll walk you through the SQL execution order, the SQL clauses, and provide their corresponding PySpark syntax. SQL Execution Order and Corresponding ... Read more

The post Understanding SQL Execution Order and Corresponding PySpark Syntax appeared first on DataForGeeks.

Snowflake – Performance Tuning and Best Practices

Nikhil Aggarwal — Sat, 14 May 2022 20:39:55 +0000

Snowflake’s cloud-native architecture makes it incredibly easy to get started — but running it efficiently at scale is a whole different game. If you’ve ever faced slow queries, ballooning credit consumption, or unpredictable performance, you’re not alone. Tuning Snowflake workloads requires more than just adjusting warehouse sizes — it involves understanding how Snowflake stores data, ... Read more

The post Snowflake – Performance Tuning and Best Practices appeared first on DataForGeeks.

Apache Spark – Performance Tuning and Best Practices

Nikhil Aggarwal — Wed, 04 May 2022 12:33:18 +0000

Apache Spark has revolutionized the way we process large-scale data — delivering unparalleled speed, scalability, and flexibility. But as many engineers discover, achieving optimal performance in Spark is far from automatic. Your job runs — but takes longer than expected. The cluster scales — but the costs rise disproportionately. Memory errors appear out of nowhere. ... Read more

The post Apache Spark – Performance Tuning and Best Practices appeared first on DataForGeeks.

Data Serialisation – Avro vs Protocol Buffers

Nikhil Aggarwal — Wed, 23 Mar 2022 20:43:39 +0000

Background File Formats Evolution Why not use CSV/XML/JSON? Repeated or no meta information. Files are not splittable, so cannot be used in a map-reduce environment. Missing/ Limited schema definition and evolution support. Can leverage “JsonSchema” to maintain schema separately for JSON. It may still require transformation based on a schema, so why not consider Avro/Proto? ... Read more

The post Data Serialisation – Avro vs Protocol Buffers appeared first on DataForGeeks.

Count(*) – Explaining different behaviour in Joins

Nikhil Aggarwal — Fri, 04 Feb 2022 13:27:00 +0000

Observations : Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data. Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row ... Read more

The post Count(*) – Explaining different behaviour in Joins appeared first on DataForGeeks.