aviyehuda.com

Merge + liquid clustering – common issues

avi yehuda — Sat, 03 Jan 2026 09:34:45 +0000

As a Spark support engineer, I still encounter many cases where MERGE or JOIN operations on Delta tables do not perform as expected even when liquid clustering is used. While liquid clustering is a significant improvement over traditional partitioning and offers many advantages, people still sometimes struggle. There is often an assumption that enabling liquid clustering will automatically result in efficient merges, but in practice this is not always true and the reason is lack of understanding.

Here are the most common issues when executing merge on a liquid clustering table.

1. Clustering by a key, merging by another

This is one of the most straightforward and common problems. If a table is clustered by one key but the MERGE condition uses a different key, liquid clustering cannot help.

When the merge key is not part of the clustering keys, Spark is generally unable to prune files when scanning the target table. Each incoming key must be searched across the entire dataset. As a result, Spark often performs a full scan and falls back to a Sort Merge Join or, when the dataset is small enough, a Broadcast Nested Loop Join, both of which can be expensive at scale.

2. Clustering by multiple keys, merging by only one
In some cases, users do merge on a clustering key but still observe little or no pruning. A frequent cause is clustering on multiple columns.

Liquid clustering relies on indexing, but adding more clustering keys dilutes the effectiveness of the index for each individual column. Instead of having a strong index on a single dimension, the data layout is optimized across combinations of keys. When a merge filters on only one of those keys, pruning becomes much less effective.

Clustering by a single index

Clustering by multiple indexes

3. Clustering and merging on the same key, but the key has no natural order

Liquid clustering is well suited for high-cardinality dimensions, but the type of key still matters. If the clustering key does not have a logical or natural order, it may not behave well as a merge key.

Date or timestamp columns are usually excellent candidates. For example, merging data from only the previous month into a table clustered by date allows Spark to prune most of the files efficiently.

By contrast, consider a user ID represented as a UUID. Even if the table is clustered by this column, pruning will likely be minimal. Liquid clustering indexes data using value ranges. Dates have a natural ordering, so filtering by a range maps cleanly to those indexes. UUIDs, however, are effectively random, causing most ranges to overlap and forcing Spark to scan a large portion of the table.

4. Clustering and merging by an ordered key, but it has a very wide merge range

There are cases where users correctly cluster and merge on an ordered key, such as a creation date, yet the merge still takes a long time. In these scenarios, the issue often lies in the incoming dataframe rather than the target table.

Liquid clustering uses min–max ranges to index data files. If the incoming dataset contains values spanning a very large range, for example, dates from three years ago through today, those ranges overlap many indexed files. This overlap significantly reduces pruning and results in large scans despite correct clustering.

5. Complex or non-pushdown-friendly filtering conditions

One of the common ways to force pruning, is to use hard coded filtering inside the merge condition or right before doing the merge. That is indeed a very good solution to help Spark AQE to understand which files can be skipped, however it doesn’t always work. Even when it appears that Spark should be able to skip files, it sometimes cannot. A common reason is overly complex filtering logic.

Filters involving functions, casts, nested expressions, or non-deterministic logic can prevent predicate pushdown and make it difficult for Spark to reason about value ranges. When Spark cannot translate filters into simple range predicates, file skipping becomes ineffective.

Since this post is already quite long, I’ll write a separate one with examples of such complex filtering conditions.

In case you do use complex conditions though, one of the things that can help is to enable Photon on the cluster. Photon has some performance improvements that work in such cases.

Recommendations

The most important advice is when designing a merge on a liquid clustered table, keep in mind the 2 sides of the merge – the target table and the merged dataframe. Try to see how you make sure that the range of values of key in the merged dataframe will be pulled by spark from only a limited number of files.

Here are some practical practices you can follow:

Use a merge key that is also a clustering key.
Minimize the number of clustering keys.
Prefer clustering keys with a logical order (such as dates or timestamps).
Keep the value range of the merge key in the incoming dataframe as narrow as possible.
Apply filters early, ideally before the merge or directly in the merge condition. Keep filtering conditions as much as possible.
Regularly optimize the table to maintain an efficient layout.

Liquid clustering is a powerful optimization, but its effectiveness depends heavily on how well the clustering strategy aligns with merge patterns and data characteristics. Understanding these trade-offs is key to avoiding unexpected performance bottlenecks.

Let Users Ask Questions to Your Data Lake Using AI and Spark

avi yehuda — Sun, 06 Jul 2025 15:04:12 +0000

In my past roles as a data infrastructure engineer, building systems that enabled analysts to query the data warehouse in a simple, efficient, and cost-effective way was always a challenge. With the rise of AI, this task has become much easier. However, letting AI query datasets can be costly, and there’s always the risk of hallucinations. More importantly, exposing sensitive data to large language models is often not acceptable.

By incorporating Spark into the workflow, costs are significantly reduced, hallucinations become far less likely, and most importantly, there’s no need to send confidential data to an external LLM.

Querying a specific dataset with Spark 4

Recently, Spark introduced a great new feature that allows you to ask questions in natural language directly on a DataFrame.

spark_ai = SparkAI(llm = chatOpenAI)
spark_ai.activate()
res_df = df.ai.transform("what product id has the highest revenue")
res_df.show()

This is a neat feature. However, the limitation is that it only works on a specific DataFrame. It would be far more useful if users could ask questions across the entire data catalog, allowing the AI to determine what data to query and how.

Querying the entire data catalog with LLM + Spark

While this can be achieved using an LLM alone, feeding the entire dataset to the model would be very expensive—and, as mentioned, potentially undesirable from a security perspective. A better approach is to describe the data to the LLM and ask it to generate a Spark SQL query that I can use for the actual execution using Spark. This way, although we might spend a bit more on compute, we save significantly on LLM token costs. It also greatly reduces the chances of mistakes.

To do this, I will use langchain’s ChatOpenAI model. I used Cursor IDE for generating the code for me.

The gist of the instuctions is:

Describe the datasets to LLM.
Ask LLM to generate Spark SQL query based on the user’s question.
Trigger the query using Spark.sql()

Here are the full instructions prompt I gave cursor:

>>>create a new python file.

Create a program in spark that holds a map of dataframes.
The map has a key, which is the name of the dataframe ,
the value is the dataframe, but also a description of the dataframe.

Create 3 dataframes of data – produce description, product reviews and product sales.

The program get input from the program user, the user is asking a question in human text ,
the question related to the data in the dataframes.

The question goes through 2 phases.
First we send it to langchain ChatOpenAI.
In the prompt we ask explain to ChatOpenAI about our dataset,
but we don’t send the actual data, so not to lose money, we only send the name,
the description and the columns of the datasets.
We also ask ChatOpenAI to give us the result in a structure manner.
We will ask ChatOpenAI to give us the correct query in spark sql syntax to query the datasets.

The response can either be a follow up question or a query in spark sql syntax.

If the response is a follow up question than we will ask the user the followup question to the user and do the whole process again,
if the response is a query, than we will query our dataframes the query using spark.sql(query) and present the result to the user.

If the result has only one record and one column, we will display the value of that one cell.
Otherwise we will display it as a table to the user.

if there is more than 1 rows in the result,
after showing the result, ask the user whether they want to see a graph , if they do, than do df.plot() to display the graph

Here is the generated code

And this is what I got back from Cursor: after asking a question about my datasets, the system understood what needed to be queried and even generated a graph.

Here the program asks for a followup question:

Notes:

This approach requires you to provide clear and accurate descriptions for every dataset in your catalog. The more detailed the descriptions, the lower the risk of errors.
Additionally, to improve the quality of the results, you can include a few sample (dummy) records for each dataset. This gives the LLM a better understanding of the data’s structure and semantics.

Vector Search in Databricks

avi yehuda — Sun, 25 May 2025 07:16:47 +0000

A short presentation I gave to my team in Databricks

Databricks: Using DECLARE VARIABLE to overcome a file pruning issue in the SQL editor

avi yehuda — Sat, 18 Jan 2025 06:14:49 +0000

File pruning is an optimization process in Spark that skips unneeded files from being read during query execution, based on the query’s filter condition. It is a critical performance optimization in distributed data processing systems like Spark, especially when working with large datasets stored in partitioned file formats such as Parquet, ORC, or Delta Lake.

In this query:

SELECT * from my_table where my_key=123

Spark can skip reading files that do not contain the value 123 for my_key. This optimization reduces disk I/O and improves query performance.

File pruning is automatically enabled whether the user writes Spark code programmatically or uses SQL syntax in a notebook or the SQL editor.

However, when a user employs a user defined function (UDF) as a filter in a query, it may interfere with file pruning. For example:

CREATE FUNCTION get_key_value_udf()

RETURNS INT

RETURN (SELECT key_value FROM config_table WHERE config_name = 'target_key' LIMIT 1); 

SELECT * from my_table where my_key = get_key_value_udf()

In most cases, the above query will prevent file pruning from working because Spark treats UDFs as black boxes and cannot infer their logic for optimization. As a result, Spark may read all the files, potentially degrading query performance.

Cause

Using a UDF prevents file pruning in Spark because UDFs are treated as black boxes by the Spark optimizer. Since the output of a UDF is determined only at runtime, it becomes nearly impossible for Spark to leverage UDFs for file pruning.

The solution should be easy when using Spark python/scala syntax. You would simply need to execute the UDF separately to get the result and use it in the filter.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def get_key_value_udf():
    result = spark.sql("SELECT key_value FROM config_table WHERE config_name = 'target_key' LIMIT 1").collect()    
    return result[0]['key_value'] if result else None

key_value = get_key_value_udf()
filtered_df = spark.table("my_table").filter(f"my_key = {key_value}")

The solution in SQL editor is not quite straight forward.

Using views or temp tables will not as well as Spark will execute them together with the actual application.

Solution

The solution is straightforward and quite elegant—you can accomplish this using `DECLARE VARIABLE`.

Example:

DECLARE VARIABLE myvar INT DEFAULT 123;

SET VAR myvar = (SELECT key_value FROM config_table WHERE config_name = 'target_key' LIMIT 1)

SELECT * FROM my_table WHERE my_key = myvar;

More about DECLARE VARIABLE.

Data Engineering: Strategies for data retrieval on multi-dimensional data

avi yehuda — Mon, 20 Nov 2023 18:45:55 +0000

You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example, a very common dimension to partition data in data lakes is by date or time. However, what if your data querying requirements involve multiple dimensions. Let’s say you wish to query your’e data by field A and also by field B or sometimes by field A but other times by field B.
In this post I’ll go over several common options for such case.
For the sake of connivance I’ll give examples on how to implement it on the data lake using standard folder names and parquet to hold the data. You should know however that the paradigms are also valid for other areas like DBs, NoSQL DBs, memory storage and so on.

The default: micro partitions

Micro-partitions is a technique used to sub-partition data within a dataset. Each micro-partition contains metadata for individual fields, providing valuable information for data consumption performance optimization.

For instance, consider a scenario where data is organized into daily partitions stored in Parquet files.

/day=20240101/data1.parquet
/day=20240101/data2.parquet
/day=20240101/data3.parquet

In this setup, each Parquet file (or even each page within a Parquet file) can be referred to as a micro-partition. Parquet files inherently store metadata per file and per page, which can enhance data consumption performance.

Snowflake, also employs micro-partitions by default, only it uses richer metadata and superior indexing capabilities than the simple parquet files. This enhanced metadata and indexing within Snowflake’s micro-partitions contribute to significant performance gains, making micro-partitions a highly beneficial feature within the platform.

The obvious approach: nested partitions

Let’s start with nested partitions. In a typical Hive partitions structure, it looks like this:

/=/=/data.parquet

While this works well for consistent queries involving both Field A and Field B, it falls short when you need the flexibility to query either field separately. For instance:

//Good for this:
Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 ")

//Not so good for this:
Spark.sql("select * from my_data_set where FieldA=11")
Spark.sql("select * from my_data_set where FieldB=22")

The reason this method is not usfull for these cases is that for the 2nd type query, all partitions need to be scanned which makes it not as usuful.

The opposite approach: data duplication with separate partitions

Another approach involves duplicating the data and partitioning it once by Field A and once by Field B. The directory structure in a hive like structure might look like this:

/=/data.parquet

and

/=/data.parquet

It represents the opposite of the previous option, meaning:

// Good for this:
Spark.sql("select * from my_data_set where FieldA=11"); 
Spark.sql("select * from my_data_set where FieldB=22");

// Not good for this:
Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 ");

Also, maintaining data consistency becomes more challenging in this scenario.

Best of Both Worlds? Partitioning by field A + externally indexing by field B

A widely adopted strategy in databases. The advantage here is that the index serves as a reference to the data, not a copy of it.

In the data lake world it means partition the data by fieldA, same as before

/=/data.parquet

And in addition mantaining a slim dataset which reffrances the same data files by fieldB values.

In datalakes It’s possible to implement it yourself, although usually it is implemented using some additional data catalog. This is also one of the advantages of using lakehouses(like databricks data lakehouse) since you get it out of the box.
It’s ideal for cases where you need to query the data based on specific values for field B.

Spark.sql("select * from my_data_set where FieldB=22");

However, it’s less suitable for queries involving a range of values for field B,

Spark.sql("select * from my_data_set where FieldB>22");

The reason it is not as useful is since the indexed keys are not stored in a continuous manner on the machine like partitions usually do.

Often useful: partitioning by field A + clustering or sorting by field B:

This is an improvement over the micro partitions approach. Here we partition the data by fieldA as you normally do, but you make sure that inside each partition the data is clustered by fieldB.

Here is one example of how to implement it using Spark:

    // partition the data partitioned by A and inside partitioned by B
    val sortedDF = df.repartitionByRange($"fieldA", $"fieldB")

    // than write the data in a partitioned manner
    sortedDF.write
      .mode(SaveMode.Overwrite) 
      .partitionBy("fieldA")
      .parquet("/dataset_root")

In the example above, data will be written partitioned by field A. But inside each partitioned the data will be divided to files (micro-partitioned) also by field B.

The used theologies need to support this of course. In case of parquet it works well since parquet holds metadata for each field which includes min and max values. Most of the technologies (like apache spark) take this into account so they are able to skip files which do not include the required values for field-B.

This is a solid choice for various use cases, while it is not the best approach for queries like this:

Spark.sql("select * from my_data_set where FieldB=22 ");

Spark.sql("select * from my_data_set where FieldB>22 ");

Since it means going over all partitions. However since the data is grouped by fieldB within the partitions at least some of the files may be skipped.
This approach is particularly useful when field B contains a wide range of possible values (high cardinality). It can also be a beneficial design when field B’s values are unevenly distributed (skwed).

This is why this paradigm is very common in multiple technologies, for example: clustering in BigQuery, sort key in DynamoDB. clustering inside micro partitions in snowflake and so on.

The secret weapon – Z-order

A less common but important option is to index or partition by a Z-order. In this case, the data is also sorted, but instead of being partitioned by field A and sorted by fields B, it will be sorted by a key which is a composite of both fields A and B:

This method is actually ideal for all of the query types mentioned so far.
The secret is in the method which combines the 2 fields together, it makes sure that keys with similar values are stored in proximity to one another, and this holds true for both fields that make up the partition. So no matter wether your’e retrieving data based on one of the fields or both, whether you need a precise value or range of values, this method will surely help. Also as the previous method, this method is good for high cardinality and skews as well.

The implementation of this is not very common though and quite complex. Currently the most common implementations are by hosted environments like Databricks lakehouse.

Conclusion:

Choosing the right strategy for multi-dimensional data querying depends on your specific use case. Each approach has its strengths and trade-offs. Whether you go for nested partitions, data duplication, external indexing, sorting, or Z-indexing, understanding these strategies equips you to make informed decisions based on your data lake architecture and querying needs.

Parquet data filtering with Pandas

avi yehuda — Fri, 13 Oct 2023 19:37:41 +0000

When it comes to filtering data from Parquet files using pandas, several strategies can be employed. While it’s widely recognized that partitioning data can significantly enhance the efficiency of filtering operations, there are additional methods to optimize the performance of querying data stored in Parquet files. Partitioning is just one of the options.

Filtering by partitioned fields

As previously mentioned, this approach is not only the most familiar but also typically the most impactful in terms of performance optimization. The rationale behind this is straightforward. When partitions are employed, it becomes possible to selectively exclude the need to read entire files or even entire directories of files (aka, predicate pushdown), resulting in a substantial and dramatic improvement in performance.

import pandas as pd
import time
from faker import Faker

fake = Faker()

MIL=1000000
NUM_OF_RECORDS=10*MIL
FOLDER="/tmp/out/"
PARTITIONED_PATH=f"{FOLDER}partitioned_{NUM_OF_RECORDS}/"
NON_PARTITIONED_PATH_PREFIX=f"{FOLDER}non_partitioned_{NUM_OF_RECORDS}.parquet"

print(f"Creating fake data")
data = {
    'id': range(NUM_OF_RECORDS),  # Generate IDs from 1 to 100
    'name': [fake.name() for _ in range(NUM_OF_RECORDS)],
    'age': [fake.random_int(min=18, max=99) for _ in range(NUM_OF_RECORDS)],
    'state': [fake.state() for _ in range(NUM_OF_RECORDS)],
    'city': [fake.city() for _ in range(NUM_OF_RECORDS)],
    'street': [fake.street_address() for _ in range(NUM_OF_RECORDS)]
}

df = pd.DataFrame(data)

# writing without partitions
df.to_parquet(path=NON_PARTITIONED_PATH)

# writing partitioned data
df.to_parquet(path=PARTITIONED_PATH, partition_cols=['state'])

# reading non partitioned
start_time = time.time()
df1 = pd.read_parquet(path=NON_PARTITIONED_PATH)
df1 = df1[df1['state']=='California']
runtime1 = (time.time()) - start_time  # 37 sec

# reading partitioned data
start_time = time.time()
df2 = pd.read_parquet(path=PARTITIONED_PATH, filters=[('state','==','California')])
runtime2 = (time.time()) - start_time # 0.20 sec

The time improvement (along with reduced memory and CPU usage) is substantial, decreasing from 37 seconds to just 0.20 seconds.

Filtering by non partitioned fields

In the example above, we observed how filtering based on a partitioned field can enhance data retrieval. However, there are scenarios where data can’t be effectively partitioned by the specific field we wish to filter. Moreover, in some cases, filtering is required based on multiple fields. This means all input files will be opened, which can be harmful to performance.

Thankfully, Parquet offers a clever solution to mitigate this issue. Parquet files are split to row groups , within each row group Parquet stores metadata. This metadata includes the minimum and maximum values for each field.

When writing Parquet files with Pandas you can select what will be the number of records in each control group.

When using Pandas to read Parquet files with filters, the Pandas library leverages this Parquet metadata to efficiently filter data loaded into memory. If the desired field falls outside the min/max range of a row group, that entire row group is gracefully skipped.

df = pd.DataFrame(data)

# writing non partitioned data, specifying the size of the row group
df.to_parquet(path=PATH_TO_PARQUET_FILE, row_group_size=1000000)

# reading non partitioned data and filtering by row groups only
df = pd.read_parquet(path=DATASET_PATH, filters=[('state','==','California')])

Viewing the metadata inside Parquet files can be done using PyArrow.

>>> import pyarrow.parquet as pq

>>> parquet_file = pq.ParquetFile(PATH_TO_PARQUET_FILE)

>>> parquet_file.metadata

  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 6
  num_rows: 1000000
  num_row_groups: 10
  format_version: 2.6
  serialized_size: 9325

>>> parquet_file.metadata.row_group(0).column(3)

  file_offset: 1675616
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 100000
  path_in_schema: state
  is_stats_set: True
  statistics:
    
      has_min_max: True
      min: Alabama
      max: Wyoming
      null_count: 0
      distinct_count: 0
      num_values: 100000
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: SNAPPY
  encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 1599792
  data_page_offset: 1600354
  total_compressed_size: 75824
  total_uncompressed_size: 75891

Notice that the number of row groups in mentioned in the metadata of the entire file and the minimum and maximum values mentioned inside the statistics section of each column for each row group.

However, there is a method to further harness this Parquet feature for even more optimized results: sorting.

Filtering by sorted fields

As mentioned in the previous section, part of the metadata stored by Parquet includes the minimum and maximum values for each field within every row group. When the data is sorted based on the field we intend to filter by, Pandas has a greater likelihood of skipping more row groups.

For example, let’s consider a dataset that includes a list of records, with one of the fields representing ‘state.’ If the records are unsorted, there’s a good chance that each state appears in most of the row groups. For example, look at the metadata in the previous section, you can see that the 1st row group alone holds all the states from ‘Alabama’ to ‘Wyoming’.

However, if we sort the data based on the ‘state’ field, there’s a significant probability of skipping many row groups.

df = pd.DataFrame(data)

# sorting the data based on 'state'
df.sort_values("state").to_parquet(path=NON_PARTITIONED_SORTED_PATH)

Now let’s look again at the metadata and see how it changed

>>> parquet_file = pq.ParquetFile(PATH_TO_PARQUET_FILE)

>>> parquet_file.metadata.row_group(0).column(3).statistics.min
'Alabama'
>>> parquet_file.metadata.row_group(0).column(3).statistics.max
'Kentucky'


>>> parquet_file.metadata.row_group(1).column(3).statistics.min
'Kentucky'
>>> parquet_file.metadata.row_group(1).column(3).statistics.max
'North Dakota'


>>> parquet_file.metadata.row_group(2).column(3).statistics.min
'North Dakota'
>>> parquet_file.metadata.row_group(2).column(3).statistics.max
'Wyoming'

As you can see, after sorting by state the min max values are effected accordingly, each row groups hold part of the states instead of all of the states. This means reading with filters should be a lot quicker now.

Now let’s see how it affects the performance of reading the data. The code for reading the data hasn’t change.

# reading non partitioned data and filtering by row groups, the input is sorted by state
start_time = time.time()
df = pd.read_parquet(path=DATASET_PATH, filters=[('state','==','California')])
runtime = (time.time()) - start_time # 0.24 seconds

Astonishingly the performance here is almost as good as using partitions.

This principle applies to both partitioned and non-partitioned data, we can use both methods at the same time. If we sometimes want to filter the data based on field A and other times base on field B, then partitioning by field A and sorting by field B could be a good option.
In other cases, for instance, where the field we want to filter by is a field with a high cardinality, we could partition by some hash of the value (bucketing) and sort the data inside it by the actual value of the field, in this way we will enjoy the advantages of both methods – partitioning and row groups.

Reading a subset of the columns

Although less commonly used, another method for achieving better results during data retrieval involves selecting only the specific fields that are essential for your task. This strategy can occasionally yield improvements in performance. This is due to the nature of Parquet format. Parquet is implemented in a columnar format, which means it stores the data column by column inside each row group. Reading only some of the columns means the other columns will be skipped.

start_time = time.time()
df = pd.read_parquet(path=NON_PARTITIONED_SORTED_PATH, columns=["name", "state"])
runtime = (time.time()) - start_time # 0.08 seconds

Unsurprisingly, the improvement in performance is great.

Conclusion

While partitioning data is typically the optimal approach, it is not always a possibility. Sorting the data can lead to significant improvements, we may skip more row groups this. Additionally, if feasible, selecting only the necessary columns is always a good choice.

I this post helped you understanding how to harness the power of parquet and pandas for better performance.
Here is a script containing all the previously mentioned examples, complete with time comparisons.

Spark and Small Files

avi yehuda — Sat, 12 Mar 2022 11:12:00 +0000

In my previous post I have showed this short code example:

sparkSession.sql("select * from my_website_visits where post_id=317456")
   .write.parquet("s3://reports/visits_report")

And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big and that we filter most of it using the ‘where’ clause.

Well the answer is of course, is that that piece of code may result in a large amount of small files.

Why?
Because we are reading a large input, the number of tasks will be quite large. When filtering out most of the data and then writing it, the number of tasks will remain the same, since no shuffling was done. This means that each task will write only a small amount of data, which means small files on the target path.

If in the example above Spark created 165 tasks to handle our input. That means that even after filtering most of the data, the output of this process will be at least 165 files with only a few kb in each.

What is the problem with a lot of small files?
Well, first of all, the writing itself is inefficient. More files means unneeded overhead in resources and time. If you’re storing your output on the cloud like AWS S3, this problem may be even worst, since Spark files committer stores files in a temporary location before writing the output to the final location. Only when all the data is done being written to the temporary location, than it is being copied to the final location.

But perhaps worst than the impact a lot of small files have on the writing process, is the impact that they have on the consumers of that data. Since data is usually written once but read multiple times. So when creating the data in multiple small files, you’re also hurting your consumers.

But that’s not all. Sometimes you need to store the output in a partitioned manner. Let’s say that you want to write this data partitioned by country.

sparkSession.sql("select * from my_website_visits where post_id=317456")
.write.partitionBy("country").parquet("s3://reports/visits_report")

That will make things even worst right? Since now each of those 165 tasks can potentially write a file to each of those partitions. So it can reach up to 165 * (num of countries).

What can be done to solve it?
Well, the obvious solution is of course to use repartition or coalesce. But like I mentioned in my last post make sure to be careful if you’re planning to use coalesce. If you do partition the data when writing, like we saw above, there is something else that you can do.

sparkSession.sql("select * from my_website_visits where post_id=317456")
.partition("country", 1).write.partitionBy("country").parquet("s3://reports/visits_report")

In the example above we partitioned the data by country even before writing. Therefore I am expecting to have only 1 file per country.

Coalesce with care…Coalesce Vs. Repartition in SparkSQL

avi yehuda — Mon, 10 Jan 2022 08:01:27 +0000

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)?

sparkSession.sql("select * from my_website_visits where post_id=317456")
.write.parquet("s3://reports/visits_report")

Hint1: the input data (my_website_visits) is quite big.

Hint2: we filter out most of the data before writing.

I’m sure that you got it by now; if the input data is big, and spark is configured in an ideal way, it means that my spark job has a lot of tasks.
Which means that the writing is also done from multiple tasks.
This probably means that the output of this will be a large amount of a very small parquet files.

Small files is a known problem in the big data world. It takes an unnecessary large amount of resources to write this data, but more importantly, it takes a large amount of resources to read this data (more IO, more memory, more runtime…).
This is how it looks in Spark UI.

In this case we have 165 tasks, which means that we can have up to 165 output files.

How would you improve this?
Instead of writing from multiple workers, let’s write from a single worker.

How this is done in spark?

Coalesce vs. Repartition

In Spark there are two common transformation to change the number of tasks; coalesce and repartition. They are very similar but not identical.
Repartition in spark sql triggers a shuffle, where coalesce doesn’t. And as we know, shuffle can be expensive (note: this is true for DataFrames/DataSets. In RDDs the behaviour is a bit different).
So let’s try to use coalesce.

sparkSession.sql("select * from my_website_visits where post_id=317456")
.coalesce(1).write.parquet("s3://reports/visits_report")

That took 3.3 minutes to run, while the original program took only 12 seconds to run.

Now let’s try it with repartition:

sparkSession.sql("select * from my_website_visits where post_id=317456")
.repartition(1).write.parquet("s3://reports/visits_report")

That took only 8 seconds to run.

How can that be?! repartition adds a shuffle, so it should be more expensive.
Let’s look at Spark UI

This is how it looks when using coalesce.

And this is when using repartition:

The reason should be clear. Using coalesce reduces the number of tasks for the entire stage, also for the part which comes before calling the coalesce. This means that the reading of the input and the filtering was done using only a single worker with a single task, as oppose to 165 tasks with the original program.

Repartition on the other hand creates a shuffle, and this indeed adds to the runtime, but since the 1st stage is still done using 165 tasks the total runtime is much better than coalesce.

Does this means that coalesce is evil? Definitely not.
Let’s see an example where coalesce actually a better choice.

Limit with Coalesce

sparkSession.sql("select * from my_website_visits").limit(10)
.write.parquet("s3://reports/visits_report")

This ran for 2.1 minutes, but when using coalesce:

sparkSession.sql("select * from my_website_visits")
.coalesce(1).limit(10).write.parquet("s3://reports/visits_report")

It ran for only 3 seconds(!)

As you can see coalesce helped a lot here. To understand why, we need to understand how does the limit operator work.

Limit is actually dividing the program to 2 stages with a shuffle in between. In the 1st stage, there is a LocalLimit operation, which is executed in each of the partitions. The filtered data from each of the partitions is then combined into a single partition where another limit operation is executed on that data. This operation is defined as the GlobalLimit.

This is how it looks in the SQL tab in Spark UI:

Notice that the local limit and global limit are on separate stages:

Now if this data was ordered in some way, that would have make sense to execute local limit on each partitions before doing the global limit. But since there is no ordering here at all, this is obviously a wasteful operation, we could’ve just taken those 10 records randomly from one of the partitions, logically it wouldn’t make a difference and it would’ve been much faster.

When using coalesce(1) though it helps in 2 ways.
First, as seen, it sets the tasks number to be 1 for the entire stage. Since limit also reduces the number of tasks to 1, then that extra stage and shuffle which limit adds are not needed anymore.
But there is another, more important reason why coalesce(1) helps here. As seen, coalesce(1) reduces the number tasks to 1 for the entire stage (unlike repartition which splits the stage), the local limit operation is done only on a single partition instead of doing it on many partitions. And that helps the performance by a lot.

Looking at this in spark UI when using coalesce, you can clearly see that the local and global limit are executed on the same stage.

And what about Repartition for this case?

sparkSession.sql("select * from my_website_visits").repartition(1).limit(10)
.write.parquet("s3://reports/visits_report")

It takes 2.7 seconds. Even slower than the original job.

This is how it looks in the SQL tab in spark UI:

We see that in this case local and global limit are also executed on the same stage on that single task, like coalesce. so why it is slower here?

Repartition as oppose to coalesce, doesn’t change the number of tasks for the entire stage, instead it creates a new stage with the new number of partitions. This means that in our case it is actually taking all the data from all the partitions and combining them into a single big partition. This of course has a huge impact on performance.

Conclusion

As seen, both coalesce and repartition can help or hurt the performance of our applications, we just need to be careful using them.

Quick tip: Easily find data on the data lake when using AWS Glue Catalog

avi yehuda — Fri, 15 Jan 2021 06:23:53 +0000

Finding data on the data lake can sometimes be a challenge. At my current workplace (ZipRecruiter) we have hundreds of tables on the data lake and it’s growing each day. We store the data on AWS S3 and we use AWS Glue Catalog as meta data for our Hive tables.

But even with Glue Catalog, finding data on the data lake can still be a hustle. Let’s say I am trying to find a certain type of data, like ‘clicks’ for example. It would be very nice to have an easy way to get all the clicks related tables (including aggregation tables, join tables and so on..) so i could choose from. Or perhaps I would like to know which tables were generated by a specific application. There is no easy way to find these table by default.

But here is something pretty cool that I recently found about Glue Catalog that can help.
If you add properties to glue tables, then you can search tables based on those properties.

For example, if you would add the property “clicks” to all the job related tables, then you can get all of those tables as a result by searching the phrase “clicks” in GlueCatalog.

You can also add property like “Application: ClicksGenerator” to all of the tables that were generated by the ClicksGenerator application.
Other ideas for labels may be: team names, last update date, data lag, data update frequency, and so on…

The right way to use Spark and JDBC

avi yehuda — Mon, 17 Dec 2018 09:17:08 +0000

A while ago I had to read data from a MySQL table, do a bit of manipulations on that data and store the results on the disk.
The obvious choice was to use Spark, I was already using it for other stuff and it seemed super easy to implement.

This is more or less what I had to do (I removed the part which does the manipulation for the sake of simplicity):

spark.read.format("jdbc"). 
   option("url", "jdbc:mysql://dbhost/sbschhema"). 
   option("dbtable", "mytable"). 
   option("user", "myuser"). 
   option("password", "mypassword").
 load().write.parquet("/data/out")

Looks good, only it didn’t quite work. Either it was super slow or it totally crashed depends on the size of the table.
Tuning Spark and the cluster properties helped a bit, but it didn’t solve the problems.

Since I was using AWS EMR, it made sense to give Sqoop a try since it is a part of the applications supported on EMR.

sqoop import --verbose 
--connect jdbc:mysql://dbhost/sbschhema 
--username myuser --table opportunity 
--password  mypassword --m 20 --as-parquetfile --target-dir /data/out

Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.

Since both Spark and Sqoop are a based on Hadoop map-reduce framework, it was clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.

By turning on the verbose flag of Sqoop, you can get a lot more details.
What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing.
But before doing that, Sqoop does something smart that Spark doesn’t do.

It first fetches the primary key (unless you give him another key to split the data by), it then checks it’s minimum and maximum values. Then it lets each of its mappers to query the data but with different boundaries for the key, so that the rows are split evenly between the mappers.

If for example the key maximum value is 100, and there are 5 mappers, than the query of the 1st mapper will look like this::

SELECT * FROM mytable WHERE mykey >= 1 AND mykey >= 20;

and the query for the second mapper will be like this:

SELECT * FROM mytable WHERE mykey >= 21 AND mykey >= 40;

and so on..

This totally made sense. Spark was not working properly because it didn’t know how to split the data between the mappers.

So it was time to implement the same logic with Spark.
This means I had to do these actions on my code to make Spark work properly.
1. Fetch the primary key of the table
2. Find the key minimum and maximum values
3. Execute spark with those values

This is the code I ended up with:

def main(args: Array[String]){

// parsing input parameters ...

val primaryKey = executeQuery(url, user, password, s"SHOW KEYS FROM ${config("schema")}.${config("table")} WHERE Key_name = 'PRIMARY'").getString(5)
val result = executeQuery(url, user, password, s"select min(${primaryKey}), max(${primaryKey}) from ${config("schema")}.${config("table")}")
val min = result.getString(1).toInt
val max = result.getString(2).toInt
val numPartitions = (max - min) / 5000 + 1

val spark = SparkSession.builder().appName("Spark reading jdbc").getOrCreate()

var df = spark.read.format("jdbc").
option("url", s"${url}${config("schema")}").
option("driver", "com.mysql.jdbc.Driver").
option("lowerBound", min).
option("upperBound", max).
option("numPartitions", numPartitions).
option("partitionColumn", primaryKey).
option("dbtable", config("table")).
option("user", user).
option("password", password).load()

// some data manipulations here ...

df.repartition(10).write.
   mode(SaveMode.Overwrite).parquet(outputPath)

}

And it worked perfectly.

Remarks:
1. The numPartitions I set for Spark is just a value I found to give good results according to the number of rows. This can be changed, since the size of the data is also effected by the column size and data types of course.
2. The repartition action at the end is to avoid having small files.