Alex Ott's blog

Managing Databricks settings and previews using Terraform

2026-04-10T12:56:00.000+02:00

Many Databricks customers use Terraform to create workspaces and deploy resources within them, or to create account-level resources. But very often, there is a requirement not only to deploy resources but also to ensure that workspaces are correctly configured. I.e., many security-conscious customers disable the export of results from notebooks and SQL queries, and completely disable the embedding of dashboards into 3rd-party systems, etc. And all these settings must be set without human involvement, especially for production environments.

For a long time, people were using databricks_workspace_conf and databricks_sql_global_config resources to control some of the settings, but they had a few major problems:

Not all workspace settings were exposed via these resources;
Many settings available via databricks_workspace_conf weren't publicly documented. Over time, many customers discovered setting names on their own, but it was not officially supported, as settings could be removed without notice.

At some point, specific development teams began adding dedicated workspace- and account-level APIs to control specific settings, and corresponding Terraform resources were added. I.e., databricks_disable_legacy_dbfs_setting (workspace-level), or databricks_disable_legacy_features_setting (account-level). But this approach was unsustainable, as it led to resource sprawl and maintenance overhead. And still, there were no possibilities to control previews or users' preferences.

The situation has changed with the introduction of the generic settings API for both workspace and account levels, allowing the development teams to easily add new settings when necessary, and they are automatically exposed to users. To work with those APIs, corresponding Terraform resources were added: databricks_workspace_setting_v2 and databricks_account_setting_v2. And what is important, these resources could be used to configure Databricks previews on both workspace and account levels!

The usage of those resources is quite simple:

Find the setting name using workspace or account-level APIs.
Create an instance of workspace or account-level resource using that setting name as an argument, and specify the required value argument. The actual value argument depends on the specific setting - it could be a primitive value: boolean_val, integer_val, string_val, or it could be a complex value, i.e. automatic_cluster_update_workspace, aibi_dashboard_embedding_approved_domains, etc. (check resource documentation for more details).

I.e., I want to enable preview for "Lakeflow Connect for Jira", that is, workspace-level preview:

Via list settings API, I find that the setting has the jira_connector name and it's

{
  "description": "Ingest Jira data with a simple and efficient connector. 
Available via API for both Jira Cloud and on premise instances.",
  "name": "jira_connector",
  "type": "{\"boolean_val\": {\"value\": true}}"
}

Add a corresponding resource to my Terraform code:

resource "databricks_workspace_setting_v2" "jira" {
  name = "jira_connector"
  boolean_val = {
    value = true
  }
}

Do standard terraform plan, terraform apply to apply setting change.

And I can see in the UI that it's flipped.

Similarly, it could be done on the account level.

As of right now, we need to keep in mind a few things when using these resources:

Not all settings are available yet in the new API, as migration is still in progress.
Deletion of a setting is a no-op - it won't revert the setting to the original value. So if you want to disable preview or revert another setting to the original value, you need to do it explicitly.

Traditional New Year post, 2025th edition

2025-12-31T18:52:00.000+01:00

It's the last day of the year, and it's time for a traditional blog post.

As usual, it was quite busy at work this year - many different customers, different tasks on different topics, and a lot of different internal activities. Although I got a possibility to concentrate more on team upskilling (internal presentations, trainings, etc.), development and maintenance of reusable assets, development of different tooling for migrations, organizing/overseeing work of my colleagues, working more closely with different product teams, etc. And this year, after five years at Databricks, I was promoted to Principal SSA, and I'm very thankful to my managers for their support throughout that journey.

Early this year, we released the DQX project into Databricks Labs (thank you, Marcin & the team) - the data quality library originally written almost five years ago. We were really surprised by how fast customers started to adopt it in their data processing pipelines. This growth allowed us to invest even more time in developing the new functionality. We are also working directly with the Data Quality monitoring product team to ensure that the official product incorporates all learnings from the field. I even had the opportunity to stay on stage at the Data and AI summit with Marcin and Neha (a big thank you!) talking about DQX.

For the first time, I visited the Data and AI summit in San Francisco. It was a very interesting experience, talking with so many people in different formats (braindates, customer product meetings, booth, …). Although I feel that I needed at least one more week to catch up with my colleagues :-)

This year, the work on Terraform continued in different forms. ~250 pull requests were merged into Databricks Terraform provider - new functionality, bug fixes, etc. A lot of work was done on the Terraform exporter, which is heavily used by Databricks customers for migrations, disaster recovery, or to start their own Terraform journey (I need to write a separate blog post about the exporter and the challenges of reconstructing deployed resources). Besides internal trainings, Vuong and I recorded a webinar about using Terraform for deploying Databricks resources at scale (it's already available in the customer academy - you can watch it even as a user of Free Edition). I even got recognized by Hashicorp as a Hashicorp Core Contributor 2025 - primarily for our work that I described in a separate blog post.

Adoption of LLMs for work significantly grew this year - it went from occasional use of Copilot for programming to use of a mix of Claude & Cursor for programming, Glean, Perplexity, and custom agents for working with documents, more efficient search for information, understanding new stuff, etc. On the programming side, I often feel like in this meme (even as I learn new stuff, I'm still far from very advanced user):

Agents allow me to concentrate on writing a specification, helping with researching a specific topic, doing the code reviews, and offloading the boring stuff, like writing tests to an agent (or a swarm of different agents). With new LLM tools, I've significantly reduced the number of open TODO items, and some of them were quite complex, so I was always waiting to find more time to work on them. The new tools helped me build a lot of new functionality in the Terraform exporter. I.e., implementing support for plugin framework allowed me to increase exported resource coverage to almost all resources available in the Terraform provider, or LLMs helped me to implement a functionality for cross-cloud resource migration, rewriting cloud attributes and instance types. Besides Terraform work, it helped me a lot in designing and writing code in other areas - cybersecurity-related, migration tooling, etc. In the new year, I plan to continue investing in learning new patterns to enhance my work efficiency.

Cybersecurity is my favorite topic, especially when it comes to big data. This year, we continued to help customers adopt Databricks for their cybersecurity needs. And we see more and more customers doing that at scale - you can watch a number of presentations at the Data and AI summit on the topic of cybersecurity. And new product features, like new stuff in declarative pipelines help to implement use cases faster and more efficiently. Another significant topic we observe is the adoption of LLMs and Agents for cybersecurity use cases on Databricks. And some of the results are very impressive - AgentBricks in combination with Genie allows not only to understand what happens from the data, but also to generate mitigation procedures based on existing runbooks, or even call mitigation tools automatically.

I wish everyone a healthy and prosperous New Year!

Delta Live Tables recipes: Consuming from Azure Event Hubs using Unity Catalog Service Credentials

2025-05-29T12:38:00.001+02:00

I wrote previously on different methods of connection from Delta Live Tables to Azure Event Hubs, but both of them suffer from a common problem - they need either a service principal secret or a Shared Access Signature (SAS), which are long-living credentials that could be potentially leaked and used outside of the pipeline.

Several months ago, Databricks introduced Unity Catalog Service Credentials that are based on a special type of managed identity called Azure Databricks access connector. Service credentials allow the generation of short-lived authentication tokens and connect to different Azure services without requiring passwords or other long-lived credentials. And they are managed by Unity Catalog, so you can limit who can use them, or allow their usage only from specific workspaces (s). All of this heavily improves the security posture.

Although we could already connect to different services with generated authentication tokens, we still could do this in the Kafka connector out of the box, as the authentication flow is handled by Kafka itself. But recently, this problem was fixed, and now we can authenticate to Azure Event Hubs using UC Service Credentials. Support for it is rolled out in Databricks Runtime 16.1+ (serverless support is coming soon), and available in Delta Live Tables preview channel that is based on DBR 16.1 (both serverless and classic compute).

And it's the easiest way to authenticate to Event Hubs:

Create UC Service Credential if you don't have one.
Assign necessary roles to it on Event Hubs (i.e., Azure Event Hubs Data receiver, Azure Event Hubs Data sender, etc.)
Specify the service credential name in the databricks.serviceCredential option when reading or writing data.

That's all!

We can check that it works in the notebook attached to a cluster running DBR 16.4:

credential_name = "service-credential"
eh_server = "<host>.servicebus.windows.net:9093"

eh_opts = {
    "databricks.serviceCredential": credential_name,
    "kafka.bootstrap.servers": eh_server,
    "subscribe": "iocs",
    "startingOffsets": "earliest"
}

df = spark.readStream.format("kafka").options(**eh_opts).load()
display(df.selectExpr("CAST(value AS STRING) as value"))

And we can see the data read from the topic:

DLT supports service credentials as well, both for reading data with spark.readStream, and writing via DLT Sinks:

credential_name = "service-credential"
eh_server = "<host>.servicebus.windows.net:9093"

# Read data from Event Hubs
@dlt.table
def raw_iocs():
    eh_opts = {
        "databricks.serviceCredential": credential_name,
        "kafka.bootstrap.servers": eh_server,
        "subscribe": "iocs",
        "startingOffsets": "earliest"
    }

    df = spark.readStream.format("kafka").options(**eh_opts).load()
    return df

# Create a write sink
dlt.create_sink("eventhubs", "kafka", {
    "databricks.serviceCredential": credential_name,
    "kafka.bootstrap.servers": eh_server,
    "topic": "altest",
  }
)

# Actual data writer
@dlt.append_flow(
    name="write_back",
    target="eventhubs"
)
def write_back():
    df = dlt.read_stream("raw_iocs").select("value")
    return df

And if we run that pipeline, we'll see both read and written data (visible in the Azure portal):

So, if you need to connect to Azure Event Hubs from Databricks, I recommend starting to use service credentials instead of service principal or SAS authentication.

Efficient use of the latest DLT features for cybersecurity use cases

2025-03-03T10:11:00.003+01:00

In cybersecurity, everything starts with the collection and processing of data from multiple data sources. These data should be parsed, and then converted into a normalized form, matching some common information model, such as, OCSF (Open Cybersecurity Schema Framework). Typically, the data is organized into several categories, such as Network activity, Identity and Access Management, etc. After that, this data could be used for threat hunting or performing automated detection and response - the common schema helps a lot because we can apply the same detections and queries to data from multiple data sources. We can depict that activity as follows:

It all looks nice on the picture, but implementing efficient and scalable data ingestion and normalization at scale is quite a challenge because we need to handle dozens of different data sources that often use different data formats, there are spikes in the log volumes, i.e., when people come to the office, etc. Plus we need to be cost-efficient and have a good balance between the amount of provisioned resources and data processing latency. Very often, when we use Apache Spark on Databricks to process security logs, people try to combine multiple streaming pipelines inside the single job to get more efficient cluster resource usage, but it leads to more complexity due to the need to handle dependencies between multiple streams, handle errors, and restart of individual streams inside the single job.

The Delta Live Tables (DLT) is a great tool for data ingestion, transformation, and normalization. The declarative nature of DLT pipelines made it easier to write data processing pipelines - very often we can come up with some generic implementations driven by a config. Enhanced Autoscaling allows to handle data spikes, automatically scale clusters up and down, providing a right balance between cost and data processing latency. DLT is also well integrated with Databricks Auto Loader to efficiently ingest data in different formats from the cloud storage. Other features, such as expectations, automatic maintenance, and simplified observability, allow us to build and simplify the maintenance and make sure that we have correct data in our tables.

For some time, DLT had some limitations that required careful planning of an implementation. For example:

Tables created by the DLT pipeline are fully owned by a specific pipeline, and it was not possible to write to the same table from other pipelines. This made work with normalized data more complex, as only a single pipeline could be used for writing to it.
We were able to write data only to the Delta tables, so if we would like to push detection data to external destinations (Kafka, Splunk, Microsoft Sentinel, etc.), we should have a separate job that just reads new data and writes it into the corresponding destination, or tinker with mapInPandas and similar things.
All tables created by the DLT pipeline were stored under the same schema. This made it more complex to maintain permissions, as we typically give wide access only to normalized data (gold), leaving access to bronze (raw data) and silver (decoded, but not normalized data) layers only to a smaller audience (i.e., data engineers).
When multiple source directories were present for the same dataset, we needed to combine them into a single stream using the union function, but Spark Structured Streaming has some specific rules about adding and removing new stream sources, and it was not easy to handle that correctly.

However, during the last year, the DLT product team implemented a lot of new functionality and reduced the number of limitations, making it much easier to develop complex data processing pipelines. I prepared a demo of all of this new stuff. You can find full source code and setup instructions in the repository.

Writing to the same table from multiple streams

Let's start with the last item - this problem was fixed more than half a year ago with the introduction of append flows. With append flows you can easily add or remove data sources that are used to populate a defined streaming table without the need to do a full refresh - this is especially useful when source data has a short retention. It's very easy to use - you just define a destination table, and then define one or more functions that will be used as append flows. It's very easy to combine with the meta-programming approach, allowing to define a function that will return an append flow parameterized by something, i.e. source path for the data (the only requirement is that each flow should have a unique name). For example, here is how we can ingest and parse log data for Apache and Nginx web servers that use the same log file format:

# Define the streaming table to which we'll write
dlt.create_streaming_table(
    name=apache_web_table_name,
    comment="Table for data parsed from Apache HTTP server-compatible logs"
)


def read_apache_web(input: str, add_opts: Optional[dict] = None) -> DataFrame:
    # read data, parse, and convert into the DataFrame
    return df


def create_apache_web_flow(input: str, add_opts: Optional[dict] = None):
    @dlt.append_flow(
        name=f"apache_web_{sanitize_string_for_flow_name(input)}",
        target=apache_web_table_name,
        comment=f"Ingesting from {input}",
    )
    def flow():
        return read_apache_web(input, add_opts)


# Handling of Apache Web logs
create_apache_web_flow(apache_web_input)

# Handling of NGINX logs (compatible with Apache Web)
create_apache_web_flow(nginx_input)

If we run this code, then we'll see a single table inside the graph, but if we select it, and navigate to the Flows tab, then we'll see that it's populated from two data sources:

Publishing to tables in different UC catalogs and schemas

Previously, when we created tables inside the DLT pipeline, they were all stored under the UC schema configured in the catalog and target configuration parameters. However, this wasn't always desired, as it made permissions management more complex, especially when a single DLT pipeline produced tables for all layers of Lakehouse architecture (bronze/silver/gold).

But now it's possible to specify where specific table goes - it depends on if the table has a simple name, then it will be put into <default-catalog>.<default-schema>.<name> (default catalog and schema are defined on pipeline level), or it will go into <default-catalog>.<schema>.<name> if it has form of <schema>.<name>, or we can use a fully qualified name like <catalog>.<schema>.<name>. But anyway it's best to avoid hardcoding of catalog and schema names, and instead either rely on catalog and schema names specified on the pipeline level or pass catalog and schema names as configuration of the pipeline.

In the provided project I'm passing catalog and schema names for normalized data explicitly, and then forming the fully qualified name like this (it's done by get_qualified_table_name function that I defined in helpers.py):

name = "test"
catalog = spark.conf.get("gold_catalog")
schema = spark.conf.get("gold_schema")
table_name = f"{catalog}.{schema}.{name}"

while for silver tables I use catalog and schema configured on the pipeline level:

Using sinks to write to non-Delta destinations or from multiple pipelines

The first two items from the list of limitations above were the most critical for cybersecurity use cases - it should be easy to write to normalized tables from multiple pipelines or write to external systems without additional jobs or complex code.

And now it's possible with recently released DLT Sinks API - you can define a sink object pointing to a Delta table defined outside of the pipeline, or even to another data format supporting streaming writes, such as Kafka or even custom data sources. The usage is very similar to append flows - just instead of a streaming table you define a sink object, and then use it as a target for append flow functions. For example, here is how we can write to the same Delta table from multiple pipelines:

# Create a sink
sink_name = "http_normalized"
dlt.create_sink(sink_name, "delta", {
    "tableName": "my_catalog.my_schema.http",
    "mergeSchema": "true"
  }
)

# This is in one pipeline
@dlt.append_flow(
    name="apache_web_normalized", 
    target=sink_name
)
def write_normalized():
    df = dlt.read_stream(apache_web_table_name)
    df = ... # transform data to a normalized form
    return df


# This is in another pipeline
@dlt.append_flow(
    name="zeek_normalized", 
    target=sink_name
)
def write_normalized():
    df = dlt.read_stream(zeek_http_table_name)
    df = ... # transform data to a normalized form
    return df

If we want to write to non-Delta destinations, we need to provide all the necessary options for that connector. For example, here is how you can define a sink for Azure Event Hubs using the Kafka connector bundled with DLT and then write to it using the append flow:

dlt.create_sink(
    "alerts_eventhubs",
    "kafka",
    { # Create Kafka options dictionary for connection with OAuth authentication
        "kafka.bootstrap.servers": f"{eh_server}:9093",
        "topic": "alerts",
        ....
    }
)


@dlt.append_flow(
    name="alerts", 
    target="alerts_eventhubs"
)
def write_alerts():
    df = dlt.read_stream(detections_table_name)
    df = ... # transform data to a format supported by Kafka
    return df

And when it's executed, we can see the data pushed to Azure Event Hubs topic:

Technically, the sinks implemented using custom data source APIs (I wrote about them a few months ago) are supported as well, although there are still limitations related to their support on serverless and custom libraries inside UC UDFs, but they should be fixed soon. But this code works on non-serverless pipeline:

splunk_opts = {
    "url": "http://10.1.0.6:8088/services/collector/event",
    "token": "splunk_hec_token",
    "time_column": "detection_time",
}
dlt.create_sink("splunk", "splunk", splunk_opts)

@dlt.append_flow(name = "write_to_splunk", target = "splunk")
def flowFunc():
    return dlt.read_stream(detections_table_name)

And we can see the data in the Splunk interface:

Putting it all together

To demonstrate all these things working together, I created a sample project available on GitHub. This project consists of three DLT pipelines that perform data ingestion and parsing, normalization of the schema to Open Cybersecurity Schema Framework (OCSF), and doing rudimentary detection against normalized data as it's shown on the image below:

Ingestion of Apache Web and Nginx logs into apache_web table and then normalizing it into an http table corresponding to OCSF's HTTP activity.
Ingestion of Zeek data:
- Zeek HTTP data into zeek_http table, and then normalizing it into an http table corresponding to OCSF's HTTP activity.
- Zeek Conn data into zeek_conn table, and then normalizing it into a network table corresponding to OCSF's Network activity.
Detection pipeline that does the following:
- Matches network connections data from network table against iocs table (it's filled with dummy data, just for a demo).
- Checks HTTP logs from http table for admin page scans from external parties.
- All matches are stored in the detections table, and optionally pushed to Azure Event Hubs.

Follow the instructions in README to deploy, set up, and run the code.

The execution graph for ingestion of Apache Web logs is quite simple - silver + normalized table:

The execution graph for Zeek data is a bit more complex, just because we have two different log types and two corresponding normalized tables:

And the execution graph for the detections pipeline is also quite simple (the actual view depends on the pipeline configuration):

The relationships between all objects (UC volumes and tables) across all pipelines are better visible on the data lineage graph:

Conclusion

The latest changes in the Delta Live Tables helped simplify the implementation of cybersecurity use cases - try DLT if you're doing cybersecurity on Databricks! And more functionality is coming soon, stay tuned!

Looking back to 2024th

2024-12-31T17:18:00.003+01:00

It will be a New Year in a couple of hours and it's time for the traditional blog post…

From the professional side, it was the "year of Terraform" with a lot of activity around both Databricks Terraform provider and even the core Terraform. The total of 224 my pull requests were merged into Databricks provider (😱, I really didn't realize that there were so many…). A lot of PRs were for Terraform Exporter adding new resources and improving performance/stability, but besides the exporter, there were many new resources and data sources, bug fixes, doc improvements, etc. And of course, a lot of time was spent on code reviews, issues triage, working with my colleagues on Databricks Terraform Examples, and other stuff. Quite a lot of activity was around enablement:

together with Vuong Nguyen we started the year with public webinar on Terraform best practices (the recording is available in Databricks Academy under the title “Deep Dive into automating your Databricks platform using Terraform”).

in the middle of the year, I held an internal session for my colleagues in field engineering to discuss more best practices, troubleshooting, etc.
quite a lot of this content went into the Terraform workshop that we conduct with our customers who are interested in deep dives (contact your Databricks account team if you're interested!).
and some of the content went into public blog posts: Terraform vs. Databricks Asset Bundles (most popular), Databricks SDKs vs. CLI vs. REST APIs vs. Terraform provider vs. DABs, and Working with huge Terraform states.

In general, quite a lot of effort was spent around automation (CI/CD, DevOps/DataOps/MLOps, …), cloud infrastructure, security, disaster recovery, etc. - all the things that should be in place to have a robust and secure data and ML platform 😜.

Another big part of my work was concentrated on cybersecurity. Fun fact - I came to Apache Spark more than ten years ago when I worked at McAfee, and we were building a scalable data processing platform for a new product. At that time we selected Apache Spark because it had more potential (batch, streaming, ml, …) than other solutions (Storm, …), and time showed that we made the right choice. Many customers realized that cybersecurity is really a big data problem (three Vs - volume/variety/velocity) and it requires the right technology to solve that problem that isn't really solvable with existing SIEM solutions. So this year I worked with my colleagues on helping customers build their cybersecurity solutions on top of Databricks - from data ingestion to real-time and ad-hoc threat detection, reporting on cybersecurity data, and applying ML to that data. And there were not only end customers - my colleagues and I are helping other software companies to build on top of Databricks. And Apache Spark also evolves, allowing the easier building of integrations for cybersecurity, i.e., allowing the easier building of customer readers/writers, as it was demoed in a blog post Spark custom data sources and sinks for cybersecurity use cases.

There were a bit more GitHub activity compared to the last year - contributions to different OSS projects, cybersecurity-related (custom data sources for Spark, pySigma backend for Databricks that allows converting Sigma rules into Apache Spark queries), a lot of examples for different topics, and many one-time contributions, from code fixes to improving documentation, etc. (I hope that I'll continue OSS contributions next year as well. )

This year was very interesting from a professional standpoint, and I hope that next year will be as well.

I wish my readers a happy and prosperous New Year!

Working with huge Terraform states

2024-12-27T13:18:00.004+01:00

If you regularly work with Terraform, you should be familiar with the best practices regarding the number of resources in a single state. For example, Google's best practices for Terraform recommend not to include more than 100 resources (and ideally only a few dozen) in a single state.

If you have too many resources in the state it affects many things:

Planning takes too long. You usually need to perform a state refresh for existing resources to check their presence and configuration and decide if any changes should be made, and this will happen even for small changes. It typically requires performing API calls that are often rate-limited, so you can't get information fast even with high parallelism. (For example, a general limit for Databricks APIs is 30 requests/seconds, and it's lower for some APIs.)
A similar problem is with the apply - you will most probably be rate-limited when creating/modifying/deleting resources or getting information via data sources.
increased blast radius - it's harder to review changes in big plans, and if something goes wrong, it may affect all resources.

With correct code organization, splitting code into modules, etc. we can avoid having too many resources in a single state, but it will be a topic for a separate blog post.

In reality you not always can follow the best practices, and you may end up having 10th or 100th of thousands of objects in the same state. In my own practice, I saw this problem in the following Databricks-related cases:

Centralized provisioning of users/groups - typically this happens when existing solutions, such as Microsoft SCIM connector, don't provide enough flexibility.
Centralized provisioning of Unity Catalog objects - catalogs, schemas, etc. Usually, it's an anti-pattern and is solved by correct code organization, splitting into multiple states, etc.
Workspace migrations - typically they are done by using the Databricks Terraform Exporter to generate Terraform code for workspace content, and then applying it to the destination workspace. However they are usually done once, so the slowness during the plan/apply isn't a big problem.
Disaster Recovery (DR) - the usual recommendation is to use Terraform from the beginning to deploy all necessary resources, and follow the best practices on code organization. But it's not always possible in some cases, i.e., a customer doesn't use IaC solutions for deployments, there is a lot of content generated by users working interactively, etc. In this case, customers try to use Terraform Exporter to periodically generate the code and apply it to a destination workspace. (Native DR implementation is coming soon, so we won't need this approach anymore).

During this year I worked on supporting disaster recovery solution for a Databricks workspace used by more than ten thousand users to interactively develop code using Databricks Notebooks, SQL queries, and dashboards, and all this work should be replicated daily to a backup workspace, including not only the content, but also permissions and other related things. A typical approach in this case is to use Git repositories to store the code, and then only repositories should be replicated, so we won't have too many objects in the Terraform state. But in this specific case, the usage of Git was blocked by the customer's security team, and we needed to replicate ~400k notebooks, plus necessary objects such as directories in the workspace, plus permissions for both notebooks and directories, increasing the state size to approximately 600k resources.

The implemented solution was quite standard:

Use the Terraform exporter to generate Terraform code and associated objects (like files with notebook code).
Apply changes to the destination workspace by performing plan/apply. Because the apply direction was almost always primary -> backup and users didn't work in the backup workspace until the DR event, we saved a lot of time by skipping refresh in the plan.

Initially, I concentrated on the first item - making the Exporter run as fast as possible, improving the implementation of the incremental export mode, etc. (I need to write a separate blog post about this part). But we quickly found that the export phase wasn't the main problem - it was Terraform itself. We simply couldn't run plan/apply fast enough to meet the target SLAs even with increased API limits and Terraform execution parallelism. This led us to look into the Terraform code more deeply and make some changes in the Terraform (and OpenTofu) code to improve the performance.

The major bottlenecks identified were related to the Terraform architecture and implementation:

not optimized code - the code in some places is written in a naive style. In a normal situation, it's not visible for users because performance problems arise when you're working with 10th of thousands of resources. For example, the code in AttachResourceConfigTransformer that is called in both plan and apply phases had an N² complexity, and with 400k resources it took ~18 hours to execute. The fix was relatively small, but allowed to decrease execution time to just a few minutes. There are other places where we still have N² complexity, but N there is not a number of resources, but a number of changes, so it was ok for us as a typical change set is just a couple of thousand resources.
copy the entire state on each change - during the apply phase, each resource instance copies the whole state to make changes in it (deep copy, including every nested value). With a huge state, occupying dozens or hundred megabytes of memory, it puts a lot of pressure on the Go's garbage collector, leading to a situation when almost half of the execution time is spent there. We also identified that there was an extra copy done, leading to the production of more garbage than required. This extra copying was also fixed by the Hashicorp team decreasing the pressure on garbage collector a bit. Although the correct solution would be to implement copy-on-write and do it only for affected values, this will be a bigger architectural change, so it was postponed.
global lock around the state - to make changes to the state a specific resource instance needs to acquire the lock. In a typical situation, it's almost not visible to a user, but because it's coupled with the deep copy (described in the previous item), in case of the huge state this operation is much slower, leading to slow execution even with high execution parallelism. In our evaluations, we found that with ~600k resources, we can get approximately 2 operations per second, even with parallelism set to the hundred. Fixing this problem also will require a re-architecture of the Terraform, so we didn't do anything there.
checkpointing the state - when the remote state backend is used (as recommended), Terraform performs changes to the state in memory, and then periodically saves it to remote storage. By default, it's done every 20 seconds and it's a "stop the world" operation when no changes are made to the state. With a huge state size, the overhead of serializing the state as JSON (see the next item) and saving it to remote storage becomes very significant. Technically it would be best to implement checkpointing as a separate goroutine to perform it asynchronously and not block the execution, but required a lot of work to handle all edge cases, so we went with the simpler solution - made the checkpoint interval configurable. Now it's possible to set the checkpoint interval via TF_STATE_PERSIST_INTERVAL environment variable, decreasing the number of "stop the world" operations during the execution (it was ok for our case because operations on notebooks and other workspace objects are idempotent). Note: the local state backend is much worse as it checkpoints after each change.
JSON representation of the state - Terraform uses JSON format to save the state. By default, it uses a pretty-printed representation that it's easy to read by humans, but it's very inefficient - the space character that is used for code indenting occupies approximately 20-25% of the total file size, significantly increasing serialized state and as result, upload times (on relatively slow links). When using compact JSON representation we can significantly decrease state file size and this was confirmed by the implementation (unfortunately, the PR is still not accepted into Terraform, only to OpenTofu).

With the fixes released as part of Terraform 1.9, we were able to reach our SLAs. With ~600k resources in the state and 3-10k changes per day, the plan takes approximately 1-1.5 hours, and the apply time - is one to three hours, depending on the number of changes per day.

Conclusion

It's possible (but not recommended) to use Terraform with tens of thousands of resources if you understand how Terraform works, the limitations of architecture, and you can tune it accordingly (checkpoint interval, parallelism, etc.).

One of the general observations was that performance degrades non-linearly, and slowness due to memory copying and other factors starts after 50-70k resources in the state. So, if you can split your huge state into multiple chunks of smaller size that could be applied independently, it will help with the performance of Terraform itself, although you may still hit the limits of APIs used by specific Terraform providers.

Spark custom data sources and sinks for cybersecurity use cases

2024-11-24T12:58:00.004+01:00

It's very common in cybersecurity that we need to load from different sources (i.e., load data from threat feeds) or write data to external systems (i.e., push data to SIEM/SOAR). Apache Spark is a great tool for crunching big amounts of cybersecurity data, in a batch or streaming manner. Out of the box, Spark has built-in data sources and sinks for file-based formats and event streaming systems (such as Kafka), but its integration with other external systems isn't very trivial. Typically you work with them using REST APIs, and then you need to have different implementations for batch and streaming use cases, mixing that implementation complexity (i.e., foreachBatch) with actual business logic.

The upcoming release of Apache Spark 4 includes PySpark DataSource API (already included into Databricks Runtime 15.3+) that greatly simplify the task of integrating with external systems. Now we can easily add a custom data source implementation and then use it the same way as built-in data sources and sinks - just specify the name of your custom data source in the format, and your implementation will be called to handle reads or writes (both batch and streaming, if your implementation supports it):

class MyDataSource(DataSource):
    @classmethod
    def name(cls):
        return "my-source"

    def writer(self, schema: StructType, overwrite: bool):
        ...


spark.dataSource.register(MyDataSource)
# df is read from some source
df.write.format("my-source").mode("overwrite").save()

To play with the new data source APIs I decided to implement sinks for a typical task - push data (such as, detections, alerts, etc.) to external systems, such as, Splunk (I'm thinking about supporting for reads as well). The implementation is quite simple, but it greatly simplifies integration now - instead of using foreachBatch with your stream or calling REST API from mapInPandas you now can just say .format("splunk"), and provide necessary options, and the data source implementation will take care for calling necessary APIs, and it works the same for both batch and streaming use cases.

First we need to register our data source:

from cyber_connectors import *

spark.dataSource.register(SplunkDataSource)

And then we can just use it to write data to Splunk providing necessary options such as url and token so the data source knows where to send data and how to authenticate (see README for the list of supported options). For example, I have some Zeek HTTP logs coming as JSON files, and I can easily push them to Splunk:

dir_name = "tests/samples/json/"
bdf = spark.read.format("json").load(dir_name)  # to infer schema - not use in the prod!

sdf = spark.readStream.format("json").schema(bdf.schema).load(dir_name)
# apply some filtering here to detect suspicious events

stream_options = {
  "url": "http://localhost:8088/services/collector/event",
  "token": "....",
  "source": "zeek",
  "index": "zeek",
  "host": "my_host",
  "time_column": "ts",
  "checkpointLocation": "/tmp/splunk-checkpoint/"
}
stream = sdf.writeStream.format("splunk") \
  .trigger(availableNow=True) \
  .options(**stream_options).start()

And I can see the data in my Splunk instance:

And that's all! My code is now concentrated on handling my business logic and is not polluted with some implementation details. If necessary, I can switch to another external system by just changing .format("splunk") to .format("something-else").

Databricks SDKs vs. CLI vs. REST APIs vs. Terraform provider vs. DABs

2024-09-16T12:26:00.000+02:00

The previous blog post about Databricks Terraform provider vs. Databricks Asset Bundles (DABs) was quite successful, but it didn't cover all possible application areas. So, there were requests for a follow-up post covering other tools, such as Databricks CLI, SDKs, and REST APIs, and when to use them compared to Databricks Terraform provider and DABs.

Databricks REST API

The Databricks REST API is the foundation for all other tools. All interactions with the Databricks Platform happen via it and you have full control over what you're doing. But with the great power, you're now responsible for handling all nuances of API usage:

Authentication: Multiple authentication methods are supported, but, for example, you need to generate and renew OAuth tokens yourselves.
Implementation details, like pagination in list API: different APIs use different pagination methods, and you need to understand all the details of each (note: the unification is in progress, but it takes time).
Error handling: You need to retry the call when you get the HTTP 429 status code (rate limit) and some other situations, or stop processing if you get other, non-retryable errors.
Some services, such as clusters, model serving, etc., are starting their objects, and you need to wait until they successfully start to declare the success. This could be done by continuous polling, but you shouldn't overload APIs by polling too often and shouldn't waste time polling rarely.

Databricks SDKs

Databricks provides a number of SDKs for different languages (officially supported are for Go, Python, and Java languages). All these SDKs are generated from the same source - API specification that describes the whole Databricks REST API surface. Having SDKs generated from the same source has a big advantage - SDKs get new functionality as soon as new/updated APIs are published. Another great thing is that APIs and their usage in different languages are quite similar to each other (taking into account language differences), so it's easier to switch between different languages.

SDKs solve all the problems described above by providing:

Authentication - you can authenticate using all supported authentication methods (PATs, Databricks/Azure/GCP user-to-machine and machine-to-machine OAuth, …). You can provide authentication parameters either explicitly when creating an API client, via environment variables, or have a mix of them. SDKs also support using information from configuration profiles defined in Databricks CLI's configuration file. And when you're running it from a Databricks Notebook, you don't even need to specify any authentication parameters - everything will be configured automatically.
Abstracting away implementation details, such as pagination implementation, you just call Clusters.ListAll and don't worry about what pagination method is used by the specific API.
Handling retries and errors - SDKs automatically retry the call if it hits rate limits or other conditions that allow the action to be tried again.
Providing auxiliary methods, such as GetByName to get an object by its name or WaitGetClusterRunning to wait for a cluster creation - all these methods are generated automatically for most services. But SDKs also include manually written auxiliary methods, such as Clusters.SelectNodeType or Clusters.that, SelectSparkVersion allow the building of cloud-agnostic code (similar to this Terraform example).

In general, the use of Databricks SDKs is very simple:

you create an instance of workspace or account client
you use methods of specific service exposed by the client - clusters, jobs, etc.

Here is a simple example of listing all jobs in the workspace using Python SDK (authentication parameters will be taken from the notebook environment or environment variables):

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
job_list = w.jobs.list(expand_tasks=False)

More complex examples could be found in documentation and in Databricks Labs Sandbox repository.

Databricks CLI

Databricks CLI is built on top of Databricks Go SDK and provides an easy-to-use interface to interact with Databricks Platform from the command line (on both workspace and account levels). CLI is also a home for Databricks Asset Bundles that greatly simplify deployment and promotion of the code and other assets to the Databricks Platform.

Because it's built on top of Go SDK, it inherits all its capabilities but provides an easier-to-use interface to perform specific tasks - create or start clusters, list jobs, etc. That's ideal for one-time use, or for scripting. But we still need to take care of providing the right payload, such as JSON-encoded cluster or job specification, etc., the same as for the corresponding REST APIs. These payloads could be quite complex, and not so portable if we talk about references to cluster policies, instance pools, DLT pipelines, or other "external" references. For such cases, it's better to use DABs or the Databricks Terraform provider to define the environment consisting of multiple objects and deploy them in the right order, with references, etc.

One great part of Databricks CLI is the ability to define a configuration profile - a named entity describing a specific environment - primarily these are authentication parameters, like, host, token, etc., but it's possible to specify other configurations as well. After the profile is defined we can easily use that configuration by specifying only its name, without the need to specify all parameters together. I.e., it's easy to export workspace objects (notebooks, workspace files, etc.) from one workspace and import them into another using the following command:

databricks --profile ws1 workspace export-dir -o '/Users/...' local-dir && \
    databricks --profile ws2 workspace import-dir -o local-dir "/..."

Profiles can also be used by SDKs and Terraform providers, making it easy to reuse the same code by specifying environment variables to specify which profile should be used instead of hardcoding configuration in the code or specifying multiple environment variables.

When to use what?

To decide what tool to use I typically ask myself a very simple question - what I want to achieve?

If I want to define some "environment" (especially complex, consisting of multiple objects, like, a job with multiple tasks of different types), and keep its configuration up to date - then use DABs or Terraform. These tools will take care of tracking what objects are already created, what configuration they have, etc., and make changes if necessary to bring them to the desired state. DABs provide an additional functionality on top of it, like, starting a DLT pipeline or job and wait for its execution (it's not available in Terraform by default).
If you need to perform some action - use Databricks CLI or SDKs for the language of your choice. (These actions are typically stateless):
- CLI is ideal for one-time actions, like, start cluster, list jobs, etc. As soon as you need to implement more complex logic, you will start to chain CLI calls using shell, and it will become an unsupported mess (believe me, I wrote and supported huge shell scripts ;-)
- SDKs are ideal for implementing complex logic - you can use the full power of selected programming language with abstractions provided by SDKs. With SDKs, it's easy to implement custom tasks, i.e., find all scheduled/triggered jobs, and pause/unpause them, or deactivate/reactivate all non-admin users/service principals in the workspace, etc. See more examples in the Databricks Labs Sandbox repository.

The direct use of the Databricks REST APIs should be the last resort due to the need to handle authentication, retries, implementation details (i.e., pagination), etc. yourself. Although there are still cases when you can select to use REST APIs:

There is no SDK for your language. The best approach would be to raise a request to add one for your language of choice, and use REST API directly until it's available. With correct design of your programs, you can easily swap your direct implementation with SDKs.
SDKs don't provide the necessary functionality yet - typically this happens with APIs that are in the private preview, so the API specification isn't updated yet. In this case, you can still use CLI and SDKs - they provide a raw interface to REST APIs, handling things like authentication and error handling/retries:
- For CLI the raw interface is available as databricks api commands ().
- SDKs provide an low-level API client that is used by both workspace and account-level clients under the hood. For example, Python SDK has the ApiClient class that could be used to call an arbitrary REST API.

Conclusion

I hope that this blog post will help you identify and start using the right tool for your Databricks automation journey. I would be really grateful for your feedback!

Terraform vs. Databricks Asset Bundles

2024-08-01T15:10:00.000+02:00

I often get questions from customers and my colleagues: We have Databricks Terraform Provider and Databricks Asset Bundles (DABs), and they have overlapping functionality—what should we use for deploying my data processing and machine learning pipelines and what should we use for deploying the infrastructure? I recently presented internally, and as part of this presentation, I tried to formulate specific guidance on that topic...

Typical Challenges when using Terraform

One of the most significant challenges with Terraform is that many data engineers, data scientists, and machine learning engineers are not familiar with it. Terraform is predominantly a tool used by DevOps and infrastructure teams, and its steep learning curve can be a barrier for those who primarily work with data and machine learning models. This lack of familiarity often leads to a reliance on DevOps teams to manage infrastructure, which can slow down the development process.

Managing Terraform code across multiple environments (development, staging, production) requires careful planning and organization. The need to modularize code and create environment-specific configurations adds complexity. Tools like Terragrunt can help by providing a wrapper that simplifies some of these tasks, but it is not a perfect solution and still requires significant setup and maintenance. Often, customers end up relying on pre-built templates provided by their DevOps teams, which can limit flexibility and autonomy for developers.

Terraform requires a state file to keep track of the resources it manages. When deploying from CI/CD pipelines, this state must be stored somewhere accessible, typically in cloud storage. However, managing permissions and access to this state file can be problematic, especially in large organizations with stringent security policies. Issues with state management can lead to failed deployments and require manual intervention, further complicating the deployment process.

How DABs Solve These Pain Points

DABs allow users to specify multiple environments (development, staging, production) in a single configuration file. This streamlined approach reduces the need for extensive modularization and environment-specific code. Additionally, the -t switch enables easy deployment to different environments by overriding environment-specific parameters, making it straightforward to integrate into CI/CD pipelines.
Databricks Asset Bundles (DABs) use Terraform under the hood, but they abstract away much of the complexity. This means that data engineers, data scientists, and ML engineers can deploy infrastructure without needing deep knowledge of Terraform. By simplifying the interface, DABs make it easier for these professionals to manage their own infrastructure needs.
DABs handle state management by using workspace files to store the Terraform state. This approach eliminates the need for dedicated cloud storage and simplifies permission management. With DABs, developers do not have to worry about where and how to store state files, reducing the potential for deployment issues related to state management.
By addressing the above challenges, DABs reduce the load on infrastructure teams and provide more autonomy to developers. This autonomy allows data professionals to implement integration tests and manage their own deployments without heavy reliance on DevOps teams, leading to faster development cycles and more efficient workflows.

DABs vs. Terraform - when to use what

If your organization does not have a robust DevOps framework in place or if your engineering team is not well-versed in Terraform, adopting DABs can be highly beneficial. DABs provide a more accessible and streamlined way to manage infrastructure, allowing data professionals to focus on their core tasks without being bogged down by infrastructure complexities.

When to Use Terraform

Terraform remains a powerful tool for managing large-scale infrastructure and is well-suited for the following tasks:

Deployment of Workspaces and Related Cloud Infrastructure: Use Terraform to set up foundational components like workspaces and the associated cloud resources.
Assignment of Groups/Users/Service Principals to Workspaces: Manage access control and user assignments with Terraform to ensure secure and organized access to resources.
Deployment of Workspace-Level Resources: Terraform is ideal for deploying shared resources such as cluster policies, groups, and permissions at the workspace level.
Management of Major Unity Catalog Objects: Deploy and manage essential catalog objects like metastore, catalogs, and grants with Terraform for a structured data governance framework.

When to Use DABs

DABs are particularly effective for managing project-level artifacts and promoting them between environments. Consider using DABs for:

Deployment of Project-Level Artifacts: DABs can deploy data pipelines, workflows, and other project-specific resources. Although not all resources are currently supported, DABs provide a straightforward way to manage these artifacts.
Environment Promotion and CI/CD Integration: DABs excel at promoting artifacts between environments and integrating them into CI/CD pipelines, simplifying the process of moving changes from development to production.

Conclusion

In summary, while Terraform is a robust tool for infrastructure management, DABs offer a more accessible and streamlined approach for data professionals. By leveraging the strengths of both tools, organizations can optimize their infrastructure management processes and empower their teams to work more efficiently.

Traditional New Year post, 2023rd edition

2023-12-31T15:06:00.000+01:00

Today is the last day of the year, and it's time for a traditional blog post with a review of the year.

From the professional side, it was another busy but very interesting year with many activities across multiple areas. For me it was primarily cloud infrastructure, security, all things automation, disaster recovery, migrations, and related areas. I tried to reflect on this in my post on three years at Databricks published on LinkedIn, and it's also visible from the range of topics of blog posts published this year.

From my point of view the automation (cloud infra, security, DevOps & CI/CD, …) is a critical part of the project's success, and this was one of the most significant parts of my work. Terraform is a robust tool for automation, and I did spend a considerable amount of time on the related work:

More than 150 pull requests were merged into Databricks Terraform provider - not only the new functionality or bug fixes but also quite a lot of work was done on Terraform exporter that is heavily used for environment migrations and disaster recovery projects.
In May we announced Terraform modules for Databricks - reusable code that helps customers to build their Databricks infrastructure faster, and we're working on including more modules so customers will be able just to combine necessary pieces to get their infrastructure ready to use.
A lot of internal work on enablement around Terraform adoption - some parts of it will be presented in the upcoming webinar.

Besides Terraform, quite a lot of work (PRs, GH issues, …) was done with the engineering team responsible for the developer ecosystem - new Databricks SDKs for Go and Python languages and the new Databricks CLI. With these new tools, it's much easier to develop additional tools for Databricks (like this) or automate some boring tasks.

This year, a few projects related to cybersecurity kicked off, and hopefully, we'll get more work in this area where I have significant experience and where Databricks and Apache Spark are the natural fit. Modern cybersecurity is a big data domain with challenges around large-scale real-time data processing, data normalization, threat detection, and reporting. Technologies like Delta Live Tables not only simplify development and deployment of scalable data processing pipelines, but they also include features like enhanced autoscaling that allow to automatically scale pipelines up and down, providing cost-efficient way of handling spiky workloads that are natural for cybersecurity (we had that challenges back at McAfee).

In February, Databricks celebrated ten years, and attending the company kick-off event in Las Vegas was interesting. For me, it was a chance to finally meet people in person after working with many of them for 2.5 years. It was also the first long-distance business trip since the pandemic began almost three years ago. Although frankly speaking, I can't say that I miss these trips - it's interesting to meet people, but travel takes too much time, so I need to wait for teleportation :-)

With all this, I'm looking forward to what the new year will bring. And I wish a happy New Year to all!

Delta Live Tables recipes: Consuming from Azure Event Hubs using OAuth 2.0/OIDC authentication

2023-10-28T13:00:00.005+02:00

Last year,I blogged about consuming data from the Azure Event Hubs with Delta Live Tables (DLT). That blog post showed how to do that using Apache Kafka client that is bundled together with Databricks Runtime that is used by DLT.

That example used Shared Access Signatures (SAS) generated for a specific Event Hubs namespace or a topic. However, in many organizations, the use of SAS is prohibited because it’s a long-living token that is potentially risky to use. Instead, it’s recommended to use short-living tokens of service principals that need to be generated according to the OIDC/OAuth 2.0 specification. These tokens need to be periodically refreshed, which should be done automatically by a consumer.

Before Databricks Runtime 12.2 was released earlier this year, DBR versions were using 2.x versions of Apache Kafka clients that didn’t support OAuth/OIDC authentication, so I even created a simple library that could be used with Databricks clusters to generate and refresh OAuth tokens. But we still had a problem using it on DLT as we can’t attach jar libraries to the DLT pipeline.

Things had changed in DBR 12.2, which upgraded the Apache Kafka clients library, and it now has built-in support for OAuth 2.0/OIDC authentication flows (see KIP-768 for more details), so it’s now just a matter of correct configuration to start consuming from the Azure Event Hubs topic using an Azure service principal. To make it work, we need a service principal ID, secret, and Azure Tenant ID - using this data, we can construct the correct SASL configuration string. We also need to grant the service principal a corresponding role on Azure Event Hubs (“ Azure Event Hubs Data Receiver” for reading data or “Azure Event Hubs Data Sender” for writing data).

The complete example of a DLT pipeline that consumes from Event Hubs topic looks as follows:

import pyspark.sql.functions as F
import dlt

topic = "<topic>"
eh_namespace_name = "<eh_namespace_name>"
eh_server = f"{eh_namespace_name}.servicebus.windows.net"

# Data for service principal are stored in the secret scope
tenant_id = dbutils.secrets.get("scope", "tenant_id")
client_id = dbutils.secrets.get("scope", "sp-id")
client_secret = dbutils.secrets.get("scope", "sp-secret")
# Generate SASL configuration string (it's split to fit into the screen)
sasl_config = f'kafkashaded.org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule' + \
  f' required clientId="{client_id}" clientSecret="{client_secret}"' + \
  f' scope="https://{eh_server}/.default" ssl.protocol="SSL";'

# Create Kafka options dictionary
callback_class = "kafkashaded.org.apache.kafka.common.security.oauthbearer.secured.OAuthBearerLoginCallbackHandler"
oauth_endpoint = f"https://login.microsoft.com/{tenant_id}/oauth2/v2.0/token"
kafka_options = {
  "kafka.bootstrap.servers": f"{eh_server}:9093",
  "subscribe": topic,
  "startingOffsets": "earliest",
  "kafka.security.protocol": "SASL_SSL",
  "kafka.sasl.mechanism": "OAUTHBEARER",
  "kafka.sasl.jaas.config": sasl_config,
  "kafka.sasl.oauthbearer.token.endpoint.url": oauth_endpoint,
  "kafka.sasl.login.callback.handler.class": callback_class,
}

@dlt.table
def bronze():
    df = spark.readStream.format("kafka").options(**kafka_options).load()
    return df.withColumn("value", F.col("value").cast("string"))

The only change necessary to make it work on Databricks is to prepend kafkashaded to the class names because the Apache Kafka client is shaded.

Looking back to 2022nd

2022-12-31T16:31:00.002+01:00

It's the last day of the year, and it's time to write a traditional "year in review" blog post.

On professional side it was very intensive & interesting year. I'm still working with customers, although my role has changed a bit - now I belong to a group of specialist solution architects, working with customers on advanced use cases in specific areas. For me it's an interesting mix of data engineering, platform, security, data governance, devops, cybersecurity, …, and ability to work with big enterprise customers. Work with customers was tightly connected with other activities - blogging, internal & external knowledge sharing, contributing to internal & open source projects, working with product teams in releasing new functionality, etc.

The significant amount of work was done for Databricks Terraform provider. The most significant event was that Databricks Terraform provider reached version 1.0 and became a fully supported part of Databricks portfolio, and continues to be a very popular tool between Databricks customers. Although the provider now is a part of the product, the field team continues actively contributing to its functionality - knowing how people are using it is a very important aspect of developing tools for end-users. From my side, during the year there were more than 80 merged pull requests, with quite a bit of work in the last months on the exporter functionality that allows users to quickly start to maintain existing Databricks resources with Terraform.

Databricks Terraform provider wasn't the only open source contribution this year. In the first half of the year I had a possibility to continue contributions to Apache Airflow, not only fixing bugs or improving existing Airflow operators, but also adding new functionality, like support for Databricks SQL that simplifies data ingestion from different data sources into Delta Lake tables. Plus there were many contributions to projects under the Databricks Labs & Databricks umbrellas, and quite a lot of work (code samples/demos/…) inside personal repositories…

This year I tried to return to blogging. Besides publishing in the personal blog, I managed to co-author five blog posts in the company blog on different topics. I'm planning to continue writing in both blogs, having already few drafts in work.

Continuing to answer on StackOverflow was another form of external knowledge sharing about all things Databricks, Delta Lake, Apache Spark, etc. and sometimes I hear from customers that they know me because of answers. This year I managed to get a gold badge (score of 1000) for the databricks tag.

Another thing that I managed to do this year is to get back to more cybersecurity-related work - the area where I have good practical experience. It was in the different forms - two blog posts (1, 2) about working with CrowdStrike data in the company blog, one post in personal blog, writing a lot of code for ingestion & enrichment of different data sources (not open yet), helping customers to build cybersecurity lakehouses, … Cybersecurity is a big data area, where Apache Spark/Databricks are a natural fit.

There were many other things that happened during this interesting year - it's a pleasure to work surrounded by many talented colleagues, and I'm looking with hope into the next year.

Happy New Year!

Delta Live Tables recipes: implementing unit & integration tests, and doing CI/CD

2022-12-22T15:41:00.006+01:00

The extended & updated version of this blog post is published on the Databricks blog.

Cloud-agnostic resources deployment with Databricks Terraform Provider

2022-11-25T18:01:00.003+01:00

Databricks Terraform Provider includes a number of the data sources that greatly simplify creation of portable Terraform templates. There are few classes of data sources related to compute, user & group management, and other topics. In practice, the most often used data sources are:

databricks_node_type together with databricks_spark_version allow to define jobs, clusters, instance pools & DLT pipelines that are cloud agnostic.
databricks_current_user allows to avoid hard coding of paths to notebooks in jobs & DLT pipelines, so it's easy to move resources between environments, or avoid names conflicts - for example, when developing a job or a DLT pipeline could be created for each of the developers, and should point to a notebook for a given user, but in production environment, this job or DLT pipeline will be owned by service principal.
databricks_group is heavily used to refer to predefined user groups, such as, admins or users, for example, when setting permissions to specific resources, or when adding users as workspace administrators (you can find examples in the documentation).

Let's look at how databricks_node_type, databricks_spark_version, and databricks_current_user could be used to create cloud agnostic Terraform templates. When you work with multiple clouds and define jobs or clusters, you need to specify node type - name of the instance type that will be used to run your code. The problem is that these names are cloud specific, and in some cases people resolve to ugly code like node_type_id = (var.cloud == "aws") ? "c5d.2xlarge" : (var.cloud == "azure" ? "Standard_F8s" : "c2-standard-8") that is hard to read & support (and it will break if Databricks will add support for another cloud). Also, you need to specify a Databricks Runtime (DBR) version that you want to use (the spark_version parameter in cluster definition) that consists of several pieces: version itself, is it ML runtime or not, is it ML runtime for GPU or CPU, is it Photon-optimized, is it long term support version (LTS) or not, etc., for example, 11.3.x-cpu-ml-scala2.12 or 11.3.x-photon-scala2.12. Also, new versions are released regularly, and if you want to have clusters/jobs to run on the latest version, you may need to update your Terraform code after each release of new runtimes.

And use of databricks_node_type and databricks_spark_version solve these problems:

you parameterize databricks_node_type by specifying what is the minimal number of cores required per node, how much memory should be per core, should it have GPU or not, category (compute or memory optimized, …), and many other parameters described in the documentation. When executing, Databricks Terraform provider fetches the list available node types via REST API, and finds a node matching your parameters that you can use in the cluster/job definition (Warning: sometimes it can't find it if you have incompatible requirements).
similarly, you tell databricks_spark_version to search a DBR version matching your requirements: ML or not, with Photon or not, etc. - see documentation for full list. Similarly, when invoked, Terraform provider will call corresponding REST API, and find a specific version matching your requirements (or not find, if you specify incorrect combination, like, Photon + ML).

Let's look at the specific example - deployment of a Databricks job that will execute a notebook on a job cluster. Full source code is available on GitHub. It also demonstrates the use of databricks_current_user data source to create user-specific name for a job, and deploy a notebook into the user's directory.

First let select the corresponding node type for our job - here I want a node that has a local disk, has at least 8 cores, and it's compute optimized:

data "databricks_node_type" "this" {
  local_disk            = true
  min_cores             = 8
  category              = "Compute Optimized"
}

I also want to use latest Databricks ML Runtime with long term support:

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
  ml                = true
}

Then I just refer to that data sources in my job definition:

resource "databricks_job" "this" {
  ...
  new_cluster {
    num_workers   = 1
    spark_version = data.databricks_spark_version.latest_lts.id
    node_type_id  = data.databricks_node_type.this.id
  }
  ...
}

That's all!

Let's see what happens if I execute that code on Azure, and then compare results with AWS & GCP. After job is created, let see into job cluster definition:

As we can see, Terraform provider has selected the Standard_F8s instance type (compute optimized, with 8 cores), and selected 11.3.x-cpu-ml-scala2.12 as runtime version (latest LTS version with ML support for execution on nodes without GPU).

If you execute the same code on AWS, runtime version won't change, but we'll get c5d.2xlarge as the node type:

And if we do the same on GCP, the node type will change to the c2-standard-8 (have you noticed that this node has 32Gb of RAM instead of 16GB on Azure & AWS? This happens because there were no other node with smaller amount of memory):

This blog post demonstrated that it's really easy to create Terraform code for Databricks that is easy to use on different clouds, and also avoid updating your code when new runtime versions are released.

Ingesting indicators of compromise with Filebeat, Azure Event Hubs & Delta Lake on Databricks

2022-10-21T12:33:00.000+02:00

In cybersecurity, an Indicator of Compromise (IoC) is very important piece of information that is observed on a network or in an operating system that usually indicates a computer intrusion. Typical IoCs are things like file hashes, URLs/domains/IPs of botnet command & control servers, etc. Having this information we can use it to perform the real-time matching of logs & other data against known IoCs, or to perform investigations against historical data. There are multiple data formats that are used to exchange information about IoCs that allows sharing this information between different parties - there are open threat exchange platforms, but there are also a few security vendors that provide high quality, curated threat feeds.

As mentioned above, when it comes to use of IoC data we typically have two distinct use cases:

matching IoCs against new data - this usually happens in the real-time or near-real-time fashion against the streaming data, and generated alerts are kicking-in the investigation process. To minimize the time between generation of events/logs and generation of alerts, our tool should support efficient lookup in the IoC data.
matching IoCs against historical data - typically this happens as part of the incident response process, when analysts are looking into previous activity in light of the new data. In this case the tool should be able efficiently process huge amounts of historical data joined with IoC data.

Apache Spark in combination with Delta Lake as underlying file storage format is a perfect combination that is able to handle both of these use cases very efficiently - Spark & Delta support both streaming & batch workloads using the same code, so you don't need to duplicate the IoC data, or write different code for each of the use cases. Additional efficiency when working with historical data could come from use of Databricks SQL that allows to process big amounts of data faster due use of the Photon engine.

To make IoC data available for use we need to perform two tasks:

Collect IoC data. When you need to receive IoC data from the threat feeds, you usually need to scrap some REST API or something like that - this task often needs a custom code. But for popular threat exchange platforms there is an easier way to do that - you can simply use the Threat Intel module of the Elastic Filebeat - very popular, lightweight tool for shipping log data to Elasticsearch and other destinations, such as, Apache Kafka (or Azure Event Hubs that also supports Apache Kafka protocol).
Make collected data available for consumption. Usually data from different threat exchange platforms come in slightly different formats, also depending on what kind of IoCs are reported (domain names vs. IPs vs. file hashes, etc.). To access IoC data efficiently we need to transform them into a unified format.

The rest of the article describes these two steps in more detail.

Setting up & running the Filebeat to ingest IoC data

Setting up Filebeat to output to Azure Event Hubs - it's easy to configure of Filebeat to ingest data into Event Hubs (full example of config file you can find here):
- we need to make sure that we disable output.elasticsearch and output.logstash blocks in the filebeat.yaml
- and we need to modify the output.kafka block as shown below, replacing values in the <> with actual values:
  - eh-namespace in the hosts is the name of your EventHubs namespace
  - for authentication we're using Shared Access Signature that you need to copy from Azure Portal (or get via command-line/terraform) - you need to put it into the password field. The value of username is fixed and equal to $ConnectionString
  - set value of topic field to the name of EventHubs topic into which we'll ingest the data
  - the rest of the fields should have the fixed values as specified below.

output.kafka:
  hosts: ["<eh-namespace>.servicebus.windows.net:9093"]
  sasl.mechanism: "PLAIN"
  username: "$ConnectionString"
  password: "Endpoint=sb://<eh-namespace>.servicebus.windows.net/..."

  # message topic selection + partitioning
  topic: '<topic-name>'
  partition.round_robin:
    reachable_only: false

  required_acks: 1
  compression: none
  ssl.enabled: true
  max_message_bytes: 1000000

Enable the Threat Intel module - that's also a very easy task:
- in the filebeat.yaml make sure that all subsections inside the filebeat.inputs are commented out.
- we need to enable threatintel module by renaming the modules.d/threatintel.yml.disabled to modules.d/threatintel.yml
- edit modules.d/threatintel.yml to enable specific integrations. In current article we're using following feeds: abuseurl, abusemalware & malwarebazaar from Abuse.ch, and otx from AlientVault OTX.
Start Filebeat - of course we can run filebeat on a personal machine, but because it need to run all the time, it could be easier to run it in the cloud, where we can use something like Standard B1ls (on Azure) that has enough memory to run the Filebeat process, and it will cost you less than $4/month.

Processing collected IoC data

The previous section described how we can make IoC data published, but now we need to read them, and make them available for direct use. To do it we need to take several things into consideration when implementing data processing:

Filebeat's Threat Intel module periodically loads data from the specified REST API endpoints, but it doesn't perform de-duplication of the data - if there are no changes in the API output, it still writes collected data into a configured sink. The solution for this is to generate a hash of the actual payload & discard all duplicate events that have the same hash.
Different threat feeds use different data formats, and we need to perform normalization - use the same field names, expand different hashes of the same file into individual rows for easier matching, etc.
The same IoC may come via different threat feeds. There are different ways of handling this - ignore duplicates, merge data from multiple providers, etc. For simplicity I selected the first method - ignore duplicate submissions.

The implementation itself is quite straightforward and follows the standard medallion architecture (full source code is on GitHub):

Raw data are ingested from Event Hubs into a bronze layer without much modification - we add a hash of the actual payload that is used to detect duplicates, extracting the threat feed name (the dataset column), and also adding a date column that is used for data partitioning. By keeping the raw data intact we'll be able to reprocess them if necessary, or add handling of new threat feeds later.
Actual data transformation happens when we ingest data into a silver layer. The code consists of the few functions that perform decoding and normalization of data for specific threat feeds (datasets) - this data then is written into a single Delta Lake table that then is used for streaming & batch processing.

Current implementation is implemented using Spark Structured Streaming, but right now it's implemented as a batch-like job using the Trigger.Once that is triggered several times per day using the Databricks Workflows that looks as following:

To reach the best performance when working with collected IoC data we need to have the correct data layout. In the current implementation, the silver table has following structure (only main columns are listed):

dataset (string) - from which threat feed we got this IoC.
ioc_type (string) - IoC type (possible values are URL, domain, hostname, IPv4, and different file hashes in form of FileHash-<hash-type>).
ioc (string) - actual IoC value, depending on the IoC type (hash/IP/…).
first_seen (timestamp) - when a given IoC was first reported.
last_seen (timestamp) - when a given entry was seen last time (please note that not all threat feeds report it).

Based on the target schema of the silver table, we can use following techniques to get best performance when working with IoC data:

partition table by the ioc_type column, so we'll read only specific data when matching specific IoC types.
index the first_seen & last_seen columns so we can get advantage of the data skipping.
Z-Order data by first_seen column to make data skipping even more efficient. This is done by a maintenance task.
create a bloom filter (currently Databricks-only) for the ioc column to make joins & point lookups more efficient.

1.1.3. Use the collected IoC data

After we prepared our IoC data, it's really easy to use them - we just need to perform a join between a dataframe with data (from stream or batch read) and our IoC table - we only need to make sure that we have input data in the correct format (the ioc_type should specify type of entry (IP/file hash/…), ioc - value to check, and timestamp - when the event happened):

data = # input dataframe
iocs = # dataframe with IoC data
joined = data.join(iocs, (
       (data.ioc_type == iocs.ioc_type) & (data.ioc == iocs.ioc) &
       (data.timestamp >= iocs.first_seen) &
       (data.timestamp <= F.coalesce(iocs.last_seen, F.current_timestamp())))) \
    .drop(data.ioc_type).drop(data.ioc)

And that's all. It took less than 200 rows of the Python code to implement ingestion & normalization of the data for four threat feeds, and then use this data to detect potential security incidents.

Reflecting on two years at Databricks...

2022-08-13T14:51:00.006+02:00

This Wednesday, 10th of August, was my second anniversary of working at Databricks. Initially I planned to write this blog post on that day, but as usual, started to dig into customer work, and remembered about it only in the evening, after I went away from the keyboard.

I joined Databricks professional services team in the middle of the pandemic year. All interviews were done remotely (it became normal by that time), and I was really impressed by the people who did the interviews - there were deep technical and non-technical questions, but it wasn't something that to demonstrate superiority (I've seen such things previously). People were really excited to talk about position, working at Databricks, etc. There were multiple reasons to join Databricks:

I always like Apache Spark, since I started to use it in early 2015th, and the possibility of working in the company behind Spark was really exciting. (before that, Spark was one of the decision points when I thought about joining DataStax...)
culture - few of my colleagues from DataStax were already working at Databricks, and I heard many stories about company's culture
the company was (and still) growing at a fast pace, and besides Spark, there were many other interesting products in the portfolio. And I especially wanted to get deeper into machine learning.
remote position - as a SaaS product, the amount of (potential) travel is much lower compared to the products that aren't cloud-based. Really, as of right now, I didn't have any work-related trips, although I had a possibility of working with many customers across the whole Europe - all not leaving the comfort of my home office setup.

During the first weeks, as I was going through the onboarding trainings, I started to get to know a wider team - not only the direct colleagues, but people from other geo locations, and different departments - product & engineering, pre-sales solutions architects, ... And these interactions were confirming the stories that I've heard previously about company culture - there are a lot of very smart, but humble people, they are ready to help when you have questions or problems (especially during the onboarding), they are open for suggestions, you can reach people across org boundaries, discuss something with high management, ... And it keeps the same after two years, even though the company grew very significantly (when I joined we had less than 1,500 across the globe, and now we're close to 4,000).

The pace of the product development inside the company is very high - looking back, I can see how many things were added or heavily changed even since the last year, not even talking about two years ago. Databricks SQL, Delta Live Tables, Databricks Repos, Unity Catalog, just to name a few - these things are making life of our customers easier, allowing them to concentrate on solving their business problems, not trying to reinvent the wheel of running Spark & other things themselves. This makes work very interesting, although sometimes you can feel a kind of information overload, when you're trying to cover all areas of your interest.

Often, when I'm talking with people outside of the Databricks, they have an impression that my work is primarily around Spark (data engineering) and machine learning. But reality is quite different - reliable & scalable data engineering and machine learning aren't possible until you have a solid foundation of automation (cloud infra/data/ml/dev ops), security/compliance, and related things. As result, a big chunk of our work is spent around deployment planning (for Databricks and other cloud infrastructure), security, building CI/CD pipelines, and related topics. These things are the base on which customer's teams can build their data and machine learning products. And, almost from the beginning, I've started to contribute to the Databricks Terraform provider that is used by the significant number of Databricks customers to automate their deployments. And I want specifically mention Serge Smertin who leads the development effort of terraform provider (and many other projects) - I learned many new things from him, and always was amused by his relentless push for making things powerful but easy to use. With the similar goal of helping customers to automate, I've started to contribute to Apache Airflow, so now it's not only possible just to run Databricks jobs, but you can query the data, import new data sets, and do many other things using the Databricks SQL. And besides of this, there were many other things done that allow to simplify work with Databricks, for example, testing of code in Databricks notebooks with Nutter, a lot of code snippets demonstrating different aspects of the platform (check spark-playground & databricks-playground repositories on GitHub), etc.

Working in a quickly growing company gives you a lot of possibilities to contribute to its success. These contributions could come in the different forms, like, sharing knowledge internally (in form of SME groups, presentations, creating new workshops, ...) & externally (i.e., I published three blog posts in the company's blog, answering on Databricks Community & StackOverflow), working closely with product & engineering on new functionality, simplifying internal processes, contributing to open source, ... But most important is that these contributions are recognized, allowing career growth, switching to a new role if you want to try a new area, etc.

All these things could be summarised as follows - decision to join Databricks was one of the best decisions so far, and I'm looking forward to more things happening there...

Delta Live Tables recipes: Consuming from Azure Event Hubs

2022-06-19T16:30:00.004+02:00

Databricks Delta Live Tables (DLT) is a new framework from Databricks aimed on simplifying building reliable & maintainable data processing pipelines. With this framework developers are concentrating on writing data transformations themselves, linking them together, and Delta Live Tables handles task orchestration, cluster management, error handling, monitoring, and data quality. Delta Live Tables supports both batch & streaming workloads, supporting all data formats & input sources included into a Databricks Runtime (DBR).

On Azure, Event Hubs (often spelled as EventHubs) is a popular solution for events transportation, similar to Apache Kafka, so when it comes to building solutions on Azure, the Event Hubs is a natural choice. There is a Spark connector for Event Hubs, but right now it's not included into Databricks Runtime, and DLT doesn't allow (yet) to attach 3rd party Java libraries to a DLT pipeline.

But there is a workaround for that problem - Azure Event Hubs provides an endpoint compatible with Apache Kafka protocol, so we can work with Event Hub topics using the Apache Kafka connector that is included into a Databricks Runtime. We just need to follow the instructions in the official documentation with small changes, specific to DBR:

we need to get Shared Access Signatures (SAS) to authenticate to Event Hubs topic - it should look like Endpoint=sb://<....>.windows.net/;?... and will be used as a password. For security reasons it's recommended to put it into a Databricks secret scope (update variables secret_scope and secret_name with your actual values).
we need to form the correct string (the eh_sasl variable) for SASL (Simple Authentication and Security Layer) authentication - as a user name we're using static value $ConnectionString, and Event Hubs SAS is used as a password. SASL string looks a bit different on Databricks - instead of org.apache.kafka.common.security.plain.PlainLoginModule... it should be prefixed with kafkashaded. as the original Java package is shaded to avoid conflicts with other packages.
you need to provide the name of the Event Hubs namespace & topic from which to read data in eh_namespace_name and topic_name variables.

The final solution looks as following:

@dlt.table
def eventhubs_topic1():
  secret_scope = "scope"
  secret_name = "eventhub_sas"
  topic_name = "topic1"
  eh_namespace_name = "<eh-ns-name>"
  readConnectionString = dbutils.secrets.get(secret_scope, secret_name)
   eh_sasl = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule' \
    + f' required username="$ConnectionString" password="{readConnectionString}";'
  bootstrap_servers = f"{eh_namespace_name}.servicebus.windows.net:9093"
  kafka_options = {
     "kafka.bootstrap.servers": bootstrap_servers,
     "kafka.sasl.mechanism": "PLAIN",
     "kafka.security.protocol": "SASL_SSL",
     "kafka.request.timeout.ms": "60000",
     "kafka.session.timeout.ms": "30000",
     "startingOffsets": "earliest",
     "kafka.sasl.jaas.config": eh_sasl,
     "subscribe": topic_name,
  }
  return spark.readStream.format("kafka") \ 
    .options(**kafka_options).load()

With it you can refer to your DLT table by name eventhubs_topic1 in the dlt.read or dlt.read_stream functions. An example of using similar code can be seen in the image of a real DLT pipeline that I'm using for processing of threat feeds (there will be a separate post on that topic) - the threatintel_bronze consumes data from the Event Hubs.

There are also additional benefits in using the Apache Kafka connector. The biggest one is that the original Event Hubs connector requires 1-to-1 mapping between partitions in Event Hubs topic and Spark partitions. This means that if you have more CPU cores than partitions in Event Hubs topic, then your cluster resources will be used only partially, so you will spend money doing nothing. In Apache Kafka connector, the minPartitions parameter allows to specify desired number of Spark partitions, and connector will split existing Kafka/Event Hubs partitions into subranges, allowing creation of Spark partitions without 1-to-1 mapping. And this greatly improves cluster utilization. Stay tuned for a separate blog post on optimization of Spark + Event Hubs combo.

Goodbye 2021st...

2021-12-31T18:43:00.002+01:00

As usual, 31st December is a good time to look back on the year behind. This year flew by very fast, filled with many things both professional & personal.

On the professional side there was a lot of activity - many different clients from small to huge, and interesting projects around very different things - architecture & implementation, security, automation/infrastructure, data quality, scaling the data processing (from an organisational perspective), improving development processes, etc. I'll try to find time to write about lessons in some of these areas. It's interesting that one demo that demonstrates how to do automated testing of Databricks notebooks (I developed it for my CI/CD workshop) is 2nd most popular of my Github repositories.

As time allowed, I tried to continue to contribute to OSS. The most significant OSS contribution was to Databricks Terraform Provider - around 10k lines of Go code. Another big part of activity was to the project Overwatch that simplifies collection & analysis of data from Databricks workspaces to find problematic usage of resources, analyse costs, etc. And on top of that, quite many small activities (PRs, issues, etc.) to fix bugs in documentation, port some components to Spark 3, etc. Hopefully, I'll continue to work on OSS stuff at the same scale.

From a personal side, the second pandemic year didn't allow a return back to "normal life". But we still managed to travel two times (it was quite a relief after almost 1.5 year since a "normal" vacation). This year, I finally managed to complete my reading challenge - primarily because of travelling just with Kindle, without distraction from iPad/laptop.

I wish Happy New Year to everyone! And be safe!

Looking back to 2020th

2020-12-31T17:47:00.000+01:00

The last day of the year is a good opportunity to look back on what happened during the current year.

The pandemic changed our life significantly, and I'm not the exception, although maybe not so cardinally as for others - I'm working mostly remotely for the last three years, with periodic trips to customers. And this was the most significant change for me this year - everything became virtual in a very short period of time, without visits to customers onsite. Although, in the first two months of the year I traveled a lot - almost half of the mileage for the 2019th. The biggest effect of this switch to virtual was on trainings that I did for customers - if you can collaborate with people remotely when investigating some problems, discussing implementations, etc., with trainings that's different - it's harder to see if people understands what you're teaching, as you don't see reactions - this required to change an approach to teaching, including materials that are presented…

The biggest change that happened this year is the new job - after interesting years at DataStax, I went to Databricks, to a similar position - helping customers to build solutions on the top of the Databricks platform. Databricks is well known as a company behind Spark, but it's not only Spark - MLflow & Delta Lake are very popular & powerful technologies for building data processing & machine learning solutions. And inside Databricks, all of them are getting new functionality faster, before release to the open source. And being a cloud platform, Databricks made it easier working with customers during pandemic - you aren't required to be onsite to help people. Overall, it's very interesting to be in a fast growing company, with a lot of really smart people around, so you can learn a lot. Plus, I got much more exposure to the Azure & AWS services that I didn't touch much before. One of the interesting things to observe is that Spark is traditionally associated with Scala, but in practice I'm writing much more code in Python/PySpark :-)

This year I again didn't make my reading challenge - I set it to 55 books, like the last year, but read only 40 (it was 46 in 2019th) - this also was a side effect of the not traveling so much, plus a job change (but I read a lot of documentation :-). One of the book-related activities was a technical proofreading of several books by O'Reilly & Manning (I worked with Manning before on several books):

Cassandra. The Definitive Guide, 3rd edition
The Practitioner's Guide to Graph Data
Graph Databases in Action
Graph-Powered Machine Learning (not released yet)

Programming related activity was spread between open source & internal projects. On the open source front, I became the committer for Apache Zeppelin, primarily improving support for Cassandra (more details are in this blog post), but with the job change that was put on hold. But at the new job I suddenly started to write in Go again, contributing to the Terraform provider for Databricks. Besides that, there were a lot of small contributions to multiple OSS projects, including the several open sourced projects at DataStax that just make life of administrators easier.

This year also was more productive than previous around writing. I wrote (with co-authors) two blog posts for DataStax's blog (1, 2), and seven blog posts for my own blog around Cassandra, Spark, Zeppelin, DataStax, and Databricks. And I have drafts for several blog posts that I'm planning to publish early next year.

And many other things happened during the year, but I don't want to list everything here :-)

I wish you everyone a happy & prosperous New Year! And stay healthy - this is the main thing right now.

Running Apache Zeppelin on DSE Analytics

2020-07-31T12:50:00.005+02:00

DataStax Enterprise (DSE) includes the modified version of Apache Spark branded as DSE Analytics. This version has increased fault tolerance, doesn't rely on the Zookeeper, and has many additional optimizations & enhancements for work with Cassandra. It also includes a Hadoop-compatible distributed file system - DSEFS. And I already wrote about using Zeppelin with another component of DSE Analytics - AlwaysOn SQL Service.

In this post I want to discuss how we can use Apache Zeppelin to run on the DSE Analytics, allowing us to use the Spark resources that we already have in the DSE cluster. If we just need access data in DSE from Spark, without running Spark code in DSE Analytics, then we just need to configure Zeppelin to use recently released Spark Cassandra Connector 2.5 as it has better compatibility with DSE. For this post I used Zeppelin 0.9-preview2 with DSE 6.8.1 that includes Spark 2.4.

To run Zeppelin on DSE Analytics we have two options that are described in corresponding sections:

Execute Zeppelin directly on the node of DSE cluster - this is the easiest way, but not very good from a security standpoint, adding more load to the DSE node, etc.
Execute Zeppelin on some other node that have access to the DSE cluster - this solves security and other problems, but requires more work to setup

In both cases we're relying on the code shipped in DSE, and we don't need to explicitly install Spark Cassandra Connector.

Running Zeppelin on DSE node(s)

This is the most straightforward way to run Zeppelin & get access to DSE Analytics, DSEFS, etc. The procedure is simple:

Start Zeppelin as dse exec path_to_zeppelin.sh on one of the nodes inside DSE Analytics data center. dse exec will setup all necessary parameters - CLASSPATH, etc., so Zeppelin will pick up all necessary jars that are necessary to submit jobs to the DSE Analytics
In Zeppelin UI change the Spark interpreter settings. Change the spark.master (master in the Zeppelin 0.8) parameter to dse://? instead of default local[*] - this will force Zeppelin to execute jobs on DSE Analytics, with all its advantages, like, automatic registration of DSE tables in Spark SQL, access to DSEFS, etc.

After configuration is changed, we can execute Spark code to read data from DSE, write to DSEFS, execute Spark SQL queries (and we don't need to explicitly register Cassandra tables!), etc.:

And we can see in the Spark Master of DSE Analytics that Zeppelin is really executed there:

Running Zeppelin with DSE Analytics outside of DSE Cluster

Sometimes, it's undesirable to run Zeppelin on the DSE node directly due to many reasons - resource consumption, security concerns (for example, people may get access to files via shell interpreter or other means), etc. In this case we can still have benefits of running Zeppelin via dse exec - we just need to do following:

download DSE distribution and unpack it on the machine where we want to run Zeppelin - you don't need to configure or start anything. We just need DSE-specific jar files to be able to
start Zeppelin via dse exec as before
configure it to run on DSE Analytics, but we'll need to make more operations to achieve this:
- we need to obtain an IP address of Spark master - this is could be done either by looking for Spark master IP in output of dsetool status, or we can use dse client-tool spark master-address - this option would be even easier for automatic configuration of the Zeppelin, because it will print complete URI of Spark master
- change spark.master parameter to value obtained via dse client-tool spark master-address - it should be at least dse://<Master-IP>?, or with more parameters like connection.host and local_dc. For example: dse://10.121.34.176:9042?connection.local_dc=SearchAnalytics;connection.host=10.121.34.94,10.121.33.133;

if there is no connection.host in the Spark master URI, then you need to add the spark.cassandra.connection.host property, and put there a comma-separated list of DSE nodes
if necessary, add other properties specific for Spark Cassandra Connector and DSE Analytics. We can obtain them by executing dse client-tool configuration byos-export command. Usually these are properties related to security, but we can specify any additional property specific for the Spark Cassandra Connector, like, username and passwords, or performance tuning options
to work with DSEFS as the default file system we can specify the Hadoop option spark.hadoop.fs.defaultFS with value of dsefs://<DSE_NODE_IP>. This is not strictly required, we still can use DSEFS but we'll need to specify node address in the path, like, dsefs://192.168.0.10/file.csv (see screenshot below)

After everything is configured, we can execute our code. Result will be the same, we'll get the Zeppelin process running on DSE Analytics, and we'll have full access to data. And we can use DSEFS as well - we can write data to DSEFS using explicit or implicit filesystem:

and see that data on DSEFS:

Conclusion

This post shows that it's quite easy to run Apache Zeppelin on DSE Analytics, either directly on the cluster's nodes, or outside of the DSE cluster. For the second option, the setup process could be simplified by packing both DSE & Zeppelin into a Docker image (example), and configuring Zeppelin using its configuration REST API.

What's new in Apache Zeppelin's Cassandra interpreter

2020-07-30T11:59:00.004+02:00

The upcoming Zeppelin 0.9 is a very big release for Apache Zeppelin (the 0.9.0-preview2 was just released). A lot has happened since release of the 0.8.x series - better support for Spark & Flink, new interpreters (Influxdb, KSQL, MongoDB, SPARQL, …), a lot of bug fixes and improvements in the existing interpreters. In this blog post I want to specifically discuss improvements in the Cassandra interpreter that exists since Zeppelin 0.5.5, released almost 5 years ago.

The two most notable changes in the new release (already available in the 0.9.0-preview2) are:

Upgrade of the driver to DataStax Java driver 4.x (ZEPPELIN-4378)
Control of formatting for results of SELECT queries (ZEPPELIN-4796)

Upgrade to the DataStax Java driver 4.x

Prior releases of the Cassandra interpreter were based on the open source DataStax Java Driver for Apache Cassandra 3.x. It worked fine with Apache Cassandra, but not always was usable with DataStax Enterprise (DSE), for example, you couldn't use it with DSE-specific data types, like, Point, when you get data back as ByteBuffer instead of Point:

DataStax Java driver 4.0, released in March 2019th, was a complete rewrite of the Cassandra driver to make it more scalable and fault-tolerant. To achieve these goals, the architecture of the driver has changed significantly, making it binary incompatible with previous versions. Also since Java driver 4.4.0, released in January 2020th, all DSE-specific functionality is available in the single (unified) driver, instead of traditional separation on OSS & DSE drivers. With release of the unified driver 4, the 3.x series of the driver was put into the maintenance mode, receiving only critical bug-fixes, but no new features.

To get access to the new features of the driver, internals of Cassandra interpreter were rewritten. Because of the architectural changes of the new driver, the changes in the interpreter were quite significant. But in result we're getting more functionality:

Access to all improvements and new functions provided by the driver itself - better load balancing policy, fault tolerance, performance, etc.
Allow to configure all parameters of the Java driver. In previous versions of interpreter, every configuration option of the driver should be explicitly exposed in the interpreter's configuration, and addition of the new option required change in the interpreter's code, and release of the new version together with Zeppelin release. In the new version of interpreter, we can set any driver configuration option, even if it's not explicitly exposed by interpreter. This is possible because of the way the new Java driver is configured - configuration could be specified in the config file, set programmatically, or even via Java system properties. This flexibility was already demonstrated in the blog post on connecting Zeppelin to the DataStax's Astra (Cassandra as a Service)
Support for DSE-specific features, for example, now it's possible to execute commands of DSE Search, or work with geospatial data types:

Because of the changes in driver itself, there are some breaking changes in interpreter:

the new driver supports only Cassandra versions that implement native protocol V3 and higher (Cassandra 2.1+, and DSE 4.7+). As result, support for Cassandra 1.2 and 2.0 is dropped (but you shouldn't use them in 2020th anyway)
there is only one retry policy provided by the new driver, and support for other retry policies (LoggingRetryPolicy, FallthroughRetryPolicy, etc.) are removed. As result of this, support for query parameter @retryPolicy was dropped, so existing notebooks that are using this parameter need to be modified

Control of the results' formatting

The previous version of the interpreter always used the predefined formatting for numbers, and date/time related data types. Also, the content of the collections (maps, sets & lists), tuples, and user-defined types was always formatted using the CQL syntax, with This wasn't always flexible, especially for building graphs, or exporting data into a file for importing into external system that may expect data in some specific format.

In a new interpreter users can control formatting of results - you can configure this on interpreter and even on the cell level. This includes:

selection between output in the human-readable or strict CQL format. In the human-readable format, users can have more control on the formatting, like, specification of precision, formatting of date/time results, etc.
control of precision for float, double, and decimal types
specification of locale that will be used for formatting - this affects date/time & numeric types
specification of format for date/time types for each of date, time, and timestamp types. You can use any option of DateTimeFormatter class
specification of timezone for timestamp type

All of this is applied to all data, including the content of collections, tuples, and user-defined types.

Formatting options could be set on the interpreter level by changing new configuration options (see documentation for details) - if you change them, this will affect all notebooks:

With default options, user will get data in human-readable format, like this:

But sometimes it's useful to change formatting only in specific cells. This is now possible by specifying options in the list after the interpreter name, like %cassandra(option=value, ...) (please note, that if option includes = or , characters, it should be put into double quotes, or escaped with \). There are multiple options available, that are described in the documentation(TODO: link) and built-in help. For example, we can change formatting to CQL:

Or we can multiple options at the same time - locale (see that it affects formatting of numbers and date/time), timezone, format of timestamp, date, etc.:

Other changes

There are also smaller changes available in the new release - they are making the interpreter more stable, or add a new functionality. This includes:

(ZEPPELIN-4444) explicitly check for schema disagreement when executing the DDL statements (CREATE/ALTER/DROP). This is very important for stability of the Cassandra cluster, especially when executing many of them from the same cell. Because Cassandra is a distributed system, they could be executed on the different nodes in almost the same time, and such uncoordinated execution may lead to a state of the cluster called "schema disagreement" when different nodes have different versions of the database schema. Fixing this state usually requires manual intervention of database administrators, and restarting of the affected nodes
(ZEPPELIN-4393) added support for -- comment style, in addition to already supported // and /* .. */ styles
(ZEPPELIN-4756) make "No results" messages foldable & folded by default. In previous versions, when we didn't get any results from Cassandra, for example, by executing INSERT/DELETE/UPDATE, or DDL queries, interpreter output a table with statement itself, and information about execution (what hosts were used for execution, etc.). This table occupied quite significant space on the screen, but usually didn't bring much useful information for a user. In the new version, this information is still produced, but it's folded, so it doesn't occupy screen space, and still available if necessary.

Conclusion

I hope that all described changes will make use of the Cassandra from Zeppelin easier. If you have ideas for a new functionality in Cassandra interpreter, or found a bug, feel free to create an issue at Apache Zeppelin's Jira, or drop an email to Zeppelin user mailing list.

Working with DataStax Astra from Databricks platform

2020-07-28T18:00:00.005+02:00

One of the notable changes in the Spark Cassandra Connector (SCC) 2.5.0 is the support for Astra - DataStax's Cassandra as a Service offering. Having managed Cassandra makes it very easy to start development of the applications - you can create a new database in a couple of minutes. Databricks is also well known for its Spark-based unified cloud data processing platform. Both Databricks & DataStax offer the free tier, and this combination is an ideal ground for prototypes. This short blog describes how to work with Astra from the Databricks, using free tiers in both cases.

To get access to Astra from Databricks, we need the following:

running instance of Astra database - if you don't have it, it's easy to create - just login to Astra, and press "Create New Database"
credentials (username & password) specified when creating database
secure connect bundle that will be used to establish connection to Astra - this bundle could be downloaded from the database's main page, and need to be uploaded to DBFS (Databricks File System), so it will be available to Spark
Spark cluster configured to use secure connect bundle, together with other parameters - credentials, etc.

First we need to upload the secure connect bundle to DBFS. Easiest way to do it is to go to "Data", click the "Add Data" button, and use the "Upload File" form. After the file is uploaded, remember the path to the uploaded file (like, /FileStore/tables/secure_connect_aott.zip) - we'll need it on the next steps.

Then we need to create a Spark cluster. Go to "Clusters" and click "Create Cluster". Select runtime, either Spark 2.4, or Spark 3.0 - depending on the version selected we'll need to use different versions of Spark Cassandra Connector. Click "Spark" link and enter configuration parameters there:

spark.cassandra.auth.username - user name to connect to database (test in my case)
spark.cassandra.auth.password - password for user (123456 in my case)
spark.cassandra.connection.config.cloud.path - path to the uploaded file with secure connect bundle (dbfs:/FileStore/tables/secure_connect_aott.zip)
spark.dse.continuousPagingEnabled with value false - this is a workaround for SPARKC-602 that we need to apply right now to avoid errors when reading data from Astra

After entering all this data, press "Create cluster" - this will take you to the list of your running clusters. If you point a mouse to the instance that is creating right now, you can see several links, like "Libraries / Spark UI / Logs" - we need to select "Libraries" to add Spark Cassandra Connector to a cluster. In the opened page, click "Install New" - this will open the dialog for addition of the library. Select the "Maven" tab. Because we have dependency conflict between SCC and Databricks runtime we must not use the spark-cassandra-connector artifact, but the assembly version of it: spark-cassandra-connector-assembly (see SPARKC-601 for details). For runtime version 6.x we need to use com.datastax.spark:spark-cassandra-connector-assembly_2.11:2.5.1, and for runtime 7.0 we need to take com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0-beta (or released version when it's done). Click "Install" to add the library to a cluster. (Please note that to use SparkCassandraExtensions, for DirectJoin, for example, you need to have a cluster init script in place, that should copy the assembly before the driver & executor will start...)

To make this blog post self contained and not dependent on the previously created tables & loaded data, let's generate test data, create a table using SCC, and write data into that table:

import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.ClusteringColumn

val newData = spark.range(1, 1000)
  .select($"id".as("pk"), $"id".cast("int").as("c1"), 
          $"id".cast("int").as("c2"), $"id".cast("string").as("str"))

val ksName = "test"
val tableName = "newdata"
newData.createCassandraTableEx(ksName, tableName, 
                               partitionKeyColumns = Seq("pk"), 
                               clusteringKeyColumns = Seq(
                                  ("c1", ClusteringColumn.Ascending), 
                                  ("c2", ClusteringColumn.Descending)), 
                               ifNotExists = true)
newData.write.cassandraFormat(tableName, ksName).mode("append").save

This code will generate a dataframe with 999 rows, create a table with following structure, and write data into it:

CREATE TABLE test.newdata (
    pk bigint,
    c1 int,
    c2 int,
    str text,
    PRIMARY KEY (pk, c1, c2)
) WITH CLUSTERING ORDER BY (c1 ASC, c2 DESC);

To check that data is written correctly, let read them into another dataframe, print its schema, and count number of rows:

val data = spark.read.cassandraFormat(tableName, ksName).load
data.printSchema
data.count

as expected, this will print schema as:

root
 |-- pk: long (nullable = false)
 |-- c1: integer (nullable = false)
 |-- c2: integer (nullable = false)
 |-- str: string (nullable = true)

and output the number of rows, as expected it's 999.

That's all! You can continue to use Spark Cassandra Connector to work with data in Astra using either Dataframe, or RDD APIs - all functionality is the same, including joins with Cassandra, writing streaming data into it, etc. See Spark Cassandra Connector documentation for more information.

Spark & efficient joins with Cassandra

2020-07-27T15:17:00.005+02:00

In modern data processing, especially when handling streaming data, quite often there is a need for enrichment of data coming from external sources. High-level diagram for such data processing usually looks as following:

For effective enrichment of the data, the database that holds that additional information should provide low-latency, high throughput access to the data. And Apache Cassandra with its very fast reads by primary key, is the ideal candidate for storing the data that will be used for enrichment (maybe in addition to being a storage of processed data).

Apache Spark is often a base for implementation of the data processing pipelines, for both batch & streaming data. And it has very good support for Cassandra provided by the Spark Cassandra Connector (SCC). Connector provides access to Cassandra via both RDD & Dataframe APIs, and recently released SCC 2.5 added a lot of the new functionality that earlier was available only as a part of DataStax Enterprise, including support for effective joins with Cassandra for dataframes.

Spark Cassandra Connector has optimizations for executing join of dataframe or RDD with data in Cassandra - data that is used to join are converted into requests to individual partitions or rows that are executed in parallel, avoiding the full scan of the data in Cassandra (there are settings that define the thresholds when SCC will decide to do a full scan vs individual requests). Russell Spitzer has a very good blog post about joins in dataframes, including information about its performance. SCC allows to perform either inner join, or left join between RDD/dataframe and Cassandra. One of the useful things when performing joins is that it reflects changes done in Cassandra, so you can always have access to the latest data. You can also use caching on data in Cassandra to avoid hitting Cassandra on every join, and periodically refresh the cached version to pull the latest changes.

We'll use following table definition and data in the most of examples shown below:

create table test.jtest1 (
  pk int,
  c1 int,
  c2 int,
  v  text,
  primary key(pk, c1, c2));
insert into test.jtest1(pk, c1, c2, v) values (1, 1, 1, 't1-1-1');
insert into test.jtest1(pk, c1, c2, v) values (1, 1, 2, 't1-1-2');
insert into test.jtest1(pk, c1, c2, v) values (1, 2, 1, 't1-2-1');
insert into test.jtest1(pk, c1, c2, v) values (1, 2, 2, 't1-2-2');
insert into test.jtest1(pk, c1, c2, v) values (2, 1, 1, 't2-1-1');
insert into test.jtest1(pk, c1, c2, v) values (2, 1, 2, 't2-1-2');
insert into test.jtest1(pk, c1, c2, v) values (2, 2, 1, 't2-2-1');
insert into test.jtest1(pk, c1, c2, v) values (2, 2, 2, 't2-2-2');

The join condition could be on:

full partition key (pk column) - in this case, SCC will pull all rows from that partition and create N rows for each input row
partial primary key with all preceding clustering columns should be specified, for example (pk + c1 columns) - SCC will pull all rows that are matching to the given range query, and create so many rows for each input row
full primary key (pk + c1 + c2) - in this case SCC will pull only one row, if it exists, and use that data for joining

The join isn't supported on following:

partial partition key in case of composite partition key
on the clustering columns that are not preceded by previous clustering columns, for example, pk + c2 without c1
other join types, like, right, or full

In such cases, depending on API, SCC either throws an error, or will fallback to the performing full scan of the Cassandra table and execution of the join on the Spark side.

Joins in Dataframe API

Let's start with the Dataframe API that is recommended to use in modern Spark. Support for effective joins of dataframes with data in Cassandra for a long time was only available in DSE Analytics - commercial distribution of Cassandra and Spark developed by DataStax, and open source version of SCC had support for joins only in RDD API. But with release of SCC 2.5, join of dataframes also became available for all users of the open source version of SCC.

Please note that this functionality is not enabled by default (together with some other, like, support for ttl and writetime functions). You need to enable special Catalyst rules by setting configuration parameter spark.sql.extensions to a value com.datastax.spark.connector.CassandraSparkExtensions when starting Spark shell, or submitting a job. Something like this:

bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
   --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

If you forget to do this, SCC won't optimize your join, and it will be performed as usual - by reading all data from Cassandra, and executing join in Spark (with shuffle!). You can always check that this optimization is applied by running dataframe.explain, and looking for a string "Cassandra Direct Join" in the physical plan - we'll see that in the examples below (if you're running code on the DSE Analytics, it will be "DSE Direct Join").

Let's look at specific examples of performing joins of dataframe with data in Cassandra. We're using the test.jtest1 defined above to demonstrate the behaviour when we're using only partition key, and complete or partial primary key. All dataframe examples have following code in common:

import spark.implicits._
import org.apache.spark.sql.cassandra._

val cassdata = spark.read.cassandraFormat("jtest1", "test").load

We're starting with a partition key only. For that we generate the dataframe with one column with name id and values from one to four:

val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id"))

After its execution, we can check that SCC optimized this join by executing explain:

scala> joined.explain
== Physical Plan ==
Cassandra Direct Join [pk = id#2] test.jtest1 - Reading (pk, c1, c2, v) Pushed {}
+- *(1) Project [cast(id#0L as int) AS id#2]
   +- *(1) Range (1, 5, step=1, splits=8)

and we check that we have correct data pulled from Cassandra. We can see that SCC pulled all rows from partitions 1 and 2 and converted them into individual rows in the resulting dataframe:

scala> joined.count
res1: Long = 8

scala> joined.show
+---+---+---+---+------+
| id| pk| c1| c2|     v|
+---+---+---+---+------+
|  1|  1|  1|  1|t1-1-1|
|  1|  1|  1|  2|t1-1-2|
|  1|  1|  2|  1|t1-2-1|
|  1|  1|  2|  2|t1-2-2|
|  2|  2|  1|  1|t2-1-1|
|  2|  2|  1|  2|t2-1-2|
|  2|  2|  2|  1|t2-2-1|
|  2|  2|  2|  2|t2-2-2|
+---+---+---+---+------+

Handling of the partial primary key is similar - we're generating similar dataframe but with 2 columns, and join it:

val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
  .select($"id", $"id".as("cc1"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id") 
   && cassdata("c1") === toJoin("cc1"))

Again we see that SCC optimized that query:

scala> joined.explain
== Physical Plan ==
Cassandra Direct Join [pk = id#30, c1 = cc1#32] test.jtest1 - Reading (pk, c1, c2, v) Pushed {}
+- *(1) Project [cast(id#28L as int) AS id#30, cast(id#28L as int) AS cc1#32]
   +- *(1) Range (1, 5, step=1, splits=8)

scala> joined.count
res8: Long = 4

scala> joined.show
+---+---+---+---+---+------+
| id|cc1| pk| c1| c2|     v|
+---+---+---+---+---+------+
|  1|  1|  1|  1|  1|t1-1-1|
|  1|  1|  1|  1|  2|t1-1-2|
|  2|  2|  2|  2|  1|t2-2-1|
|  2|  2|  2|  2|  2|t2-2-2|
+---+---+---+---+---+------+

And with full primary key behaviour is the same:

val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
  .select($"id", $"id".as("cc1"), $"id".as("cc2"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id") 
   && cassdata("c1") === toJoin("cc1") 
   && cassdata("c2") === toJoin("cc2"))

We have one-to-one correspondence of rows from the generated dataframe with row in the Cassandra:

scala> joined.explain
== Physical Plan ==
Cassandra Direct Join [pk = id#318, c1 = cc1#320, c2 = cc2#321] test.jtest1 - Reading (pk, c1, c2, v) Pushed {}
+- *(1) Project [cast(id#316L as int) AS id#318, cast(id#316L as int) AS cc1#320, cast(id#316L as int) AS cc2#321]
   +- *(1) Range (1, 5, step=1, splits=8)

scala> joined.count
res13: Long = 2

scala> joined.show
+---+---+---+---+---+---+------+
| id|cc1|cc2| pk| c1| c2|     v|
+---+---+---+---+---+---+------+
|  1|  1|  1|  1|  1|  1|t1-1-1|
|  2|  2|  2|  2|  2|  2|t2-2-2|
+---+---+---+---+---+---+------+

Left join isn't much different - we only need explicitly specify it with "left" or "leftouter" argument:

val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
  .select($"id", $"id".as("cc1"), $"id".as("cc2"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id") 
       && cassdata("c1") === toJoin("cc1") 
       && cassdata("c2") === toJoin("cc2"), 
   "left")

and again, we see that SCC optimized the query. The only difference is that it retains the rows for which we didn't find rows in Cassandra:

scala> joined.explain
== Physical Plan ==
Cassandra Direct Join [pk = id#349, c1 = cc1#351, c2 = cc2#352] test.jtest1 - Reading (pk, c1, c2, v) Pushed {}
+- *(1) Project [cast(id#347L as int) AS id#349, cast(id#347L as int) AS cc1#351, cast(id#347L as int) AS cc2#352]
   +- *(1) Range (1, 5, step=1, splits=8)

scala> joined.count
res5: Long = 4

scala> joined.show
+---+---+---+----+----+----+------+
| id|cc1|cc2|  pk|  c1|  c2|     v|
+---+---+---+----+----+----+------+
|  1|  1|  1|   1|   1|   1|t1-1-1|
|  2|  2|  2|   2|   2|   2|t2-2-2|
|  3|  3|  3|null|null|null|  null|
|  4|  4|  4|null|null|null|  null|
+---+---+---+----+----+----+------+

But if we try to perform right or full join:

val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id") 
       && cassdata("c1") === toJoin("cc1") 
       && cassdata("c2") === toJoin("cc2"), 
   "right")

then we'll see that it's executed by reading the data from the whole table, and performing join on the Spark level (this is example for "right" join, plan for "full" join looks slightly different):

scala> joined.explain
== Physical Plan ==
*(2) BroadcastHashJoin [id#56, cc1#58, cc2#59], [pk#4, c1#5, c2#6], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, false], input[1, int, false], input[2, int, false]))
:  +- *(1) Project [cast(id#54L as int) AS id#56, cast(id#54L as int) AS cc1#58, cast(id#54L as int) AS cc2#59]
:     +- *(1) Range (1, 5, step=1, splits=8)
+- *(2) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [pk#4,c1#5,c2#6,v#7] PushedFilters: [], ReadSchema: struct<pk:int,c1:int,c2:int,v:string>

scala> joined.show
+----+----+----+---+---+---+------+
|  id| cc1| cc2| pk| c1| c2|     v|
+----+----+----+---+---+---+------+
|   1|   1|   1|  1|  1|  1|t1-1-1|
|null|null|null|  1|  1|  2|t1-1-2|
|null|null|null|  1|  2|  1|t1-2-1|
|null|null|null|  1|  2|  2|t1-2-2|
|null|null|null|  2|  1|  1|t2-1-1|
|null|null|null|  2|  1|  2|t2-1-2|
|null|null|null|  2|  2|  1|t2-2-1|
|   2|   2|   2|  2|  2|  2|t2-2-2|
+----+----+----+---+---+---+------+

As it was mentioned above, in case of the partial primary key, all preceding clustering columns need to be specified in joining condition as well. If we don't do that, like in this example that joins on partition key & second clustering column:

val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
  .select($"id", $"id".as("cc2"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id") 
  && cassdata("c2") === toJoin("cc2"))

then we'll get an error saying that we can't do that (please note that error will be thrown only during reading of the data, not when you're declaring the join):

scala> joined.show
java.lang.IllegalArgumentException: Can't pushdown join on column ColumnDef(c2,ClusteringColumn(1,ASC),IntType) without also specifying [ Set(ColumnDef(c1,ClusteringColumn(0,ASC),IntType)) ]
  at com.datastax.spark.connector.rdd.AbstractCassandraJoin$class.checkValidJoin(AbstractCassandraJoin.scala:114)
  at com.datastax.spark.connector.rdd.CassandraJoinRDD.checkValidJoin(CassandraJoinRDD.scala:26)
  at com.datastax.spark.connector.rdd.AbstractCassandraJoin$class.getPartitions(AbstractCassandraJoin.scala:210)
  at com.datastax.spark.connector.rdd.CassandraJoinRDD.getPartitions(CassandraJoinRDD.scala:26)

if you still want to do it, then you need to set directJoinSetting to off when reading data, like this:

val cassdata = spark.read.cassandraFormat("jtest1", "test")
  .option("directJoinSetting", "off").load
val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
  .select($"id", $"id".as("cc2"))
val joined = toJoin.join(cassdata, cassdata("pk") === toJoin("id") 
  && cassdata("c2") === toJoin("cc2"))

and this will force SCC to perform full table scan:

scala> joined.explain
== Physical Plan ==
*(2) BroadcastHashJoin [id#195, cc2#197], [pk#185, c2#187], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295))))
:  +- *(1) Project [cast(id#193L as int) AS id#195, cast(id#193L as int) AS cc2#197]
:     +- *(1) Range (1, 5, step=1, splits=8)
+- *(2) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [pk#185,c1#186,c2#187,v#188] PushedFilters: [], ReadSchema: struct<pk:int,c1:int,c2:int,v:string>

scala> joined.show
+---+---+---+---+---+------+
| id|cc2| pk| c1| c2|     v|
+---+---+---+---+---+------+
|  1|  1|  1|  1|  1|t1-1-1|
|  1|  1|  1|  2|  1|t1-2-1|
|  2|  2|  2|  1|  2|t2-1-2|
|  2|  2|  2|  2|  2|t2-2-2|
+---+---+---+---+---+------+

Theoretically direct join should also work for Spark SQL, like this:

val toJoin = spark.range(1, 5).select($"id".cast("int").as("id"))
toJoin.createOrReplaceTempView("tojoin")

spark.sql("""CREATE OR REPLACE TEMPORARY VIEW cassdata
  USING org.apache.spark.sql.cassandra
  OPTIONS (table "jtest1", keyspace "test", pushdown "true", directJoinSetting "auto")""")
val joined = spark.sql("select * from tojoin tj inner join cassdata cd on tj.id = cd.pk")

but if we look into execution plan, we can see that it doesn't happen:

scala> joined.explain
== Physical Plan ==
*(2) BroadcastHashJoin [id#552], [pk#554], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
:  +- *(1) Project [cast(id#550L as int) AS id#552]
:     +- *(1) Range (1, 5, step=1, splits=8)
+- *(2) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [pk#554,c1#555,c2#556,v#557] PushedFilters: [], ReadSchema: struct<pk:int,c1:int,c2:int,v:string>

This is investigated as SPARKC-613, and hopefully will be fixed.

Joins in RDD API

For a long time, RDD API was only a way to perform effective joins with data in Cassandra. Spark Cassandra Connector provides two functions for performing joins: joinWithCassandraTable and leftJoinWithCassandraTable - they exist in SCC for a long time (since version 1.2, released more than 5 years ago). When executed, both functions return an instance of a special RDD type: CassandraJoinRDD - it has all functions of CassandraRDD API, plus one function (.on) that specifies the list of columns on which join should be performed.

Let's re-implement the same examples as above but with RDD API. We're starting with partition key only:

import com.datastax.spark.connector._

val toJoin = sc.parallelize(1 until 5).map(x => Tuple1(x.toInt))
val joined = toJoin.joinWithCassandraTable("test", "jtest1")
  .on(SomeColumns("pk"))

Please note that we need explicitly create Tuple1 objects, as joinWithCassandraTable expects an RDD of tuples. We can check that we got correct RDD type as result of execution:

scala> joined.toDebugString
res21: String =
(8) CassandraJoinRDD[150] at RDD at CassandraRDD.scala:18 []
 |  ParallelCollectionRDD[147] at parallelize at <console>:33 []

The type of joined is CassandraJoinRDD[(Int,),CassandraRow], and we can check that with collect:

scala> joined.collect
res22: Array[((Int,), com.datastax.spark.connector.CassandraRow)] = Array(
 ((1,),CassandraRow{pk: 1, c1: 1, c2: 1, v: t1-1-1}), 
 ((1,),CassandraRow{pk: 1, c1: 1, c2: 2, v: t1-1-2}), 
 ((1,),CassandraRow{pk: 1, c1: 2, c2: 1, v: t1-2-1}), 
 ((1,),CassandraRow{pk: 1, c1: 2, c2: 2, v: t1-2-2}), 
 ((2,),CassandraRow{pk: 2, c1: 1, c2: 1, v: t2-1-1}), 
 ((2,),CassandraRow{pk: 2, c1: 1, c2: 2, v: t2-1-2}), 
 ((2,),CassandraRow{pk: 2, c1: 2, c2: 1, v: t2-2-1}), 
 ((2,),CassandraRow{pk: 2, c1: 2, c2: 2, v: t2-2-2})
)

We can access data in CassandraRow using the standard functions, like, getInt, getString, get, …, but it's not always handy. To simplify work with results, both functions support mapping of the rows into tuples or into instances of specific (case) classes, that are much easier to use when doing processing of obtained data:

case class CassData(pk: Int, c1: Int, c2: Int, v: String)
val joined = toJoin.joinWithCassandraTable[CassData]("test", "jtest1")
  .on(SomeColumns("pk"))

and we'll get data in Cassandra mapped into our case class:

scala> joined.collect
res23: Array[((Int,), CassData)] = Array(
 ((1,),CassData(1,1,1,t1-1-1)),
 ((1,),CassData(1,1,2,t1-1-2)),
 ((1,),CassData(1,2,1,t1-2-1)),
 ((1,),CassData(1,2,2,t1-2-2)),
 ((2,),CassData(2,1,1,t2-1-1)),
 ((2,),CassData(2,1,2,t2-1-2)),
 ((2,),CassData(2,2,1,t2-2-1)),
 ((2,),CassData(2,2,2,t2-2-2))
)

With partial partition key, behaviour is similar - create two elements tuple, and call joinWithCassandraTable:

val toJoin = sc.parallelize(1 until 5).map(x => (x.toInt, x.toInt))
val joined = toJoin.joinWithCassandraTable[CassData]("test", "jtest1")
  .on(SomeColumns("pk", "c1"))

and as expected, we're getting back four rows:

scala> joined.collect
res28: Array[((Int, Int), CassData)] = Array(
 ((1,1),CassData(1,1,1,t1-1-1)),
 ((1,1),CassData(1,1,2,t1-1-2)),
 ((2,2),CassData(2,2,1,t2-2-1)),
 ((2,2),CassData(2,2,2,t2-2-2))
)

Similarly, for full primary key:

val toJoin = sc.parallelize(1 until 5).map(x => (x.toInt, x.toInt, x.toInt))
val joined = toJoin.joinWithCassandraTable[CassData]("test", "jtest1")
  .on(SomeColumns("pk", "c1", "c2"))

that gives us two rows:

scala> joined.collect
res29: Array[((Int, Int, Int), CassData)] = Array(
 ((1,1,1),CassData(1,1,1,t1-1-1)),
 ((2,2,2),CassData(2,2,2,t2-2-2))
)

Left join is done the same way as inner join, only another function is used:

val toJoin = sc.parallelize(1 until 5).map(x => (x.toInt, x.toInt))
val joined = toJoin.leftJoinWithCassandraTable[CassData]("test", "jtest1")
  .on(SomeColumns("pk", "c1"))

But the type of joined is CassandraLeftJoinRDD instead of CassandraJoinRDD, and instead of the CassandraRow or instance of our class, we're getting Option as the data for given key may absent in the Cassandra:

scala> joined.collect
res38: Array[((Int, Int, Int), Option[CassData])] = Array(
 ((1,1,1),Some(CassData(1,1,1,t1-1-1))),
 ((1,1,1),Some(CassData(1,1,2,t1-1-2))),
 ((2,2,2),Some(CassData(2,2,1,t2-2-1))),
 ((2,2,2),Some(CassData(2,2,2,t2-2-2))),
 ((3,3,3),None),
 ((4,4,4),None)
)

All examples above did use the tuple to represent the data to join with. But we can also use case classes here - we only need to have field names matching to columns in the table, like here (please note that we need to enforce list of columns, otherwise functions will use just a partition key - it's the same as for tuples):

case class ToJoin(pk: Int, c1: Int)
val toJoin = sc.parallelize(1 until 5).map(x => ToJoin(x.toInt, x.toInt))
val joined = toJoin.leftJoinWithCassandraTable[CassData]("test", "jtest1")
  .on(SomeColumns("pk", "c1"))

In this case, it's easier to work with an instance of the case class instead of tuple:

scala> joined.collect
res48: Array[(ToJoin, Option[CassData])] = Array(
 (ToJoin(1,1),Some(CassData(1,1,1,t1-1-1))),
 (ToJoin(1,1),Some(CassData(1,1,2,t1-1-2))),
 (ToJoin(2,2),Some(CassData(2,2,1,t2-2-1))),
 (ToJoin(2,2),Some(CassData(2,2,2,t2-2-2))),
 (ToJoin(3,3),None),
 (ToJoin(4,4),None))

Besides simple usage that was shown above, SCC provides more capabilities in the RDD API. For example, we can repartition RDD that we join with data in Cassandra to match partitioning of the data in Cassandra, so we can avoid non-local reads from Cassandra.

Also, we can use any function of the CassandraRDD API, such as, select, where, limit, etc. For example, we can limit the number of the returned rows by putting the condition onto the c2 column that will be applied to every partition (please note that it should be valid CQL expression!):

case class ToJoin(pk: Int, c1: Int)
val toJoin = sc.parallelize(1 until 5).map(x => ToJoin(x.toInt, x.toInt))
val joined = toJoin.leftJoinWithCassandraTable[CassData]("test", "jtest1")
  .on(SomeColumns("pk", "c1")).where("c2 > 1")

so we get less data than in the previous example:

scala> joined.collect
res55: Array[(ToJoin, Option[CassData])] = Array(
 (ToJoin(1,1),Some(CassData(1,1,2,t1-1-2))),
 (ToJoin(2,2),Some(CassData(2,2,2,t2-2-2))),
 (ToJoin(3,3),None),
 (ToJoin(4,4),None)
)

The limit(N) call will return max N rows per Spark partition. While perPartitionLimit(N) will return max N rows per Cassandra partition. This is quite useful, for example, when we're doing joins with some "historical" data, where we have multiple rows per partition, but usually need to join with the latest entry for a given partition. For example, we may have a table that contains information about historical stock prices (sorted by timestamp in descending order):

create table test.stock_price (
  ticker text,
  tm timestamp,
  price double,
  primary key(ticker, tm)
) with clustering order by (tm desc);
insert into test.stock_price (ticker, tm, price) 
  values ('MSFT', '2020-07-25T10:00:00Z', 100.0);
insert into test.stock_price (ticker, tm, price) 
  values ('MSFT', '2020-07-25T11:00:00Z', 101.0);
insert into test.stock_price (ticker, tm, price) 
  values ('MSFT', '2020-07-25T12:00:00Z', 99.0);
insert into test.stock_price (ticker, tm, price) 
  values ('MSFT', '2020-07-25T13:00:00Z', 97.0);

For example, if we have data about stocks coming from some source. In this case we can join incoming data with latest prices for given shares, and perform some calculation:

case class StockData(ticker: String, currentPrice: Double)
case class StockPrice(ticker: String, tm: java.time.Instant, price: Double)

val stocks = sc.parallelize(Seq(StockData("MSFT", 100), StockData("INTC", 200)))
val joined = stocks.leftJoinWithCassandraTable[StockPrice]("test", "stock_price")
   .on(SomeColumns("ticker")).perPartitionLimit(1)

After execution, we can see that we pulled the latest price for Microsoft shares:

scala> joined.collect
res37: Array[(StockData, Option[StockPrice])] = Array(
  (StockData(MSFT,100.0),Some(StockPrice(MSFT,2020-07-25T13:00:00Z,97.0))), 
  (StockData(INTC,200.0),None)
)

Configuration options, optimizations, etc.

Spark Cassandra Connector has a number of configuration parameters that may affect execution of the joins. Some of the configuration parameters could be specified globally, via instance of the ReadConf class or via option, while others could be specified only as table option.

With spark.cassandra.concurrent.reads parameter we can control how many parallel requests will be sent per core when executing join (default: 512). For example we can change it to a lower value if we want to decrease the load to cluster from doing joins, although this will increase processing time.

Table-only options include (only for Dataframe API!):

directJoinSetting with possible values on (always perform direct join), off (disable direct join), and auto - when SCC decides about use of direct join based on the specified threshold between size of the data to join, and data in Cassandra (default: auto)
directJoinSizeRatio defines a threshold for switching to full scan (default: 0.9)

More practical example

The previous sections showed the basic usage of the joins with data in Cassandra. This section is trying to show how to perform joins when processing streaming data. Following project (full source code) demonstrates the use of joins with Dataframe & RDD APIs to perform enrichment of data coming from Kafka with data from Cassandra. In our case, we're getting from Kafka the information about stocks (stock ticker, timestamp, and price), and enrich that data with more information about specific stock, like, full company name, type company, stock exchange, etc. After enrichment, we just output data to the console, but the code could be adjusted to do something more useful with enriched data. To run the code, just follow instructions in README.

The implementation that uses Spark Structured Streaming is very straightforward:

get data from Kafka
decode JSON payload
create dataframe for data in Cassandra (if we have "static" dataset in Cassandra, then we can use cache the data so they will be read only once)
perform join (we use joined.explain check that we got "Cassandra Direct Join")
output data to console

// 1.
val streamingInputDF = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaServes)
  .option("subscribe", topicName)
  .load()

// 2.
val parsed = streamingInputDF.selectExpr("CAST(value AS STRING)")
  .select(from_json($"value", schema).as("stock"))
  .select("stock.*")
  .withColumnRenamed("symbol", "ticker")

// 3.
val cassandra = spark.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "stock_info", "keyspace" -> "test"))
  .load

// 4.
val joined = parsed.join(cassandra, cassandra("symbol") === parsed("ticker"), 
  "left").drop("ticker")
joined.explain

// 5.
val query = joined.writeStream
      .outputMode("update")
      .format("console")
      .start()

And when we execute it then we can see the data printed to console, like this:

+------------------+--------------------+------+----------+--------+--------------+--------------------+
|             value|            datetime|symbol|base_price|exchange|      industry|                name|
+------------------+--------------------+------+----------+--------+--------------+--------------------+
| 254.5442902345344|2020-07-14 14:03:...|  ADBE|     253.0|  NASDAQ|          TECH|       ADOBE SYSTEMS|
| 66.13761365408801|2020-07-14 14:03:...|   LNC|      66.0|    NYSE|    FINANCIALS|    LINCOLN NATIONAL|
| 37.18736354960266|2020-07-14 14:04:...|   AAL|      37.0|  NASDAQ|TRANSPORTATION|AMERICAN TRANSPOR...|
+------------------+--------------------+------+----------+--------+--------------+--------------------+

The implementation that uses RDD-based Spark Streaming follows the same steps as previous example, although it is slightly more complicated, because it's doing more than dataframe-based implementation - it filters out entries for which we didn't find data in Cassandra, and prints only entries for which we have data in Cassandra:

case class StockData(symbol: String, timestamp: Instant, price: Double) 
   extends Serializable
case class StockInfo(symbol: String, exchange: String, name: String, 
   industry: String, base_price: Double) extends Serializable
case class JoinedData(symbol: String, exchange: String, name: String, 
   industry: String, base_price: Double, timestamp: Instant, price: Double) 
   extends Serializable

val ssc = new StreamingContext(sc, Seconds(10))
// ....
val stream = KafkaUtils.createDirectStream[String, String](
  ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams)
)

val parsedData = stream.flatMap(x => parseJson(x.value()))
val transformedData = parsedData.transform(rdd => {
  val joined = rdd.leftJoinWithCassandraTable[StockInfo]("test", "stock_info")
  joined.persist()
  val missingInfoCount = joined.filter(x => x._2.isEmpty).count()
  val stocksWithInfo = joined.filter(x => x._2.isDefined)
  val existingInfoCount = stocksWithInfo.count()
  println(s"There are $missingInfoCount stock tickers without information in Cassandra")
  println(s"There are $existingInfoCount stock tickers with information in Cassandra")
  val combined = stocksWithInfo.map(x => {
    val i = x._2.get
    val d = x._1
    JoinedData(i.symbol, i.exchange, i.name, i.industry, i.base_price, 
       d.timestamp, d.price)
  })
  joined.unpersist()
  combined
})
transformedData.foreachRDD(rdd => rdd.foreach(println))
ssc.start()

and when it's running, we'll see on the console following messages:

There are 0 stock tickers without information in Cassandra
There are 20 stock tickers with information in Cassandra
...
JoinedData(ESND,NASDAQ,ESSENDANT,WHOLESALERS,13.0,2020-07-14T16:19:19.588Z,13.483634952551117)
JoinedData(SWK,NYSE,STANLEY BLACK & DECKER,HOUSEHOLD PRODUCTS,128.0,2020-07-14T16:19:23.588Z,121.58327281753643)
JoinedData(BLK,NYSE,BLACKROCK,FINANCIALS,424.0,2020-07-14T16:19:24.588Z,394.7030616365362)

Conclusion

Joining with data in Cassandra is a very convenient and fast method for data enrichment - with a small amount of code we can quickly pull necessary data from the database, and perform data processing based on the retrieved data.

Using Apache Zeppelin to work with data in DSE via AlwaysOn SQL Service

2020-07-23T17:08:00.002+02:00

Since release 6.0 DataStax Enterprise (DSE) includes AlwaysOn SQL Service (AOSS) that allows to connect to DSE Analytics via JDBC or ODBC drivers and execute Spark SQL queries against data in DSE, or external sources, such as, data on DSEFS. AOSS is built on the top of the Spark Thrift Server, but has a number of improvements, such as, improved fault tolerance, support for advanced security features of DSE (for example, row-level access control), better caching of the data to improve response time on restarts, etc. Using AOSS people can use their favorite BI tools to access data stored in DSE, and this greatly simplifies work with that data.
Apache Zeppelin has a dedicated interpreter for accessing databases via JDBC and documentation contains all information on how to configure and use this interpreter, with examples for many popular databases, such as, PostgreSQL, MySQL, etc. JDBC interpreter also supports dynamic forms, and interpolation of variables to simplify creation of interactive & dynamic queries. So we can also use Apache Zeppelin to work with data in DSE via JDBC interpreter.

Configuring Zeppelin to work with AOSS

To access data via AOSS we need to get a special version of JDBC driver that supports AOSS enhancements, such as, auto-discovery of AOSS instance, or reconnection to another server if AOSS fails. We need to get "Simba JDBC Driver for Apache Spark" from the corresponding section of DataStax download site. (besides driver it makes sense to download the driver manual as well, as it describes all driver options). After the driver is downloaded, we need to unpack the archive to a place accessible by Zeppelin. Archive should contain a file with the name "SparkJDBC41.jar".
Now we can configure Zeppelin to connect to AOSS by going to "Interpreters" section in the top right drop-down that shows the user name. We can configure existing instance of the JDBC interpreter, but it's usually recommended to create a new interpreter based on the JDBC interpreter template for each type of the used database. Click "+Create" button, enter interpreter name, like aoss (it will be used to specify interpreter on the cell level, like, "%aoss"), and select "jdbc" in "Interpreter group" drop-down - this will load all existing properties of JDBC interpreter, that we can fill with information specific for AOSS:

We need to configure several things that are common for all AOSS installations:

we need to put full path to Simba JDBC driver into "artifact" field of "Dependencies" section (like, /Users/ott/work/zeppelin/SparkJDBC41.jar)
we need to put driver's class name (com.simba.spark.jdbc41.Driver) for configuration parameter default.driver

We also must specify value for configuration parameter default.url. For AOSS, there are two ways to do it:

Explicitly specify host name or IP-address with port configured for AOSS (10000 by default, configured by alwayson_sql_options:thrift_port setting in dse.yaml), like, jdbc:spark://server:10000 - although this method works, but it's not optimal as it requires to know which of servers is running AOSS right now, and no connection to another server will happen in case of failover
Use auto-discovery functionality of the driver that relies on the meta-information published by every node of DSE Analytics (by default on the port 9077, configured by alwayson_sql_options:web_ui_port setting in dse.yaml). In this case, the driver will automatically discover where the instance of AOSS is running, and also perform connection to the new node if the current node fails. For this case, URL looks as following: jdbc:spark://AOSSStatusEndpoints=server1:9077,server2:9077; (we can specify any number of nodes as parameter)

We can pass additional driver options by adding them to the URL. Refer to the driver documentation for a list of the available options. We can also configure other Zeppelin parameters, but we can leave them with default values. After everything is configured, press "Save" to save changes (I removed not necessary parameters to make screenshot smaller):

Usage

After we create the interpreter, we can start to use it either in the new notebooks, or in existing ones. We can configure interpreter on the notebook level when creating it, or we can put %interpreter_name at the beginning of the cell, to indicate that we're using a specific interpreter.
And everything that we need to do now - just issue Spark SQL queries, and wait for results, like this:

We can check that the same data is available via CQL (don't wonder about syntax - this table has DSE Search index):

We can also use all available visualizations, including additional, like geospark-zeppelin, that is installed from the Helium registry:

Conclusion

This post demonstrates flexibility and ease of use of the Apache Zeppelin when working with different technologies, such as databases, etc.
P.S. this post was written using Zeppelin 0.9.0-preview1

Working with DataStax Astra from Apache Zeppelin

2020-06-23T12:01:00.001+02:00

Apache Zeppelin is very powerful web-based environment for collaborative work with very good support for the big data technologies and databases, such as, Spark, Flink, Cassandra, and many others. Apache Zeppelin 0.9.0 will include a lot of changes for Cassandra interpreter:

migration to the new, unified DataStax Java driver that brings more performance & stability, and also support for DSE-specific functionality, such as, geospatial types
flexible formatting of results - we can output data in CQL, or human-readable formats, format time/date-related columns using custom patterns, control formatting of floating point numbers, etc. All of this could be configured on interpreter and/or cell level
ability to change any configuration parameter of the Java driver

The last item is the most important one for connecting to DataStax Astra (Cassandra as a Service from DataStax) - we can specify the path to secure connect bundle, and get access to our Astra instance.

Right now, there is no precompiled version of Zeppelin with these changes available, so you will need to compile Zeppelin from sources. After compilation is done, start Zeppelin, and open in web browser default Zeppelin address: http://localhost:8080/.

We can configure Cassandra interpreter directly to work with Astra, but often it's better to create a separate interpreter (like is shown on the picture below):

go to the "Interpreter" menu (in drop down in the top right corner), and there click "Create"

enter the name of the interpreter (astra), and select the cassandra in the interpreter group dropdown

enter the username/password in the cassandra.credentials.username & cassandra.credentials.password properties

clear the value of the cassandra.hosts property (this is temporary workaround until it's fixed on the driver level)

change the value of cassandra.query.default.consistency to LOCAL_QUORUM, and cassandra.query.default.serial.consistency to LOCAL_SERIAL as Astra requires this to perform write operations
optionally change the value of cassandra.keyspace to the name of keyspace that was created in Astra
add a property with name datastax-java-driver.basic.cloud.secure-connect-bundle and value of the path to the secure bundle save the interpreter - this enables connection to Astra

Using the new interpreter is easy:

Click "Create new note" Enter the note name, and select astra as default interpreter.

Start to execute commands, for example, execute describe cluster; that should show something like this:

Create table, insert data, and select them:

Please note that the new interpreter inherits all functionality of the base interpreter, for example, it's possible to specify formatting options, like this (formatting using German locale, for German timezone):