Stories by Christopher Jones on Medium

New Python Jupyter Notebooks for python-oracledb

Christopher Jones — Mon, 20 Oct 2025 03:04:58 GMT

The Jupyter notebooks for python-oracledb have been updated with the latest and greatest functionality to teach you how to use Python to connect to Oracle Database.

The notebooks are at github.com/oracle/python-oracledb/tree/main/samples/notebooks

Python-oracledb Resources

Python-oracledb is an open source package for the Python Database API specification with many additions to support advanced Oracle Database features. By default, it is a ‘Thin’ driver that is immediately usable without needing any additional install e.g. no Instant Client is required. Python-oracledb is used by frameworks, ORMs, SQL generation libraries, and other projects.

The python-oracledb driver and Oracle AI Database 26ai

Christopher Jones — Wed, 15 Oct 2025 11:01:39 GMT

Oracle AI Database 26ai has been announced: “Oracle AI Database 26ai replaces Oracle Database 23ai. Transitioning from 23ai to 26ai is simple — just apply the October 2025 release update with no database upgrade or application re-certification. Advanced AI features like AI Vector Search are included at no additional charge.”

The database version is now of the form 23.26.x.y.z where “x” becomes the release update number. Since the Oracle Client libraries are a subset of the database libraries, users of Oracle’s Python driver python-oracledb (when running in “Thick mode”) will see a similar, updated Client library version. The Python code:

import oracledb

oracledb.init_oracle_client()
print(f'Oracle Client version {oracledb.clientversion()}')
with oracledb.connect(user='cj', password='cj', dsn='localhost/orclpdb1') as connection:
    print(f'Oracle AI Database version {connection.version}')

will print:

$ python v.py
Oracle Client version (23, 26, 0, 0, 0)
Oracle AI Database version 23.26.0.0.0

Oracle Instant Client will unzip to the directory instantclient_23_26.

To read more about the versioning scheme and updating to 26ai, check out Mike Dietrich’s blog post Oracle AI Database 26ai replaces Oracle Database 23ai

What do you get in Python?

It’s a great time to review some of the fantastic Oracle AI Database 26ai and related Oracle Cloud Infrastructure features that are supported by python-oracledb. These include:

AI VECTOR data type and AI Vector Search for AI applications
Sessionless Transactions to decouple transactions from connections
JSON Duality Views to let developers and SQL lovers have the best of both worlds
Pipelining to improve statement throughput and keep the applications and database working without delays
Multi-pool DRCP for better workload partitioning
Implicit Connection pooling for scaling of legacy applications
SQL Annotations for centralized application metadata
Data Use Case Domains for properties and constraints on the data schema
OCI Instance Principal authentication for automatic database authentication from authorized Oracle Cloud Infrastructure compute instances
Configuration Providers for centralized credential management and application configuration
Cloud Native Authentication for database authentication with tokens
Oracle’s fast, efficient OSON internal representation of JSON to accelerate your modern applications
TCP Fast Open for faster connection usage with Oracle Autonomous AI Database

There are many other recent changes in the database world that bring benefit to all applications: faster connection protocols, lots of new SQL syntax features, lots of improved database features, updated client tooling, and a focus on developers. You can find details in the Oracle AI Database New Features manual.

Python-oracledb itself has new, important features that complement Oracle AI Database but can also be used with older database releases:

DataFrame support for fast analysis and AI use with your favorite Python modules
Direct Path Loads for fast data ingestion and ETL workloads

Overall, Oracle AI Database 26ai and the python-oracledb driver bring great technology to your business, improving efficiency and powering your work with AI agents, MCP, and vectors.

Python-oracledb Resources

The python-oracledb driver and Oracle AI Database 26ai was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

OpenTelemetry with Python and Oracle Database

Christopher Jones — Thu, 09 Oct 2025 21:14:15 GMT

Python Oracle Database applications can easily be integrated with OpenTelemetry for monitoring and troubleshooting.

With the growing demand to monitor and trace fleets of applications, an observability framework becomes important. OpenTelemetry is the most widely known, being usable across a wide range of environments and languages.

OpenTelemetry describes itself as:

An observability framework and toolkit designed to facilitate the

- Generation

- Export

- Collection

of telemetry data such as traces, metrics, and logs.

It is open source, and vendor- and tool-agnostic.

Python modules that enable sophisticated OpenTelemetry support are widely used. In particular, the OpenTelemetry Database API Instrumentation package opentelemetry-instrumentation-dbapi (documentation here) provides observability integration for Python DB API V2 compliant drivers. This makes tracing of Oracle’s open source python-oracledb driver very easy, giving you rapid, ongoing insight into your applications.

Basic Tracing

Start by installing required packages:

python -m pip install opentelemetry-sdk opentelemetry-api opentelemetry-instrumentation-dbapi oracledb

The simplest tracing available is to display logging output to your terminal. This is done by adding a ConsoleSpanExporter() and trace integration with the oracledb module to the top of your script. For example:

# opentelemetry-db-1.py

import os

import oracledb

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)
from opentelemetry.instrumentation.dbapi import trace_integration

provider = TracerProvider()
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

trace_integration(oracledb, "connect", "oracle")

# Database credentials and connection string
un = os.environ.get("PYTHON_USERNAME")
pw = os.environ.get("PYTHON_PASSWORD")
cs = os.environ.get("PYTHON_CONNECTSTRING")

with oracledb.connect(user=un, password=pw, dsn=cs) as connection:
    with connection.cursor() as cursor:
        sql = "select 'hello' from dual"
        for r, in cursor.execute(sql):
            print(r)

You can see that the connection and query code doesn’t have additional hooks or tracing annotations. Tracing is automatically performed.

Set your credential environment variables and run the script:

export PYTHON_USERNAME=cj
export PYTHON_PASSWORD=...
export PYTHON_CONNECTSTRING=localhost/freepdb1

python opentelemetry-db-1.py

The output shows the expected query result “hello” and an OpenTelemetry span named “select” which records information about the query execution:

hello
{
    "name": "select",
    "context": {
        "trace_id": "0x02aa04b49d900dcadaf577fb9e5f15a3",
        "span_id": "0x0ca4c709bf0ac7d7",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2025-10-05T22:42:33.460805Z",
    "end_time": "2025-10-05T22:42:33.881186Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "db.system": "oracle",
        "db.name": "",
        "db.statement": "select 'hello' from dual"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.27.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

The trace contains metadata that is useful for understanding when the event occurred and what was happening. The start and end times can be used to calculate how long the query took. The trace also has identifier information that helps correlating nested spans in bigger applications.

OpenTelemetry Database API Instrumentation also lets you automatically track unexpected events. Edit opentelemetry-db-1.py and force an error (but don't show the message in application output) by adding this:

        try:
            sql = "insert into doesnotexist values (1)"
            cursor.execute(sql)
        except:
            print("Whoops")

The output will now contain a second OpenTelemetry span, this time named by default as “insert”. Even though the script itself “swallowed” the error and just displays “Whoops” to the console as normal output , the OpenTelemetry span contains the real error information:

hello
Whoops
{
    "name": "select",
    . . .
}
{
    "name": "insert",

    . . .

    "status": {
        "status_code": "ERROR",
        "description": "DatabaseError: ORA-00942: table or view \"CJ\".\"DOESNOTEXIST\" does not exist\nHelp: https://docs.oracle.com/error-help/db/ora-00942/"
    },

    . . .
}

There will also be a stack trace in the span, which you can see when you run it yourself.

This error information capture allows you to remotely monitor and troubleshoot unexpected events without even having to explicitly log them.

Customizing Tracing

There are various attributes that can be set to enhance your traces. Guidelines are at OpenTelemetry Database API Instrumentation and Semantic conventions for database client spans.

The following example sets some attributes to customize the output of a query from the sample LOCATIONS table:

# opentelemetry-db-2.py

import os

import oracledb

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.dbapi import trace_integration

un = os.environ.get('ORACLE_USERNAME')
pw = os.environ.get('ORACLE_PASSWORD')
cs = os.environ.get('ORACLE_DSN')

resource = Resource(attributes={

    # displayed as attributes.service.name
    "service.name": "flightBooking",

    # displayed as resource.attributes.db.name
    "db.name": "myOracleDB"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

trace_integration(
    oracledb,

    connect_method_name="connect",

    # displayed as attributes.db.system
    database_system="Oracle ADB",

    # displayed as attributes.db.statement.parameters
    # !! SECURITY WARNING: shows bind variable values !!
    capture_parameters=True,
)

with oracledb.connect(user=un, password=pw, dsn=cs) as connection:
    with connection.cursor() as cursor:
        sql = "select city from locations where location_id = :1"
        for r, in cursor.execute(sql, [2200]):
            print(r)

Output is like:

Sydney
{
    "name": "select",

    . . .

    "attributes": {
        "db.system": "Oracle ADB",
        "db.name": "",
        "db.statement": "select city from locations where location_id = :1",
        "db.statement.parameters": "[2200]"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "service.name": "flightBooking",
            "db.name": "myOracleDB"
        },
        "schema_url": ""
    }
}

Note that setting capture_parameters caused the bind variable value "2200" to be recorded which is a security issue. Only use this setting in limited development circumstances, not in production applications.

Tracing extended python-oracledb functionality

Python-oracledb has great functionality that is not part of the Python DB API standard and therefore is not automatically instrumented by opentelemetry-instrumentation-dbapi.

To show this, edit opentelemetry-db-2.py and add a query that uses python-oracledb's fantastic DataFrame method fetch_df_all():

        sql = "select city from locations where country_id = :1"
        odf = connection.fetch_df_all(sql, ['UK'])
        print(odf.num_rows())

The script output now displays an additional 3 from the print() statement (because the sample table I used contains three cities in the UK), but only the original OpenTelemetry span for the previous connection.execute() call is shown:

Sydney
3
{
    "name": "select",

    . . .

    "attributes": {
        "db.system": "Oracle ADB",
        "db.name": "",
        "db.statement": "select city from locations where location_id = :1",
        "db.statement.parameters": "[2200]"
    },

    . . .
}

To trace calls such as fetch_df_all()that extend the Python DB API, you can add explicit instrumentation. Change the DataFrame query code to be:

        tracer = trace.get_tracer(__name__)
        with tracer.start_as_current_span("myDFQuery"):
            sql = "select city from locations where country_id = :1"
            odf = connection.fetch_df_all(sql, ['UK'])
            print(odf.num_rows())

The output of the script now has an extra span for the DataFrame operation:

. . .
{
    "name": "myDFQuery",
    "context": {
        "trace_id": "0x8512a9fac568c07fc16cd872f68d0346",
        "span_id": "0x03f424111825540f",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-10-06T01:20:34.200129Z",
    "end_time":   "2025-10-06T01:20:39.212618Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {},
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "service.name": "flightBooking",
            "db.name": "myOracleDB"
        },
        "schema_url": ""
    }
}

Recording and Visualizing OpenTelemetry Traces

All the samples above wrote OpenTelemetry tracing straight to the console. However it is more useful to store traces and use them for later analysis, or for aggregation over longer time periods. There are various recording and graphical tools that can be used with OpenTelemetry, for example Zipkin, Prometheus, Grafana, and Jaeger. Here I will show Zipkin.

The easiest way to get a Zipkin instance is in a container:

docker run -d -p 9411:9411 --name zipkin openzipkin/zipkin

For the application, install the additional Python exporter module for Zipkin:

python -m pip install opentelemetry-exporter-zipkin

Copy and edit opentelemetry-db-2b.py to add an import for ZipkinExporter. Change the BatchSpanProcessor to use it. The file diff is:

--- opentelemetry-db-2b.py 2025-10-06 12:52:25
+++ opentelemetry-db-2b-zipkin.py 2025-10-06 12:53:01
@@ -1,14 +1,14 @@
-# opentelemetry-db-2b.py
+# opentelemetry-db-2b-zipkin.py
 
 import os
 
 import oracledb
 
 from opentelemetry import trace
+from opentelemetry.exporter.zipkin.json import ZipkinExporter
 from opentelemetry.sdk.trace import TracerProvider
 from opentelemetry.sdk.trace.export import (
     BatchSpanProcessor,
-    ConsoleSpanExporter,
 )
 from opentelemetry.sdk.resources import Resource
 from opentelemetry.instrumentation.dbapi import trace_integration
@@ -27,7 +27,7 @@
 })
 
 provider = TracerProvider(resource=resource)
-processor = BatchSpanProcessor(ConsoleSpanExporter())
+processor = BatchSpanProcessor(ZipkinExporter())
 provider.add_span_processor(processor)
 trace.set_tracer_provider(provider)

When the docker ps command shows the Zipkin container is "healthy", run your edited Python script. The result is just the application output:

Sydney
3

The tracing information has been exported to the Zipkin instance in the background.

Now you can load http://localhost:9411/zipkin/ in a browser. This gives you an overview of what happened in the app:

You can drill into each span, for example by clicking SHOW for the first query:

The bind variable value is shown since capture_parameters=True is still set. As a reminder don’t do this in production applications.

Summary

OpenTelemetry is an observability framework that can be used across a wide variety of applications. Comprehensive Python packages let you rapidly enable it in python-oracledb applications for monitoring and troubleshooting. The python-oracledb driver can be instrumented automatically, or with custom tracing calls.

Various backend systems can be used record and analyze OpenTelemetry information, letting you drill into nested spans to find slow code areas, letting you aggregate data, and letting you enable custom alerts.

You may also be interested in our work to add OpenTelemetry support to Node.js.

Python-oracledb Resources

OpenTelemetry with Python and Oracle Database was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Direct Path Loads: Fast data ingestion with Python and Oracle Database

Christopher Jones — Tue, 07 Oct 2025 04:25:20 GMT

The fastest way to load very large datasets with Python into Oracle Database is python-oracledb’s direct_path_load() method. This can radically improve your ETL workflow performance.

This blog benchmarks various ways to load data into Oracle Database using Python.

The Background

ETL (“Extract, Transform, and Load”) pipelines are common in Python, specially in the world of data analysis and AI. Performance is critical, so how can you improve data loading times?

Direct Path Loading is a database feature most commonly known from SQL*Loader, but also exposed by JDBC and ODP.NET — and now by python-oracledb. It allows data being inserted into an Oracle Database table to bypass code layers such as the database buffer cache. Also there are no INSERT statements used. Direct Path Loads allow very fast ingestion of huge amounts of data.

Direct Path Load support was introduced in python-oracledb 3.4 with a direct_path_load() method off the connection object. It lets you pass in a list of sequences or a DataFrame. For example with a list:

DATA = [
    (1, "First row"),
    (2, "Second row"),
    (3, "Third row"),
]

connection.direct_path_load(
    schema_name="HR",
    table_name="mytab",
    column_names=["id", "name"]
    data=DATA
)

The API is simple. You pass in the schema, table name, column names, and the data. There is only one additional argument available in this initial release: a batch_size parameter that is used to split the processing of the supplied data into chunks of that number of rows, allowing you to more easily tune the performance without having to explicitly loop and make multiple calls to direct_path_load(). The method is supported only in python-oracledb Thin mode - there are no plans to add it to Thick mode.

The API documentation is here. The User Guide is here.

As a consequence of the specialized Database architecture, there are a few restrictions on when Direct Path Loads can be used: check the documentation on SQL*Loader Direct Path Loads and on the Oracle Call Interface Direct Path Load Interface.

The CSV file, the database table, the benchmark

I decided to measure not just insertion time, but the whole time to ingest a CSV file into an Oracle Database table. This is a task I know a lot of you are doing.

The CSV files: I used files with 100,000 lines, with 1,000,000 lines, and with 2,000,000 lines. The CSV files were like:

1,"23-Sep-2025","String for row 1"
2,"23-Sep-2025","String for row 2"
. . .

The simple code I used to create the files is here.

The table: The database table was created like:

create table mytab (id number, dt date, name varchar2(50));

The code: Each test read the whole file into Python memory, and wrote it to database in one go. My code was actually structured to allow loading in smaller batches so I could experiment with streaming data. In a real world scenario you might have too much data to hold in memory, or to efficiently send across the network in one operation. But, for this blog, I kept it simple and did everything in one shot. Wherever you see BATCH_SIZE or BLOCK_SIZE in the code snippets, assume that this is set to the appropriate value to read and insert all rows at once.

The complete code can be found here.

The data ingestion choices

There are three ways I used to read data from a CSV file into Python:

With the Pandas read_csv() method which reads into a DataFrame
With the standard Python library csv module which returns each row as a list of strings
With PyArrow’s csv methods which read into a PyArrow Table interoperable with a DataFrame

After data was in Python memory, there were five ways to insert into the database:

Calling the Pandas to_sql() method
Passing a list to python-oracledb’s executemany() method
Passing a DataFrame to python-oracledb’s executemany() method
Passing a list to python-oracledb’s direct_path_load() method
Passing a DataFrame to python-oracledb’s direct_path_load() method

Obviously, not all combinations are possible.

The Solutions

I tried five solutions, which are listed here in slowest-to-fastest order. The complete code can be found here. The results are shown at the bottom.

1. Loading CSV Data with Pandas

Though simple to code, using Pandas to load data is the slowest solution — and not just because of extra initial overhead checking the schema. So unless you need specific Pandas data loading functionality, or performance really isn’t important, you should not use it.

Code is like:

csv_reader = pandas.read_csv(
    "sample.csv",
    header=None,
    names=["id", "dt", "name"],
    parse_dates=['dt'],
    chunksize=BATCH_SIZE)
for d in csv_reader:
    d.to_sql("mytab", engine, if_exists='append', index=False)

To be roughly comparable with other solutions, I pre-created the table, so I used the append mode of Pandas.

With large data sets, the overhead Pandas incurs checking the schema before inserting data is relatively lower, but it is still a factor if ultimate speed is your goal.

2. Using Python’s csv module to read into a list, and executemany() to insert

Using the standard Python csv module and calling Oracle’s efficient executemany() method used to be the “best-practice” solution but now comes in as second slowest.

Code is like:

cursor.setinputsizes(None, None, 50)

sql = "insert into mytab (id, dt, name) values (:1, :2, :3)"
data = []
csv_reader = csv.reader(open("sample.csv", "r"), delimiter=",")
for line in csv_reader:
    data.append((float(line[0]), datetime.strptime(line[1], "%d-%b-%Y"), line[2]))
    if len(data) % BATCH_SIZE == 0:
        cursor.executemany(sql, data)
        data = []
if data:
    cursor.executemany(sql, data)
connection.commit()

I carefully used setinputsizes() so that python-oracledb’s executemany() call knew how much memory to allocate for each of the three fields / bind variables and didn’t have to do slow re-allocations as more data was parsed. The first column is numeric, so I passed None to use the default type handling. The second bind variable of the call is also None since the default python-oracledb date handling knows the size of dates. The third CSV field is a string so I chose the upper column size.

The dates are stored in the CSV file as strings “23-Sep-2025”. My code converts them to datetime objects for insertion. Without this explicit conversion, performance can be slow due to the type mismatch with the database type. See my earlier blog Application and database type mismatches slow down data loads.

3. Using PyArrow’s CSV loader and passing the DataFrame to executemany()

Using PyArrow’s CSV loader has some type handling efficiencies, so performance of this solution was another step better for me.

Code is like:

sql = "insert into mytab (id, dt, name) values (:1, :2, :3)"

read_options = pyarrow.csv.ReadOptions(
    column_names=["id", "dt", "name"],
    block_size=BLOCK_SIZE
)
convert_options = pyarrow.csv.ConvertOptions(
    timestamp_parsers=["%d-%b-%Y"],
    column_types={
        "id":pyarrow.int64(),
        "dt":pyarrow.timestamp("us"),
        "name":pyarrow.string()
    }
)
csv_reader = pyarrow.csv.open_csv(
    "sample.csv",
    read_options=read_options,
    convert_options=convert_options)
for df in csv_reader:
    if df is None:
        break
    cursor.executemany(sql, df)
connection.commit()

For the read options, I set the column names since my CSV file didn’t have a header row.

When reading in batches, PyArrow takes a buffer size, not a row count. As previously noted, I set this to a size so that all the data was read and inserted in one go.

Since executemany() can take a DataFrame, I didn't need to do any conversion of the data to a list. This resulted in compact code iterating over the data.

4. Using Python’s csv module to read into a list, and direct_path_load() to insert

Now the fun begins: Using the standard Python csv module to construct a list, and then passing that to the new direct_path_load() method is getting speedy.

Code is like:

data = []
csv_reader = csv.reader(open("sample.csv", "r"), delimiter=",")
for line in csv_reader:
    data.append((float(line[0]), datetime.strptime(line[1], "%d-%b-%Y"), line[2]))
    if len(data) % BATCH_SIZE == 0:
        connection.direct_path_load(
            schema_name="HR",
            table_name="mytab",
            column_names=["id", "dt", "name"],
            data=data)
        data = []
if data:
    connection.direct_path_load(
        schema_name="HR",
        table_name="mytab",
        column_names=["id", "dt", "name"],
        data=data)

The code is very similar to solution #2, but calls direct_path_load() instead of executemany().

There is no explicit commit() call here, since direct_path_load() will commit the data.

5. Using PyArrow’s CSV loader and passing the DataFrame to direct_path_load()

The new best-practice solution takes advantage of both the PyArrow CSV loader to read into a DataFrame format, and then passing this DataFrame to direct_path_load().

Code is like:

read_options = pyarrow.csv.ReadOptions(
    column_names=["id", "dt", "name"]
    block_size=BLOCK_SIZE
)
convert_options = pyarrow.csv.ConvertOptions(
    timestamp_parsers=["%d-%b-%Y"],
    column_types={
        "id": pyarrow.int64(),
        "dt": pyarrow.timestamp("us"),
        "name": pyarrow.string()
    }
)
csv_reader = pyarrow.csv.open_csv(
    "sample.csv", read_options=read_options, convert_options=convert_options)
for df in csv_reader:
    if df is None:
        break
    connection.direct_path_load(
        schema_name="HR",
        table_name="mytab",
        column_names=["id", "dt", "name"],
        data=df)

The code is similar to solution #3, but calls direct_path_load() instead of executemany().

For the read options, I set the column names since my CSV file didn’t have a header row.

When reading in batches, PyArrow takes a buffer size, not a row count. As previously noted, I set this to the maximum size so that all the data was read and inserted in one go.

Since direct_path_load() can take a DataFrame, I didn't need to do any conversion of the data to a list. This resulted in compact code iterating over the data.

There is no explicit commit() call here, since direct_path_load() will commit the data.

Results in a Picture

Here are my results. Less is better. The numbers are averages over a few runs. In each group, the bars represent the solutions in order described above.

For the three file sizes, loading with Pandas (solution #1) was slowest, while loading with PyArrow’s csv module and calling direct_path_load() (solution #5) was the fastest. The benefit increased as the data size increased. For my 2,000,000 row file, Direct Path Loading was 4 times faster than using Pandas, and 3 times faster than solution #2, our old recommendation.

On the numbers themselves, I am using an x86_64 version 23 database, emulating the architecture on an arm64 Mac, so it is inherently slow. Also I was running Python on the same Mac. Your results will vary for all the normal reasons, so do your own benchmarking. Reasons include how slow your machine is, the network speed, how much other work your database is doing, how much data you are loading, and what data types are involved. When testing, you may also want to measure the impact on the database itself. Even if the elapsed time for using direct_path_load() is slower than executemany() for a particular data set (e.g a small one), there may still be a reduced impact on the database, which may improve overall system performance.

Summary

The fastest way to load very large datasets into Oracle Database with Python is to use direct_path_load(). When loading a CSV file, you can also take advantage of PyArrow CSV methods to read into a DataFrame format. Direct Path Loading performance will vary with many factors, including the data types involved. Direct Path Loads are fast but have a few database restrictions, so review the database documentation before assuming you can magically drop them into an existing ETL pipeline. Unlike executemany(), there is currently no way to filter noisy data, so make sure your data is clean before trying to load it.

When evaluating which Oracle Database data loading solution to use, don’t forget to check out Oracle’s specialized SQL*Loader tool, and also External Tables.

Let us know what you think of Direct Path Loading. If you want to try (or improve) the five solutions shown in this blog, the code is here.

Thanks for using python-oracledb.

Python-oracledb Resources

Direct Path Loads: Fast data ingestion with Python and Oracle Database was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

python-oracledb 3.4 introduces Direct Path Loads for rapid bulk data insertion

Christopher Jones — Tue, 07 Oct 2025 04:21:12 GMT

Another great release of python-oracledb supports your use of Python and Oracle Database for ETL, Data Analysis, and AI.

Photo by chris panas on Unsplash

The python-oracledb 3.4 driver is now available on PyPI.

Direct Path Loading

The marquee feature in python-oracledb 3.4 is Direct Path Loading, something I’ve really wanted to support for a long time.

Direct Path Loading is a database feature most commonly known from SQL*Loader, but also exposed by JDBC, Oracle Call Interface, and ODP.NET — and now by python-oracledb. It allows data being inserted into an Oracle Database table to bypass code layers such as the database buffer cache. Also there are no INSERT statements used. This allows very fast ingestion of huge amounts of data.

In python-oracledb 3.4, a simple API direct_path_load() lets you pass in a list of sequences or a DataFrame to be loaded into a table. For example with data in a sequence:

DATA = [
    (1, "First row"),
    (2, "Second row"),
    (3, "Third row"),
]

connection.direct_path_load(
    schema_name="HR",
    table_name="mytab",
    column_names=["id", "name"]
    data=DATA
)

If you are using Python to load large amounts of data into Oracle Database you will want to check out this new feature. See my companion blog Direct Path Loads: Fast data ingestion with Python and Oracle Database for more information.

Python-oracledb documentation on Direct Path Loads is here.

DataFrames are “production”

We have removed the “pre-release” status for DataFrame features. My thanks to all our users and early adopters for their inputs into the design and functionality over the last few releases.

DataFrame Type Mapping

We now allow explicit “type mapping” when querying into a DataFrame, letting you choose the types that your data frames will use. The following will create a Dataframe containing an int16 and a string, with names “col_1” and “col_2” respectively.

schema = pyarrow.schema([
    ("col_1", pyarrow.int16()),
    ("col_2", pyarrow.string())
])

odf = connection.fetch_df_all(
    "select 456 c1, 'King' c2 from dual",
    requested_schema=schema
)
tab = pyarrow.table(odf)
print(tab)

One use for this is to reduce the memory requirements for numbers which have a known, small value range, since you can now specify a smaller numeric type than the default type mapping uses. This can also help performance.

See the documentation Explicit Data Frame Type Mapping for more information.

DataFrame Chunk Support

We now support multiple chunks when ingesting DataFrames.

DataFrame Type Support

Additional data types are now supported in DataFrames. We added support for all of the signed and unsigned fixed width integer types when ingesting data frames supporting the Arrow PyCapsule interface into Oracle Database. Previously only int64 was supported. Also added was support for types date32 and date64.

When querying string and binary data into a DataFrame, we now default to the Apache Arrow “LARGE_STRING” and “LARGE_BINARY” types. These support 64-bit offsets, making it more convenient to work with large data sets. If saving 4 bytes per record is important, you can use explicit type mapping to request STRING or BINARY types.

Easy Batch Size Control with executemany()

The executemany() method commonly used for batch INSERT and UPDATE statements now accepts a batch_size parameter. This is used to split the processing of the supplied data into chunks of that number of rows. This allows you to more easily tune the performance of batch inserts without having to explicitly loop and make multiple calls to executemany().

Fine-grained control over LOB and number handling

We added fetch_lobs and fetch_decimals parameters where applicable to the methods used for fetching rows or DataFrames. These behave in the same way as the oracledb.defaults.fetch_lobs and oracledb.defaults.fetch_decimals attributes, but give you more control over which routines in your code return data in the desired format.

New Optional Install Dependencies

For users of Centralized Configuration Provider support and Cloud Native Authentication, we added optional install dependencies [oci_config], [azure_config], [oci_auth] and [azure_auth] to simplify installation of required dependencies. For example, to get set up to use OCI Cloud Native Authentication, you can now do:

python -m pip install oracledb[oci_auth]

Deprecation Warnings

We’re giving notice that we will necessarily need to de-support old stuff at some future time.

Eventually we will have to stop building packages for macOS Intel and Windows 32-bit because the Python cryptography package we require has just announced its deprecation of these architectures. Their timeline puts an upper limit on how long we can continue to produce packages for these architectures. We may need to drop support a bit earlier, depending on how the python-oracledb major release schedule aligns with the cryptography package’s.

We’re also giving notice that as time passes and new versions of Oracle Database and Oracle Client are released, we will eventually need to take very old versions out of our test plan and connectivity / interoperability support. Over the last 18 years we’ve been adding support for new versions as they have come out: Oracle Database 11g features were added to cx_Oracle/python-oracledb back in October 2007. Oracle Database version 12c support was added in May 2014. Those are ancient dates. These older database and client software packages have themselves been in “Upgrade Support”-only status from Oracle Support for a number of years. Now that Oracle Database is at version 23, we’re nudging you to upgrade. We haven’t set any specific timeline for python-oracledb’s desupport of the old versions, but be warned it will have to happen.

Other changes

For all the other improvements and bug fixes in python-oracledb 3.4, see the Release Notes.

Thank you for using python-oracledb !

Python-oracledb Resources

python-oracledb 3.4 introduces Direct Path Loads for rapid bulk data insertion was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Python Data Classes make it easy to fetch database rows as objects

Christopher Jones — Mon, 22 Sep 2025 22:49:50 GMT

Python-oracledb rowfactories are a powerful way for Oracle Database queries to alter the representation of fetched rows, reducing the amount of application boilerplate code and data copying. This blog shows how easy it is to use a Python Data Class with a rowfactory to transform rows into instances of a user-defined class.

Photo by Peter Herrmann on Unsplash

A rowfactory is a method that is invoked for each row fetched from the database before it is returned to the application. It can be set on a cursor after statement execution, before data is fetched.

Consider this code which does not use a rowfactory:

cursor.execute(
    """select employee_id, last_name, hire_date
       from employees
       where employee_id < 103
       order by employee_id""")

for row in cursor:
    print(row)

It simply prints tuples:

(100, 'King', datetime.datetime(2003, 6, 17, 0, 0))
(101, 'Kochhar', datetime.datetime(2005, 9, 21, 0, 0))
(102, 'De Haan', datetime.datetime(2001, 1, 13, 0, 0))

Since my goal is to get rows as objects, I can create a class for the three fields and add a dataclass decorator so it can be used as the cursor rowfactory. The full script is:

# dc.py - Data Classes and Rowfactories

import getpass
import platform
import dataclasses
import datetime

import oracledb

un = 'cj'
cs = 'localhost/orclpdb1'
pw = getpass.getpass(f'Enter password for {un}@{cs}: ')

@dataclasses.dataclass
class MyRow:
    employee_id: int
    last_name: str
    hire_date: datetime.datetime

connection = oracledb.connect(user=un, password=pw, dsn=cs)
cursor = connection.cursor()

cursor.execute(
    """select employee_id, last_name, hire_date
       from employees
       where employee_id < 103
       order by employee_id""")

cursor.rowfactory = MyRow

for row in cursor:
    print(row)

the output is now:

MyRow(employee_id=100, last_name='King', hire_date=datetime.datetime(2003, 6, 17, 0, 0))
MyRow(employee_id=101, last_name='Kochhar', hire_date=datetime.datetime(2005, 9, 21, 0, 0))
MyRow(employee_id=102, last_name='De Haan', hire_date=datetime.datetime(2001, 1, 13, 0, 0))

Each row has been returned as a MyRow object from which you can access the data fields as normal. For example, if you change the loop to:

for row in cursor:
    print("Number:", row.employee_id)
    print("Name:", row.last_name)
    print("Hire Date:", row.hire_date)

the output is:

Number: 100
Name: King
Hire Date: 2003-06-17 00:00:00
Number: 101
Name: Kochhar
Hire Date: 2005-09-21 00:00:00
Number: 102
Name: De Haan
Hire Date: 2001-01-13 00:00:00

It’s simple and easy!

Notes

Any time execute() is called, any existing rowfactory on the cursor is cleared, so if you re-execute a statement, remember to set cursor.rowfactory again.

For other ways to change data with rowfactories, for example to return rows as dictionaries, see the python-oracledb documentation Changing Query Results with Rowfactories.

Python-oracledb Resources

Python Data Classes make it easy to fetch database rows as objects was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Come and hear all about python-oracledb at Oracle AI World in Las Vegas

Christopher Jones — Mon, 22 Sep 2025 00:07:57 GMT

If you are going to Oracle AI World in Las Vegas next month, a must-see is the session Python-oracledb: Advanced Integration for Oracle Database and Python [LRN2902] on Thursday, 16th October 2025.

The Talk

Anthony Tuininga, the widely respected creator and lead developer of python-oracledb is presenting Python-oracledb: Advanced Integration for Oracle Database and Python [LRN2902] at 10:15am on Thursday, 16th October 2025. Hear all about the great new features in python-oracledb (including some cool features that are so new they aren’t in the abstract). It’s a great opportunity to ask questions, and to connect with us.

You may also be interested in Create and Implement Multicloud Applications for Any Runtime and Hyperscaler [LRN2914], which will cover “centralized configuration providers” in more detail. Our development VP, Srinath Krishnaswamy, will be giving the python-oracledb section of this talk.

The Booth

Yes there will be a demo booth. Details will be forthcoming.

Python-oracledb Resources

Application and database type mismatches slow down data loads

Christopher Jones — Fri, 19 Sep 2025 22:29:06 GMT

Back in Round-trips are only part of the data insertion problem, I linked to a user case study where it was the optimizer, not the cost of round-trips, that slowed loading data into Oracle Database. The same problem bit me recently. Let me show you my example and the solution.

Photo by Timur Garifov on Unsplash

The scenario is using Python to load a 500,000 line CSV file “sample.csv” like:

1,"18-Sep-2025","String for row 1"
2,"18-Sep-2025","String for row 2"
3,"18-Sep-2025","String for row 3"
. . .
500000,"18-Sep-2025","String for row 500000"

into an Oracle Database table:

create table mytab (id number, dt date, name varchar2(50));

Notice that the data file has string formatted dates, as you would expect in CSV files.

A standard way of loading in python-oracledb is:

import csv

cursor.setinputsizes(None, 20, 50)
data = []
csv_reader = csv.reader(open("sample.csv", "r"), delimiter=",")
for line in csv_reader:
    data.append((line[0], line[1], line[2]))
if data:
    sql = "insert into mytab (id, dt, name) values (:1, :2, :3)"
    cursor.executemany(sql, data)

I carefully used setinputsizes() so that python-oracledb’s executemany() call knew how much memory to allocate for each of the three fields and didn’t have to do slow re-allocations as more data was parsed. The first column is numeric, so I passed None to use the default type handling. The second and third CSV fields are strings so I chose some reasonable sizes.

In the CSV reader loop, I appended a tuple of each row to a list, which was eventually passed to the efficient executemany() function.

Running my example program (which has some additional timing code) gave:

Total elapsed time: 29,576 ms
500000 rows were inserted

My database is an emulated x86_64 architecture on an arm64 Mac, so it is inherently slow. Also I was running Python on the same Mac. However the elapsed time still looked way off what I expected.

The issue is the date field: there is a type conversion that is slow. Here is an updated variant of the script:

import csv

cursor.setinputsizes(None, None, 50)
data = []
csv_reader = csv.reader(open("sample.csv", "r"), delimiter=",")
for line in csv_reader:
    data.append((float(line[0]), datetime.strptime(line[1], "%d-%b-%Y"), line[2]))
if data:
    sql = f"insert into mytab (id, dt, name) values (:1, :2, :3)"
    cursor.executemany(sql, data)

It converts the CSV date strings into datetime objects when constructing the list. The corresponding second field of the setinputsizes() call is now None since the default python-oracledb date handling knows the size of dates.

This time it ran in:

Total elapsed time: 7,568 ms
500000 rows were inserted

which is a lot faster. The number of round-trips was the same, but the bind type mismatch has been solved.

If you didn’t start by reviewing Round-trips are only part of the data insertion problem I recommend reading it now. As well as the case study, it has some links to cases where round-trips were the problem, and shows how to measure and improve the load speed.

In summary, by efficiently saving round-trips, and by making sure the database is efficiently ingesting the data you are sending it, you can accelerate your data loads.

An addendum: you might know Pandas CVS functionality makes it easy to load into the database - and it has options to automatically parse dates. However, Pandas can be very slow so I generally avoid it for data uploads. An upcoming blog post will show this and give some new best-practice samples which will be even faster than executemany(). [Update: that post is Direct Path Loads: Fast data ingestion with Python and Oracle Database].

Python-oracledb Resources

Application and database type mismatches slow down data loads was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

New Tcl oratcl fork makes deployment easy

Christopher Jones — Thu, 18 Sep 2025 21:39:48 GMT

Do you use the Tcl language with Oracle Database? Miguel Bañón has released a fork of the oratcl driver that introduces a loose coupling on the Oracle Client libraries, making it much easier to distribute and deploy.

Photo by Wolfgang Hasselmann on Unsplash

You can now locate or update the Oracle Client libraries without having to rebuild oratcl. You can use Oracle Client libraries from Oracle Instant Client packages, from a full Oracle Client installation (such as installed by Oracle’s GUI installer), or from those included in Oracle Database if Tcl is on the same machine as the database.

The technology behind the oratcl change is ODPI-C, Oracle’s wrapper over the Oracle Client libraries. ODPI-C dynamically loads the Oracle Client at runtime, so as long as you have the libraries in your OS library loading path, oratcl will use them. ODPI-C is used by a number of projects, notably python-oracledb and node-oracledb Thick modes.

The new orctcl fork is on GitHub here. Documentation is here.

Using Python for Data Analysis and AI

Christopher Jones — Mon, 01 Sep 2025 05:33:36 GMT

Are you using Python for data analysis and AI? Did you know the python-oracledb driver for Oracle Database can query directly into, and insert from, Python DataFrames? This can be very fast when you want to use packages such as Apache PyArrow, Pandas, Polars, NumPy, Dask, PyTorch, or to write files in Apache Parquet or Delta Lake format.

Photo by Steve Johnson on Unsplash

Oracle Database is the best repository for your data when doing analytic or AI workloads.

Here are handy resources showing how simple and efficient it is fetch into DataFrames, and to insert DataFrames directly into Oracle Database using Python.

Videos

Blogs

Samples

See files beginning with “dataframe” on GitHub

Documentation

Background

Python-oracledb is an open source package for the Python Database API specification with many additions to support advanced Oracle Database features. By default, it is a ‘Thin’ driver that is immediately usable without needing Oracle Instant Client libraries. Install it with python -m pip install oracledb

Using Python for Data Analysis and AI was originally published in Oracle Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.