Paul Gross’s Blog

PostgreSQL Scripting Tips

2026-01-06T00:00:00-08:00

I have been working on a double-entry ledger implementation in PostgreSQL called pgledger, and I wanted to write some example scripts. pgledger is written in SQL and meant to be used from any language or platform where you can call SQL functions, so I wanted the examples to be pure SQL.

Here are a few tips I discovered along the way. The sections are relatively independent:

Storing Response Variables with \gset
Transposing Data with \crosstabview
Showing SQL Statements and Output in the Same File

Storing Response Variables with `\gset`

At first, I found it difficult to write SQL scripts given the random nature of pgledger function responses. Each function call would generate new random identifiers, so I couldn’t just write a static set of SQL statements. (See here for more about how I generate random IDs for pgleger).

This is when I discovered the psql command \gset. \gset executes a SQL statement and stores the result in a local variable. For example:

pgledger=# SELECT id FROM pgledger_create_account('user1.external', 'USD') \gset

And then id can be used in followup SQL statements as :'id':

pgledger=# select :'id';
            ?column?
---------------------------------
 pgla_01KE4VY0XGE7FTJY9FVYV2689N
(1 row)

Variables can be named inline or by setting a prefix as an argument to \gset:

pgledger=# SELECT id AS user1_external_id FROM pgledger_create_account('user1.external', 'USD') \gset

pgledger=# SELECT id FROM pgledger_create_account('user1.receivables', 'USD') \gset user1_receivables_

pgledger=# select :'user1_external_id', :'user1_receivables_id';
            ?column?             |            ?column?
---------------------------------+---------------------------------
 pgla_01KE4W21EXFWHASSZKMSGH6RT2 | pgla_01KE4W25Y5FVF9W256JE7RYJZP
(1 row)

For more information, check out the PostgreSQL docs at \gset or see how I use it in the pgledger examples.

Transposing Data with `\crosstabview`

Once data is recorded in a ledger, one of the ways to visualize it is by tracing one flow across many accounts. For example, a single payment might first be recorded as a receivable (expecting money), then as available funds, and finally as partially or fully refunded. If the payment ID is recorded in metadata, a simple query in pgledger might look like this:

SELECT
    e.transfer_id,
    a.name,
    e.amount
FROM pgledger_entries_view e
INNER JOIN pgledger_accounts_view a ON e.account_id = a.id
WHERE e.metadata ->> 'payment_id' = 'p_123'
ORDER BY e.transfer_id;

           transfer_id           |          name          | amount
---------------------------------+------------------------+--------
 pglt_01KE4XK326F8PB40DKGMW4V7J3 | user1.external         | -50.00
 pglt_01KE4XK326F8PB40DKGMW4V7J3 | user1.receivables      |  50.00
 pglt_01KE4XK328F5EBTFS44PMV4NNF | user1.receivables      | -49.50
 pglt_01KE4XK328F5EBTFS44PMV4NNF | user1.available        |  49.50
 pglt_01KE4XK329ENCRYKCTYWSW7JMG | user1.available        | -20.00
 pglt_01KE4XK329ENCRYKCTYWSW7JMG | user1.pending_outbound |  20.00
 pglt_01KE4XK329FGK92HRWTEARYBVQ | user1.pending_outbound | -20.00
 pglt_01KE4XK329FGK92HRWTEARYBVQ | user1.external         |  20.00
(8 rows)

This shows all of the account movements, but it can be hard to visualize in this form as rows. Since each transfer in the ledger is made up of multiple entries (e.g. from and to), the query results show multiple rows for each transfer (as you can see by the duplicate transfer_id values).

With data like this, I find that transposing (or rotating) the output is often a better way to visualize it. Each account becomes a column, and the logical movements each become a single row affecting multiple columns.

In psql, this can be done with the \crosstabview command. This is as simple as adding \crosstabview to the end of the SQL statement:

SELECT
    e.transfer_id,
    a.name,
    e.amount
FROM pgledger_entries_view e
INNER JOIN pgledger_accounts_view a ON e.account_id = a.id
WHERE e.metadata ->> 'payment_id' = 'p_123'
ORDER BY e.transfer_id \crosstabview

           transfer_id           | user1.external | user1.receivables | user1.available | user1.pending_outbound
---------------------------------+----------------+-------------------+-----------------+------------------------
 pglt_01KE4XK326F8PB40DKGMW4V7J3 |         -50.00 |             50.00 |                 |
 pglt_01KE4XK328F5EBTFS44PMV4NNF |                |            -49.50 |           49.50 |
 pglt_01KE4XK329ENCRYKCTYWSW7JMG |                |                   |          -20.00 |                  20.00
 pglt_01KE4XK329FGK92HRWTEARYBVQ |          20.00 |                   |                 |                 -20.00
(4 rows)

Now, each transfer is a single row, and each account is a column. Reading down, you can easily see which accounts were affected for each transfer.

For more information, check out the PostgreSQL docs at \crosstabview or see how I use it in the pgledger reconciliation examples.

Showing SQL Statements and Output in the Same File

The last trick I discovered was showing both SQL input and output in the same file. I didn’t want folks to read the examples without understanding what each SQL query returned.

I did this by writing each example in one file (e.g. basic-example.sql) and then running that file through psql with the --echo-all flag. This echoed both the input SQL and the output results, which I wrote to a new file (e.g. basic-example.sql.out). This produces a single file for review, which even includes the comments from the original file.

Check out the resulting .out files in the pgledger examples, or the scripting behind them in the justfile.

Double-Entry Ledgers: The Missing Primitive in Modern Software

2025-06-17T00:00:00-07:00

I think ledgers are underutilized in software development today. Specifically, double-entry ledger modeling would be a better fit in a lot of systems than the ad-hoc ledger-ish things they currently have.

This is why I’ve been working on pgledger, a pure PostgreSQL ledger implementation. If adding a ledger implementation is super simple, then I’m hoping more folks will do it. And it can become another modeling primitive that we reach for to accomplish all sorts of things.

What is a Ledger?

A double-entry ledger at its core is a few simple concepts put together:

The current amount or balance of a thing
A historical record of how the amount got to that amount (immutable, append only, etc)
Where that amount came from at each step

That’s it. So if Alice sends $100 to Bob, the ledger records Alice’s balance changing from $0 to $-100 (going to Bob) and Bob’s balance changing from $0 to $100 (coming from Alice). All of this is recorded at once, all amounts are accounted for, and all balances sum to $0.

Note: Many ledgers model debits and credits rather than negative and positive numbers. In this case, Alice would have a debit of $100 and Bob would have a credit of $100. Personally, I find using negative and positive numbers simpler.¹

The fact that every transfer only moves amounts, never creates them from scratch, is a built-in error check.² And the historical record serves as an audit log.

How the ledger is implemented and what is actually stored on disk varies with each ledger implementation, but the important point is that all of this information is recorded atomically.

Once you start thinking about tracking amount changes over time, you start seeing ledgers in more places. Let’s walk through some examples I’ve seen in real software.

Recording Payments

Say we are building an online business. Starting simple, we need to know when someone places an order, so maybe we start with an orders table. But then we realize that payments have a more complicated lifecycle (we don’t receive the payment right away when an order is created), so we want to know when we can actually start service or ship a product. We might add a payments table with a status column that represents values like waiting_to_receive or complete.

This is sort of like a single entry ledger. We have a table of “transfers” from customers to our business. But things soon get more complicated. How do we record a refund? Is that a new refunds table? Or do we record a row in payments with a negative amount? What happens when our account balance isn’t what we expect? Are we missing payments? Or did we receive a different amount than we expected? How can we figure it out?

If we have a real double-entry ledger, we can record these interactions more explicitly:

When an order is created, we now have a receivable, where we are waiting on money. We can represent this as a transfer from the external user to a receivables account:

Transfer ID	Description	┃	user	receivables	available
1	order created	┃	-$10	$10

Note that each row in this representation is a transfer, and each column to the right of the vertical bar (┃) is an account. All of the row amounts sum to $0.

Then, when we actually receive the money in our account, we can move it from the receivables to our available balance:

Transfer ID	Description	┃	user	receivables	available
1	order created	┃	-$10	$10
2	payment received	┃		-$10	$10

And now we can see where the built in error checking comes into play. After receiving the payment, the receivables balance should be $0. If it isn’t, something went wrong, such as receiving less than we expected. With the original modeling, we’d have to build something custom to check the received amount against what we expected. We can also easily answer questions like “how much money are we waiting for?” without any extra logic (the balance of the receivables account).

Or if we don’t have the balance we expect, it’s easier to figure out why. We can look at the entries for an account and see every balance change over time and look for discrepancies. We can also look at other balances and see how they relate. Maybe our bank balance is $100 lower than we expect, but a different account is $100 more than we expect. In that case, we can look for missing or incorrect transfers between those two accounts.

Continuing the modeling, refunds would go the other way, often for a different amount (e.g. partial refund):

Transfer ID	Description	┃	user	receivables	available
1	order created	┃	-$10	$10
2	payment received	┃		-$10	$10
3	partial refund	┃	$5		-$5

Now we can see the external user received $5 back, and the company’s available fund only has $5 in it now. We have a unified view over both payments and refunds in the same tables. And we can see where the money went at every step.

This is obviously a simplified example, but another benefit of maintaining a ledger is the ability to add as many accounts as we want. For example, instead of maintaining a single receivables account, we can have a receivables account per user. Or we can have sub accounts within the available account to manage pending funds, held funds, or more.³

Reward Points

Tracking payments is perhaps an obvious example, so let’s consider a different case for moving around amounts: tracking user points. For example, a user can earn points by taking actions on our site, such as posting a message or referring a friend. Or maybe they earn points based on purchases, like airline miles.

If we were going to start super simple, maybe we’d just add a points column to the users table. Then, when someone earns or spends points, we just update the amount:

update users SET points = points + 100 where id = 'u_123';

But then we learn we need to show someone a history of their point changes. So next, we introduce a point_events table to add an audit log of point changes. We write a new row whenever points are earned or spent. But already we can start to see the complexity growing. Now, we need to atomically write a row and update a balance at the same time, and we need to ensure that concurrent actions don’t conflict with each other.

Over time, the requirements keep growing and getting more complicated:

Once points are spent or used, there will be a row in the point_events table with a negative amount. But where did the points go? Were they sent to another user? Were they spent? Did they expire? How do we track this? Do we add new columns to track this data?
How do we model users sending points to another user? Presumably we record two rows to the point_events table and update two balances, but we have to ensure our code writes everything atomically and correctly.
What links these two rows together? Do we need optional foreign keys on the point_events table, populating when it’s a transfer between users?

As the features evolve, the requirements look more and more like a double-entry ledger, with the “currency” of each account set to “points”. Rather than build an ad-hoc bespoke data model that we need to keep expanding, we can use ledger modeling from the beginning which handles all of these cases.

Let’s start with a points account per user, with transfers coming from a single company account. In reality, you would probably use different company accounts for different purposes or types of points. We can also use a spent account to track when points are spent.

Transfer ID	description	┃	company	user1	user2	spent
1	user1 earns 100 points	┃	-100	100
2	user2 earns 200 points	┃	-200		200
3	user2 spends 100 points	┃			-100	100
4	user1 sends user2 50 points	┃		-50	50

At the end of this flow, it’s easy to see that user1 has 50 points, user2 has 150, the company has sent 300 points, and users have spent 100 of those.

The ledger also gives us simple auditability. If user2 wants to know why their balance is 150, we can show them the series of ledger entries that result in that balance (along with the counterparty, timestamp, etc).

Later, we can even model redeeming points for cash/gift cards as a currency conversion from “points” to “USD”, capturing the exchange rate in the ledger as well without any extra modeling.⁴

More Use Cases

Another similar use case is modeling usage credits for an API, such as buying credits, spending them on various actions, and monitoring when they approach or reach zero. Credits would be another “currency” and the various things to track would each be accounts.

Taking it further, we can model things like content moderation actions per user (e.g. offenses, warnings, appeals, etc). Each user has accounts for the various actions, so we can count them over time, understand totals, compute reputation scores, etc.

A ledger could even represent an inventory management system, tracking quantities of items in various locations, movement between them, and their current states.

Summary

The main idea here is that if an app already has ledger modeling built in, then many things can be built on top of it without a lot of extra work or complexity per use case. We don’t need to reinvent concepts and modeling and code each time. We just use the ledger with a new set of accounts and currencies. There’s an initial cost to introducing a ledger, but then that value is recouped over time.

And the ledger components can be encapsulated with clear seams and interfaces. The ledger implementation stands alone, and the business logic is how you structure the accounts and transfers.

How you add ledgers as a core component is up to you. You can use pgledger, TigerBeetle, your own custom code, or something else entirely. And if you find more interesting use cases for ledgers, please let me know!

Discussions:

There are some good (and not so good) discussions about this post:

I learned a lot about double-entry accounting from the Ledger CLI tool, which also uses negative and positive numbers: https://ledger-cli.org/doc/ledger3.html#Stating-where-money-goes ↩
More specifically, the full accounting equation is generally written as Assets = Liabilities + Equity ↩
For a longer discussion, see https://github.com/pgr0ss/pgledger/discussions/29 ↩
For examples on currency conversions, see https://github.com/pgr0ss/pgledger?tab=readme-ov-file#currencies or https://docs.tigerbeetle.com/coding/recipes/currency-exchange/ ↩

Visualizing Financial Data with DuckDB And Plotly

2025-05-22T00:00:00-07:00

I like to keep a pretty close eye on my finances, such as my spending habits and net worth. Over the years, I’ve used a lot of different tools, such as YNAB, Mint and Quicken.

These days, I really like the spreadsheet based tool Tiller (note: affiliate link). With Tiller, all of my financial data lives in a spreadsheet that I control (in Google Sheets).

The visualizations that come with Tiller are incredible (both official and community), but a huge benefit is having all of my data available as a spreadsheet. I can export the data as CSVs and then run whatever tools I want.

I also recently learned about Plotly, a great graphing library for Python. Combine that with my love of DuckDB for querying data¹, and I have found new ways to visualize my financial data.

For example, below is a sunburst diagram visualizing expenses. An interactive version is available at plotly_expenses.html.

(For these examples, I’m using this sample Tiller sheet from this community post.)

In the sections below, I’ll walk through how I generated this diagram.

Querying Tiller Data with DuckDB

First, I exported the Transactions and Categories tabs from the Tiller Sheet. Transactions is a list of every ingested transaction with an assigned category. The Categories tab adds a hierarchy to the set of categories, such as Groceries and Restaurants belonging to the Food group of expenses.

Then, I used DuckDB to query these sheets in Python. For example, here’s a sum of expenses by group and category:

expenses_by_category = duckdb.sql(
    """
    select
        c.Type,
        c.Group,
        t.Category,
        -round(sum(replace(replace(t.Amount, '$', ''), ',', '')::decimal), 1) as Amount
    from read_csv('Tiller Sample Data - Transactions.csv') t
    join read_csv('Tiller Sample Data - Categories.csv') c on c.Category = t.Category
    and c."Hide From Reports" is null
    and c.Type = 'Expense'
    group by t.Category, c.Group, C.Type
"""
)

expenses_by_category.show()

Which prints:

┌─────────┬───────────────┬─────────────────────┬───────────────┐
│  Type   │     Group     │      Category       │    Amount     │
│ varchar │    varchar    │       varchar       │ decimal(38,1) │
├─────────┼───────────────┼─────────────────────┼───────────────┤
│ Expense │ Discretionary │ Clothes/Gear        │        8612.5 │
│ Expense │ Wellness      │ Guidance            │        8131.8 │
│ Expense │ Food          │ Snacks/Coffee       │        1796.6 │
│ Expense │ Auto          │ Camper              │        2392.8 │
│ Expense │ Discretionary │ Subscriptions       │        1540.0 │
│ Expense │ Food          │ Groceries           │       17483.5 │
│ Expense │ Living        │ Household           │        4231.0 │
│ Expense │ Health        │ Pharmacy            │        3437.9 │
│ Expense │ Discretionary │ Streaming           │        2303.0 │
│ Expense │ Wellness      │ Gym/Yoga            │        3171.9 │
│    ·    │  ·            │    ·                │           ·   │
│    ·    │  ·            │    ·                │           ·   │
│    ·    │  ·            │    ·                │           ·   │
│ Expense │ Food          │ Restaurants         │       10897.4 │
│ Expense │ Auto          │ Fees/Repairs/Maint. │        1549.1 │
│ Expense │ Discretionary │ Hobbies             │        3538.9 │
│ Expense │ Living        │ Cell Phone          │        3060.0 │
│ Expense │ Giving        │ Donations           │        3086.4 │
│ Expense │ Living        │ Utilities           │        3600.0 │
│ Expense │ Living        │ Rent                │       40727.9 │
│ Expense │ Discretionary │ Fun                 │        7076.9 │
│ Expense │ Living        │ Internet            │        3150.0 │
│ Expense │ Auto          │ Car Insurance       │       11682.0 │
├─────────┴───────────────┴─────────────────────┴───────────────┤
│ 26 rows (20 shown)                                  4 columns │
└───────────────────────────────────────────────────────────────┘

The only tricky part was cleaning up the Amount field so DuckDB treated it as a number. There may be a simpler way to do that.

Graphing with Plotly

Once I had the tabular data from DuckDB, graphing it with Plotly was simple. Plotly can generate an image file, but I think it really shines when generating standalone, interactive HTML:

expenses_by_category_sunburst = px.sunburst(
    expenses_by_category,
    path=["Type", "Group", "Category"],
    values="Amount",
)

expenses_by_category_sunburst.update_traces(textinfo="label+percent parent")

expenses_by_category_sunburst.write_image("expenses_sunburst.png")

expenses_by_category_sunburst.write_html("plotly_expenses.html")

Full script

I use uv these days for managing Python dependencies, which lets you embed the dependency requirements as a comment in the Python script. So you can save this as tiller_plotly.py and then run it with uv run tiller_plotly.py, which will automatically download the dependencies:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "duckdb~=1.2",
#     "kaleido~=0.2", # Needed to generate images
#     "plotly[express]~=6.0",
#     "pyarrow~=19.0",
# ]
# ///

import duckdb
import plotly.express as px

expenses_by_category = duckdb.sql(
    """
    select
        c.Type,
        c.Group,
        t.Category,
        -round(sum(replace(replace(t.Amount, '$', ''), ',', '')::decimal), 1) as Amount
    from read_csv('Tiller Sample Data - Transactions.csv') t
    join read_csv('Tiller Sample Data - Categories.csv') c on c.Category = t.Category
    and c."Hide From Reports" is null
    and c.Type = 'Expense'
    group by t.Category, c.Group, C.Type
"""
)

expenses_by_category.show()

expenses_by_category_sunburst = px.sunburst(
    expenses_by_category,
    path=["Type", "Group", "Category"],
    values="Amount",
)

expenses_by_category_sunburst.update_traces(textinfo="label+percent parent")

expenses_by_category_sunburst.write_image("expenses_sunburst.png")

expenses_by_category_sunburst.write_html("plotly_expenses.html")

And if you want to try out Tiller, check it out here: Tiller (affiliate link)

Other DuckDB posts I’ve written: DuckDB as the New jq and DuckDB over Pandas/Polars ↩

A Ledger In PostgreSQL Is Fast!

2025-05-16T00:00:00-07:00

I’ve been working on a ledger implementation in pure PostgreSQL called pgledger. For the backstory, please read my previous blog post: Ledger Implementation in PostgreSQL.

Now that the project is a bit further along, I decided to gather some performance numbers. And it’s fast! Depending on the scenario, I can easily get over 10,000 ledger transfers per second on my laptop with a stock, un-optimized PostgreSQL. I would imagine a well tuned production database would do a lot more.

Sure, it’s not TigerBeetle level performance, but still more than enough for most applications. And the simplicity of having the ledger in your main database is huge.

Scenarios and Scripting

Performance is a notoriously hard thing to measure, since different usage patterns and different hardware can yield very different results.

I have been iterating on a script in the pgledger repository to help measure performance: performance_check.go. My thought was that others could also run this script in their environments if they want to gather more realistic numbers for their setup.

The script takes a few inputs, including:

The number of accounts (each transfer is moving money from one of these accounts to another one)
The number of concurrent workers doing transfers
The duration of time to run

To simulate a scenario where there isn’t much account contention, we can ensure there are many more accounts than workers. That way, concurrent workers rarely try to transfer between the same accounts. Or alternatively, if our system has only a handful of hot accounts, we can keep the number of accounts low to simulate workers waiting for locked accounts.

The script also measures the database size before and after, and then calculates the amount of disk space used per transfer. This should take into account the data in both tables and indexes.

Local Results

Here are some results from my laptop (M3 Macbook Air). I am using a vanilla, unoptimized PostgreSQL 17.5, set up with:

brew install postgresql@17
brew services start postgresql@17

First, the low contention scenario:

> go run performance_check.go --accounts=50 --workers=20 --duration=30s

Completed transfers: 319105
Elapsed time in seconds: 30.0
Database size before: 1795 MB
Database size after:  2021 MB
Database size growth in bytes: 237223936
Transfers/second: 10636.8
Milliseconds/transfer: 1.9
Bytes/transfer: 743

We can see here that we spent less than 2 milliseconds per transfer for an overall rate of 10,636.8 transfers per second. And each transfer added about 743 bytes to the database. Compared to many queries I’ve seen in financial application code, this is quite fast.

And the account contention scenario, where workers may need to wait on other workers currently using the same accounts:

> go run performance_check.go --accounts=10 --workers=20 --duration=30s

Completed transfers: 226767
Elapsed time in seconds: 30.0
Database size before: 2021 MB
Database size after:  2182 MB
Database size growth in bytes: 168566784
Transfers/second: 7558.9
Milliseconds/transfer: 2.6
Bytes/transfer: 743

In this scenario, our throughput dropped to 7,558.9 transfers per second since transfers were waiting to lock hot accounts before proceeding. You can also see that the time per transfer increased to 2.6 milliseconds. Still quite fast, though.

Remote Results

I wanted to also test the performance against a hosted database, so I set up an account on Neon. The free database has 2 vCPU and 8 GB of RAM. Since the latency from my laptop to the cloud hosted database is higher, I ran more workers (since each worker would spend more time waiting on network responses):

> go run performance_check.go --accounts=100 --workers=60 --duration=30s

Completed transfers: 48937
Elapsed time in seconds: 30.0
Database size before: 157 MB
Database size after:  192 MB
Database size growth in bytes: 36470784
Transfers/second: 1631.2
Milliseconds/transfer: 36.8
Bytes/transfer: 745

(Note that I also had to increase the pool size in the PostgreSQL client to handle the increased workers by appending &pool_max_conns=100 to the connection string.¹)

These numbers aren’t as great, but it’s still over 1,000 transfers per second on the free tier of a database that is at least one US state away, so I think it’s still pretty good. If someone wants to test this on their large optimized production database, I would love to see the results.

Final Thoughts

I’m happy with these numbers for now. I think this performance is high enough for most use cases, and it’s likely that any production system will have other bottlenecks in the database before hitting the pgledger limits.

I also haven’t done a lot of optimizations to the ledger. In fact, I’ve made some choices that hurt performance in the name of simplicity. If/When performance becomes a bigger concern, I think there’s a few things that could be done to make pgledger even faster. It seems like the limiting factor right now is disk write speed, so reducing the amount of data per transfer would help, such as:

Removing fields which are duplicated between pgledger_transfers and pgledger_entries such as the account IDs
Using an ID format that takes up less space (I like prefixed ULIDs, but they could be stored as UUID types)
Removing the created_at fields and relying on the timestamps within the IDs

So if you need a ledger, please check out pgledger and let me know what you think!

https://github.com/jackc/pgx/blob/777e7e5cdf2d349c37e1eef8eedc0e21857e9b95/pgxpool/pool.go#L141 ↩

Ledger Implementation in PostgreSQL

2025-03-24T00:00:00-07:00

First, the tl;dr: I am working on a financial ledger implementation implemented entirely in PostgreSQL called pgledger.

Before I get to the why, here’s how it looks so far:

-- Set up your accounts:
select id from pgledger_create_account('account_1'); -- save this as account_1_id
select id from pgledger_create_account('account_2'); -- save this as account_2_id

-- Create transfers:
select * from pgledger_create_transfer($account_1_id, $account_2_id, 12.34);
select * from pgledger_create_transfer($account_1_id, $account_2_id, 56.78);

-- See updated balances:
select name, balance, version from pgledger_get_account($account_2_id);

   name    | balance | version
-----------+---------+---------
 account_2 |   69.12 |       2

-- See ledger entries:
select * from pgledger_entries where account_id = $account_2_id;

  id   | account_id | transfer_id | amount | account_previous_balance | account_current_balance | account_version |          created_at
-------+------------+-------------+--------+--------------------------+-------------------------+-----------------+-------------------------------
 96198 |         42 |       48103 |  12.34 |                     0.00 |                   12.34 |               1 | 2025-03-19 21:31:03.596426+00
 96200 |         42 |       48104 |  56.78 |                    12.34 |                   69.12 |               2 | 2025-03-19 21:31:21.615916+00

Each transfer subtracts from one account’s balance and adds to another account’s balance. It also writes two entries, one for each account, which record the previous and current balances, as well as the account versions. Each transfer to or from an account increments the version, giving a linear view of the changing balance over time.

This makes it easy to view the history to understand why an account balance is at its current value, or even query for a historical value at a given time.

It’s all just functions and tables, so you have the full power of PostgreSQL. Start a transaction and execute as many transfers as you want grouped together. Or query the tables using whatever SQL you desire.

Why Ledgers

I’ve worked in payments for a long time at many different companies, and one recurring theme is building in-house financial ledger software.

Ledgers are a fundamental building block of any software that deals with money. It’s incredibly important to know what money is where, how it got there, and what it’s for.

They serve both current app needs as well as reporting and reconciliation. Everything from “do I have enough money in this account to do X?” to “why doesn’t an account have the balance I expect?” and “where is my money getting held up in my processes?”

A pattern I’ve noticed is that most companies tend to build their own internal ledger. There are many ways to implement a ledger, but at their core, the concepts are the same and the feature sets are pretty consistent. It doesn’t feel like something everyone has to reimplement every time.

Furthermore, building a ledger properly from scratch is tricky, and there are lots of potential edge cases and race conditions. For example, concurrent transfers causing an account to go negative, or concurrent balance updates clobbering each other. Several times, I’ve seen folks start with a simple table to record payments/transfers/etc, and then realize over time that they actually did need a proper double entry ledger.

Why PostgreSQL

Today, if you don’t want to build your own ledger, you do have a few options. For example, there are hosted services like Modern Treasury and ledger-specific databases like TigerBeetle. Both of these are impressive and probably a good fit for many.

But by using a ledger outside of the main application database, you lose transactionality and atomicity. Namely, you have to worry about orchestrating two systems that can fail independently. What happens if you write your main data, but the ledger update fails? Or the ledger operation succeeds but your app hits an error and fails to write the surrounding data. Integrating with these often requires two phase commits and other strategies to ensure they stay in sync. And when they fall out of sync, it can be very hard to debug.

What I generally want is to be able to include ledger updates in the same database transaction as the other work, and then have it either all commit or all rollback atomically. In the past, I’ve done this with a bunch of application code. But that ties it to a specific language, framework, etc. It’s not very portable to a new project, and possibly why we don’t see a lot of open source libraries around ledgers.

So this time I’m trying something different. I am working on a ledger implementation entirely in PostgreSQL. That means as long as you use PostgreSQL (and in theory it could be ported to other databases), it is entirely transactional/atomic and language/framework/application agnostic. All you need to do from app code is execute the right SQL functions within the same database transactions that you write everything else.

Building it in PostgreSQL also means you don’t have to integrate with any new APIs or run any new services. It’s in line with the idea of “Just Use Postgres for Everything”, which is especially attractive to startups and small companies.

This is partly an experiment to see what a ledger implementation in pure PostgreSQL would look like, and partly something I hope to use on a future project.

Testing in Go

If you look at the code today, you’ll notice that while the entire implementation is in SQL, the tests are in Go. I chose Go mainly because it’s the language I’m working with the most these days, and it supports good concurrency. I have a couple of concurrent tests already that look for deadlocks and race conditions, and I hope to write more.

Feedback Please

Please check out pgledger and I’d love to hear what you think, especially if you work with ledgers today. I have many more features I’d like to implement, so feel free to keep an eye on the project. And if this is something that you are interested in using, please let me know.

Personal Notes Tooling

2025-01-29T00:00:00-08:00

I love keeping notes. Everything from meeting summaries to packing lists to books I’ve read and more. And these days, I tend to favor digital notes. I’m faster at typing than writing, and I love being able to search them. And with my phone, I don’t have to carry around a notebook.

I’ve gone through a few different systems/tools as I continue the search for my perfect setup. I figured I’d share what I am looking for and what I currently use.

To be explicit, this is for personal notes. When I’m writing shared notes, I use different tools. For example, at work I’ll often use a wiki or project management system, and with family and friends, something like Google Docs.

But for my personal notes, here is what I’m looking for in my ideal setup:

Good mobile support

I want to be able to rely on my phone when I’m not at my computer. It’s ok if the mobile experience is more limited, but I need the basics of adding and editing notes, and ideally searching.

Offline support

I travel a lot, and I want to be able to add and edit notes when my phone or laptop is in airplane mode. Or when I have spotty cell/wifi coverage.

Ideally, I should be able to edit normally, and then when I’m back in service, the notes should sync in the background.

As an example of something I don’t like, the iPhone Dropbox app allows me to edit notes without service, but then when I try to save, it can hang and refuse to persist. I don’t want to fear losing updates when I’m offline.

Open format

I’ve gone through various tools in the past, and I don’t want to get locked into a tool that I can’t easily get out of. Concretely, I would prefer the files be plain text (e.g. markdown). Plain text files are universal, and I can use many different tools to edit or convert them. Here’s a good summary of this argument: Write plain text files

If plain text isn’t an option, the next best thing would be a different open format, such as sqlite.

With open formats, even if the tool I’m using doesn’t support a good export, I can write my own. Or if the new tool wants a different format, I can easily convert the files.

If all else fails, I’m ok using a proprietary format as long as there’s good export functionality to an open format.

Vim support

I’m a big vim fan (or rather neovim these days). When I’m at my computer, I would like to be able to type notes with vim. Vim mode on another tool can work, but I prefer real vim when possible.

Image support

This isn’t a strict must have for me, but it’s nice if I can embed images in docs (such as recipes).

Versioning

Another nice to have is some kind of versioning or point in time backup. Worst case, I can backup periodic exports, but ideally, the system would have a better solution for this.

Current Tooling

So with this list of requirements and desires, here’s the set of current tools I’m using:

The crux of my current setup is directories of plain text markdown files stored in iCloud Drive. iCloud Drive handles the syncing between my laptop and my phone.

I also made this directory a git repository, and I periodically commit and push changes to a hosted git repo. This gives me both versioning and backups, in case something goes wrong with the iCloud Drive (e.g. it gets accidentally deleted or corrupted).

On desktop, I add/edit these files mostly in neovim. I also occasionally use Obsidian, especially for more complex things like adding images (it automatically copies in the image and makes the markdown links to it).

On my phone, I use the Obsidian mobile app. The main reason I chose iCloud Drive over another syncing method is because that’s what the Obsidian app supports outside of their proprietary sync: https://help.obsidian.md/getting-started/sync-your-notes-across-devices.

Downsides

While iCloud Drive syncing mostly works, it’s definitely not the smoothest. I frequently see sync issues, where changes on one device don’t show up on another for a long time. And it doesn’t handle conflicts well, just blowing away changes on one of the devices rather than trying to merge files, or asking me to resolve conflicts. Thankfully at least, iCloud finally supports the ability to keep a folder downloaded on a device: https://www.macrumors.com/2024/06/26/icloud-keep-downloaded-option-ios-18-macos-sequoia/.

Even though I set the entire Obsidian directory in iCloud to Keep Downloaded, Obsidian will often pop up a message that it’s synchronizing my files and I need to wait. I’m afraid if I skip I will somehow edit an old version and blow away newer changes. I don’t know if the issue is with Obsidian or with iCloud, but they are trying to push me towards their proprietary sync:

My current tooling is Mac only. This isn’t currently an issue for me, as I have a Mac laptop and iPhone. I do have linux devices, but I don’t currently use them for notes. I still don’t love being tied to the Apple ecosystem, though, so I would prefer something that is cross platform instead.

Summary

I’ve been using this setup for the last 6 months or so, and all in all, it’s working pretty well for me. The Obsidian mobile app is good, and it’s easy to edit and search. I do hope that iCloud syncing improves over time. But since it’s all just text files and I don’t use the advanced Obsidian features, I don’t feel too locked in, and I could switch to different apps easily.

If you have suggestions on other approaches or tools I should try, please let me know!

DuckDB over Pandas/Polars

2024-11-01T00:00:00-07:00

Since my previous post on DuckDB (DuckDB as the New jq), I’ve been continuing to use and enjoy DuckDB.

Recently, I wanted to analyze and visualize some financial CSVs, including joining a few files together. I started out with Polars (which I understood to be a newer/better Pandas). However, as someone who doesn’t use it frequently, I found the syntax confusing and cumbersome.

For example, here is how I parsed a Transactions.csv and summed entries by Category for rows in 2024 (simplified example, code formatted with Black):

df = pl.read_csv("Transactions.csv")
df = (
    df.select("Date", "Category", "Amount")
    .with_columns(
        pl.col("Date").str.to_date("%m/%d/%Y"),
        pl.col("Amount")
        .map_elements(lambda amount: amount.replace("$", ""))
        .str.to_decimal(),
    )
    .filter(pl.col("Date") > datetime.date(2024, 1, 1))
    .group_by("Category")
    .agg(pl.col("Amount").sum())
)

print(df)

Things that tripped me up:

The syntax for selecting and transforming columns
Telling it how to parse the month/day/year column
Writing a lambda to strip out the $ (maybe there is a better way to do this?)
The mix of df. and pl. calls, such as calling df.group_by but passing in pl.col(...).sum(...) as the argument to the aggregation

I’m sure this is straightforward for someone who uses these tools frequently. However, that’s not me. I play around for a bit and then come back to it weeks or months later and have to relearn.

In contrast, I write SQL day in and day out, so I find it much easier. Once I switched to DuckDB, I could write much more familiar (to me) SQL, while still using python for the rest of the code:

results = duckdb.sql(
    """
    select
        Category,
        sum(replace(Amount, '$', '')::decimal) as Amount
    from read_csv('Transactions.csv')
    where Date > '2024-01-01'
    group by Category
"""
)
results.show()

Note that DuckDB automatically figured out how to parse the date column.

And I can even join multiple CSVs together with SQL and add more complex WHERE conditions:

results = duckdb.sql(
    """
    select
        c.Group,
        sum(replace(t.Amount, '$', '')::decimal) as Amount
    from read_csv('Transactions.csv') t
    join read_csv('Categories.csv') c on c.Category = t.Category
    where t.Date > '2024-01-01'
    and c.Type in ('Income', 'Expense')
    group by c.Group
"""
)
results.show()

In summary, I find DuckDB powerful, easy, and fun to use.

Update:

A Reddit comment showed me how to remove the map_elements:

pl.col("Amount").str.replace("\\$", "").str.to_decimal()

But I think the double use of .str is a good example of how this is complex to me as a casual user.

Update 2:

Another Reddit comment showed how to do “a shorter (no intermediary steps) and more efficient (scan) version”:

df = (
    pl.scan_csv("Transactions.csv")
    .filter(pl.col("Date").str.to_date("%m/%d/%Y") > datetime.date(2024, 1, 1))
    .group_by("Category")
    .agg(pl.col("Amount").str.replace("\\$", "").str.to_decimal().sum())
    .collect()
)

print(df)

Discussions:

There are some good discussions about this post, especially around the increased composability of Polars/Pandas vs SQL and better ways to write the Polars code:

The Many Ways To Read Tech News

2024-03-29T00:00:00-07:00

When I am interested in reading tech news, I have a few sites I often visit, such as:

https://news.ycombinator.com (aka Hacker News)
https://lobste.rs
https://www.reddit.com/r/programming

These sites and many others collect links to other original sources, such as blog posts, news articles, tech projects, etc. Until recently I wasn’t aware of just how many sites aggregate the same links.

I wrote a couple of blog posts recently that managed to get on the front page of Hacker News, Lobste.rs, etc. and saw just how many places linked to my post.

This is by no means scientific, but I tried to capture and group some of the popular aggregators, roughly sorted by how many referrals I received from each one. There are many, many more that seem to be single person hobby projects. Hacker News in particular seems to attract a lot of folks building and hosting alternative frontends.

Maybe you’ll discover a new preferred way to read tech news.

Hacker News Alternative Frontends

Judging by these sites, there are a few reasons why people run alternative frontends to Hacker News. Most of these present the same information but with a different user interface. Some change the content by adjusting the rankings of posts, snapshotting different time periods (e.g. top links per day for each day), or adding summaries. I’ve commented on a few of them in the list:

https://hckrnews.com
https://hn.algolia.com (Search, filtering and sorting)
https://www.daemonology.net/hn-daily (Top posts per day)
https://hackerweb.app
https://hn.premii.com
https://hn.svelte.dev
https://hackernews.betacat.io (ChatGPT summaries)
https://www.hackernewz.com
https://news.social-protocols.org
https://hnrss.github.io (Custom, realtime RSS feeds)
https://hnrankings.info (Rank of posts over time)
https://vue-hn.netlify.app
https://hn42.net
https://hn.toonmaterial.com
http://hnapp.com (Search interface)
http://hn.elijames.org
https://yester-hn.riched.net
https://ycombinator.mytools.pw
https://yahni.news
https://www.hntoplinks.com
https://www.hakaran.com
https://www.distilhn.com (AI generated summaries)
https://whnex.com
https://viralerts.com
https://slacker-news.fly.dev
https://news.sune.one
https://modernorange.io
https://innerself-hn.com
https://hn.test.tube (Top posts by time period)
https://hackernewsday.com (Top posts by day)

Aggregate Multiple Sources

These are the sites which aggregate multiple sources into one interface. Most of these show separate lists for each source, but a few do combine them into unified lists.

Mailing Lists

There are also a handful of popular mailing lists, likely getting their stories from the main aggregators. I doubt I have a full list here, however, since the referrer is generally a email client instead of the website.

I also learned about https://kill-the-newsletter.com, which converts newsletters into Atom feeds, which can then be consumed via news readers.

Misc

There are other popular tech podcasts, but The Changelog has really good show notes with links so it was easy to see that folks had clicked through.

And of course there are other sets of user submitted posts where I saw some traffic, such as:

https://slashdot.org
https://alterslash.org (Alternative Slashdot frontend)
https://www.sqox.com
https://lemmy.world

I also came across a bunch of sites in other languages, but I left those out because in most cases, it was hard for me to determine quite what they were doing.

DuckDB as the New jq

2024-03-21T00:00:00-07:00

Recently, I’ve been interested in the DuckDB project (like a SQLite geared towards data applications). And one of the amazing features is that it has many data importers included without requiring extra dependencies. This means it can natively read and parse JSON as a database table, among many other formats.

I work extensively with JSON day to day, and I often reach for jq when exploring documents. I love jq, but I find it hard to use. The syntax is super powerful, but I have to study the docs anytime I want to do anything beyond just selecting fields.

Once I learned DuckDB could read JSON files directly into memory, I realized that I could use it for many of the things where I’m currently using jq. In contrast to the complicated and custom jq syntax, I’m very familiar with SQL and use it almost daily.

Here’s an example:

First, we fetch some sample JSON to play around with. I used the GitHub API to grab the repository information from the golang org:

% curl 'https://api.github.com/orgs/golang/repos' > repos.json

Now, as a sample question to answer, let’s get some stats on the types of open source licenses used.

The JSON structure looks like this:

[
  {
    "id": 1914329,
    "name": "gddo",
    "license": {
      "key": "bsd-3-clause",
      "name": "BSD 3-Clause \"New\" or \"Revised\" License",
      ...
    },
    ...
  },
  {
    "id": 11440704,
    "name": "glog",
    "license": {
      "key": "apache-2.0",
      "name": "Apache License 2.0",
      ...
    },
    ...
  },
  ...
]

This might not be the best way, but here is what I cobbled together after searching and reading some docs for how to do this in jq:

 % cat repos.json | jq \
  'group_by(.license.key)
  | map({license: .[0].license.key, count: length})
  | sort_by(.count)
  | reverse'
[
  {
    "license": "bsd-3-clause",
    "count": 23
  },
  {
    "license": "apache-2.0",
    "count": 5
  },
  {
    "license": null,
    "count": 2
  }
]

And here is what it looks like in DuckDB using SQL:

% duckdb -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
┌──────────────┬───────┐
│   license    │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ bsd-3-clause │    23 │
│ apache-2.0   │     5 │
│              │     2 │
└──────────────┴───────┘

For me, this SQL is much simpler and I was able to write it without looking at any docs. The only tricky part is querying nested JSON with the ->> operator. The syntax is the same as the PostgreSQL JSON Functions, however, so I was familiar with it.

And if we do need the output in JSON, there’s a DuckDB flag for that:

% duckdb -json -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
[{"license":"bsd-3-clause","count":23},
{"license":"apache-2.0","count":5},
{"license":null,"count":2}]

We can still even pretty print with jq at the end, after using DuckDB to do the heavy lifting:

% duckdb -json -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc" \
  | jq
[
  {
    "license": "bsd-3-clause",
    "count": 23
  },
  {
    "license": "apache-2.0",
    "count": 5
  },
  {
    "license": null,
    "count": 2
  }
]

JSON is just one of the many ways of importing data into DuckDB. This same approach would work for CSVs, parquet, Excel files, etc.

And I could choose to create tables and persist locally, but often I’m just interrogating data and don’t need the persistence.

Read more about DuckDB’s great JSON support in this blog post: Shredding Deeply Nested JSON, One Vector at a Time

Update:

I also learned that DuckDB can read the JSON directly from a URL, not just a local file:

% duckdb -c \
  "select license->>'key' as license, count(*) as count \
  from read_json('https://api.github.com/orgs/golang/repos') \
  group by 1 \
  order by count desc"

Lessons Learned From Payments Startups

2024-01-26T00:00:00-08:00

Over my career so far, I’ve worked in a number of payments companies, including several startups. At the last startup, I was involved in building out a payments platform from scratch (from first line of code). This post is a collection of thoughts and lessons learned. Hopefully, at least some of this is useful to others.

The sections are relatively independent, so here are some quick links:

Use The Tools You Have
Optimize For Change
Focus on Iteration
Testing
Modular Monolith
Put Everything in the Database
Make It Easy To Query The Database
Job Drain Pattern
Check in Generated Files
Decision Logs
Continuous Deployment

Use The Tools You Have (Before Adding New Tools)

Every new tool, language, database, etc adds an enormous amount of complexity. You have to set it up and manage it (even managed offerings still require work), integrate with it, learn the ins and outs (often only after it’s failed in some way), and you will find out things you didn’t even know to think about.

So before I reach for something new, I try to use what we have, even if it’s not the optimal thing. For example, my projects have often used PostgreSQL as the database. PostgreSQL is quite full featured, so I try to use it for as much as possible. This includes job queues, search, and even simple caching (e.g. table that stores temporary values which get cleared out over time). It’s not necessarily the ideal platform for these, but it’s so much easier to just manage the one database than a whole suite of data systems. And at some point, the app will outgrow PostgreSQL’s capability for one or more of these, but even deferring that decision and work is hugely valuable.

The same goes for introducing new languages and frameworks. When possible, I like to use what we have and only introduce something new once we’ve pushed the existing stuff to the breaking point.

Another advantage is that over time, a lot of software becomes deprecated, but not removed. Some product or feature is no longer maintained, but since it’s in active use, it’s not fully shut down or deleted. It’s bad enough to leave deprecated code and services running, but it’s even worse if this means you now have extra databases or other platform systems that still have to be maintained, but don’t provide any current value. Even deprecated systems still need security upgrades, migrations to new servers, and more.

Optimize For Change

It is especially true with startups, but really change is a part of any software project. Requirements change, our understanding of the problems change, technology changes, and even the focus of a company can change. So it’s important to ensure that the software can change as well. Sometimes this is subjective (which architecture is the most amenable to change) and other times it’s concrete.

For example, I worked on one system which had both an customer installed on-premise system and a cloud hosted system. The on-premise system was extremely hard to change as it required customers to do their own upgrades (often on their own schedules). In contrast, the cloud hosted system was fully under our control. So optimizing for change meant putting as much into the cloud hosted system as possible and keeping the on-premise portion thin. That way, we needed fewer changes to the hard to change parts, and we could roll out as many changes as we needed to the cloud piece on our own schedule.

Optimizing for change can also help with architecture discussions and decisions. When deciding between alternatives, picking the one that is easiest to change later can be helpful. It’s easier to try new things when the cost of undoing that change isn’t as high. If the new framework or tool doesn’t work out, you can switch back or switch to something else that’s new.

In my opinion, one of the best ways to optimize for change is to keep things as simple as possible. Sometimes, folks will over-engineer current systems to try to predict how they will evolve in the future and to try to future-proof them now. One example of this is making things generic when there is only one type today. I think this is a mistake. Our guesses for how things will change are often incorrect, and it’s easier to change a simple system than a complex one. It’s also easier to maintain a simpler system today than carry the over-engineered baggage around with us.

Focus on Iteration

It’s super important to be able to break down work into small, deliverable pieces. I’ve seen too many projects go months without showing any value. Sometimes they do finally deliver, but other times, they will get canceled or significantly altered instead. It’s far better to release piecemeal, even if it’s not fully featured. Feature flags and other ways to partially roll out features are great here. It allows you to get production feedback from a subset of customers, or even just internal folks. And it allows visible progress throughout a long project.

I find that a lot of frustration over software estimates and delivery time frames go away if folks can see visible progress over time, rather than a nebulous future delivery date.

One thing I wish I had a better solution for was making the stability of features more obvious. For example, I want to ship quickly to get feedback, but then I want to still be able to change that feature or API. However, once customers start using something, they often implicitly assume that it won’t change.

It would be great to find a way to mark features or APIs as alpha, beta, stable, etc and set clear expectations and time frames for those features. For example, encouraging customers to try out an alpha API, but knowing that it will change and they will have to update their integration periodically. Personally, I haven’t seen this done super well yet.

Testing

Testing code is super valuable, and there are many different approaches with different trade-offs. A lot can be said on this topic, but I’ll just mention one aspect that I’ve been thinking about a lot: balancing speed and quality of tests.

In general, having a lot of tests lets you make changes with confidence. If a large, thorough suite of tests pass, you can be reasonably sure you haven’t broken something. It can even let you upgrade core components with confidence, such as the application framework or language version.

However, the more tests you have the slower they will take to run. What starts as a suite of just a few seconds can easily take minutes or longer if you aren’t careful. One way this is addressed is by trying to isolate tests from other systems, often with mocking. For example, testing the core of the business logic without the database, or testing the API without actually opening connections and making API calls, or mocking out responses from 3rd party systems.

But the trade-off here is that as you isolate tests to make them faster, you may also make them less realistic and less able to catch problems. The mock based tests are fast, but perhaps the mock doesn’t work the same way as the real component in certain edge cases.

Or you want to change something about the interaction between components, and now you have to update hundreds of cases where you set up mocks for testing.

I don’t have a great answer for this one. I try to isolate code from external dependencies when I can (e.g. by writing business logic as simple functions that take their data as input). And for the rest at the edges or when testing interactions, I just try to be thoughtful about the trade-offs we make for speed vs accuracy with testing. I also tend to prefer fakes over mocks, where you have a mostly stable stand-in that is used across many tests instead of setting up mock expectations per test.

Modular Monolith

A lot has been said on modular monoliths elsewhere, so I’ll just add that I really like this approach. It’s really hard to know what the eventual seams of a software system will be, and it’s hard for a small team to work on many separate services (including hosting, deployment, monitoring, upgrades, etc).

In the recent cases where we used a monolith, I think it worked out really well. It will always be work to pull a service out of the monolith eventually, but we can try to be thoughtful about the code separation within the monolith to help make it easier (and to crystallize our thinking on what is a separate domain area). And we’re deferring these decisions until later, so we can focus on building more quickly now (which is especially important in a startup).

Put Everything in the Database

I’m a big fan of storing almost everything in the database. I find that it makes things so much easier to understand and debug if you can query all of the relevant data together. Often, I will prefer the database to logging, since you can’t easily correlate logs with stored data (e.g. Storing External Requests).

For example, in payment systems, payments often move through many different states. It’s really helpful to have entries in the database that represent what changed and when, even if only the final state is important. Then, when trying to debug why a payment is in a weird state, we can see all of the relevant data in all of the tables in one place (e.g. in an events or audits table).

Adding a unique request identifier makes it even more useful. Then, you can associate a failed API requests with all of its database records.

There are practical considerations, however, as data sizes really start to grow. One strategy I’ve used is to store some of this data with a shorter lifespan. For example, log style data may only be useful for a few weeks, so it can be deleted after that. Or exported to files and archived separately.

Another issue is with Personally Identifiable Information (PII). There are often legal and ethical requirements for this type of data, so it needs to be considered on a case by case basis. Sometimes, it can still be stored, but only for a short time. Other times, it should be scrubbed or excluded from the database.

Make It Easy To Query The Database

Once you get everything into the database, I find it super helpful to give folks an easy way to query it. Recently, I used Metabase and really enjoyed how it allowed easy, web based querying and graphing of our data. We set it up with a read-only connection to a read replica, so there was little concern of impacting production or accidentally changing data. We found that both developers and non-technical folks used it extensively.

For example, we made dashboards where you could enter an orderId and see all of the data from all of the tables that stored associated data. This was hugely valuable for debugging and for our support folks.

Again, there are considerations of who can see the data, and how much of it. But in general, giving folks the ability to answer their own data questions is super powerful, and it takes load off developers. And building shared dashboards and graphs so everyone can watch the same metrics was very powerful.

Job Drain Pattern

Once a system outgrows a single database, data consistency issues start to pop up. Even introducing a background job system or a search tool can start to show issues. For example, the main database was written, but the process that copied to the search tool failed. Or the background job was queued before the main database was committed.

There are various ways to solve this problem, and in particular, I like the job drain pattern, written up well at Transactionally Staged Job Drains in Postgres . I’ve used this pattern on several different projects successfully.

Check in Generated Files

Similar to put everything in the database is put everything into git. For me, this includes generated files when possible. I know a lot of ecosystems prefer generating only at build time into temporary directories, but I really like having them in git. I find it really useful to be able to diff these files when making changes, such as upgrading the generation library or code. Otherwise, it can be hard to tell if anything meaningful has changed, or if more has changed than you expected.

When working with Gradle, I also like to check in the generated lockfiles that specify the exact version of every transitive dependency. Then, when Dependabot/Renovate/etc perform automated upgrades, it’s easy to see which transitive dependencies have also changed.

Decision Logs

I think in general, a lot of internal documentation is wasted effort. People spend countless hours writing up product plans or docs that are never looked at again.

However, I do think some documentation is often valuable. In particular, I like Decision Logs. The idea is that whenever the team needs to make a decision, that decision is captured in some light documentation. I think it serves two purposes:

Writing up the options along with the advantages and disadvantages of each helps clarify thinking, and helps make better decisions. It shows what you’ve considered, and allows others to note gaps or misunderstandings. It’s also often helpful to clarify what you are not trying to address with the decision, i.e. what’s out of scope.
Months or even years later, looking back at the Decision Log can be useful to understand why the system is designed a certain way. For example, someone new is hired and doesn’t understand why you chose Database X over Database Y. They can go read the entry. Or when someone proposes something new that’s already been considered, you can go back and see why it wasn’t chosen previously and if anything in the situation has changed (e.g. with the company or the capabilities of the tool). The Decision Log helps to remove “institutional knowledge” where only a handful of old-timers know the reasons for anything.

I do think that these Decision Logs (and other documentation) should be kept relatively light, however. Folks should not spend days writing them up.

Continuous Deployment

I’m a big fan of continuous deployment. This can look different on different projects, but ideally, every commit to the main branch will deploy to both test and production environments. I see a number of benefits:

It means the time between commit and production is small, so completed work gets into the hands of users quickly. You also don’t have to worry about when code will be released, and when other code that depends on it can also be released. You can merge a change, let it deploy, and then merge another change.
It requires the deploys to be fully automated, which both makes them repeatable and also generally discoverable. Anyone can see what steps are run for every deployment and they are always the same (no hidden steps). Furthermore, if there’s ever a need for a manual deployment, someone can go look at the scripts and run the same commands.
It removes an often time consuming developer chore. Now, deploys just happen and you don’t have to spend time coordinating or performing them.

For beta features, or features that aren’t ready to be visible to everyone, I think feature flags work well. There are lots of libraries and products in this space, but it’s possible to start simple with what is built into GitLab: https://docs.gitlab.com/ee/operations/feature_flags.html

Paul Gross’s Blog

PostgreSQL Scripting Tips

Storing Response Variables with \gset

Transposing Data with \crosstabview

Showing SQL Statements and Output in the Same File

Double-Entry Ledgers: The Missing Primitive in Modern Software

What is a Ledger?

Recording Payments

Reward Points

More Use Cases

Summary

Visualizing Financial Data with DuckDB And Plotly

Querying Tiller Data with DuckDB

Graphing with Plotly

Full script

A Ledger In PostgreSQL Is Fast!

Scenarios and Scripting

Local Results

Remote Results

Final Thoughts

Ledger Implementation in PostgreSQL

Why Ledgers

Why PostgreSQL

Testing in Go

Feedback Please

Personal Notes Tooling

Good mobile support

Offline support

Open format

Vim support

Image support

Versioning

Current Tooling

Downsides

Summary

DuckDB over Pandas/Polars

The Many Ways To Read Tech News

Hacker News Alternative Frontends

Aggregate Multiple Sources

Mailing Lists

Misc

DuckDB as the New jq

Lessons Learned From Payments Startups

Use The Tools You Have (Before Adding New Tools)

Optimize For Change

Focus on Iteration

Testing

Modular Monolith

Put Everything in the Database

Make It Easy To Query The Database

Job Drain Pattern

Check in Generated Files

Decision Logs

Continuous Deployment

Storing Response Variables with `\gset`

Transposing Data with `\crosstabview`