<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://www.pgrs.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.pgrs.net/" rel="alternate" type="text/html" /><updated>2026-01-26T21:15:53-08:00</updated><id>https://www.pgrs.net/feed.xml</id><title type="html">Paul Gross’s Blog</title><subtitle>Principal software developer</subtitle><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><entry><title type="html">PostgreSQL Scripting Tips</title><link href="https://www.pgrs.net/2026/01/06/postgresql-scripting-tips/" rel="alternate" type="text/html" title="PostgreSQL Scripting Tips" /><published>2026-01-06T00:00:00-08:00</published><updated>2026-01-06T00:00:00-08:00</updated><id>https://www.pgrs.net/2026/01/06/postgresql-scripting-tips</id><content type="html" xml:base="https://www.pgrs.net/2026/01/06/postgresql-scripting-tips/"><![CDATA[<p>I have been working on a double-entry ledger implementation in PostgreSQL called <a href="https://github.com/pgr0ss/pgledger">pgledger</a>, and I wanted to write some example scripts. <a href="https://github.com/pgr0ss/pgledger">pgledger</a> is written in SQL and meant to be used from any language or platform where you can call SQL functions, so I wanted the examples to be pure SQL.</p>

<p>Here are a few tips I discovered along the way. The sections are relatively independent:</p>
<ul>
  <li><a href="#storing-response-variables-with-gset">Storing Response Variables with <code class="language-plaintext highlighter-rouge">\gset</code></a></li>
  <li><a href="#transposing-data-with-crosstabview">Transposing Data with <code class="language-plaintext highlighter-rouge">\crosstabview</code></a></li>
  <li><a href="#showing-sql-statements-and-output-in-the-same-file">Showing SQL Statements and Output in the Same File</a></li>
</ul>

<h2 id="storing-response-variables-with-gset">Storing Response Variables with <code class="language-plaintext highlighter-rouge">\gset</code></h2>

<p>At first, I found it difficult to write SQL scripts given the random nature of <code class="language-plaintext highlighter-rouge">pgledger</code> function responses. Each function call would generate new random identifiers, so I couldn’t just write a static set of SQL statements. (See here for more about how I <a href="https://github.com/pgr0ss/pgledger#ids">generate random IDs for <code class="language-plaintext highlighter-rouge">pgleger</code></a>).</p>

<p>This is when I discovered the <code class="language-plaintext highlighter-rouge">psql</code> command <code class="language-plaintext highlighter-rouge">\gset</code>. <code class="language-plaintext highlighter-rouge">\gset</code> executes a SQL statement and stores the result in a local variable. For example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pgledger</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">pgledger_create_account</span><span class="p">(</span><span class="s1">'user1.external'</span><span class="p">,</span> <span class="s1">'USD'</span><span class="p">)</span> <span class="err">\</span><span class="n">gset</span>
</code></pre></div></div>

<p>And then <code class="language-plaintext highlighter-rouge">id</code> can be used in followup SQL statements as <code class="language-plaintext highlighter-rouge">:'id'</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pgledger=# select :'id';
            ?column?
---------------------------------
 pgla_01KE4VY0XGE7FTJY9FVYV2689N
(1 row)
</code></pre></div></div>

<p>Variables can be named inline or by setting a prefix as an argument to <code class="language-plaintext highlighter-rouge">\gset</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pgledger</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">id</span> <span class="k">AS</span> <span class="n">user1_external_id</span> <span class="k">FROM</span> <span class="n">pgledger_create_account</span><span class="p">(</span><span class="s1">'user1.external'</span><span class="p">,</span> <span class="s1">'USD'</span><span class="p">)</span> <span class="err">\</span><span class="n">gset</span>

<span class="n">pgledger</span><span class="o">=#</span> <span class="k">SELECT</span> <span class="n">id</span> <span class="k">FROM</span> <span class="n">pgledger_create_account</span><span class="p">(</span><span class="s1">'user1.receivables'</span><span class="p">,</span> <span class="s1">'USD'</span><span class="p">)</span> <span class="err">\</span><span class="n">gset</span> <span class="n">user1_receivables_</span>

<span class="n">pgledger</span><span class="o">=#</span> <span class="k">select</span> <span class="p">:</span><span class="s1">'user1_external_id'</span><span class="p">,</span> <span class="p">:</span><span class="s1">'user1_receivables_id'</span><span class="p">;</span>
            <span class="o">?</span><span class="k">column</span><span class="o">?</span>             <span class="o">|</span>            <span class="o">?</span><span class="k">column</span><span class="o">?</span>
<span class="c1">---------------------------------+---------------------------------</span>
 <span class="n">pgla_01KE4W21EXFWHASSZKMSGH6RT2</span> <span class="o">|</span> <span class="n">pgla_01KE4W25Y5FVF9W256JE7RYJZP</span>
<span class="p">(</span><span class="mi">1</span> <span class="k">row</span><span class="p">)</span>
</code></pre></div></div>

<p>For more information, check out the PostgreSQL docs at
<a href="https://www.postgresql.org/docs/18/app-psql.html#APP-PSQL-META-COMMAND-GSET"><code class="language-plaintext highlighter-rouge">\gset</code></a> or see how I use it in the <a href="https://github.com/pgr0ss/pgledger/tree/main/examples">pgledger examples</a>.</p>

<h2 id="transposing-data-with-crosstabview">Transposing Data with <code class="language-plaintext highlighter-rouge">\crosstabview</code></h2>

<p>Once data is recorded in a ledger, one of the ways to visualize it is by tracing one flow across many accounts. For example, a single payment might first be recorded as a receivable (expecting money), then as available funds, and finally as partially or fully refunded. If the payment ID is recorded in metadata, a simple query in <a href="https://github.com/pgr0ss/pgledger">pgledger</a> might look like this:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
    <span class="n">e</span><span class="p">.</span><span class="n">transfer_id</span><span class="p">,</span>
    <span class="n">a</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
    <span class="n">e</span><span class="p">.</span><span class="n">amount</span>
<span class="k">FROM</span> <span class="n">pgledger_entries_view</span> <span class="n">e</span>
<span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">pgledger_accounts_view</span> <span class="n">a</span> <span class="k">ON</span> <span class="n">e</span><span class="p">.</span><span class="n">account_id</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">id</span>
<span class="k">WHERE</span> <span class="n">e</span><span class="p">.</span><span class="n">metadata</span> <span class="o">-&gt;&gt;</span> <span class="s1">'payment_id'</span> <span class="o">=</span> <span class="s1">'p_123'</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">e</span><span class="p">.</span><span class="n">transfer_id</span><span class="p">;</span>

           <span class="n">transfer_id</span>           <span class="o">|</span>          <span class="n">name</span>          <span class="o">|</span> <span class="n">amount</span>
<span class="c1">---------------------------------+------------------------+--------</span>
 <span class="n">pglt_01KE4XK326F8PB40DKGMW4V7J3</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="k">external</span>         <span class="o">|</span> <span class="o">-</span><span class="mi">50</span><span class="p">.</span><span class="mi">00</span>
 <span class="n">pglt_01KE4XK326F8PB40DKGMW4V7J3</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">receivables</span>      <span class="o">|</span>  <span class="mi">50</span><span class="p">.</span><span class="mi">00</span>
 <span class="n">pglt_01KE4XK328F5EBTFS44PMV4NNF</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">receivables</span>      <span class="o">|</span> <span class="o">-</span><span class="mi">49</span><span class="p">.</span><span class="mi">50</span>
 <span class="n">pglt_01KE4XK328F5EBTFS44PMV4NNF</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">available</span>        <span class="o">|</span>  <span class="mi">49</span><span class="p">.</span><span class="mi">50</span>
 <span class="n">pglt_01KE4XK329ENCRYKCTYWSW7JMG</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">available</span>        <span class="o">|</span> <span class="o">-</span><span class="mi">20</span><span class="p">.</span><span class="mi">00</span>
 <span class="n">pglt_01KE4XK329ENCRYKCTYWSW7JMG</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">pending_outbound</span> <span class="o">|</span>  <span class="mi">20</span><span class="p">.</span><span class="mi">00</span>
 <span class="n">pglt_01KE4XK329FGK92HRWTEARYBVQ</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">pending_outbound</span> <span class="o">|</span> <span class="o">-</span><span class="mi">20</span><span class="p">.</span><span class="mi">00</span>
 <span class="n">pglt_01KE4XK329FGK92HRWTEARYBVQ</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="k">external</span>         <span class="o">|</span>  <span class="mi">20</span><span class="p">.</span><span class="mi">00</span>
<span class="p">(</span><span class="mi">8</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>

<p>This shows all of the account movements, but it can be hard to visualize in this form as rows. Since each transfer in the ledger is made up of multiple entries (e.g. from and to), the query results show multiple rows for each transfer (as you can see by the duplicate <code class="language-plaintext highlighter-rouge">transfer_id</code> values).</p>

<p>With data like this, I find that transposing (or rotating) the output is often a better way to visualize it. Each account becomes a column, and the logical movements each become a single row affecting multiple columns.</p>

<p>In <code class="language-plaintext highlighter-rouge">psql</code>, this can be done with the <code class="language-plaintext highlighter-rouge">\crosstabview</code> command. This is as simple as adding <code class="language-plaintext highlighter-rouge">\crosstabview</code> to the end of the SQL statement:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
    <span class="n">e</span><span class="p">.</span><span class="n">transfer_id</span><span class="p">,</span>
    <span class="n">a</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
    <span class="n">e</span><span class="p">.</span><span class="n">amount</span>
<span class="k">FROM</span> <span class="n">pgledger_entries_view</span> <span class="n">e</span>
<span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">pgledger_accounts_view</span> <span class="n">a</span> <span class="k">ON</span> <span class="n">e</span><span class="p">.</span><span class="n">account_id</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">id</span>
<span class="k">WHERE</span> <span class="n">e</span><span class="p">.</span><span class="n">metadata</span> <span class="o">-&gt;&gt;</span> <span class="s1">'payment_id'</span> <span class="o">=</span> <span class="s1">'p_123'</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">e</span><span class="p">.</span><span class="n">transfer_id</span> <span class="err">\</span><span class="n">crosstabview</span>

           <span class="n">transfer_id</span>           <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="k">external</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">receivables</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">available</span> <span class="o">|</span> <span class="n">user1</span><span class="p">.</span><span class="n">pending_outbound</span>
<span class="c1">---------------------------------+----------------+-------------------+-----------------+------------------------</span>
 <span class="n">pglt_01KE4XK326F8PB40DKGMW4V7J3</span> <span class="o">|</span>         <span class="o">-</span><span class="mi">50</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span>             <span class="mi">50</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span>                 <span class="o">|</span>
 <span class="n">pglt_01KE4XK328F5EBTFS44PMV4NNF</span> <span class="o">|</span>                <span class="o">|</span>            <span class="o">-</span><span class="mi">49</span><span class="p">.</span><span class="mi">50</span> <span class="o">|</span>           <span class="mi">49</span><span class="p">.</span><span class="mi">50</span> <span class="o">|</span>
 <span class="n">pglt_01KE4XK329ENCRYKCTYWSW7JMG</span> <span class="o">|</span>                <span class="o">|</span>                   <span class="o">|</span>          <span class="o">-</span><span class="mi">20</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span>                  <span class="mi">20</span><span class="p">.</span><span class="mi">00</span>
 <span class="n">pglt_01KE4XK329FGK92HRWTEARYBVQ</span> <span class="o">|</span>          <span class="mi">20</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span>                   <span class="o">|</span>                 <span class="o">|</span>                 <span class="o">-</span><span class="mi">20</span><span class="p">.</span><span class="mi">00</span>
<span class="p">(</span><span class="mi">4</span> <span class="k">rows</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, each transfer is a single row, and each account is a column. Reading down, you can easily see which accounts were affected for each transfer.</p>

<p>For more information, check out the PostgreSQL docs at
<a href="https://www.postgresql.org/docs/18/app-psql.html#APP-PSQL-META-COMMANDS-CROSSTABVIEW"><code class="language-plaintext highlighter-rouge">\crosstabview</code></a> or see how I use it in the <a href="https://github.com/pgr0ss/pgledger/blob/edc811b8d6b4a3deeea28810a13b968c90f4a323/examples/reconciliation.sql.out">pgledger reconciliation examples</a>.</p>

<h2 id="showing-sql-statements-and-output-in-the-same-file">Showing SQL Statements and Output in the Same File</h2>

<p>The last trick I discovered was showing both SQL input and output in the same file. I didn’t want folks to read the examples without understanding what each SQL query returned.</p>

<p>I did this by writing each example in one file (e.g. <code class="language-plaintext highlighter-rouge">basic-example.sql</code>) and then running that file through <code class="language-plaintext highlighter-rouge">psql</code> with the <code class="language-plaintext highlighter-rouge">--echo-all</code> flag. This echoed both the input SQL and the output results, which I wrote to a new file (e.g. <code class="language-plaintext highlighter-rouge">basic-example.sql.out</code>). This produces a single file for review, which even includes the comments from the original file.</p>

<p>Check out the resulting <code class="language-plaintext highlighter-rouge">.out</code> files in the <a href="https://github.com/pgr0ss/pgledger/tree/main/examples">pgledger examples</a>, or the scripting behind them in the <a href="https://github.com/pgr0ss/pgledger/blob/edc811b8d6b4a3deeea28810a13b968c90f4a323/justfile#L66-L80">justfile</a>.</p>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[I have been working on a double-entry ledger implementation in PostgreSQL called pgledger, and I wanted to write some example scripts. pgledger is written in SQL and meant to be used from any language or platform where you can call SQL functions, so I wanted the examples to be pure SQL.]]></summary></entry><entry><title type="html">Double-Entry Ledgers: The Missing Primitive in Modern Software</title><link href="https://www.pgrs.net/2025/06/17/double-entry-ledgers-missing-primitive-in-modern-software/" rel="alternate" type="text/html" title="Double-Entry Ledgers: The Missing Primitive in Modern Software" /><published>2025-06-17T00:00:00-07:00</published><updated>2025-06-17T00:00:00-07:00</updated><id>https://www.pgrs.net/2025/06/17/double-entry-ledgers-missing-primitive-in-modern-software</id><content type="html" xml:base="https://www.pgrs.net/2025/06/17/double-entry-ledgers-missing-primitive-in-modern-software/"><![CDATA[<p>I think ledgers are underutilized in software development today. Specifically, <a href="https://en.wikipedia.org/wiki/Double-entry_bookkeeping">double-entry ledger</a> modeling would be a better fit in a lot of systems than the ad-hoc ledger-ish things they currently have.</p>

<p>This is why I’ve been working on <a href="https://github.com/pgr0ss/pgledger">pgledger</a>, a pure PostgreSQL ledger implementation. If adding a ledger implementation is super simple, then I’m hoping more folks will do it. And it can become another modeling primitive that we reach for to accomplish all sorts of things.</p>

<h2 id="what-is-a-ledger">What is a Ledger?</h2>

<p>A double-entry ledger at its core is a few simple concepts put together:</p>

<ul>
  <li>The current amount or balance of a thing</li>
  <li>A historical record of how the amount got to that amount (immutable, append only, etc)</li>
  <li>Where that amount came from at each step</li>
</ul>

<p>That’s it. So if Alice sends $100 to Bob, the ledger records Alice’s balance changing from $0 to $-100 (going to Bob) and Bob’s balance changing from $0 to $100 (coming from Alice). All of this is recorded at once, all amounts are accounted for, and all balances sum to $0.</p>

<p><strong>Note:</strong> Many ledgers model debits and credits rather than negative and positive numbers. In this case, Alice would have a debit of $100 and Bob would have a credit of $100. Personally, I find using negative and positive numbers simpler.<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p>

<p>The fact that every transfer only moves amounts, never creates them from scratch, is a built-in error check.<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> And the historical record serves as an audit log.</p>

<p>How the ledger is implemented and what is actually stored on disk varies with each ledger implementation, but the important point is that all of this information is recorded atomically.</p>

<p>Once you start thinking about tracking amount changes over time, you start seeing ledgers in more places. Let’s walk through some examples I’ve seen in real software.</p>

<h2 id="recording-payments">Recording Payments</h2>

<p>Say we are building an online business. Starting simple, we need to know when someone places an order, so maybe we start with an <code class="language-plaintext highlighter-rouge">orders</code> table. But then we realize that payments have a more complicated lifecycle (we don’t receive the payment right away when an order is created), so we want to know when we can actually start service or ship a product. We might add a <code class="language-plaintext highlighter-rouge">payments</code> table with a <code class="language-plaintext highlighter-rouge">status</code> column that represents values like <code class="language-plaintext highlighter-rouge">waiting_to_receive</code> or <code class="language-plaintext highlighter-rouge">complete</code>.</p>

<p>This is sort of like a single entry ledger. We have a table of “transfers” from customers to our business. But things soon get more complicated. How do we record a refund? Is that a new <code class="language-plaintext highlighter-rouge">refunds</code> table? Or do we record a row in <code class="language-plaintext highlighter-rouge">payments</code> with a negative amount? What happens when our account balance isn’t what we expect? Are we missing payments? Or did we receive a different amount than we expected? How can we figure it out?</p>

<p>If we have a real double-entry ledger, we can record these interactions more explicitly:</p>

<p>When an order is created, we now have a <a href="https://www.investopedia.com/terms/a/accountsreceivable.asp">receivable</a>, where we are waiting on money. We can represent this as a transfer from the external user to a <code class="language-plaintext highlighter-rouge">receivables</code> account:</p>

<table>
  <thead>
    <tr>
      <th>Transfer ID</th>
      <th>Description</th>
      <th>┃</th>
      <th>user</th>
      <th>receivables</th>
      <th>available</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>order created</td>
      <td>┃</td>
      <td>-$10</td>
      <td>$10</td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p>Note that each row in this representation is a transfer, and each column to the right of the vertical bar (┃) is an account. All of the row amounts sum to $0.</p>

<p>Then, when we actually receive the money in our account, we can move it from the <code class="language-plaintext highlighter-rouge">receivables</code> to our <code class="language-plaintext highlighter-rouge">available</code> balance:</p>

<table>
  <thead>
    <tr>
      <th>Transfer ID</th>
      <th>Description</th>
      <th>┃</th>
      <th>user</th>
      <th>receivables</th>
      <th>available</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>order created</td>
      <td>┃</td>
      <td>-$10</td>
      <td>$10</td>
      <td> </td>
    </tr>
    <tr>
      <td>2</td>
      <td>payment received</td>
      <td>┃</td>
      <td> </td>
      <td>-$10</td>
      <td>$10</td>
    </tr>
  </tbody>
</table>

<p>And now we can see where the built in error checking comes into play. After receiving the payment, the receivables balance should be $0. If it isn’t, something went wrong, such as receiving less than we expected. With the original modeling, we’d have to build something custom to check the received amount against what we expected. We can also easily answer questions like “how much money are we waiting for?” without any extra logic (the balance of the <code class="language-plaintext highlighter-rouge">receivables</code> account).</p>

<p>Or if we don’t have the balance we expect, it’s easier to figure out why. We can look at the entries for an account and see every balance change over time and look for discrepancies. We can also look at other balances and see how they relate. Maybe our bank balance is $100 lower than we expect, but a different account is $100 more than we expect. In that case, we can look for missing or incorrect transfers between those two accounts.</p>

<p>Continuing the modeling, refunds would go the other way, often for a different amount (e.g. partial refund):</p>

<table>
  <thead>
    <tr>
      <th>Transfer ID</th>
      <th>Description</th>
      <th>┃</th>
      <th>user</th>
      <th>receivables</th>
      <th>available</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>order created</td>
      <td>┃</td>
      <td>-$10</td>
      <td>$10</td>
      <td> </td>
    </tr>
    <tr>
      <td>2</td>
      <td>payment received</td>
      <td>┃</td>
      <td> </td>
      <td>-$10</td>
      <td>$10</td>
    </tr>
    <tr>
      <td>3</td>
      <td>partial refund</td>
      <td>┃</td>
      <td>$5</td>
      <td> </td>
      <td>-$5</td>
    </tr>
  </tbody>
</table>

<p>Now we can see the external user received $5 back, and the company’s available fund only has $5 in it now. We have a unified view over both payments and refunds in the same tables. And we can see where the money went at every step.</p>

<p>This is obviously a simplified example, but another benefit of maintaining a ledger is the ability to add as many accounts as we want. For example, instead of maintaining a single <code class="language-plaintext highlighter-rouge">receivables</code> account, we can have a receivables account per user. Or we can have sub accounts within the <code class="language-plaintext highlighter-rouge">available</code> account to manage pending funds, held funds, or more.<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup></p>

<h2 id="reward-points">Reward Points</h2>

<p>Tracking payments is perhaps an obvious example, so let’s consider a different case for moving around amounts: tracking user points. For example, a user can earn points by taking actions on our site, such as posting a message or referring a friend. Or maybe they earn points based on purchases, like airline miles.</p>

<p>If we were going to start super simple, maybe we’d just add a <code class="language-plaintext highlighter-rouge">points</code> column to the <code class="language-plaintext highlighter-rouge">users</code> table. Then, when someone earns or spends points, we just update the amount:</p>

<p><code class="language-plaintext highlighter-rouge">update users SET points = points + 100 where id = 'u_123';</code></p>

<p>But then we learn we need to show someone a history of their point changes. So next, we introduce a <code class="language-plaintext highlighter-rouge">point_events</code> table to add an audit log of point changes. We write a new row whenever points are earned or spent. But already we can start to see the complexity growing. Now, we need to atomically write a row and update a balance at the same time, and we need to ensure that concurrent actions don’t conflict with each other.</p>

<p>Over time, the requirements keep growing and getting more complicated:</p>

<ol>
  <li>Once points are spent or used, there will be a row in the <code class="language-plaintext highlighter-rouge">point_events</code> table with a negative amount. But where did the points go? Were they sent to another user? Were they spent? Did they expire? How do we track this? Do we add new columns to track this data?</li>
  <li>How do we model users sending points to another user? Presumably we record two rows to the <code class="language-plaintext highlighter-rouge">point_events</code> table and update two balances, but we have to ensure our code writes everything atomically and correctly.</li>
  <li>What links these two rows together? Do we need optional foreign keys on the <code class="language-plaintext highlighter-rouge">point_events</code> table, populating when it’s a transfer between users?</li>
</ol>

<p>As the features evolve, the requirements look more and more like a double-entry ledger, with the “currency” of each account set to “points”. Rather than build an ad-hoc bespoke data model that we need to keep expanding, we can use ledger modeling from the beginning which handles all of these cases.</p>

<p>Let’s start with a points account per user, with transfers coming from a single company account. In reality, you would probably use different company accounts for different purposes or types of points. We can also use a <code class="language-plaintext highlighter-rouge">spent</code> account to track when points are spent.</p>

<table>
  <thead>
    <tr>
      <th>Transfer ID</th>
      <th>description</th>
      <th>┃</th>
      <th>company</th>
      <th>user1</th>
      <th>user2</th>
      <th>spent</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>user1 earns 100 points</td>
      <td>┃</td>
      <td>-100</td>
      <td>100</td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>2</td>
      <td>user2 earns 200 points</td>
      <td>┃</td>
      <td>-200</td>
      <td> </td>
      <td>200</td>
      <td> </td>
    </tr>
    <tr>
      <td>3</td>
      <td>user2 spends 100 points</td>
      <td>┃</td>
      <td> </td>
      <td> </td>
      <td>-100</td>
      <td>100</td>
    </tr>
    <tr>
      <td>4</td>
      <td>user1 sends user2 50 points</td>
      <td>┃</td>
      <td> </td>
      <td>-50</td>
      <td>50</td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p>At the end of this flow, it’s easy to see that user1 has 50 points, user2 has 150, the company has sent 300 points, and users have spent 100 of those.</p>

<p>The ledger also gives us simple auditability. If user2 wants to know why their balance is 150, we can show them the series of ledger entries that result in that balance (along with the counterparty, timestamp, etc).</p>

<p>Later, we can even model redeeming points for cash/gift cards as a currency conversion from “points” to “USD”, capturing the exchange rate in the ledger as well without any extra modeling.<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup></p>

<h2 id="more-use-cases">More Use Cases</h2>

<p>Another similar use case is modeling usage credits for an API, such as buying credits, spending them on various actions, and monitoring when they approach or reach zero. Credits would be another “currency” and the various things to track would each be accounts.</p>

<p>Taking it further, we can model things like content moderation actions per user (e.g. offenses, warnings, appeals, etc). Each user has accounts for the various actions, so we can count them over time, understand totals, compute reputation scores, etc.</p>

<p>A ledger could even represent an inventory management system, tracking quantities of items in various locations, movement between them, and their current states.</p>

<h2 id="summary">Summary</h2>

<p>The main idea here is that if an app already has ledger modeling built in, then many things can be built on top of it without a lot of extra work or complexity per use case. We don’t need to reinvent concepts and modeling and code each time. We just use the ledger with a new set of accounts and currencies. There’s an initial cost to introducing a ledger, but then that value is recouped over time.</p>

<p>And the ledger components can be encapsulated with clear seams and interfaces. The ledger implementation stands alone, and the business logic is how you structure the accounts and transfers.</p>

<p>How you add ledgers as a core component is up to you. You can use <a href="https://github.com/pgr0ss/pgledger">pgledger</a>, <a href="https://tigerbeetle.com/">TigerBeetle</a>, your own custom code, or something else entirely. And if you find more interesting use cases for ledgers, please let me know!</p>

<p><strong>Discussions:</strong></p>

<p>There are some good (and not so good) discussions about this post:</p>

<ul>
  <li><a href="https://lobste.rs/s/uqniaz/double_entry_ledgers_missing_primitive">https://lobste.rs/s/uqniaz/double_entry_ledgers_missing_primitive</a></li>
  <li><a href="https://news.ycombinator.com/item?id=44320050#44326891">https://news.ycombinator.com/item?id=44320050#44326891</a></li>
  <li><a href="https://www.reddit.com/r/programming/comments/1lduuw1/doubleentry_ledgers_the_missing_primitive_in/">https://www.reddit.com/r/programming/comments/1lduuw1/doubleentry_ledgers_the_missing_primitive_in/</a></li>
</ul>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>I learned a lot about double-entry accounting from the Ledger CLI tool, which also uses negative and positive numbers: <a href="https://ledger-cli.org/doc/ledger3.html#Stating-where-money-goes">https://ledger-cli.org/doc/ledger3.html#Stating-where-money-goes</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2">
      <p>More specifically, the full <a href="https://en.wikipedia.org/wiki/Accounting_equation">accounting equation</a> is generally written as <code class="language-plaintext highlighter-rouge">Assets = Liabilities + Equity</code> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3">
      <p>For a longer discussion, see <a href="https://github.com/pgr0ss/pgledger/discussions/29">https://github.com/pgr0ss/pgledger/discussions/29</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4">
      <p>For examples on currency conversions, see <a href="https://github.com/pgr0ss/pgledger?tab=readme-ov-file#currencies">https://github.com/pgr0ss/pgledger?tab=readme-ov-file#currencies</a> or <a href="https://docs.tigerbeetle.com/coding/recipes/currency-exchange/">https://docs.tigerbeetle.com/coding/recipes/currency-exchange/</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[I think ledgers are underutilized in software development today. Specifically, double-entry ledger modeling would be a better fit in a lot of systems than the ad-hoc ledger-ish things they currently have.]]></summary></entry><entry><title type="html">Visualizing Financial Data with DuckDB And Plotly</title><link href="https://www.pgrs.net/2025/05/22/visualizing-financial-data-with-duckdb-and-plotly/" rel="alternate" type="text/html" title="Visualizing Financial Data with DuckDB And Plotly" /><published>2025-05-22T00:00:00-07:00</published><updated>2025-05-22T00:00:00-07:00</updated><id>https://www.pgrs.net/2025/05/22/visualizing-financial-data-with-duckdb-and-plotly</id><content type="html" xml:base="https://www.pgrs.net/2025/05/22/visualizing-financial-data-with-duckdb-and-plotly/"><![CDATA[<p>I like to keep a pretty close eye on my finances, such as my spending habits and net worth. Over the years, I’ve used a lot of different tools, such as YNAB, Mint and Quicken.</p>

<p>These days, I really like the spreadsheet based tool <a href="https://www.awin1.com/cread.php?awinmid=18709&amp;awinaffid=1947491&amp;p=">Tiller</a> (note: affiliate link). With Tiller, all of my financial data lives in a spreadsheet that I control (in Google Sheets).</p>

<p>The visualizations that come with Tiller are incredible (both official and community), but a huge benefit is having all of my data available as a spreadsheet. I can export the data as CSVs and then run whatever tools I want.</p>

<p>I also recently learned about <a href="https://plotly.com/python/">Plotly</a>, a great graphing library for Python. Combine that with my love of <a href="https://duckdb.org/">DuckDB</a> for querying data<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, and I have found new ways to visualize my financial data.</p>

<p>For example, below is a sunburst diagram visualizing expenses. An interactive version is available at <a href="/assets/plotly_expenses.html" target="_blank">plotly_expenses.html</a>.</p>

<p><img src="/assets/expenses_sunburst.png" alt="expenses sunburst" /></p>

<p>(For these examples, I’m using this <a href="https://docs.google.com/spreadsheets/d/1Lj5RWlaKDlIkU516SmzxcF105xCef3gbX4qhy_PH-Fo/edit#gid=1256593101">sample Tiller sheet</a> from this <a href="https://community.tiller.com/t/tiller-feeds-sample-data-for-google-builders/16960">community post</a>.)</p>

<p>In the sections below, I’ll walk through how I generated this diagram.</p>

<h2 id="querying-tiller-data-with-duckdb">Querying Tiller Data with DuckDB</h2>

<p>First, I exported the <code class="language-plaintext highlighter-rouge">Transactions</code> and <code class="language-plaintext highlighter-rouge">Categories</code> tabs from the Tiller Sheet. <code class="language-plaintext highlighter-rouge">Transactions</code> is a list of every ingested transaction with an assigned category. The <code class="language-plaintext highlighter-rouge">Categories</code> tab adds a hierarchy to the set of categories, such as <code class="language-plaintext highlighter-rouge">Groceries</code> and <code class="language-plaintext highlighter-rouge">Restaurants</code> belonging to the <code class="language-plaintext highlighter-rouge">Food</code> group of expenses.</p>

<p>Then, I used DuckDB to query these sheets in Python. For example, here’s a sum of expenses by group and category:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">expenses_by_category</span> <span class="o">=</span> <span class="n">duckdb</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span>
    <span class="sh">"""</span><span class="s">
    select
        c.Type,
        c.Group,
        t.Category,
        -round(sum(replace(replace(t.Amount, </span><span class="sh">'</span><span class="s">$</span><span class="sh">'</span><span class="s">, </span><span class="sh">''</span><span class="s">), </span><span class="sh">'</span><span class="s">,</span><span class="sh">'</span><span class="s">, </span><span class="sh">''</span><span class="s">)::decimal), 1) as Amount
    from read_csv(</span><span class="sh">'</span><span class="s">Tiller Sample Data - Transactions.csv</span><span class="sh">'</span><span class="s">) t
    join read_csv(</span><span class="sh">'</span><span class="s">Tiller Sample Data - Categories.csv</span><span class="sh">'</span><span class="s">) c on c.Category = t.Category
    and c.</span><span class="sh">"</span><span class="s">Hide From Reports</span><span class="sh">"</span><span class="s"> is null
    and c.Type = </span><span class="sh">'</span><span class="s">Expense</span><span class="sh">'</span><span class="s">
    group by t.Category, c.Group, C.Type
</span><span class="sh">"""</span>
<span class="p">)</span>

<span class="n">expenses_by_category</span><span class="p">.</span><span class="nf">show</span><span class="p">()</span>
</code></pre></div></div>

<p>Which prints:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────┬───────────────┬─────────────────────┬───────────────┐
│  Type   │     Group     │      Category       │    Amount     │
│ varchar │    varchar    │       varchar       │ decimal(38,1) │
├─────────┼───────────────┼─────────────────────┼───────────────┤
│ Expense │ Discretionary │ Clothes/Gear        │        8612.5 │
│ Expense │ Wellness      │ Guidance            │        8131.8 │
│ Expense │ Food          │ Snacks/Coffee       │        1796.6 │
│ Expense │ Auto          │ Camper              │        2392.8 │
│ Expense │ Discretionary │ Subscriptions       │        1540.0 │
│ Expense │ Food          │ Groceries           │       17483.5 │
│ Expense │ Living        │ Household           │        4231.0 │
│ Expense │ Health        │ Pharmacy            │        3437.9 │
│ Expense │ Discretionary │ Streaming           │        2303.0 │
│ Expense │ Wellness      │ Gym/Yoga            │        3171.9 │
│    ·    │  ·            │    ·                │           ·   │
│    ·    │  ·            │    ·                │           ·   │
│    ·    │  ·            │    ·                │           ·   │
│ Expense │ Food          │ Restaurants         │       10897.4 │
│ Expense │ Auto          │ Fees/Repairs/Maint. │        1549.1 │
│ Expense │ Discretionary │ Hobbies             │        3538.9 │
│ Expense │ Living        │ Cell Phone          │        3060.0 │
│ Expense │ Giving        │ Donations           │        3086.4 │
│ Expense │ Living        │ Utilities           │        3600.0 │
│ Expense │ Living        │ Rent                │       40727.9 │
│ Expense │ Discretionary │ Fun                 │        7076.9 │
│ Expense │ Living        │ Internet            │        3150.0 │
│ Expense │ Auto          │ Car Insurance       │       11682.0 │
├─────────┴───────────────┴─────────────────────┴───────────────┤
│ 26 rows (20 shown)                                  4 columns │
└───────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>The only tricky part was cleaning up the Amount field so DuckDB treated it as a number. There may be a simpler way to do that.</p>

<h2 id="graphing-with-plotly">Graphing with Plotly</h2>

<p>Once I had the tabular data from DuckDB, graphing it with Plotly was simple. Plotly can generate an image file, but I think it really shines when generating standalone, interactive HTML:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">expenses_by_category_sunburst</span> <span class="o">=</span> <span class="n">px</span><span class="p">.</span><span class="nf">sunburst</span><span class="p">(</span>
    <span class="n">expenses_by_category</span><span class="p">,</span>
    <span class="n">path</span><span class="o">=</span><span class="p">[</span><span class="sh">"</span><span class="s">Type</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Group</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Category</span><span class="sh">"</span><span class="p">],</span>
    <span class="n">values</span><span class="o">=</span><span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">expenses_by_category_sunburst</span><span class="p">.</span><span class="nf">update_traces</span><span class="p">(</span><span class="n">textinfo</span><span class="o">=</span><span class="sh">"</span><span class="s">label+percent parent</span><span class="sh">"</span><span class="p">)</span>

<span class="n">expenses_by_category_sunburst</span><span class="p">.</span><span class="nf">write_image</span><span class="p">(</span><span class="sh">"</span><span class="s">expenses_sunburst.png</span><span class="sh">"</span><span class="p">)</span>

<span class="n">expenses_by_category_sunburst</span><span class="p">.</span><span class="nf">write_html</span><span class="p">(</span><span class="sh">"</span><span class="s">plotly_expenses.html</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="full-script">Full script</h2>

<p>I use <a href="https://docs.astral.sh/uv/">uv</a> these days for managing Python dependencies, which lets you embed the dependency requirements as a comment in the Python script. So you can save this as <code class="language-plaintext highlighter-rouge">tiller_plotly.py</code> and then run it with <code class="language-plaintext highlighter-rouge">uv run tiller_plotly.py</code>, which will automatically download the dependencies:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># /// script
# requires-python = "&gt;=3.12"
# dependencies = [
#     "duckdb~=1.2",
#     "kaleido~=0.2", # Needed to generate images
#     "plotly[express]~=6.0",
#     "pyarrow~=19.0",
# ]
# ///
</span>
<span class="kn">import</span> <span class="n">duckdb</span>
<span class="kn">import</span> <span class="n">plotly.express</span> <span class="k">as</span> <span class="n">px</span>

<span class="n">expenses_by_category</span> <span class="o">=</span> <span class="n">duckdb</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span>
    <span class="sh">"""</span><span class="s">
    select
        c.Type,
        c.Group,
        t.Category,
        -round(sum(replace(replace(t.Amount, </span><span class="sh">'</span><span class="s">$</span><span class="sh">'</span><span class="s">, </span><span class="sh">''</span><span class="s">), </span><span class="sh">'</span><span class="s">,</span><span class="sh">'</span><span class="s">, </span><span class="sh">''</span><span class="s">)::decimal), 1) as Amount
    from read_csv(</span><span class="sh">'</span><span class="s">Tiller Sample Data - Transactions.csv</span><span class="sh">'</span><span class="s">) t
    join read_csv(</span><span class="sh">'</span><span class="s">Tiller Sample Data - Categories.csv</span><span class="sh">'</span><span class="s">) c on c.Category = t.Category
    and c.</span><span class="sh">"</span><span class="s">Hide From Reports</span><span class="sh">"</span><span class="s"> is null
    and c.Type = </span><span class="sh">'</span><span class="s">Expense</span><span class="sh">'</span><span class="s">
    group by t.Category, c.Group, C.Type
</span><span class="sh">"""</span>
<span class="p">)</span>

<span class="n">expenses_by_category</span><span class="p">.</span><span class="nf">show</span><span class="p">()</span>

<span class="n">expenses_by_category_sunburst</span> <span class="o">=</span> <span class="n">px</span><span class="p">.</span><span class="nf">sunburst</span><span class="p">(</span>
    <span class="n">expenses_by_category</span><span class="p">,</span>
    <span class="n">path</span><span class="o">=</span><span class="p">[</span><span class="sh">"</span><span class="s">Type</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Group</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Category</span><span class="sh">"</span><span class="p">],</span>
    <span class="n">values</span><span class="o">=</span><span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">expenses_by_category_sunburst</span><span class="p">.</span><span class="nf">update_traces</span><span class="p">(</span><span class="n">textinfo</span><span class="o">=</span><span class="sh">"</span><span class="s">label+percent parent</span><span class="sh">"</span><span class="p">)</span>

<span class="n">expenses_by_category_sunburst</span><span class="p">.</span><span class="nf">write_image</span><span class="p">(</span><span class="sh">"</span><span class="s">expenses_sunburst.png</span><span class="sh">"</span><span class="p">)</span>

<span class="n">expenses_by_category_sunburst</span><span class="p">.</span><span class="nf">write_html</span><span class="p">(</span><span class="sh">"</span><span class="s">plotly_expenses.html</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>And if you want to try out Tiller, check it out here: <a href="https://www.awin1.com/cread.php?awinmid=18709&amp;awinaffid=1947491&amp;p=">Tiller</a> (affiliate link)</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>Other DuckDB posts I’ve written: <a href="/2024/03/21/duckdb-as-the-new-jq/">DuckDB as the New jq</a> and <a href="/2024/11/01/duckdb-over-pandas-polars/">DuckDB over Pandas/Polars</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[I like to keep a pretty close eye on my finances, such as my spending habits and net worth. Over the years, I’ve used a lot of different tools, such as YNAB, Mint and Quicken.]]></summary></entry><entry><title type="html">A Ledger In PostgreSQL Is Fast!</title><link href="https://www.pgrs.net/2025/05/16/pgledger-in-postgresql-is-fast/" rel="alternate" type="text/html" title="A Ledger In PostgreSQL Is Fast!" /><published>2025-05-16T00:00:00-07:00</published><updated>2025-05-16T00:00:00-07:00</updated><id>https://www.pgrs.net/2025/05/16/pgledger-in-postgresql-is-fast</id><content type="html" xml:base="https://www.pgrs.net/2025/05/16/pgledger-in-postgresql-is-fast/"><![CDATA[<p>I’ve been working on a ledger implementation in pure PostgreSQL called <a href="https://github.com/pgr0ss/pgledger">pgledger</a>. For the backstory, please read my previous blog post: <a href="https://pgrs.net/2025/03/24/pgledger-ledger-implementation-in-postgresql/">Ledger Implementation in PostgreSQL</a>.</p>

<p>Now that the project is a bit further along, I decided to gather some performance numbers. And it’s fast! Depending on the scenario, I can easily get over 10,000 ledger transfers per second on my laptop with a stock, un-optimized PostgreSQL. I would imagine a well tuned production database would do a lot more.</p>

<p>Sure, it’s not <a href="https://tigerbeetle.com/">TigerBeetle</a> level performance, but still more than enough for most applications. And the simplicity of having the ledger in your main database is huge.</p>

<h2 id="scenarios-and-scripting">Scenarios and Scripting</h2>

<p>Performance is a notoriously hard thing to measure, since different usage patterns and different hardware can yield very different results.</p>

<p>I have been iterating on a script in the pgledger repository to help measure performance: <a href="https://github.com/pgr0ss/pgledger/blob/37ae752a5eaf2cec739e5da0907e61f012705e1c/go/performance_check.go">performance_check.go</a>. My thought was that others could also run this script in their environments if they want to gather more realistic numbers for their setup.</p>

<p>The script takes a few inputs, including:</p>

<ul>
  <li>The number of accounts (each transfer is moving money from one of these accounts to another one)</li>
  <li>The number of concurrent workers doing transfers</li>
  <li>The duration of time to run</li>
</ul>

<p>To simulate a scenario where there isn’t much account contention, we can ensure there are many more accounts than workers. That way, concurrent workers rarely try to transfer between the same accounts. Or alternatively, if our system has only a handful of hot accounts, we can keep the number of accounts low to simulate workers waiting for locked accounts.</p>

<p>The script also measures the database size before and after, and then calculates the amount of disk space used per transfer. This should take into account the data in both tables and indexes.</p>

<h2 id="local-results">Local Results</h2>

<p>Here are some results from my laptop (M3 Macbook Air). I am using a vanilla, unoptimized PostgreSQL 17.5, set up with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew <span class="nb">install </span>postgresql@17
brew services start postgresql@17
</code></pre></div></div>

<p>First, the low contention scenario:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> go run performance_check.go <span class="nt">--accounts</span><span class="o">=</span>50 <span class="nt">--workers</span><span class="o">=</span>20 <span class="nt">--duration</span><span class="o">=</span>30s

Completed transfers: 319105
Elapsed <span class="nb">time </span><span class="k">in </span>seconds: 30.0
Database size before: 1795 MB
Database size after:  2021 MB
Database size growth <span class="k">in </span>bytes: 237223936
Transfers/second: 10636.8
Milliseconds/transfer: 1.9
Bytes/transfer: 743
</code></pre></div></div>

<p>We can see here that we spent less than 2 milliseconds per transfer for an overall rate of 10,636.8 transfers per second. And each transfer added about 743 bytes to the database. Compared to many queries I’ve seen in financial application code, this is quite fast.</p>

<p>And the account contention scenario, where workers may need to wait on other workers currently using the same accounts:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> go run performance_check.go <span class="nt">--accounts</span><span class="o">=</span>10 <span class="nt">--workers</span><span class="o">=</span>20 <span class="nt">--duration</span><span class="o">=</span>30s

Completed transfers: 226767
Elapsed <span class="nb">time </span><span class="k">in </span>seconds: 30.0
Database size before: 2021 MB
Database size after:  2182 MB
Database size growth <span class="k">in </span>bytes: 168566784
Transfers/second: 7558.9
Milliseconds/transfer: 2.6
Bytes/transfer: 743
</code></pre></div></div>

<p>In this scenario, our throughput dropped to 7,558.9 transfers per second since transfers were waiting to lock hot accounts before proceeding. You can also see that the time per transfer increased to 2.6 milliseconds. Still quite fast, though.</p>

<h2 id="remote-results">Remote Results</h2>

<p>I wanted to also test the performance against a hosted database, so I set up an account on <a href="https://neon.tech/">Neon</a>. The free database has 2 vCPU and 8 GB of RAM. Since the latency from my laptop to the cloud hosted database is higher, I ran more workers (since each worker would spend more time waiting on network responses):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> go run performance_check.go <span class="nt">--accounts</span><span class="o">=</span>100 <span class="nt">--workers</span><span class="o">=</span>60 <span class="nt">--duration</span><span class="o">=</span>30s

Completed transfers: 48937
Elapsed <span class="nb">time </span><span class="k">in </span>seconds: 30.0
Database size before: 157 MB
Database size after:  192 MB
Database size growth <span class="k">in </span>bytes: 36470784
Transfers/second: 1631.2
Milliseconds/transfer: 36.8
Bytes/transfer: 745
</code></pre></div></div>

<p>(Note that I also had to increase the pool size in the PostgreSQL client to handle the increased workers by appending <code class="language-plaintext highlighter-rouge">&amp;pool_max_conns=100</code> to the connection string.<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>)</p>

<p>These numbers aren’t as great, but it’s still over 1,000 transfers per second on the free tier of a database that is at least one US state away, so I think it’s still pretty good. If someone wants to test this on their large optimized production database, I would love to see the results.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>I’m happy with these numbers for now. I think this performance is high enough for most use cases, and it’s likely that any production system will have other bottlenecks in the database before hitting the pgledger limits.</p>

<p>I also haven’t done a lot of optimizations to the ledger. In fact, I’ve made some choices that hurt performance in the name of simplicity. If/When performance becomes a bigger concern, I think there’s a few things that could be done to make pgledger even faster. It seems like the limiting factor right now is disk write speed, so reducing the amount of data per transfer would help, such as:</p>

<ol>
  <li>Removing fields which are duplicated between <code class="language-plaintext highlighter-rouge">pgledger_transfers</code> and <code class="language-plaintext highlighter-rouge">pgledger_entries</code> such as the account IDs</li>
  <li>Using an ID format that takes up less space (I like <a href="https://pgrs.net/2023/01/10/ulid-identifiers-and-ulid-tools-website/">prefixed ULIDs</a>, but they could be stored as UUID types)</li>
  <li>Removing the <code class="language-plaintext highlighter-rouge">created_at</code> fields and relying on the timestamps within the IDs</li>
</ol>

<p>So if you need a ledger, please check out <a href="https://github.com/pgr0ss/pgledger">pgledger</a> and let me know what you think!</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p><a href="https://github.com/jackc/pgx/blob/777e7e5cdf2d349c37e1eef8eedc0e21857e9b95/pgxpool/pool.go#L141">https://github.com/jackc/pgx/blob/777e7e5cdf2d349c37e1eef8eedc0e21857e9b95/pgxpool/pool.go#L141</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[I’ve been working on a ledger implementation in pure PostgreSQL called pgledger. For the backstory, please read my previous blog post: Ledger Implementation in PostgreSQL.]]></summary></entry><entry><title type="html">Ledger Implementation in PostgreSQL</title><link href="https://www.pgrs.net/2025/03/24/pgledger-ledger-implementation-in-postgresql/" rel="alternate" type="text/html" title="Ledger Implementation in PostgreSQL" /><published>2025-03-24T00:00:00-07:00</published><updated>2025-03-24T00:00:00-07:00</updated><id>https://www.pgrs.net/2025/03/24/pgledger-ledger-implementation-in-postgresql</id><content type="html" xml:base="https://www.pgrs.net/2025/03/24/pgledger-ledger-implementation-in-postgresql/"><![CDATA[<p><strong>First, the tl;dr</strong>: I am working on a financial ledger implementation implemented entirely in <a href="https://www.postgresql.org/">PostgreSQL</a> called <a href="https://github.com/pgr0ss/pgledger">pgledger</a>.</p>

<p>Before I get to the why, here’s how it looks so far:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Set up your accounts:</span>
<span class="k">select</span> <span class="n">id</span> <span class="k">from</span> <span class="n">pgledger_create_account</span><span class="p">(</span><span class="s1">'account_1'</span><span class="p">);</span> <span class="c1">-- save this as account_1_id</span>
<span class="k">select</span> <span class="n">id</span> <span class="k">from</span> <span class="n">pgledger_create_account</span><span class="p">(</span><span class="s1">'account_2'</span><span class="p">);</span> <span class="c1">-- save this as account_2_id</span>

<span class="c1">-- Create transfers:</span>
<span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">pgledger_create_transfer</span><span class="p">(</span><span class="err">$</span><span class="n">account_1_id</span><span class="p">,</span> <span class="err">$</span><span class="n">account_2_id</span><span class="p">,</span> <span class="mi">12</span><span class="p">.</span><span class="mi">34</span><span class="p">);</span>
<span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">pgledger_create_transfer</span><span class="p">(</span><span class="err">$</span><span class="n">account_1_id</span><span class="p">,</span> <span class="err">$</span><span class="n">account_2_id</span><span class="p">,</span> <span class="mi">56</span><span class="p">.</span><span class="mi">78</span><span class="p">);</span>

<span class="c1">-- See updated balances:</span>
<span class="k">select</span> <span class="n">name</span><span class="p">,</span> <span class="n">balance</span><span class="p">,</span> <span class="k">version</span> <span class="k">from</span> <span class="n">pgledger_get_account</span><span class="p">(</span><span class="err">$</span><span class="n">account_2_id</span><span class="p">);</span>

   <span class="n">name</span>    <span class="o">|</span> <span class="n">balance</span> <span class="o">|</span> <span class="k">version</span>
<span class="c1">-----------+---------+---------</span>
 <span class="n">account_2</span> <span class="o">|</span>   <span class="mi">69</span><span class="p">.</span><span class="mi">12</span> <span class="o">|</span>       <span class="mi">2</span>

<span class="c1">-- See ledger entries:</span>
<span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">pgledger_entries</span> <span class="k">where</span> <span class="n">account_id</span> <span class="o">=</span> <span class="err">$</span><span class="n">account_2_id</span><span class="p">;</span>

  <span class="n">id</span>   <span class="o">|</span> <span class="n">account_id</span> <span class="o">|</span> <span class="n">transfer_id</span> <span class="o">|</span> <span class="n">amount</span> <span class="o">|</span> <span class="n">account_previous_balance</span> <span class="o">|</span> <span class="n">account_current_balance</span> <span class="o">|</span> <span class="n">account_version</span> <span class="o">|</span>          <span class="n">created_at</span>
<span class="c1">-------+------------+-------------+--------+--------------------------+-------------------------+-----------------+-------------------------------</span>
 <span class="mi">96198</span> <span class="o">|</span>         <span class="mi">42</span> <span class="o">|</span>       <span class="mi">48103</span> <span class="o">|</span>  <span class="mi">12</span><span class="p">.</span><span class="mi">34</span> <span class="o">|</span>                     <span class="mi">0</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span>                   <span class="mi">12</span><span class="p">.</span><span class="mi">34</span> <span class="o">|</span>               <span class="mi">1</span> <span class="o">|</span> <span class="mi">2025</span><span class="o">-</span><span class="mi">03</span><span class="o">-</span><span class="mi">19</span> <span class="mi">21</span><span class="p">:</span><span class="mi">31</span><span class="p">:</span><span class="mi">03</span><span class="p">.</span><span class="mi">596426</span><span class="o">+</span><span class="mi">00</span>
 <span class="mi">96200</span> <span class="o">|</span>         <span class="mi">42</span> <span class="o">|</span>       <span class="mi">48104</span> <span class="o">|</span>  <span class="mi">56</span><span class="p">.</span><span class="mi">78</span> <span class="o">|</span>                    <span class="mi">12</span><span class="p">.</span><span class="mi">34</span> <span class="o">|</span>                   <span class="mi">69</span><span class="p">.</span><span class="mi">12</span> <span class="o">|</span>               <span class="mi">2</span> <span class="o">|</span> <span class="mi">2025</span><span class="o">-</span><span class="mi">03</span><span class="o">-</span><span class="mi">19</span> <span class="mi">21</span><span class="p">:</span><span class="mi">31</span><span class="p">:</span><span class="mi">21</span><span class="p">.</span><span class="mi">615916</span><span class="o">+</span><span class="mi">00</span>
</code></pre></div></div>

<p>Each transfer subtracts from one account’s balance and adds to another account’s balance. It also writes two entries, one for each account, which record the previous and current balances, as well as the account versions. Each transfer to or from an account increments the version, giving a linear view of the changing balance over time.</p>

<p>This makes it easy to view the history to understand why an account balance is at its current value, or even query for a historical value at a given time.</p>

<p>It’s all just functions and tables, so you have the full power of PostgreSQL. Start a transaction and execute as many transfers as you want grouped together. Or query the tables using whatever SQL you desire.</p>

<h2 id="why-ledgers">Why Ledgers</h2>

<p>I’ve worked in payments for a long time at many different companies, and one recurring theme is building in-house financial ledger software.</p>

<p>Ledgers are a fundamental building block of any software that deals with money. It’s incredibly important to know what money is where, how it got there, and what it’s for.</p>

<p>They serve both current app needs as well as reporting and reconciliation. Everything from “do I have enough money in this account to do X?” to “why doesn’t an account have the balance I expect?” and “where is my money getting held up in my processes?”</p>

<p>A pattern I’ve noticed is that most companies tend to build their own internal ledger. There are many ways to implement a ledger, but at their core, the concepts are the same and the feature sets are pretty consistent. It doesn’t feel like something everyone has to reimplement every time.</p>

<p>Furthermore, building a ledger properly from scratch is tricky, and there are lots of potential edge cases and race conditions. For example, concurrent transfers causing an account to go negative, or concurrent balance updates clobbering each other. Several times, I’ve seen folks start with a simple table to record payments/transfers/etc, and then realize over time that they actually did need a proper <a href="https://en.wikipedia.org/wiki/Double-entry_bookkeeping">double entry ledger</a>.</p>

<h2 id="why-postgresql">Why PostgreSQL</h2>

<p>Today, if you don’t want to build your own ledger, you do have a few options. For example, there are hosted services like <a href="https://www.moderntreasury.com/products/ledgers">Modern Treasury</a> and ledger-specific databases like <a href="https://tigerbeetle.com/">TigerBeetle</a>. Both of these are impressive and probably a good fit for many.</p>

<p>But by using a ledger outside of the main application database, you lose transactionality and atomicity. Namely, you have to worry about orchestrating two systems that can fail independently. What happens if you write your main data, but the ledger update fails? Or the ledger operation succeeds but your app hits an error and fails to write the surrounding data. Integrating with these often requires two phase commits and other strategies to ensure they stay in sync. And when they fall out of sync, it can be very hard to debug.</p>

<p>What I generally want is to be able to include ledger updates in the same database transaction as the other work, and then have it either all commit or all rollback atomically. In the past, I’ve done this with a bunch of application code. But that ties it to a specific language, framework, etc. It’s not very portable to a new project, and possibly why we don’t see a lot of open source libraries around ledgers.</p>

<p>So this time I’m trying something different. I am working on a ledger implementation entirely in PostgreSQL. That means as long as you use PostgreSQL (and in theory it could be ported to other databases), it is entirely transactional/atomic and language/framework/application agnostic. All you need to do from app code is execute the right SQL functions within the same database transactions that you write everything else.</p>

<p>Building it in PostgreSQL also means you don’t have to integrate with any new APIs or run any new services. It’s in line with the idea of <a href="https://www.amazingcto.com/postgres-for-everything/">“Just Use Postgres for Everything”</a>, which is especially attractive to startups and small companies.</p>

<p>This is partly an experiment to see what a ledger implementation in pure PostgreSQL would look like, and partly something I hope to use on a future project.</p>

<h2 id="testing-in-go">Testing in Go</h2>

<p>If you look at the code today, you’ll notice that while the entire implementation is in SQL, the tests are in Go. I chose Go mainly because it’s the language I’m working with the most these days, and it supports good concurrency. I have a couple of concurrent tests already that look for deadlocks and race conditions, and I hope to write more.</p>

<h2 id="feedback-please">Feedback Please</h2>

<p>Please check out <a href="https://github.com/pgr0ss/pgledger">pgledger</a> and I’d love to hear what you think, especially if you work with ledgers today. I have many more features I’d like to implement, so feel free to keep an eye on the project. And if this is something that you are interested in using, please let me know.</p>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[First, the tl;dr: I am working on a financial ledger implementation implemented entirely in PostgreSQL called pgledger.]]></summary></entry><entry><title type="html">Personal Notes Tooling</title><link href="https://www.pgrs.net/2025/01/29/personal-notes-tooling/" rel="alternate" type="text/html" title="Personal Notes Tooling" /><published>2025-01-29T00:00:00-08:00</published><updated>2025-01-29T00:00:00-08:00</updated><id>https://www.pgrs.net/2025/01/29/personal-notes-tooling</id><content type="html" xml:base="https://www.pgrs.net/2025/01/29/personal-notes-tooling/"><![CDATA[<p>I love keeping notes. Everything from meeting summaries to packing lists to books I’ve read and more. And these days, I tend to favor digital notes. I’m faster at typing than writing, and I love being able to search them. And with my phone, I don’t have to carry around a notebook.</p>

<p>I’ve gone through a few different systems/tools as I continue the search for my perfect setup. I figured I’d share what I am looking for and what I currently use.</p>

<p>To be explicit, this is for personal notes. When I’m writing shared notes, I use different tools. For example, at work I’ll often use a wiki or project management system, and with family and friends, something like Google Docs.</p>

<p>But for my personal notes, here is what I’m looking for in my ideal setup:</p>

<h3 id="good-mobile-support">Good mobile support</h3>

<p>I want to be able to rely on my phone when I’m not at my computer. It’s ok if the mobile experience is more limited, but I need the basics of adding and editing notes, and ideally searching.</p>

<h3 id="offline-support">Offline support</h3>

<p>I travel a lot, and I want to be able to add and edit notes when my phone or laptop is in airplane mode. Or when I have spotty cell/wifi coverage.</p>

<p>Ideally, I should be able to edit normally, and then when I’m back in service, the notes should sync in the background.</p>

<p>As an example of something I don’t like, the iPhone Dropbox app allows me to edit notes without service, but then when I try to save, it can hang and refuse to persist. I don’t want to fear losing updates when I’m offline.</p>

<h3 id="open-format">Open format</h3>

<p>I’ve gone through various tools in the past, and I don’t want to get locked into a tool that I can’t easily get out of. Concretely, I would prefer the files be plain text (e.g. markdown). Plain text files are universal, and I can use many different tools to edit or convert them. Here’s a good summary of this argument: <a href="https://sive.rs/plaintext">Write plain text files</a></p>

<p>If plain text isn’t an option, the next best thing would be a different open format, such as sqlite.</p>

<p>With open formats, even if the tool I’m using doesn’t support a good export, I can write my own. Or if the new tool wants a different format, I can easily convert the files.</p>

<p>If all else fails, I’m ok using a proprietary format as long as there’s good export functionality to an open format.</p>

<h3 id="vim-support">Vim support</h3>

<p>I’m a big vim fan (or rather <a href="https://neovim.io/">neovim</a> these days). When I’m at my computer, I would like to be able to type notes with vim. Vim mode on another tool can work, but I prefer real vim when possible.</p>

<h3 id="image-support">Image support</h3>

<p>This isn’t a strict must have for me, but it’s nice if I can embed images in docs (such as recipes).</p>

<h3 id="versioning">Versioning</h3>

<p>Another nice to have is some kind of versioning or point in time backup. Worst case, I can backup periodic exports, but ideally, the system would have a better solution for this.</p>

<h2 id="current-tooling">Current Tooling</h2>

<p>So with this list of requirements and desires, here’s the set of current tools I’m using:</p>

<p>The crux of my current setup is directories of plain text markdown files stored in iCloud Drive. iCloud Drive handles the syncing between my laptop and my phone.</p>

<p>I also made this directory a git repository, and I periodically commit and push changes to a hosted git repo. This gives me both versioning and backups, in case something goes wrong with the iCloud Drive (e.g. it gets accidentally deleted or corrupted).</p>

<p>On desktop, I add/edit these files mostly in neovim. I also occasionally use <a href="https://obsidian.md/">Obsidian</a>, especially for more complex things like adding images (it automatically copies in the image and makes the markdown links to it).</p>

<p>On my phone, I use the Obsidian mobile app. The main reason I chose iCloud Drive over another syncing method is because that’s what the Obsidian app supports outside of their proprietary sync: <a href="https://help.obsidian.md/getting-started/sync-your-notes-across-devices">https://help.obsidian.md/getting-started/sync-your-notes-across-devices</a>.</p>

<h3 id="downsides">Downsides</h3>

<p>While iCloud Drive syncing mostly works, it’s definitely not the smoothest. I frequently see sync issues, where changes on one device don’t show up on another for a long time. And it doesn’t handle conflicts well, just blowing away changes on one of the devices rather than trying to merge files, or asking me to resolve conflicts. Thankfully at least, iCloud finally supports the ability to keep a folder downloaded on a device: <a href="https://www.macrumors.com/2024/06/26/icloud-keep-downloaded-option-ios-18-macos-sequoia/">https://www.macrumors.com/2024/06/26/icloud-keep-downloaded-option-ios-18-macos-sequoia/</a>.</p>

<p>Even though I set the entire Obsidian directory in iCloud to <code class="language-plaintext highlighter-rouge">Keep Downloaded</code>, Obsidian will often pop up a message that it’s synchronizing my files and I need to wait. I’m afraid if I skip I will somehow edit an old version and blow away newer changes. I don’t know if the issue is with Obsidian or with iCloud, but they are trying to push me towards their proprietary sync:</p>

<p><img src="/assets/obsidian_sync_message.png" alt="obsidian sync message" /></p>

<p>My current tooling is Mac only. This isn’t currently an issue for me, as I have a Mac laptop and iPhone. I do have linux devices, but I don’t currently use them for notes. I still don’t love being tied to the Apple ecosystem, though, so I would prefer something that is cross platform instead.</p>

<h2 id="summary">Summary</h2>

<p>I’ve been using this setup for the last 6 months or so, and all in all, it’s working pretty well for me. The Obsidian mobile app is good, and it’s easy to edit and search. I do hope that iCloud syncing improves over time. But since it’s all just text files and I don’t use the advanced Obsidian features, I don’t feel too locked in, and I could switch to different apps easily.</p>

<p>If you have suggestions on other approaches or tools I should try, please let me know!</p>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[I love keeping notes. Everything from meeting summaries to packing lists to books I’ve read and more. And these days, I tend to favor digital notes. I’m faster at typing than writing, and I love being able to search them. And with my phone, I don’t have to carry around a notebook.]]></summary></entry><entry><title type="html">DuckDB over Pandas/Polars</title><link href="https://www.pgrs.net/2024/11/01/duckdb-over-pandas-polars/" rel="alternate" type="text/html" title="DuckDB over Pandas/Polars" /><published>2024-11-01T00:00:00-07:00</published><updated>2024-11-01T00:00:00-07:00</updated><id>https://www.pgrs.net/2024/11/01/duckdb-over-pandas-polars</id><content type="html" xml:base="https://www.pgrs.net/2024/11/01/duckdb-over-pandas-polars/"><![CDATA[<p>Since my previous post on <a href="https://duckdb.org/">DuckDB</a> (<a href="/2024/03/21/duckdb-as-the-new-jq/">DuckDB as the New jq</a>), I’ve been continuing to use and enjoy DuckDB.</p>

<p>Recently, I wanted to analyze and visualize some financial CSVs, including joining a few files together. I started out with <a href="https://pola.rs/">Polars</a> (which I understood to be a newer/better <a href="https://pandas.pydata.org/">Pandas</a>). However, as someone who doesn’t use it frequently, I found the syntax confusing and cumbersome.</p>

<p>For example, here is how I parsed a <code class="language-plaintext highlighter-rouge">Transactions.csv</code> and summed entries by <code class="language-plaintext highlighter-rouge">Category</code> for rows in 2024 (simplified example, code formatted with <a href="https://github.com/psf/black">Black</a>):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="nf">read_csv</span><span class="p">(</span><span class="sh">"</span><span class="s">Transactions.csv</span><span class="sh">"</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df</span><span class="p">.</span><span class="nf">select</span><span class="p">(</span><span class="sh">"</span><span class="s">Date</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Category</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">with_columns</span><span class="p">(</span>
        <span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Date</span><span class="sh">"</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">to_date</span><span class="p">(</span><span class="sh">"</span><span class="s">%m/%d/%Y</span><span class="sh">"</span><span class="p">),</span>
        <span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">)</span>
        <span class="p">.</span><span class="nf">map_elements</span><span class="p">(</span><span class="k">lambda</span> <span class="n">amount</span><span class="p">:</span> <span class="n">amount</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s">$</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">))</span>
        <span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="nf">to_decimal</span><span class="p">(),</span>
    <span class="p">)</span>
    <span class="p">.</span><span class="nf">filter</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Date</span><span class="sh">"</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">datetime</span><span class="p">.</span><span class="nf">date</span><span class="p">(</span><span class="mi">2024</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="p">.</span><span class="nf">group_by</span><span class="p">(</span><span class="sh">"</span><span class="s">Category</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">agg</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">).</span><span class="nf">sum</span><span class="p">())</span>
<span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div></div>

<p>Things that tripped me up:</p>

<ul>
  <li>The syntax for selecting and transforming columns</li>
  <li>Telling it how to parse the month/day/year column</li>
  <li>Writing a lambda to strip out the <code class="language-plaintext highlighter-rouge">$</code> (maybe there is a better way to do this?)</li>
  <li>The mix of <code class="language-plaintext highlighter-rouge">df.</code> and <code class="language-plaintext highlighter-rouge">pl.</code> calls, such as calling <code class="language-plaintext highlighter-rouge">df.group_by</code> but passing in <code class="language-plaintext highlighter-rouge">pl.col(...).sum(...)</code> as the argument to the aggregation</li>
</ul>

<p>I’m sure this is straightforward for someone who uses these tools frequently. However, that’s not me. I play around for a bit and then come back to it weeks or months later and have to relearn.</p>

<p>In contrast, I write SQL day in and day out, so I find it much easier. Once I switched to DuckDB, I could write much more familiar (to me) SQL, while still using python for the rest of the code:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="n">duckdb</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span>
    <span class="sh">"""</span><span class="s">
    select
        Category,
        sum(replace(Amount, </span><span class="sh">'</span><span class="s">$</span><span class="sh">'</span><span class="s">, </span><span class="sh">''</span><span class="s">)::decimal) as Amount
    from read_csv(</span><span class="sh">'</span><span class="s">Transactions.csv</span><span class="sh">'</span><span class="s">)
    where Date &gt; </span><span class="sh">'</span><span class="s">2024-01-01</span><span class="sh">'</span><span class="s">
    group by Category
</span><span class="sh">"""</span>
<span class="p">)</span>
<span class="n">results</span><span class="p">.</span><span class="nf">show</span><span class="p">()</span>
</code></pre></div></div>

<p>Note that DuckDB automatically figured out how to parse the date column.</p>

<p>And I can even join multiple CSVs together with SQL and add more complex <code class="language-plaintext highlighter-rouge">WHERE</code> conditions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="n">duckdb</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span>
    <span class="sh">"""</span><span class="s">
    select
        c.Group,
        sum(replace(t.Amount, </span><span class="sh">'</span><span class="s">$</span><span class="sh">'</span><span class="s">, </span><span class="sh">''</span><span class="s">)::decimal) as Amount
    from read_csv(</span><span class="sh">'</span><span class="s">Transactions.csv</span><span class="sh">'</span><span class="s">) t
    join read_csv(</span><span class="sh">'</span><span class="s">Categories.csv</span><span class="sh">'</span><span class="s">) c on c.Category = t.Category
    where t.Date &gt; </span><span class="sh">'</span><span class="s">2024-01-01</span><span class="sh">'</span><span class="s">
    and c.Type in (</span><span class="sh">'</span><span class="s">Income</span><span class="sh">'</span><span class="s">, </span><span class="sh">'</span><span class="s">Expense</span><span class="sh">'</span><span class="s">)
    group by c.Group
</span><span class="sh">"""</span>
<span class="p">)</span>
<span class="n">results</span><span class="p">.</span><span class="nf">show</span><span class="p">()</span>
</code></pre></div></div>

<p>In summary, I find DuckDB powerful, easy, and fun to use.</p>

<p><strong>Update:</strong></p>

<p>A <a href="https://www.reddit.com/r/DuckDB/comments/1ghd6t4/comment/luxjjnq/">Reddit comment</a> showed me how to remove the <code class="language-plaintext highlighter-rouge">map_elements</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="se">\\</span><span class="s">$</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">to_decimal</span><span class="p">()</span>
</code></pre></div></div>

<p>But I think the double use of <code class="language-plaintext highlighter-rouge">.str</code> is a good example of how this is complex to me as a casual user.</p>

<p><strong>Update 2:</strong></p>

<p>Another <a href="https://www.reddit.com/r/DuckDB/comments/1ghd6t4/comment/lvmx8m0/">Reddit comment</a> showed how to do “a shorter (no intermediary steps) and more efficient (scan) version”:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pl</span><span class="p">.</span><span class="nf">scan_csv</span><span class="p">(</span><span class="sh">"</span><span class="s">Transactions.csv</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">filter</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Date</span><span class="sh">"</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">to_date</span><span class="p">(</span><span class="sh">"</span><span class="s">%m/%d/%Y</span><span class="sh">"</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">datetime</span><span class="p">.</span><span class="nf">date</span><span class="p">(</span><span class="mi">2024</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="p">.</span><span class="nf">group_by</span><span class="p">(</span><span class="sh">"</span><span class="s">Category</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">agg</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">Amount</span><span class="sh">"</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="se">\\</span><span class="s">$</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">to_decimal</span><span class="p">().</span><span class="nf">sum</span><span class="p">())</span>
    <span class="p">.</span><span class="nf">collect</span><span class="p">()</span>
<span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Discussions:</strong></p>

<p>There are some good discussions about this post, especially around the increased composability of Polars/Pandas vs SQL and better ways to write the Polars code:</p>

<ul>
  <li><a href="https://lobste.rs/s/rlkltp/duckdb_over_pandas_polars">https://lobste.rs/s/rlkltp/duckdb_over_pandas_polars</a></li>
  <li><a href="https://www.reddit.com/r/DuckDB/comments/1ghd6t4/duckdb_over_pandaspolars/">https://www.reddit.com/r/DuckDB/comments/1ghd6t4/duckdb_over_pandaspolars/</a></li>
</ul>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[Since my previous post on DuckDB (DuckDB as the New jq), I’ve been continuing to use and enjoy DuckDB.]]></summary></entry><entry><title type="html">The Many Ways To Read Tech News</title><link href="https://www.pgrs.net/2024/03/29/the-many-ways-to-read-tech-news/" rel="alternate" type="text/html" title="The Many Ways To Read Tech News" /><published>2024-03-29T00:00:00-07:00</published><updated>2024-03-29T00:00:00-07:00</updated><id>https://www.pgrs.net/2024/03/29/the-many-ways-to-read-tech-news</id><content type="html" xml:base="https://www.pgrs.net/2024/03/29/the-many-ways-to-read-tech-news/"><![CDATA[<p>When I am interested in reading tech news, I have a few sites I often visit, such as:</p>
<ul>
  <li><a href="https://news.ycombinator.com/">https://news.ycombinator.com</a> (aka Hacker News)</li>
  <li><a href="https://lobste.rs/">https://lobste.rs</a></li>
  <li><a href="https://www.reddit.com/r/programming">https://www.reddit.com/r/programming</a></li>
</ul>

<p>These sites and many others collect links to other original sources, such as blog posts, news articles, tech projects, etc. Until recently I wasn’t aware of just how many sites aggregate the same links.</p>

<p>I wrote a couple of blog posts recently that managed to get on the front page of Hacker News, Lobste.rs, etc. and saw just how many places linked to my post.</p>

<p>This is by no means scientific, but I tried to capture and group some of the popular aggregators, roughly sorted by how many referrals I received from each one. There are many, many more that seem to be single person hobby projects. Hacker News in particular seems to attract a lot of folks building and hosting alternative frontends.</p>

<p>Maybe you’ll discover a new preferred way to read tech news.</p>

<h2 id="hacker-news-alternative-frontends">Hacker News Alternative Frontends</h2>

<p>Judging by these sites, there are a few reasons why people run alternative frontends to Hacker News. Most of these present the same information but with a different user interface. Some change the content by adjusting the rankings of posts, snapshotting different time periods (e.g. top links per day for each day), or adding summaries. I’ve commented on a few of them in the list:</p>

<ul>
  <li><a href="https://hckrnews.com/">https://hckrnews.com</a></li>
  <li><a href="https://hn.algolia.com/">https://hn.algolia.com</a> (Search, filtering and sorting)</li>
  <li><a href="https://www.daemonology.net/hn-daily/">https://www.daemonology.net/hn-daily</a> (Top posts per day)</li>
  <li><a href="https://hackerweb.app">https://hackerweb.app</a></li>
  <li><a href="https://hn.premii.com/">https://hn.premii.com</a></li>
  <li><a href="https://hn.svelte.dev/">https://hn.svelte.dev</a></li>
  <li><a href="https://hackernews.betacat.io/">https://hackernews.betacat.io</a> (ChatGPT summaries)</li>
  <li><a href="https://www.hackernewz.com/">https://www.hackernewz.com</a></li>
  <li><a href="https://news.social-protocols.org/">https://news.social-protocols.org</a></li>
  <li><a href="https://hnrss.github.io/">https://hnrss.github.io</a> (Custom, realtime RSS feeds)</li>
  <li><a href="https://hnrankings.info/">https://hnrankings.info</a> (Rank of posts over time)</li>
  <li><a href="https://vue-hn.netlify.app/">https://vue-hn.netlify.app</a></li>
  <li><a href="https://hn42.net/">https://hn42.net</a></li>
  <li><a href="https://hn.toonmaterial.com/">https://hn.toonmaterial.com</a></li>
  <li><a href="http://hnapp.com/">http://hnapp.com</a> (Search interface)</li>
  <li><a href="http://hn.elijames.org/">http://hn.elijames.org</a></li>
  <li><a href="https://yester-hn.riched.net/">https://yester-hn.riched.net</a></li>
  <li><a href="https://ycombinator.mytools.pw/">https://ycombinator.mytools.pw</a></li>
  <li><a href="https://yahni.news/">https://yahni.news</a></li>
  <li><a href="https://www.hntoplinks.com/">https://www.hntoplinks.com</a></li>
  <li><a href="https://www.hakaran.com/">https://www.hakaran.com</a></li>
  <li><a href="https://www.distilhn.com/">https://www.distilhn.com</a> (AI generated summaries)</li>
  <li><a href="https://whnex.com/">https://whnex.com</a></li>
  <li><a href="https://viralerts.com/">https://viralerts.com</a></li>
  <li><a href="https://slacker-news.fly.dev/">https://slacker-news.fly.dev</a></li>
  <li><a href="https://news.sune.one/">https://news.sune.one</a></li>
  <li><a href="https://modernorange.io/">https://modernorange.io</a></li>
  <li><a href="https://innerself-hn.com/">https://innerself-hn.com</a></li>
  <li><a href="https://hn.test.tube/">https://hn.test.tube</a> (Top posts by time period)</li>
  <li><a href="https://hackernewsday.com/">https://hackernewsday.com</a> (Top posts by day)</li>
</ul>

<h2 id="aggregate-multiple-sources">Aggregate Multiple Sources</h2>

<p>These are the sites which aggregate multiple sources into one interface. Most of these show separate lists for each source, but a few do combine them into unified lists.</p>

<ul>
  <li><a href="https://hackurls.com/">https://hackurls.com</a></li>
  <li><a href="https://brutalist.report/">https://brutalist.report</a></li>
  <li><a href="https://skimfeed.com/">https://skimfeed.com</a></li>
  <li><a href="https://devurls.com/">https://devurls.com</a></li>
  <li><a href="https://jimmyr.com/">https://jimmyr.com</a></li>
  <li><a href="https://progscrape.com/">https://progscrape.com</a> (Unified list)</li>
  <li><a href="https://serializer.io/">https://serializer.io</a> (Unified list)</li>
  <li><a href="https://www.freshnews.org/">https://www.freshnews.org</a></li>
  <li><a href="https://spike.news/">https://spike.news</a></li>
  <li><a href="https://techurls.com/">https://techurls.com</a></li>
  <li><a href="https://upstract.com/">https://upstract.com</a></li>
  <li><a href="https://nuuz.io/">https://nuuz.io</a></li>
  <li><a href="https://now.hackertab.dev/">https://now.hackertab.dev</a></li>
  <li><a href="https://webdigest.pages.dev/">https://webdigest.pages.dev</a></li>
  <li><a href="https://summary.nz/">https://summary.nz</a></li>
  <li><a href="https://readspike.com/">https://readspike.com</a></li>
  <li><a href="https://popuris.com/">https://popuris.com</a></li>
  <li><a href="https://old.thenews.im/">https://old.thenews.im</a></li>
  <li><a href="https://news.t0.vc/">https://news.t0.vc</a> (Unified list)</li>
  <li><a href="https://news.cyberwar.nl/">https://news.cyberwar.nl</a></li>
  <li><a href="https://linksfor.dev/">https://linksfor.dev</a></li>
</ul>

<h2 id="mailing-lists">Mailing Lists</h2>

<p>There are also a handful of popular mailing lists, likely getting their stories from the main aggregators. I doubt I have a full list here, however, since the referrer is generally a email client instead of the website.</p>

<ul>
  <li><a href="https://tldr.tech/">https://tldr.tech</a></li>
  <li><a href="https://hackernewsletter.com">https://hackernewsletter.com</a></li>
  <li><a href="https://daily.dev/">https://daily.dev</a></li>
  <li><a href="https://www.pointer.io/">https://www.pointer.io</a></li>
  <li><a href="https://www.bigdatanewsweekly.com/">https://www.bigdatanewsweekly.com</a></li>
  <li><a href="https://www.hndigest.com/">https://www.hndigest.com</a></li>
  <li><a href="https://architecturenotes.co/">https://architecturenotes.co</a></li>
</ul>

<p>I also learned about <a href="https://kill-the-newsletter.com/">https://kill-the-newsletter.com</a>, which converts newsletters into Atom feeds, which can then be consumed via news readers.</p>

<h2 id="misc">Misc</h2>

<p>There are other popular tech podcasts, but <a href="https://changelog.com/">The Changelog</a> has really good show notes with links so it was easy to see that folks had clicked through.</p>

<p>And of course there are other sets of user submitted posts where I saw some traffic, such as:</p>
<ul>
  <li><a href="https://slashdot.org/">https://slashdot.org</a></li>
  <li><a href="https://alterslash.org/">https://alterslash.org</a> (Alternative Slashdot frontend)</li>
  <li><a href="https://www.sqox.com/">https://www.sqox.com</a></li>
  <li><a href="https://lemmy.world/">https://lemmy.world</a></li>
</ul>

<p>I also came across a bunch of sites in other languages, but I left those out because in most cases, it was hard for me to determine quite what they were doing.</p>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[When I am interested in reading tech news, I have a few sites I often visit, such as: https://news.ycombinator.com (aka Hacker News) https://lobste.rs https://www.reddit.com/r/programming]]></summary></entry><entry><title type="html">DuckDB as the New jq</title><link href="https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/" rel="alternate" type="text/html" title="DuckDB as the New jq" /><published>2024-03-21T00:00:00-07:00</published><updated>2024-03-21T00:00:00-07:00</updated><id>https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq</id><content type="html" xml:base="https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/"><![CDATA[<p>Recently, I’ve been interested in the <a href="https://duckdb.org/">DuckDB</a> project (like a <a href="https://www.sqlite.org/">SQLite</a> geared towards data applications). And one of the amazing features is that it has many data importers included without requiring extra dependencies. This means it can natively read and parse JSON as a database table, among many other formats.</p>

<p>I work extensively with JSON day to day, and I often reach for <a href="https://jqlang.github.io/jq/">jq</a> when exploring documents. I love <code class="language-plaintext highlighter-rouge">jq</code>, but I find it hard to use. The syntax is super powerful, but I have to study the docs anytime I want to do anything beyond just selecting fields.</p>

<!-- I love [jq](https://jqlang.github.io/jq/) for wrangling JSON data, but I find it hard to use. The syntax is super powerful, but I have to read the docs anytime I want to do anything beyond just selecting fields. -->

<p>Once I learned DuckDB could read JSON files directly into memory, I realized that I could use it for many of the things where I’m currently using <code class="language-plaintext highlighter-rouge">jq</code>. In contrast to the complicated and custom <code class="language-plaintext highlighter-rouge">jq</code> syntax, I’m very familiar with SQL and use it almost daily.</p>

<p>Here’s an example:</p>

<p>First, we fetch some sample JSON to play around with. I used the GitHub API to grab the repository information from the golang org:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% curl <span class="s1">'https://api.github.com/orgs/golang/repos'</span> <span class="o">&gt;</span> repos.json
</code></pre></div></div>

<p>Now, as a sample question to answer, let’s get some stats on the types of open source licenses used.</p>

<p>The JSON structure looks like this:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
  </span><span class="p">{</span><span class="w">
    </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="mi">1914329</span><span class="p">,</span><span class="w">
    </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gddo"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"license"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"key"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bsd-3-clause"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BSD 3-Clause </span><span class="se">\"</span><span class="s2">New</span><span class="se">\"</span><span class="s2"> or </span><span class="se">\"</span><span class="s2">Revised</span><span class="se">\"</span><span class="s2"> License"</span><span class="p">,</span><span class="w">
      </span><span class="err">...</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="err">...</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="p">{</span><span class="w">
    </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="mi">11440704</span><span class="p">,</span><span class="w">
    </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"glog"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"license"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"key"</span><span class="p">:</span><span class="w"> </span><span class="s2">"apache-2.0"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Apache License 2.0"</span><span class="p">,</span><span class="w">
      </span><span class="err">...</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="err">...</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="err">...</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<p>This might not be the best way, but here is what I cobbled together after searching and reading some docs for how to do this in <code class="language-plaintext highlighter-rouge">jq</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> % <span class="nb">cat </span>repos.json | jq <span class="se">\</span>
  <span class="s1">'group_by(.license.key)
  | map({license: .[0].license.key, count: length})
  | sort_by(.count)
  | reverse'</span>
<span class="o">[</span>
  <span class="o">{</span>
    <span class="s2">"license"</span>: <span class="s2">"bsd-3-clause"</span>,
    <span class="s2">"count"</span>: 23
  <span class="o">}</span>,
  <span class="o">{</span>
    <span class="s2">"license"</span>: <span class="s2">"apache-2.0"</span>,
    <span class="s2">"count"</span>: 5
  <span class="o">}</span>,
  <span class="o">{</span>
    <span class="s2">"license"</span>: null,
    <span class="s2">"count"</span>: 2
  <span class="o">}</span>
<span class="o">]</span>
</code></pre></div></div>

<p>And here is what it looks like in DuckDB using SQL:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% duckdb <span class="nt">-c</span> <span class="se">\</span>
  <span class="s2">"select license-&gt;&gt;'key' as license, count(*) as count </span><span class="se">\</span><span class="s2">
  from 'repos.json' </span><span class="se">\</span><span class="s2">
  group by 1 </span><span class="se">\</span><span class="s2">
  order by count desc"</span>
┌──────────────┬───────┐
│   license    │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ bsd-3-clause │    23 │
│ apache-2.0   │     5 │
│              │     2 │
└──────────────┴───────┘
</code></pre></div></div>

<p>For me, this SQL is much simpler and I was able to write it without looking at any docs. The only tricky part is querying nested JSON with the <code class="language-plaintext highlighter-rouge">-&gt;&gt;</code> operator. The syntax is the same as the <a href="https://www.postgresql.org/docs/current/functions-json.html">PostgreSQL JSON Functions</a>, however, so I was familiar with it.</p>

<p>And if we do need the output in JSON, there’s a DuckDB flag for that:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% duckdb <span class="nt">-json</span> <span class="nt">-c</span> <span class="se">\</span>
  <span class="s2">"select license-&gt;&gt;'key' as license, count(*) as count </span><span class="se">\</span><span class="s2">
  from 'repos.json' </span><span class="se">\</span><span class="s2">
  group by 1 </span><span class="se">\</span><span class="s2">
  order by count desc"</span>
<span class="o">[{</span><span class="s2">"license"</span>:<span class="s2">"bsd-3-clause"</span>,<span class="s2">"count"</span>:23<span class="o">}</span>,
<span class="o">{</span><span class="s2">"license"</span>:<span class="s2">"apache-2.0"</span>,<span class="s2">"count"</span>:5<span class="o">}</span>,
<span class="o">{</span><span class="s2">"license"</span>:null,<span class="s2">"count"</span>:2<span class="o">}]</span>
</code></pre></div></div>

<p>We can still even pretty print with <code class="language-plaintext highlighter-rouge">jq</code> at the end, after using DuckDB to do the heavy lifting:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% duckdb <span class="nt">-json</span> <span class="nt">-c</span> <span class="se">\</span>
  <span class="s2">"select license-&gt;&gt;'key' as license, count(*) as count </span><span class="se">\</span><span class="s2">
  from 'repos.json' </span><span class="se">\</span><span class="s2">
  group by 1 </span><span class="se">\</span><span class="s2">
  order by count desc"</span> <span class="se">\</span>
  | jq
<span class="o">[</span>
  <span class="o">{</span>
    <span class="s2">"license"</span>: <span class="s2">"bsd-3-clause"</span>,
    <span class="s2">"count"</span>: 23
  <span class="o">}</span>,
  <span class="o">{</span>
    <span class="s2">"license"</span>: <span class="s2">"apache-2.0"</span>,
    <span class="s2">"count"</span>: 5
  <span class="o">}</span>,
  <span class="o">{</span>
    <span class="s2">"license"</span>: null,
    <span class="s2">"count"</span>: 2
  <span class="o">}</span>
<span class="o">]</span>
</code></pre></div></div>

<p>JSON is just one of the many ways of importing data into DuckDB. This same approach would work for CSVs, parquet, Excel files, etc.</p>

<p>And I could choose to create tables and persist locally, but often I’m just interrogating data and don’t need the persistence.</p>

<p>Read more about DuckDB’s great JSON support in this blog post: <a href="https://duckdb.org/2023/03/03/json.html">Shredding Deeply Nested JSON, One Vector at a Time</a></p>

<p><strong>Update:</strong></p>

<p>I also learned that DuckDB can read the JSON directly from a URL, not just a local file:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% duckdb <span class="nt">-c</span> <span class="se">\</span>
  <span class="s2">"select license-&gt;&gt;'key' as license, count(*) as count </span><span class="se">\</span><span class="s2">
  from read_json('https://api.github.com/orgs/golang/repos') </span><span class="se">\</span><span class="s2">
  group by 1 </span><span class="se">\</span><span class="s2">
  order by count desc"</span>
</code></pre></div></div>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[Recently, I’ve been interested in the DuckDB project (like a SQLite geared towards data applications). And one of the amazing features is that it has many data importers included without requiring extra dependencies. This means it can natively read and parse JSON as a database table, among many other formats.]]></summary></entry><entry><title type="html">Lessons Learned From Payments Startups</title><link href="https://www.pgrs.net/2024/01/26/lessons-learned-from-payments-startups/" rel="alternate" type="text/html" title="Lessons Learned From Payments Startups" /><published>2024-01-26T00:00:00-08:00</published><updated>2024-01-26T00:00:00-08:00</updated><id>https://www.pgrs.net/2024/01/26/lessons-learned-from-payments-startups</id><content type="html" xml:base="https://www.pgrs.net/2024/01/26/lessons-learned-from-payments-startups/"><![CDATA[<p>Over my career so far, I’ve worked in a number of payments companies, including several startups. At the last startup, I was involved in building out a payments platform from scratch (from first line of code). This post is a collection of thoughts and lessons learned. Hopefully, at least some of this is useful to others.</p>

<p>The sections are relatively independent, so here are some quick links:</p>
<ul>
  <li><a href="#use-the-tools-you-have-before-adding-new-tools">Use The Tools You Have</a></li>
  <li><a href="#optimize-for-change">Optimize For Change</a></li>
  <li><a href="#focus-on-iteration">Focus on Iteration</a></li>
  <li><a href="#testing">Testing</a></li>
  <li><a href="#modular-monolith">Modular Monolith</a></li>
  <li><a href="#put-everything-in-the-database">Put Everything in the Database</a></li>
  <li><a href="#make-it-easy-to-query-the-database">Make It Easy To Query The Database</a></li>
  <li><a href="#job-drain-pattern">Job Drain Pattern</a></li>
  <li><a href="#check-in-generated-files">Check in Generated Files</a></li>
  <li><a href="#decision-logs">Decision Logs</a></li>
  <li><a href="#continuous-deployment">Continuous Deployment</a></li>
</ul>

<h2 id="use-the-tools-you-have-before-adding-new-tools">Use The Tools You Have (Before Adding New Tools)</h2>

<p>Every new tool, language, database, etc adds an enormous amount of complexity. You have to set it up and manage it (even managed offerings still require work), integrate with it, learn the ins and outs (often only after it’s failed in some way), and you will find out things you didn’t even know to think about.</p>

<p>So before I reach for something new, I try to use what we have, even if it’s not the optimal thing. For example, my projects have often used <a href="https://www.postgresql.org/">PostgreSQL</a> as the database. PostgreSQL is quite full featured, so I try to use it for as much as possible. This includes job queues, search, and even simple caching (e.g. table that stores temporary values which get cleared out over time). It’s not necessarily the ideal platform for these, but it’s so much easier to just manage the one database than a whole suite of data systems. And at some point, the app will outgrow PostgreSQL’s capability for one or more of these, but even deferring that decision and work is hugely valuable.</p>

<p>The same goes for introducing new languages and frameworks. When possible, I like to use what we have and only introduce something new once we’ve pushed the existing stuff to the breaking point.</p>

<p>Another advantage is that over time, a lot of software becomes deprecated, but not removed. Some product or feature is no longer maintained, but since it’s in active use, it’s not fully shut down or deleted. It’s bad enough to leave deprecated code and services running, but it’s even worse if this means you now have extra databases or other platform systems that still have to be maintained, but don’t provide any current value. Even deprecated systems still need security upgrades, migrations to new servers, and more.</p>

<h2 id="optimize-for-change">Optimize For Change</h2>

<p>It is especially true with startups, but really change is a part of any software project. Requirements change, our understanding of the problems change, technology changes, and even the focus of a company can change. So it’s important to ensure that the software can change as well. Sometimes this is subjective (which architecture is the most amenable to change) and other times it’s concrete.</p>

<p>For example, I worked on one system which had both an customer installed on-premise system and a cloud hosted system. The on-premise system was extremely hard to change as it required customers to do their own upgrades (often on their own schedules). In contrast, the cloud hosted system was fully under our control. So optimizing for change meant putting as much into the cloud hosted system as possible and keeping the on-premise portion thin. That way, we needed fewer changes to the hard to change parts, and we could roll out as many changes as we needed to the cloud piece on our own schedule.</p>

<p>Optimizing for change can also help with architecture discussions and decisions. When deciding between alternatives, picking the one that is easiest to change later can be helpful. It’s easier to try new things when the cost of undoing that change isn’t as high. If the new framework or tool doesn’t work out, you can switch back or switch to something else that’s new.</p>

<p>In my opinion, one of the best ways to optimize for change is to keep things as simple as possible. Sometimes, folks will over-engineer current systems to try to predict how they will evolve in the future and to try to future-proof them now. One example of this is making things generic when there is only one type today. I think this is a mistake. Our guesses for how things will change are often incorrect, and it’s easier to change a simple system than a complex one. It’s also easier to maintain a simpler system today than carry the over-engineered baggage around with us.</p>
<h2 id="focus-on-iteration">Focus on Iteration</h2>

<p>It’s super important to be able to break down work into small, deliverable pieces. I’ve seen too many projects go months without showing any value. Sometimes they do finally deliver, but other times, they will get canceled or significantly altered instead. It’s far better to release piecemeal, even if it’s not fully featured. Feature flags and other ways to partially roll out features are great here. It allows you to get production feedback from a subset of customers, or even just internal folks. And it allows visible progress throughout a long project.</p>

<p>I find that a lot of frustration over software estimates and delivery time frames go away if folks can see visible progress over time, rather than a nebulous future delivery date.</p>

<p>One thing I wish I had a better solution for was making the stability of features more obvious. For example, I want to ship quickly to get feedback, but then I want to still be able to change that feature or API. However, once customers start using something, they often implicitly assume that it won’t change.</p>

<p>It would be great to find a way to mark features or APIs as alpha, beta, stable, etc and set clear expectations and time frames for those features. For example, encouraging customers to try out an alpha API, but knowing that it will change and they will have to update their integration periodically. Personally, I haven’t seen this done super well yet.</p>

<h2 id="testing">Testing</h2>

<p>Testing code is super valuable, and there are many different approaches with different trade-offs. A lot can be said on this topic, but I’ll just mention one aspect that I’ve been thinking about a lot: balancing speed and quality of tests.</p>

<p>In general, having a lot of tests lets you make changes with confidence. If a large, thorough suite of tests pass, you can be reasonably sure you haven’t broken something. It can even let you upgrade core components with confidence, such as the application framework or language version.</p>

<p>However, the more tests you have the slower they will take to run. What starts as a suite of just a few seconds can easily take minutes or longer if you aren’t careful. One way this is addressed is by trying to isolate tests from other systems, often with mocking. For example, testing the core of the business logic without the database, or testing the API without actually opening connections and making API calls, or mocking out responses from 3rd party systems.</p>

<p>But the trade-off here is that as you isolate tests to make them faster, you may also make them less realistic and less able to catch problems. The mock based tests are fast, but perhaps the mock doesn’t work the same way as the real component in certain edge cases.</p>

<p>Or you want to change something about the interaction between components, and now you have to update hundreds of cases where you set up mocks for testing.</p>

<p>I don’t have a great answer for this one. I try to isolate code from external dependencies when I can (e.g. by writing business logic as simple functions that take their data as input). And for the rest at the edges or when testing interactions, I just try to be thoughtful about the trade-offs we make for speed vs accuracy with testing. I also tend to prefer fakes over mocks, where you have a mostly stable stand-in that is used across many tests instead of setting up mock expectations per test.</p>

<h2 id="modular-monolith">Modular Monolith</h2>

<p>A lot has been said on modular monoliths elsewhere, so I’ll just add that I really like this approach. It’s really hard to know what the eventual seams of a software system will be, and it’s hard for a small team to work on many separate services (including hosting, deployment, monitoring, upgrades, etc).</p>

<p>In the recent cases where we used a monolith, I think it worked out really well. It will always be work to pull a service out of the monolith eventually, but we can try to be thoughtful about the code separation within the monolith to help make it easier (and to crystallize our thinking on what is a separate domain area). And we’re deferring these decisions until later, so we can focus on building more quickly now (which is especially important in a startup).</p>

<h2 id="put-everything-in-the-database">Put Everything in the Database</h2>

<p>I’m a big fan of storing almost everything in the database. I find that it makes things so much easier to understand and debug if you can query all of the relevant data together. Often, I will prefer the database to logging, since you can’t easily correlate logs with stored data (e.g. <a href="/2023/12/14/storing-external-requests/">Storing External Requests</a>).</p>

<p>For example, in payment systems, payments often move through many different states. It’s really helpful to have entries in the database that represent what changed and when, even if only the final state is important. Then, when trying to debug why a payment is in a weird state, we can see all of the relevant data in all of the tables in one place (e.g. in an events or audits table).</p>

<p>Adding a unique request identifier makes it even more useful. Then, you can associate a failed API requests with all of its database records.</p>

<p>There are practical considerations, however, as data sizes really start to grow. One strategy I’ve used is to store some of this data with a shorter lifespan. For example, log style data may only be useful for a few weeks, so it can be deleted after that. Or exported to files and archived separately.</p>

<p>Another issue is with Personally Identifiable Information (PII). There are often legal and ethical requirements for this type of data, so it needs to be considered on a case by case basis. Sometimes, it can still be stored, but only for a short time. Other times, it should be scrubbed or excluded from the database.</p>

<h2 id="make-it-easy-to-query-the-database">Make It Easy To Query The Database</h2>

<p>Once you get everything into the database, I find it super helpful to give folks an easy way to query it. Recently, I used <a href="https://github.com/metabase/metabase">Metabase</a> and really enjoyed how it allowed easy, web based querying and graphing of our data. We set it up with a read-only connection to a read replica, so there was little concern of impacting production or accidentally changing data. We found that both developers and non-technical folks used it extensively.</p>

<p>For example, we made dashboards where you could enter an <code class="language-plaintext highlighter-rouge">orderId</code> and see all of the data from all of the tables that stored associated data. This was hugely valuable for debugging and for our support folks.</p>

<p>Again, there are considerations of who can see the data, and how much of it. But in general, giving folks the ability to answer their own data questions is super powerful, and it takes load off developers. And building shared dashboards and graphs so everyone can watch the same metrics was very powerful.</p>

<h2 id="job-drain-pattern">Job Drain Pattern</h2>

<p>Once a system outgrows a single database, data consistency issues start to pop up. Even introducing a background job system or a search tool can start to show issues. For example, the main database was written, but the process that copied to the search tool failed. Or the background job was queued before the main database was committed.</p>

<p>There are various ways to solve this problem, and in particular, I like the job drain pattern, written up well at <a href="https://brandur.org/job-drain">Transactionally Staged Job Drains in Postgres </a>. I’ve used this pattern on several different projects successfully.</p>

<h2 id="check-in-generated-files">Check in Generated Files</h2>

<p>Similar to put everything in the database is put everything into git. For me, this includes generated files when possible. I know a lot of ecosystems prefer generating only at build time into temporary directories, but I really like having them in git. I find it really useful to be able to diff these files when making changes, such as upgrading the generation library or code. Otherwise, it can be hard to tell if anything meaningful has changed, or if more has changed than you expected.</p>

<p>When working with <a href="https://gradle.org">Gradle</a>, I also like to check in the generated <a href="https://docs.gradle.org/current/userguide/dependency_locking.html">lockfiles</a> that specify the exact version of every transitive dependency. Then, when <a href="https://docs.github.com/en/code-security/dependabot">Dependabot</a>/<a href="https://github.com/renovatebot/renovate">Renovate</a>/etc perform automated upgrades, it’s easy to see which transitive dependencies have also changed.</p>

<h2 id="decision-logs">Decision Logs</h2>

<p>I think in general, a lot of internal documentation is wasted effort. People spend countless hours writing up product plans or docs that are never looked at again.</p>

<p>However, I do think some documentation is often valuable. In particular, I like Decision Logs. The idea is that whenever the team needs to make a decision, that decision is captured in some light documentation. I think it serves two purposes:</p>

<ol>
  <li>
    <p>Writing up the options along with the advantages and disadvantages of each helps clarify thinking, and helps make better decisions. It shows what you’ve considered, and allows others to note gaps or misunderstandings. It’s also often helpful to clarify what you are <em>not</em> trying to address with the decision, i.e. what’s out of scope.</p>
  </li>
  <li>
    <p>Months or even years later, looking back at the Decision Log can be useful to understand why the system is designed a certain way. For example, someone new is hired and doesn’t understand why you chose Database X over Database Y. They can go read the entry. Or when someone proposes something new that’s already been considered, you can go back and see why it wasn’t chosen previously and if anything in the situation has changed (e.g. with the company or the capabilities of the tool). The Decision Log helps to remove “institutional knowledge” where only a handful of old-timers know the reasons for anything.</p>
  </li>
</ol>

<p>I do think that these Decision Logs (and other documentation) should be kept relatively light, however. Folks should not spend days writing them up.</p>

<h2 id="continuous-deployment">Continuous Deployment</h2>

<p>I’m a big fan of continuous deployment. This can look different on different projects, but ideally, every commit to the main branch will deploy to both test and production environments. I see a number of benefits:</p>

<ol>
  <li>It means the time between commit and production is small, so completed work gets into the hands of users quickly. You also don’t have to worry about when code will be released, and when other code that depends on it can also be released. You can merge a change, let it deploy, and then merge another change.</li>
  <li>It requires the deploys to be fully automated, which both makes them repeatable and also generally discoverable. Anyone can see what steps are run for every deployment and they are always the same (no hidden steps). Furthermore, if there’s ever a need for a manual deployment, someone can go look at the scripts and run the same commands.</li>
  <li>It removes an often time consuming developer chore. Now, deploys just happen and you don’t have to spend time coordinating or performing them.</li>
</ol>

<p>For beta features, or features that aren’t ready to be visible to everyone, I think feature flags work well. There are lots of libraries and products in this space, but it’s possible to start simple with what is built into GitLab: <a href="https://docs.gitlab.com/ee/operations/feature_flags.html">https://docs.gitlab.com/ee/operations/feature_flags.html</a></p>]]></content><author><name>Paul Gross</name><email>pgross@gmail.com</email></author><summary type="html"><![CDATA[Over my career so far, I’ve worked in a number of payments companies, including several startups. At the last startup, I was involved in building out a payments platform from scratch (from first line of code). This post is a collection of thoughts and lessons learned. Hopefully, at least some of this is useful to others.]]></summary></entry></feed>