<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SQL and Data Blog</title>
	<atom:link href="https://www.dpriver.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.dpriver.com/blog/</link>
	<description>SQL related blog for database professional</description>
	<lastBuildDate>Fri, 01 May 2026 07:31:30 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.2.9</generator>

<image>
	<url>https://www.dpriver.com/blog/wp-content/uploads/2022/07/cropped-logo_150_150-32x32.png</url>
	<title>SQL and Data Blog</title>
	<link>https://www.dpriver.com/blog/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros — and How to Recover It</title>
		<link>https://www.dpriver.com/blog/2026/05/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Fri, 01 May 2026 06:24:30 +0000</pubDate>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[gsp]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[array-agg]]></category>
		<category><![CDATA[bigquery]]></category>
		<category><![CDATA[column-lineage]]></category>
		<category><![CDATA[datahub]]></category>
		<category><![CDATA[dbt]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[sqlglot]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3211</guid>

					<description><![CDATA[<p>DataHub silently drops column-level lineage on the dbt-utils deduplicate macro because of how sqlglot's column resolver handles ARRAY_AGG + struct unpack. Here's why — and an open-source post-processor that recovers the missing lineage.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/05/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re/">Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros — and How to Recover It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>If your dbt + BigQuery stack lands models in DataHub, you may have hit a quiet failure mode on a subset of your tables: the <strong>table-level</strong> lineage edge appears, but the <strong>column-level</strong> lineage is empty. The graph looks complete — until you click into a column and find no upstream.</p>



<p>This is a known issue (<a href="https://github.com/datahub-project/datahub/issues/11670">datahub-project/datahub#11670</a>, open since October 2024). The root cause is specific: the BigQuery SQL that the <a href="https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/deduplicate.sql">dbt-utils <code>deduplicate</code> macro</a> emits uses two BigQuery-specific semantics — row-as-STRUCT and <code>STRUCT.&#042;</code> field expansion — that DataHub&#8217;s bundled SQL parser (sqlglot) does not currently model in its column-level lineage walker.</p>



<p>This post explains exactly what&#8217;s going on, with an honest accounting of what&#8217;s recoverable from the SQL alone versus what needs schema metadata, and shows how to recover the missing edge today using an open-source post-processor — without modifying your DataHub installation.</p>



<h2 class="wp-block-heading">What dbt-utils is actually doing on BigQuery</h2>



<p>The dbt-utils <code>deduplicate</code> macro is dialect-dispatched. Most warehouses get a <code>qualify</code>/<code>row_number()</code> translation, but BigQuery gets a <a href="https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572">different implementation on purpose</a> — <code>array_agg ... limit 1</code> is significantly cheaper at scale than <code>row_number() over (...)</code> in BigQuery&#8217;s execution engine. The compiled SQL looks like this:</p>



<pre class="wp-block-code"><code>CREATE VIEW analytics.deduplicated_articles AS
SELECT unique.*
FROM (
    SELECT
        ARRAY_AGG(
            original
            ORDER BY article_name DESC
            LIMIT 1
        )[OFFSET(0)] AS unique
    FROM all_articles AS original
    GROUP BY id
);</code></pre>



<p>The variable names are deceptive in a way that matters for parsers, so it&#8217;s worth being precise about what each one is:</p>



<ul class="wp-block-list">

<li><code>original</code> is the <strong>table alias</strong> for <code>all_articles</code>. Used as a value (inside <code>ARRAY_AGG(original)</code>), it&#8217;s the <em>whole row</em> of <code>all_articles</code> viewed as a STRUCT — not a column.</li>

<li><code>unique</code> is a <strong>STRUCT-valued column alias</strong> in the inner query. <code>ARRAY_AGG(original ...)[OFFSET(0)]</code> returns one STRUCT per group (the deduplicated winner row), and <code>AS unique</code> names that STRUCT.</li>

<li><code>SELECT unique.&#042;</code> is <strong>STRUCT field expansion</strong> (per BigQuery&#8217;s <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_expression-star"><code>SELECT expression.*</code></a> rules), not a star against a table. It explodes the STRUCT&#8217;s fields into top-level output columns.</li>




<p>Net effect: <code>analytics.deduplicated_articles</code> has the same top-level column shape as <code>all_articles</code>, with one row per <code>id</code> chosen by <code>article_name DESC</code>.</p>



<h2 class="wp-block-heading">What lineage is recoverable from the SQL alone</h2>



<p>Here&#8217;s an honest expected-vs-actual table for what a metadata-free parser can produce on this SQL:</p>



<p><table> <thead> <tr style="background:#f0f4f8"> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Lineage you&#8217;d like</th> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Recoverable without schema?</th> </tr> </thead> <tbody> <tr> <td style="padding:8px 12px;border:1px solid #ddd">Table edge: <code>all_articles &rarr; analytics.deduplicated_articles</code></td> <td style="padding:8px 12px;border:1px solid #ddd">Yes — sqlglot already gets this</td> </tr> <tr style="background:#f9fafb"> <td style="padding:8px 12px;border:1px solid #ddd">Whole-row flow: <code>all_articles.&#042; &rarr; analytics.deduplicated_articles.&#042;</code></td> <td style="padding:8px 12px;border:1px solid #ddd">Yes — but only by a parser that models row-as-STRUCT and <code>STRUCT.&#042;</code> expansion</td> </tr> <tr> <td style="padding:8px 12px;border:1px solid #ddd">Per-column edges: <code>all_articles.&lt;col&gt; &rarr; analytics.deduplicated_articles.&lt;col&gt;</code></td> <td style="padding:8px 12px;border:1px solid #ddd">No — needs schema for <code>all_articles</code> to enumerate the column names</td> </tr> <tr style="background:#f9fafb"> <td style="padding:8px 12px;border:1px solid #ddd">Row-selection dependency on <code>article_name</code> and <code>id</code></td> <td style="padding:8px 12px;border:1px solid #ddd">Yes — the <code>GROUP BY</code> and <code>ORDER BY</code> keys can be emitted as separate edges into the output (as control inputs, not value lineage)</td> </tr> </tbody> </table></p>



<p>For the deeper walkthrough on what BigQuery semantics a metadata-free parser needs to implement to recover the row-flow shape (<code>SELECT AS STRUCT</code>, <code>ARRAY(SELECT AS STRUCT ...)</code>, row-as-STRUCT in <code>array_agg</code>, <code>[OFFSET(n)]</code>, <code>STRUCT.&#042;</code> field expansion), see the companion post: <a href="https://www.dpriver.com/blog/2026/05/bigquery-column-level-lineage-without-metadata-inferring-struct-and-array-types/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=companion-blog">BigQuery Column-Level Lineage Without Metadata</a>.</p>



<h2 class="wp-block-heading">Where DataHub&#8217;s column walker stops short</h2>



<p>DataHub&#8217;s dbt source runs the model SQL through <strong>sqlglot</strong>. sqlglot extracts the table-level edge here (<code>all_articles → analytics.deduplicated_articles</code>) and resolves the inner <code>unique</code> alias correctly. The gap is in the outer query: the projection is literally <code>unique.&#042;</code>, and sqlglot&#8217;s column-lineage walker does not currently expand a STRUCT-valued column alias into its fields. Without that expansion, there are no concrete target columns to attach column-level lineage to.</p>



<p>You can see the symptom in three lines:</p>



<pre class="wp-block-code"><code>&gt;&gt;&gt; import sqlglot
&gt;&gt;&gt; from sqlglot.lineage import lineage
&gt;&gt;&gt; outer = """SELECT unique.* FROM (
...   SELECT ARRAY_AGG(original ORDER BY article_name DESC LIMIT 1)[OFFSET(0)] AS unique
...   FROM all_articles AS original GROUP BY id)"""
&gt;&gt;&gt; # Tables: works fine
&gt;&gt;&gt; [str(t) for t in sqlglot.parse_one(outer, dialect='bigquery').find_all(sqlglot.exp.Table)]
['all_articles AS original']
&gt;&gt;&gt; # Lineage walker on the outer query
&gt;&gt;&gt; lineage('unique', outer, dialect='bigquery')
SqlglotError: Cannot find column 'unique' in query.</code></pre>



<p>The error message is a side effect, not the root cause. <code>unique</code> isn&#8217;t a final output column of the outer query — only <code>unique.&#042;</code> (a star expansion) is. The walker needs to (a) recognize <code>unique</code> as a STRUCT-valued projection in the inner query, (b) carry that STRUCT type through the subquery boundary, and (c) expand <code>unique.&#042;</code> into its field set. None of those steps are wired up today, so the projection drops out of column-level lineage.</p>



<p>This isn&#8217;t a defect in sqlglot per se — modeling row-as-STRUCT, array element extraction, and <code>STRUCT.&#042;</code> expansion is genuinely additional surface area, and the upstream maintainers have asked (rightly) for a sqlglot-side issue to track it. But for teams running DataHub today on a BigQuery + dbt stack, the column lineage on every deduplicate-macro model is missing now.</p>



<h2 class="wp-block-heading">Workaround: a post-processor sidecar</h2>



<p>We built <a href="https://github.com/gudusoftware/gsp-datahub-sidecar">gsp-datahub-sidecar</a>, an Apache-2.0 tool that fills this gap. It runs <strong>alongside</strong> your existing DataHub ingestion — no changes to DataHub itself, and it doesn&#8217;t replace anything sqlglot already produced.</p>



<p>The sidecar re-parses SQL that sqlglot&#8217;s column walker couldn&#8217;t fully resolve using <a href="https://sqlflow.gudusoft.com/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=engine-mention">Gudu SQLFlow</a>, whose engine implements the row-as-STRUCT, <code>[OFFSET(N)]</code>, and <code>STRUCT.&#042;</code> semantics natively across BigQuery, Snowflake, and other dialects. It then emits the recovered lineage to DataHub via the standard REST API.</p>



<pre class="wp-block-code"><code>DataHub ingestion (unchanged)        gsp-datahub-sidecar
  dbt source -&gt; sqlglot -&gt; lineage     |
                  |                    | re-parse the SQL with Gudu SQLFlow
                  v                    | (handles row-as-STRUCT + STRUCT.*)
       table edge OK, column            v
       expansion missing for #11670 -&gt; DataHub GMS (lineage filled in)</code></pre>



<h3 class="wp-block-heading">Try it in three commands</h3>



<p><strong>Step 1 — Install:</strong></p>



<pre class="wp-block-code"><code>pip install gsp-datahub-sidecar</code></pre>



<p><strong>Step 2 — Dry-run on the dbt-utils <code>deduplicate</code> pattern.</strong> The repo ships with <code>examples/bigquery_dbt_dedup.sql</code> — the exact SQL from issue #11670, ready to run:</p>



<pre class="wp-block-code"><code>gsp-datahub-sidecar --sql-file examples/bigquery_dbt_dedup.sql --dry-run</code></pre>



<p>Output (high-level summary, with the per-edge URN lines collapsed for readability — full output is shown in <a href="https://github.com/gudusoftware/gsp-datahub-sidecar#examples">the GitHub repo</a>):</p>



<pre class="wp-block-code"><code>[INFO] Processing 1 SQL statement(s) in 'anonymous' mode...
[INFO] Extracted 1 table-level lineage relationships
[INFO]   ALL_ARTICLES --&gt; ANALYTICS.DEDUPLICATED_ARTICLES (3 columns)
[INFO] Built 2 MCPs for 1 downstream tables (3 column-level mappings)
[INFO] [DRY RUN] Would emit 4 MCPs to http://localhost:8080
[INFO] [DRY RUN]   UpstreamLineageClass (1 upstream table, 3 column-level lineages):
[INFO] [DRY RUN]     all_articles.&#042;            -&gt;  analytics.deduplicated_articles.&#042;
[INFO] [DRY RUN]     all_articles.id           -&gt;  analytics.deduplicated_articles.&#042;
[INFO] [DRY RUN]     all_articles.article_name -&gt;  analytics.deduplicated_articles.&#042;</code></pre>



<p>The three column-level edges map cleanly onto the dbt-utils macro&#8217;s structure:</p>



<ul class="wp-block-list">

<li><strong><code>all_articles.&#042; -&gt; analytics.deduplicated_articles.&#042;</code></strong> — the row-flow edge. Every column of the source flows into the same-named column on the target, because <code>unique.&#042;</code> field-expands the row STRUCT.</li>

<li><strong><code>all_articles.id -&gt; analytics.deduplicated_articles.&#042;</code></strong> — <code>id</code> is the <code>GROUP BY</code> key. Different <code>id</code> values produce different output rows, so <code>id</code> is a row-selection (control) input to every output column.</li>

<li><strong><code>all_articles.article_name -&gt; analytics.deduplicated_articles.&#042;</code></strong> — <code>article_name</code> is the <code>ORDER BY ... DESC LIMIT 1</code> key. It picks <em>which</em> row wins per group, so it&#8217;s also a row-selection input to every output column.</li>




<p>The two row-selection edges (<code>id</code>, <code>article_name</code>) are the <code>GROUP BY</code> and <code>ORDER BY</code> columns annotated as control dependencies, not value lineage. That&#8217;s the right semantics for impact analysis: if <code>article_name</code> changes, the <em>choice</em> of winning row may change, but the <em>value</em> still flows from the row&#8217;s other columns.</p>



<p>With schema metadata supplied (via SQLFlow&#8217;s database integrations or a supplied schema file), the engine expands the <code>&#042;</code> on the right into concrete same-named column edges per known column of <code>all_articles</code>. Without metadata, you still get the structural row-flow edge in DataHub rather than a silent gap — which is enough for impact analysis even when the leaf column names aren&#8217;t enumerated.</p>



<p><strong>Step 3 — Emit to DataHub.</strong> Once the dry-run looks right, drop <code>--dry-run</code> and point at your DataHub GMS:</p>



<pre class="wp-block-code"><code>gsp-datahub-sidecar \
  --sql-file examples/bigquery_dbt_dedup.sql \
  --datahub-server http://localhost:8080</code></pre>



<p>You can run the sidecar on real dbt project SQL the same way — point <code>--sql-file</code> at the compiled SQL in <code>target/run/.../&lt;model&gt;.sql</code>, or feed a directory of compiled models with a wrapper script.</p>



<h2 class="wp-block-heading">Backend modes</h2>



<p>Pick the backend based on your security and volume needs:</p>



<p><table> <thead> <tr style="background:#f0f4f8"> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Mode</th> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Auth</th> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Limit</th> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Data location</th> <th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Use case</th> </tr> </thead> <tbody> <tr> <td style="padding:8px 12px;border:1px solid #ddd"><code>anonymous</code> (default)</td> <td style="padding:8px 12px;border:1px solid #ddd">None</td> <td style="padding:8px 12px;border:1px solid #ddd">50/day per IP</td> <td style="padding:8px 12px;border:1px solid #ddd">SQL sent to api.gudusoft.com</td> <td style="padding:8px 12px;border:1px solid #ddd">Quick evaluation</td> </tr> <tr style="background:#f9fafb"> <td style="padding:8px 12px;border:1px solid #ddd"><code>authenticated</code></td> <td style="padding:8px 12px;border:1px solid #ddd">userId + secretKey</td> <td style="padding:8px 12px;border:1px solid #ddd">10k/month</td> <td style="padding:8px 12px;border:1px solid #ddd">SQL sent to api.gudusoft.com</td> <td style="padding:8px 12px;border:1px solid #ddd">Extended evaluation</td> </tr> <tr> <td style="padding:8px 12px;border:1px solid #ddd"><code>self_hosted</code></td> <td style="padding:8px 12px;border:1px solid #ddd">userId + secretKey</td> <td style="padding:8px 12px;border:1px solid #ddd">Unlimited</td> <td style="padding:8px 12px;border:1px solid #ddd">SQL stays in your VPC</td> <td style="padding:8px 12px;border:1px solid #ddd">Production</td> </tr> <tr style="background:#f9fafb"> <td style="padding:8px 12px;border:1px solid #ddd"><code>local_jar</code></td> <td style="padding:8px 12px;border:1px solid #ddd">None (local subprocess)</td> <td style="padding:8px 12px;border:1px solid #ddd">Per JAR license</td> <td style="padding:8px 12px;border:1px solid #ddd">SQL never leaves the process</td> <td style="padding:8px 12px;border:1px solid #ddd">Air-gapped / CI</td> </tr> </tbody> </table></p>



<p>For production use with regulated SQL, the <a href="https://docs.gudusoft.com/docker/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=docker-mention">self-hosted SQLFlow Docker</a> keeps everything in your network — no SQL leaves your infrastructure.</p>



<h2 class="wp-block-heading">What&#8217;s next</h2>



<p>The sidecar is a workaround, not a substitute for fixing this upstream. The honest path forward:</p>



<ol class="wp-block-list">

<li><strong>Patch sqlglot&#8217;s column-lineage walker</strong> to model row-as-STRUCT, array element extraction, and <code>STRUCT.&#042;</code> expansion. That unblocks every downstream tool that uses sqlglot, not just DataHub.</li>

<li><strong>Until then</strong>, run the sidecar on dbt models that hit the <code>deduplicate</code> macro (or any other <code>array_agg</code>-based dedup pattern), so your column lineage isn&#8217;t silently empty.</li>




<p>If you&#8217;re affected by <a href="https://github.com/datahub-project/datahub/issues/11670">#11670</a> or other lineage gaps in your DataHub instance, browse the <a href="https://github.com/gudusoftware/gsp-datahub-sidecar">GitHub repo</a> for the full example library — BigQuery procedural SQL, Power BI comments, MSSQL stored procedures, Oracle CREATE VIEW. The companion post on <a href="https://www.dpriver.com/blog/2026/04/why-your-datahub-bigquery-lineage-silently-breaks-on-procedural-sql-and-how-to-f/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=related-blog">BigQuery procedural SQL lineage</a> covers DataHub issue #11654 with the same playbook. Issues, PRs, and feedback welcome on GitHub.</p>



<p><em>Disclosure: This tool is built by <a href="https://www.gudusoft.com/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=disclosure">Gudu Software</a>, the team behind General SQL Parser and SQLFlow. We specialize in deep SQL parsing and column-level lineage across SQL dialects.</em></p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/05/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re/">Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros — and How to Recover It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>BigQuery Column-Level Lineage Without Metadata: Inferring STRUCT and ARRAY Types from SQL Alone</title>
		<link>https://www.dpriver.com/blog/2026/05/bigquery-column-level-lineage-without-metadata-inferring-struct-and-array-types/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Fri, 01 May 2026 02:19:12 +0000</pubDate>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[gsp]]></category>
		<category><![CDATA[SQL language]]></category>
		<category><![CDATA[SQL skills]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[array]]></category>
		<category><![CDATA[bigquery]]></category>
		<category><![CDATA[bigquery-lineage]]></category>
		<category><![CDATA[column-lineage]]></category>
		<category><![CDATA[data-lineage]]></category>
		<category><![CDATA[nested-types]]></category>
		<category><![CDATA[struct]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3208</guid>

					<description><![CDATA[<p>How a BigQuery-aware SQL parser can extract column-level lineage from ARRAY&#60;STRUCT&#62;, SELECT AS STRUCT, and array_agg(row) patterns without any catalog metadata — and where generic parsers silently fail.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/05/bigquery-column-level-lineage-without-metadata-inferring-struct-and-array-types/">BigQuery Column-Level Lineage Without Metadata: Inferring STRUCT and ARRAY Types from SQL Alone</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>If you are searching for a <strong>column-level lineage solution for BigQuery</strong>, you have probably already discovered the part of the problem nobody talks about up front: BigQuery is a <em>typed</em>, <em>nested</em> SQL dialect. The moment your warehouse adopts <code>STRUCT</code>, <code>ARRAY&lt;STRUCT&gt;</code>, <code>SELECT AS STRUCT</code>, or <code>array_agg(table_alias)</code>, a parser that does not know the schema can stop dead, give up on the column, or — worse — silently produce wrong lineage.</p>



<p>This post walks through two real BigQuery patterns and shows what a SQL parser can actually infer from the SQL text alone, with <strong>no access to the BigQuery catalog, no INFORMATION_SCHEMA, no dbt manifest</strong>. We will use it to explain why generic SQL parsers struggle with BigQuery lineage, and what it takes to get column-level lineage right when metadata is missing.</p>



<p>The two patterns we will analyze:</p>



<ol class="wp-block-list">

<li><code>ARRAY(SELECT AS STRUCT ...)</code> — building a nested <code>ARRAY&lt;STRUCT&gt;</code> column from a flat input row.</li>

<li><code>array_agg(original ORDER BY ... LIMIT 1)[OFFSET(0)] unique</code> followed by <code>SELECT unique.*</code> — the dbt-utils-style deduplication pattern that flattens a row STRUCT back into top-level columns.</li>
</ol>




<p>Both produce <em>correct</em> BigQuery lineage even without metadata — but only if the parser understands BigQuery semantics, not just BigQuery grammar.</p>



<h2 class="wp-block-heading">Why BigQuery lineage is harder than &#8220;normal&#8221; SQL lineage</h2>



<p>A column-level lineage extractor for plain ANSI SQL has one job per output column: walk the <code>SELECT</code> list, resolve each expression back to a base table column. That breaks down in BigQuery for three reasons:</p>



<ul class="wp-block-list">

<li><strong>Output columns are not always scalar.</strong> A <code>SELECT AS STRUCT</code> expression produces a single output column whose type is itself a record. A downstream view&#8217;s <code>column_name</code> may actually expand into a dozen physical fields.</li>

<li><strong>Row variables behave like STRUCTs.</strong> BigQuery lets you write <code>array_agg(table_alias)</code> and get back <code>ARRAY&lt;STRUCT&gt;</code> — without ever naming the columns in the SQL.</li>

<li><strong><code>STRUCT.<em></code> is field expansion, not column projection.</strong> The same syntax (<code>alias.</em></code>) means very different things depending on whether <code>alias</code> is a table or a STRUCT-typed expression.</li>
</ul>




<p>If your lineage tool was originally built for Postgres or MySQL, every one of those is a hole. Most popular open-source parsers — JSQLParser, sqlglot, sqlparse — handle two of the three at best, and even then only with metadata. Without metadata, generic parsers either drop the column from lineage entirely or attribute it to the wrong source.</p>



<p>The good news: <strong>a lot more is recoverable from the SQL alone than people assume</strong>, as long as the parser knows BigQuery&#8217;s nested-type rules.</p>



<h2 class="wp-block-heading">Pattern 1: <code>ARRAY(SELECT AS STRUCT ...)</code> builds a nested column</h2>



<p>Here is a typical BigQuery view that flattens an <code>orders_raw</code> table into a single nested <code>orders</code> column:</p>



<pre class="wp-block-code"><code>CREATE VIEW orders_view AS
SELECT
    ARRAY(SELECT AS STRUCT
        customer.name AS customer_name,
        order_id      AS order_id
    ) AS orders
FROM
    orders_raw;</code></pre>



<p>Read by a metadata-free parser, this SQL says quite a lot.</p>



<p><strong>Step 1 — the inner query is a STRUCT constructor.</strong></p>



<pre class="wp-block-code"><code>SELECT AS STRUCT
    customer.name AS customer_name,
    order_id      AS order_id</code></pre>



<p><code>SELECT AS STRUCT</code> is a BigQuery-specific keyword. It is <em>not</em> &#8220;select two columns&#8221; — it is &#8220;build one row of type <code>STRUCT</code>&#8220;. The parser does not need to know whether <code>customer.name</code> is <code>STRING</code> or <code>BYTES</code>; it knows for certain that <code>customer_name</code> and <code>order_id</code> are <em>fields of a STRUCT</em>. That&#8217;s a semantic guarantee from the grammar itself.</p>



<p><strong>Step 2 — <code>ARRAY(...)</code> wraps the STRUCT in an ARRAY.</strong></p>



<pre class="wp-block-code"><code>ARRAY(SELECT AS STRUCT ...)</code></pre>



<p>The result type is <code>ARRAY&lt;STRUCT&gt;</code>. Again, the leaf scalar types are unknown without metadata, but the <em>shape</em> is fully determined by the SQL.</p>



<p><strong>Step 3 — the outer alias names a single column.</strong></p>



<pre class="wp-block-code"><code>) AS orders</code></pre>



<p>So <code>orders_view</code> has exactly one top-level column named <code>orders</code>. Its type form is:</p>



<pre class="wp-block-code"><code>orders ARRAY&lt;STRUCT&lt;customer_name ?, order_id ?&gt;&gt;</code></pre>



<h3 class="wp-block-heading">What lineage can we extract without metadata?</h3>



<p>A BigQuery-aware parser can produce this column-level lineage:</p>



<pre class="wp-block-code"><code>orders_raw.customer.name -&gt; orders_view.orders.customer_name
orders_raw.order_id      -&gt; orders_view.orders.order_id</code></pre>



<p>That is <em>real</em>, <em>navigable</em>, <em>nested</em> column-level lineage from the SQL text — no schema lookup required. The only thing the parser cannot prove from SQL alone is the leaf scalar type (<code>STRING</code>? <code>INT64</code>?) and whether <code>orders_raw.customer</code> is itself a <code>STRUCT</code> in the source table. Those need either metadata or BigQuery&#8217;s own type checker at execution time.</p>



<h3 class="wp-block-heading">What a generic parser typically gets wrong</h3>



<p>Without BigQuery semantics, a generic parser tends to do one of these:</p>



<ul class="wp-block-list">

<li>Treat <code>ARRAY(SELECT AS STRUCT ...)</code> as an unknown function and drop the lineage.</li>

<li>Emit a single edge from <code>orders_raw -&gt; orders_view.orders</code> with no internal field structure, hiding the fact that <code>customer_name</code> and <code>order_id</code> are <em>separate</em> fields downstream.</li>

<li>Flatten the STRUCT incorrectly, producing two top-level columns <code>customer_name</code> and <code>order_id</code> on <code>orders_view</code> — which is wrong: there is only one top-level column called <code>orders</code>.</li>
</ul>




<p>If your data catalog tells you <code>orders_view</code> has columns <code>customer_name</code> and <code>order_id</code>, the parser feeding it has misread the SQL.</p>



<h2 class="wp-block-heading">Pattern 2: <code>array_agg(row)[OFFSET(0)] unique</code> followed by <code>unique.*</code></h2>



<p>This is the canonical dbt-utils deduplication pattern, which is one of the most common SQL shapes in modern BigQuery warehouses:</p>



<pre class="wp-block-code"><code>CREATE TABLE deduplicated_articles AS
  SELECT unique.*
  FROM (
       SELECT
           array_agg(
               original
                   ORDER BY article_name DESC
                   LIMIT 1
           )[OFFSET(0)] unique
       FROM all_articles original
       GROUP BY id
  );</code></pre>



<p>This SQL is harder for a generic parser because <em>no individual columns are ever named</em> in the projection. And yet, a BigQuery-aware parser can still extract correct lineage from the structure.</p>



<p><strong>Step 1 — <code>original</code> is a row variable, not a column.</strong></p>



<pre class="wp-block-code"><code>FROM all_articles original</code></pre>



<p><code>original</code> is the table alias for <code>all_articles</code>. When passed as a value expression (instead of with a <code>.column</code> access), it represents the entire row, which BigQuery treats as a STRUCT containing every column of <code>all_articles</code>.</p>



<p><strong>Step 2 — <code>array_agg(original)</code> aggregates whole rows.</strong></p>



<pre class="wp-block-code"><code>array_agg(original ORDER BY article_name DESC LIMIT 1)</code></pre>



<p>This produces <code>ARRAY&lt;STRUCT&gt;</code>. The <code>ORDER BY ... LIMIT 1</code> keeps the top row per group.</p>



<p><strong>Step 3 — <code>[OFFSET(0)]</code> extracts a STRUCT.</strong></p>



<pre class="wp-block-code"><code>array_agg(...)[OFFSET(0)] unique</code></pre>



<p>Now <code>unique</code> is a single STRUCT containing every column of <code>all_articles</code>. Crucially, the parser knows this <em>without</em> knowing what those columns are.</p>



<p><strong>Step 4 — <code>unique.*</code> flattens the STRUCT into top-level columns.</strong></p>



<pre class="wp-block-code"><code>SELECT unique.*</code></pre>



<p>This is the syntactic twist that breaks naive parsers. <code>unique.<em></code> does </em>not* mean &#8220;select the column named <code>unique</code>&#8220;. It means &#8220;expand every field of the STRUCT <code>unique</code> into its own top-level column&#8221;. So <code>deduplicated_articles</code> ends up with the same top-level column set as <code>all_articles</code>.</p>



<h3 class="wp-block-heading">What lineage can we extract without metadata?</h3>



<p>The parser can confidently emit:</p>



<pre class="wp-block-code"><code>all_articles.* -&gt; deduplicated_articles.*</code></pre>



<p>That is, &#8220;every column of <code>all_articles</code> flows into <code>deduplicated_articles</code>, picking the row with the largest <code>article_name</code> per <code>id</code>&#8220;. The parser can also note that <code>article_name</code> participates in the row-selection logic (for impact analysis), even though it does not change the <em>value</em> lineage.</p>



<p>What the parser <em>cannot</em> do without metadata is enumerate the exact column names. If you need <code>deduplicated_articles.article_name -&gt; all_articles.article_name</code> as a concrete edge, you need either the schema or a metadata-aware second pass.</p>



<h3 class="wp-block-heading">Why this pattern defeats most generic parsers</h3>



<p>Three reasons:</p>



<ul class="wp-block-list">

<li>They treat <code>original</code> (a bare table alias used as a value) as an error or as an unresolved column.</li>

<li>They treat <code>[OFFSET(0)]</code> after a function call as unknown syntax.</li>

<li>They treat <code>unique.<em></code> like <code>table_alias.</em></code> and look for a <em>table</em> called <code>unique</code>, fail to find one, and drop the projection.</li>
</ul>




<p>Any one of those failures collapses the entire lineage edge.</p>



<h2 class="wp-block-heading">What a BigQuery-aware parser needs to know</h2>



<p>Stepping back, here is the irreducible set of BigQuery semantics a metadata-free lineage tool must implement:</p>



<p><figure class="wp-block-table"><table><thead><tr><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;text-align:left">Syntax</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;text-align:left">Semantic</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;text-align:left">What lineage gains</th></tr></thead><tbody><tr><td style="padding:8px 12px;border:1px solid #ddd"><code>SELECT AS STRUCT</code></td><td style="padding:8px 12px;border:1px solid #ddd">Build a single STRUCT row, not multiple columns</td><td style="padding:8px 12px;border:1px solid #ddd">Output is one column whose fields are named in the SELECT list</td></tr><tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>ARRAY(SELECT AS STRUCT ...)</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Wrap STRUCT rows into ARRAY&lt;STRUCT&lt;&#8230;&gt;&gt;</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Output column type form is fully recoverable</td></tr><tr><td style="padding:8px 12px;border:1px solid #ddd"><code>array_agg(table_alias)</code></td><td style="padding:8px 12px;border:1px solid #ddd">Aggregate whole rows; result is ARRAY&lt;row STRUCT&gt;</td><td style="padding:8px 12px;border:1px solid #ddd">Whole-row flow is detectable even without column enumeration</td></tr><tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>arr[OFFSET(n)]</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Extract one element of an ARRAY</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Result type is the array element type (often STRUCT)</td></tr><tr><td style="padding:8px 12px;border:1px solid #ddd"><code>struct_alias.*</code></td><td style="padding:8px 12px;border:1px solid #ddd">Expand STRUCT fields into top-level columns</td><td style="padding:8px 12px;border:1px solid #ddd">Disambiguates &#8220;select all from table&#8221; vs &#8220;explode this STRUCT&#8221;</td></tr><tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>nested.path.access</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Walk into nested STRUCT fields</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Edges target the specific leaf field, not the parent column</td></tr></tbody></table></figure></p>



<p>If a parser does not implement at least these six semantics, BigQuery column-level lineage will be either incomplete or misleading whenever your team uses nested types — which, in modern BigQuery warehouses, is most of the time.</p>



<h2 class="wp-block-heading">A simple memory aid for reading BigQuery SQL</h2>



<p>When eyeballing BigQuery SQL for lineage, three patterns are worth memorizing:</p>



<ul class="wp-block-list">

<li>See <code>SELECT AS STRUCT</code> → the SQL is <em>constructing</em> a STRUCT.</li>

<li>See <code>ARRAY(SELECT AS STRUCT ...)</code> → the result column is <code>ARRAY&lt;STRUCT&gt;</code>.</li>

<li>See <code>table_alias</code> used as a <em>value</em> (e.g. <code>array_agg(original)</code>) → the value is the entire row as a STRUCT.</li>

<li>See <code>alias.<em></code> where <code>alias</code> is a STRUCT-typed expression → fields are being </em>flattened* into top-level columns, not selecting from a table.</li>
</ul>




<p>Anywhere those patterns appear, a metadata-free parser can still recover the lineage <em>shape</em>. What it cannot recover is the leaf scalar types, the full column list of a wildcard-flattened STRUCT, or the schema of the underlying table — those still need metadata or a second resolution pass.</p>



<h2 class="wp-block-heading">How Gudu&#8217;s SQL engine handles it</h2>



<p>All three of Gudu Software&#8217;s products — <a href="https://www.sqlparser.com/?utm_source=dpriver_blog&amp;utm_medium=referral&amp;utm_campaign=bigquery_struct_lineage">General SQL Parser (GSP)</a>, <a href="https://www.gudusoft.com/?utm_source=dpriver_blog&amp;utm_medium=referral&amp;utm_campaign=bigquery_struct_lineage">SQLFlow</a>, and <a href="https://gudu-sql-omni.gudusoft.com/?utm_source=dpriver_blog&amp;utm_medium=referral&amp;utm_campaign=bigquery_struct_lineage">SQL Omni</a> — share one column-level lineage engine. They differ in how you use it:</p>



<ul class="wp-block-list">

<li><strong>GSP</strong> is the SDK: embed lineage extraction in a Java application or data platform.</li>

<li><strong>SQLFlow</strong> is the hosted lineage platform with REST APIs and visualization.</li>

<li><strong>SQL Omni</strong> is the VS Code extension for individual developers, 100% offline.</li>
</ul>




<p>The engine implements BigQuery&#8217;s nested-type semantics — <code>SELECT AS STRUCT</code>, <code>ARRAY(...)</code>, row-as-STRUCT in <code>array_agg</code>, <code>[OFFSET(n)]</code> array indexing, and <code>STRUCT.<em></code> field expansion — so that the two patterns in this post produce correct lineage with or without metadata. When metadata </em>is<em> available (via SQLFlow&#8217;s database integrations or a supplied schema), the engine fills in concrete column names where wildcards appeared. When metadata is </em>not* available, you still get the structural lineage shown above, rather than silent gaps.</p>



<p>Disclosure: I work at Gudu Software.</p>



<h2 class="wp-block-heading">Try it yourself</h2>



<p>Paste either of the SQL samples in this post into the <a href="https://sqlflow.gudusoft.com/?utm_source=dpriver_blog&amp;utm_medium=referral&amp;utm_campaign=bigquery_struct_lineage">SQLFlow online demo</a> and pick <em>BigQuery</em> as the dialect. You&#8217;ll see column-level lineage rendered as a graph, including the nested <code>orders.customer_name</code> and <code>orders.order_id</code> edges from Pattern 1, and the <code>all_articles.<em> -&gt; deduplicated_articles.</em></code> whole-row flow from Pattern 2.</p>



<p>If you&#8217;d rather embed the engine in your own pipeline, the <a href="https://www.sqlparser.com/sql-parser-java.php?utm_source=dpriver_blog&amp;utm_medium=referral&amp;utm_campaign=bigquery_struct_lineage">GSP BigQuery quickstart</a> shows how to extract the same lineage as JSON in Java. For VS Code users on regulated stacks, <a href="https://marketplace.visualstudio.com/items?itemName=gudusoftware.gudu-sql-omni">SQL Omni</a> runs the same analysis 100% offline.</p>



<h2 class="wp-block-heading">Takeaways</h2>



<ul class="wp-block-list">

<li>BigQuery column-level lineage is harder than ANSI lineage because of <code>STRUCT</code>, <code>ARRAY&lt;STRUCT&gt;</code>, row-as-STRUCT, and <code>STRUCT.*</code> field expansion.</li>

<li>A <em>lot</em> is recoverable from SQL text alone, including the type <em>shape</em> of nested output columns, the field names of <code>SELECT AS STRUCT</code>, and whole-row flows from <code>array_agg(row)</code>.</li>

<li>A <em>little</em> is genuinely impossible without metadata: leaf scalar types and the column list of any wildcard-flattened STRUCT.</li>

<li>A generic SQL parser that does not implement these BigQuery semantics will produce either silent gaps or wrong attributions on any modern BigQuery warehouse — which is why a BigQuery-specialized parser matters when you&#8217;re choosing a column-level lineage solution.</li>
</ul>

<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/05/bigquery-column-level-lineage-without-metadata-inferring-struct-and-array-types/">BigQuery Column-Level Lineage Without Metadata: Inferring STRUCT and ARRAY Types from SQL Alone</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OpenMetadata + MSSQL Stored Procedures: Why Your Lineage Is Silently Empty — and How to Fix It</title>
		<link>https://www.dpriver.com/blog/2026/04/openmetadata-mssql-stored-procedures-why-your-lineage-is-silently-empty-and-how/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Tue, 21 Apr 2026 11:42:49 +0000</pubDate>
				<category><![CDATA[gsp]]></category>
		<category><![CDATA[SQL language]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[column-lineage]]></category>
		<category><![CDATA[data-governance]]></category>
		<category><![CDATA[gsp-openmetadata-sidecar]]></category>
		<category><![CDATA[merge]]></category>
		<category><![CDATA[mssql]]></category>
		<category><![CDATA[openmetadata]]></category>
		<category><![CDATA[sql-server]]></category>
		<category><![CDATA[stored-procedure]]></category>
		<category><![CDATA[temp-tables]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3198</guid>

					<description><![CDATA[<p>OpenMetadata issues #16737, #25299, and #17586 report zero lineage from MSSQL stored procedures. We analyze three failure patterns — BEGIN/END blocks, temp table chains, and square bracket identifiers — with real SQL from the community, and show how gsp-openmetadata-sidecar recovers full column-level lineage.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/openmetadata-mssql-stored-procedures-why-your-lineage-is-silently-empty-and-how/">OpenMetadata + MSSQL Stored Procedures: Why Your Lineage Is Silently Empty — and How to Fix It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Author:</strong> James <strong>Date:</strong> 2026-04-21 <strong>Categories:</strong> gsp, SQLFlow, OpenMetadata <strong>Keywords:</strong> column-lineage, data-governance, openmetadata, mssql, stored-procedure, sql-server, gsp-openmetadata-sidecar, temp-tables, merge</p>



<p>&#8212;</p>



<p>OpenMetadata issues <a href="https://github.com/open-metadata/OpenMetadata/issues/16737">#16737</a>, <a href="https://github.com/open-metadata/OpenMetadata/issues/25299">#25299</a>, and <a href="https://github.com/open-metadata/OpenMetadata/issues/17586">#17586</a> all report the same frustrating problem: <strong>MSSQL stored procedures produce zero lineage in OpenMetadata</strong>. No tables. No columns. The lineage graph just stops at the procedure boundary.</p>



<p>These issues have been open since June 2024 — over two years — with users confirming the same behavior across OpenMetadata 1.4.x through 1.11.x. We analyzed the real SQL from these issues, identified three distinct failure patterns, and built a sidecar tool that solves all of them. Here&#8217;s what&#8217;s happening and how to fix it.</p>



<h2 class="wp-block-heading">Why OpenMetadata Can&#8217;t Parse Stored Procedures</h2>



<p>OpenMetadata extracts lineage by reading SQL query logs from your database, then passing each statement through a three-parser chain: <strong>sqlglot → sqlfluff → sqlparse</strong> (via the <code>collate-sqllineage</code> library). If all three parsers fail, the statement is <strong>silently skipped</strong> — no error in the UI, just empty lineage.</p>



<p>The problem is that MSSQL stored procedures aren&#8217;t plain SQL statements. They&#8217;re wrapped in T-SQL procedural syntax that none of the three parsers can handle:</p>



<ul class="wp-block-list">

<li><code>CREATE PROCEDURE ... AS BEGIN ... END</code> — the procedure declaration wrapper</li>

<li><code>DECLARE</code> — variable declarations</li>

<li><code>#tempTable</code> — temporary tables used as intermediate staging</li>

<li><code>MERGE ... WHEN MATCHED ... WHEN NOT MATCHED</code> — upsert patterns</li>

<li><code>[dbo].[table_name]</code> — square bracket identifiers</li>
</ul>



<p>These are <strong>standard patterns in every MSSQL environment</strong> — not edge cases. If your data warehouse uses stored procedures for ETL (and most MSSQL shops do), OpenMetadata is missing your lineage.</p>



<h2 class="wp-block-heading">Three Failure Patterns, All Solved</h2>



<p>The issues in #16737, #25299, and #17586 document three distinct failure modes. Our <a href="https://github.com/gudusoftware/gsp-openmetadata-sidecar">gsp-openmetadata-sidecar</a> tool handles all of them:</p>



<figure class="wp-block-table"><table style="border-collapse:collapse;width:100%">
<thead><tr><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Pattern</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Issues</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">What Happens</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">With GSP</th></tr></thead>
<tbody>
<tr><td style="padding:8px 12px;border:1px solid #ddd"><strong>Pattern A</strong> — BEGIN/END blocks</td><td style="padding:8px 12px;border:1px solid #ddd">#16737, #25299, #17586</td><td style="padding:8px 12px;border:1px solid #ddd"><code>CREATE PROCEDURE ... BEGIN ... END</code> breaks all three parsers. Lineage silently dropped.</td><td style="padding:8px 12px;border:1px solid #ddd"><strong>Full lineage recovered</strong> — procedure wrapper parsed correctly</td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><strong>Pattern B</strong> — Temp table chains</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">#25299</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>source → #tempTable → target</code> 3-hop lineage. Parser can&#8217;t follow data through temp tables.</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><strong>Multi-hop lineage recovered</strong> — temp tables resolved as intermediates</td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd"><strong>Pattern C</strong> — Square brackets</td><td style="padding:8px 12px;border:1px solid #ddd">#16424</td><td style="padding:8px 12px;border:1px solid #ddd"><code>[database].[schema].[table]</code> identifiers break regex-based bracket handling</td><td style="padding:8px 12px;border:1px solid #ddd"><strong>Bracket identifiers resolved</strong> — cross-database FQNs extracted correctly</td></tr>
</tbody></table></figure>



<p>&#8212;</p>



<h2 class="wp-block-heading">Pattern A: CREATE PROCEDURE with BEGIN/END (Issues #16737, #25299, #17586)</h2>



<p>This is the most common failure. Here&#8217;s the exact SQL from <a href="https://github.com/open-metadata/OpenMetadata/issues/17586">issue #17586</a> — a minimal stored procedure that OpenMetadata cannot parse:</p>



<pre class="wp-block-code"><code>CREATE PROCEDURE myproc
AS
BEGIN
    INSERT INTO test2 SELECT * FROM test1
END</code></pre>



<p><strong>What OpenMetadata sees:</strong> The ingestion pipeline reads this from <code>sys.dm_exec_cached_plans</code>. The SQL text starts with <code>CREATE PROCEDURE</code> — and historically, OpenMetadata&#8217;s query log reader <a href="https://github.com/open-metadata/OpenMetadata/issues/17586">explicitly filtered these out</a> with:</p>



<pre class="wp-block-code"><code>AND lower(t.text) NOT LIKE '%%create%%procedure%%'</code></pre>



<p>That filter was removed in <a href="https://github.com/open-metadata/OpenMetadata/pull/14586">PR #14586</a>, but even after removal, the parser chain (sqlglot → sqlfluff → sqlparse) still cannot handle the <code>BEGIN...END</code> wrapper. As one user in <a href="https://github.com/open-metadata/OpenMetadata/issues/16737">#16737</a> confirmed: <em>&#8220;commenting commands like <code>CREATE PROCEDURE</code>, <code>BEGIN</code>, <code>END</code>, <code>DECLARE</code>, <code>AS</code> gives correct lineage&#8221;</em> — proving the SQL inside is valid, but the parser rejects the T-SQL wrapper.</p>



<p><strong>What GSP extracts:</strong></p>



<pre class="wp-block-code"><code>$ gsp-openmetadata-sidecar --sql-file om_issue_17586_procedure_filtered.sql --dry-run

Analyzing SQL (544 chars)...
Extracted 1 table-level lineage relationships
  test1 --&gt; test2 (1 column: *)

[DRY RUN] Would emit lineage: mssql.dbo.test1 --&gt; mssql.dbo.test2</code></pre>



<p>GSP&#8217;s T-SQL parser strips the <code>CREATE PROCEDURE ... AS BEGIN ... END</code> wrapper, recognizes the <code>INSERT INTO ... SELECT</code> inside, and extracts the lineage: <strong>test1 → test2</strong>.</p>



<p>&#8212;</p>



<h2 class="wp-block-heading">Pattern B: Temp Table Multi-Hop Chains (Issue #25299)</h2>



<p>This is the most technically interesting failure. Here&#8217;s the exact SQL from <a href="https://github.com/open-metadata/OpenMetadata/issues/25299">issue #25299</a> — a stored procedure that stages data through a temp table:</p>



<pre class="wp-block-code"><code>CREATE PROCEDURE schName.procName
AS
BEGIN
    DROP TABLE IF EXISTS #tempTable

    CREATE TABLE #tempTable (columnName int)

    INSERT INTO #tempTable (columnName)
    SELECT columnName FROM schName.sourceTable

    INSERT INTO schName.targetTable (columnName)
    SELECT columnName FROM #tempTable
END</code></pre>



<p><strong>The data flow is:</strong> <code>schName.sourceTable</code> → <code>#tempTable</code> → <code>schName.targetTable</code></p>



<p>The reporter in #25299 identified this as a 3-hop lineage problem: <em>&#8220;Procedures using temp tables (<code>#tempTable</code>) in the transformation chain: <code>source → #temp → target</code> — only two-hop lineage works, not three-hop through temp tables.&#8221;</em></p>



<p><strong>What GSP extracts:</strong></p>



<pre class="wp-block-code"><code>$ gsp-openmetadata-sidecar --sql-file om_issue_25299_create_procedure_begin_end.sql --dry-run

Analyzing SQL (1007 chars)...
Extracted 2 table-level lineage relationships
  schName.sourceTable --&gt; #tempTable (1 column: columnName)
  #tempTable --&gt; schName.targetTable (1 column: columnName)

[DRY RUN] Would emit lineage:
  mssql.schname.sourcetable --&gt; mssql.dbo.#temptable (columnName → columnName)
  mssql.dbo.#temptable --&gt; mssql.schname.targettable (columnName → columnName)</code></pre>



<p>GSP extracts the complete chain: <strong>sourceTable → #tempTable → targetTable</strong> with column-level lineage at each hop. OpenMetadata&#8217;s lineage graph would show the full data flow through the temp table intermediate.</p>



<p>&#8212;</p>



<h2 class="wp-block-heading">Pattern C: Square Bracket Cross-Database Identifiers (Issue #16424)</h2>



<p>MSSQL uses square brackets for identifier quoting, and cross-database queries use 3-part or 4-part names. <a href="https://github.com/open-metadata/OpenMetadata/issues/16424">Issue #16424</a> reported that views using <code>[database].[schema].[table]</code> syntax produce no lineage.</p>



<p>The root cause was a greedy regex <code>r"\[(.*)\]"</code> in OpenMetadata&#8217;s <code>parser.py</code> that matched across multiple bracket pairs, returning <code>"db].[schema"</code> instead of separate identifiers. Here&#8217;s a reconstructed example based on the issue description:</p>



<pre class="wp-block-code"><code>CREATE VIEW [ReportDB].[dbo].[vw_CustomerOrders]
AS
SELECT
    [SalesDB].[dbo].[Customers].[CustomerID],
    [SalesDB].[dbo].[Customers].[CustomerName],
    [SalesDB].[dbo].[Orders].[OrderID],
    [SalesDB].[dbo].[Orders].[OrderDate],
    [SalesDB].[dbo].[Orders].[TotalAmount]
FROM [SalesDB].[dbo].[Customers]
INNER JOIN [SalesDB].[dbo].[Orders]
    ON [SalesDB].[dbo].[Customers].[CustomerID] = [SalesDB].[dbo].[Orders].[CustomerID]
WHERE [SalesDB].[dbo].[Orders].[OrderDate] &gt;= '2024-01-01'</code></pre>



<p><strong>What GSP extracts:</strong></p>



<pre class="wp-block-code"><code>$ gsp-openmetadata-sidecar --sql-file om_issue_16424_square_brackets.sql --dry-run

Analyzing SQL (984 chars)...
Extracted 2 table-level lineage relationships
  SalesDB.dbo.Customers --&gt; ReportDB.dbo.vw_CustomerOrders (2 columns)
  SalesDB.dbo.Orders --&gt; ReportDB.dbo.vw_CustomerOrders (3 columns)

[DRY RUN] Would emit lineage:
  mssql.salesdb.dbo.customers --&gt; mssql.reportdb.dbo.vw_customerorders
    CustomerID → CustomerID
    CustomerName → CustomerName
  mssql.salesdb.dbo.orders --&gt; mssql.reportdb.dbo.vw_customerorders
    OrderID → OrderID
    OrderDate → OrderDate
    TotalAmount → TotalAmount</code></pre>



<p>GSP correctly handles the square brackets, resolves cross-database references (<code>SalesDB</code> → <code>ReportDB</code>), and extracts <strong>5 column-level lineage mappings</strong> across 2 source tables.</p>


<figure class="wp-block-image size-large"><img decoding="async" src="https://www.dpriver.com/blog/wp-content/uploads/2026/04/openmetadata-16424.png" alt="OpenMetadata lineage visualization showing cross-database column-level lineage from SalesDB.dbo.Customers and SalesDB.dbo.Orders to ReportDB.dbo.vw_CustomerOrders, with 5 column mappings recovered by GSP" /><figcaption class="wp-element-caption">Column-level lineage recovered by GSP for the cross-database view — 2 source tables, 5 column mappings, all square bracket identifiers resolved correctly.</figcaption></figure>



<p>&#8212;</p>



<h2 class="wp-block-heading">A Real-World Example: Everything Combined</h2>



<p>Let&#8217;s look at a more realistic stored procedure that combines all three patterns — the kind of ETL logic you&#8217;d find in any production MSSQL warehouse:</p>



<pre class="wp-block-code"><code>CREATE PROCEDURE [dbo].[usp_UpdateCustomerOrders]
AS
BEGIN
    SET NOCOUNT ON;

    SELECT
        c.customer_id, c.customer_name,
        o.order_id, o.order_date, o.total_amount
    INTO #staged_orders
    FROM [dbo].[customers] c
    INNER JOIN [dbo].[orders] o ON c.customer_id = o.customer_id
    WHERE o.order_date &gt;= DATEADD(day, -30, GETDATE());

    MERGE [dbo].[customer_order_summary] AS target
    USING #staged_orders AS source
    ON target.customer_id = source.customer_id
    WHEN MATCHED THEN
        UPDATE SET
            target.last_order_date = source.order_date,
            target.total_amount = source.total_amount,
            target.customer_name = source.customer_name
    WHEN NOT MATCHED THEN
        INSERT (customer_id, customer_name, last_order_date, total_amount)
        VALUES (source.customer_id, source.customer_name,
                source.order_date, source.total_amount);

    INSERT INTO [dbo].[audit_log] (action, record_count, run_date)
    SELECT 'usp_UpdateCustomerOrders', COUNT(*), GETDATE()
    FROM #staged_orders;

    DROP TABLE #staged_orders;
END</code></pre>



<p>This procedure has everything OpenMetadata can&#8217;t handle: <code>CREATE PROCEDURE</code>, <code>BEGIN/END</code>, a temp table (<code>#staged_orders</code>), a <code>MERGE</code> statement, and square bracket identifiers.</p>



<p><strong>What GSP extracts — 4 lineage edges, 12 column-level mappings:</strong></p>



<pre class="wp-block-code"><code>$ gsp-openmetadata-sidecar --sql-file mssql_stored_procedure.sql --dry-run

Extracted 4 table-level lineage relationships:
  dbo.customers --&gt; #staged_orders (2 columns: customer_id, customer_name)
  dbo.orders --&gt; #staged_orders (3 columns: order_id, order_date, total_amount)
  #staged_orders --&gt; dbo.customer_order_summary (4 columns)
  #staged_orders --&gt; dbo.audit_log (3 columns)</code></pre>



<figure class="wp-block-table"><table style="border-collapse:collapse;width:100%">
<thead><tr><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Source Table</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Target Table</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Column Mappings</th></tr></thead>
<tbody>
<tr><td style="padding:8px 12px;border:1px solid #ddd"><code>dbo.customers</code></td><td style="padding:8px 12px;border:1px solid #ddd"><code>#staged_orders</code></td><td style="padding:8px 12px;border:1px solid #ddd"><code>customer_id</code> → <code>customer_id</code>, <code>customer_name</code> → <code>customer_name</code></td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>dbo.orders</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>#staged_orders</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>order_id</code> → <code>order_id</code>, <code>order_date</code> → <code>order_date</code>, <code>total_amount</code> → <code>total_amount</code></td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd"><code>#staged_orders</code></td><td style="padding:8px 12px;border:1px solid #ddd"><code>dbo.customer_order_summary</code></td><td style="padding:8px 12px;border:1px solid #ddd"><code>customer_id</code>, <code>customer_name</code>, <code>order_date</code> → <code>last_order_date</code>, <code>total_amount</code></td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>#staged_orders</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>dbo.audit_log</code></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><code>COUNT(*)</code> → <code>record_count</code>, <code>GETDATE()</code> → <code>run_date</code>, literal → <code>action</code></td></tr>
</tbody></table></figure>



<p>The full data flow through the temp table staging layer is preserved, including the MERGE upsert pattern and the audit log insert.</p>


<figure class="wp-block-image size-large"><img decoding="async" src="https://www.dpriver.com/blog/wp-content/uploads/2026/04/openmetadata-stored-procedure.png" alt="OpenMetadata lineage visualization showing the combined stored procedure lineage — customers and orders flowing through #staged_orders temp table to customer_order_summary, with column-level detail" /><figcaption class="wp-element-caption">Complete lineage recovered by GSP for the combined stored procedure — 4 table-level edges and 12 column-level mappings through the temp table staging layer, including the MERGE upsert pattern.</figcaption></figure>



<p>&#8212;</p>



<h2 class="wp-block-heading">Summary: All Patterns Solved</h2>



<figure class="wp-block-table"><table style="border-collapse:collapse;width:100%">
<thead><tr><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Issue</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">Pattern</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">OpenMetadata Today</th><th style="padding:8px 12px;border:1px solid #ddd;background:#f0f4f8;font-weight:bold">With gsp-openmetadata-sidecar</th></tr></thead>
<tbody>
<tr><td style="padding:8px 12px;border:1px solid #ddd"><a href="https://github.com/open-metadata/OpenMetadata/issues/17586">#17586</a></td><td style="padding:8px 12px;border:1px solid #ddd">A — BEGIN/END wrapper</td><td style="padding:8px 12px;border:1px solid #ddd">0 lineage (silently skipped)</td><td style="padding:8px 12px;border:1px solid #ddd"><strong>1 table-level, 1 column mapping</strong></td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><a href="https://github.com/open-metadata/OpenMetadata/issues/25299">#25299</a></td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">B — Temp table chain</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">0 lineage (3-hop fails)</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><strong>2 table-level, 2 column mappings</strong></td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd"><a href="https://github.com/open-metadata/OpenMetadata/issues/16424">#16424</a></td><td style="padding:8px 12px;border:1px solid #ddd">C — Square brackets</td><td style="padding:8px 12px;border:1px solid #ddd">0 lineage (regex bug)</td><td style="padding:8px 12px;border:1px solid #ddd"><strong>2 table-level, 5 column mappings</strong></td></tr>
<tr><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">Combined example</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">A + B + C</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb">0 lineage</td><td style="padding:8px 12px;border:1px solid #ddd;background:#f9fafb"><strong>4 table-level, 12 column mappings</strong></td></tr>
</tbody></table></figure>



<p>&#8212;</p>



<h2 class="wp-block-heading">How It Works</h2>



<p>The <a href="https://github.com/gudusoftware/gsp-openmetadata-sidecar">gsp-openmetadata-sidecar</a> is a companion tool that runs alongside your OpenMetadata installation. It does not modify OpenMetadata — it uses the public REST API (<code>PUT /api/v1/lineage</code>) to push the lineage that OpenMetadata&#8217;s own parser misses.</p>



<pre class="wp-block-code"><code>OpenMetadata ingestion (unchanged)       gsp-openmetadata-sidecar
  query logs → collate-sqllineage          |
              |                            | 1. Parse SQL with Gudu SQLFlow
              v                            | 2. Resolve table FQNs via OM API
        silently skipped                   | 3. Push lineage via PUT /api/v1/lineage
        (lineage lost)                     v
                                     OpenMetadata (lineage restored)</code></pre>



<p><strong>What it does:</strong></p>



<ol class="wp-block-list">

<li><strong>Parses your SQL</strong> using <a href="https://sqlflow.gudusoft.com">Gudu SQLFlow</a> — a specialized SQL parser that handles procedural T-SQL natively (including <code>BEGIN/END</code>, <code>DECLARE</code>, temp tables, <code>MERGE</code>, dynamic SQL, cursors, and 20+ dialects)</li>

<li><strong>Resolves table names</strong> to OpenMetadata entity UUIDs via <code>GET /api/v1/tables/name/{fqn}</code></li>

<li><strong>Pushes lineage edges</strong> to OpenMetadata via <code>PUT /api/v1/lineage</code> with column-level detail</li>
</ul>



<p><strong>What it supports beyond MSSQL stored procedures:</strong></p>



<ul class="wp-block-list">

<li>BigQuery procedural SQL (<code>DECLARE</code>, <code>IF/THEN</code>, <code>CREATE TEMP TABLE</code>)</li>

<li>Oracle PL/SQL blocks</li>

<li>Snowflake scripting</li>

<li><code>MERGE INTO</code> column-level lineage (all dialects)</li>

<li>Multi-statement SQL scripts</li>

<li>Cross-database and cross-schema references</li>

<li>20+ SQL dialects total</li>
</ul>



<p>&#8212;</p>



<h2 class="wp-block-heading">Try It</h2>



<pre class="wp-block-code"><code># Install
pip install gsp-openmetadata-sidecar

# Test with the exact SQL from issue #25299 (dry run — no OM connection needed):
gsp-openmetadata-sidecar \
  --sql-file examples/om_issue_25299_create_procedure_begin_end.sql \
  --dry-run

# Test with your own stored procedures:
gsp-openmetadata-sidecar \
  --sql-file your_procedure.sql \
  --db-vendor dbvmssql \
  --dry-run

# Push lineage to your OpenMetadata instance:
gsp-openmetadata-sidecar \
  --sql-file your_procedure.sql \
  --om-server http://localhost:8585/api \
  --om-token "eyJ..." \
  --service-name mssql_prod \
  --database-name YourDB \
  --schema-name dbo</code></pre>



<p>The tool supports four backend modes: anonymous (no signup, 50 calls/day), authenticated (API key, 10k/month), self-hosted Docker (unlimited, data stays in your network), and local JAR (offline, no network at all).</p>



<p>Source code and documentation: <a href="https://github.com/gudusoftware/gsp-openmetadata-sidecar">github.com/gudusoftware/gsp-openmetadata-sidecar</a></p>



<p>&#8212;</p>



<h2 class="wp-block-heading">Related OpenMetadata Issues</h2>



<ul class="wp-block-list">

<li><a href="https://github.com/open-metadata/OpenMetadata/issues/16737">#16737</a> — Data Lineage Not Reflected for MSSQL Stored Procedure (open since June 2024)</li>

<li><a href="https://github.com/open-metadata/OpenMetadata/issues/25299">#25299</a> — Stored Procedure lineage is not supported for MS SQL connector (release backlog)</li>

<li><a href="https://github.com/open-metadata/OpenMetadata/issues/17586">#17586</a> — MS SQL Procedures Lineage Not Picked Up (partially fixed — filter removed, parser still fails)</li>

<li><a href="https://github.com/open-metadata/OpenMetadata/issues/16424">#16424</a> — Square bracket syntax breaks lineage (fixed in sqlfluff, but proc wrapper still fails)</li>

<li><a href="https://github.com/open-metadata/OpenMetadata/discussions/23717">Discussion #23717</a> — Cross-database MSSQL lineage (unanswered)</li>
</ul>



<h2 class="wp-block-heading">Also Affected by SQL Parsing Gaps?</h2>



<p>If you&#8217;re using <strong>DataHub</strong> and hitting similar SQL parsing failures, see our <a href="https://pypi.org/project/gsp-datahub-sidecar/">gsp-datahub-sidecar</a> — same approach, same engine, built for DataHub&#8217;s ingestion pipeline. It addresses BigQuery procedural SQL (<a href="https://github.com/datahub-project/datahub/issues/11654">#11654</a>), Power BI M-language lineage (<a href="https://github.com/datahub-project/datahub/issues/15327">#15327</a>), and MSSQL stored procedures (<a href="https://github.com/datahub-project/datahub/issues/12606">#12606</a>).</p>



<p>&#8212;</p>



<p><em>Disclosure: I work at <a href="https://www.gudusoft.com">Gudu Software</a>, the company behind GSP SQL Parser, SQLFlow, and the gsp-openmetadata-sidecar. All lineage results shown were produced by running the SQL examples through GSP&#8217;s T-SQL parser and verified against the actual SQL from the referenced OpenMetadata GitHub issues. The example SQL files used in this post are available in the <a href="https://github.com/gudusoftware/gsp-openmetadata-sidecar/tree/main/examples">gsp-openmetadata-sidecar repository</a>.</em></p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/openmetadata-mssql-stored-procedures-why-your-lineage-is-silently-empty-and-how/">OpenMetadata + MSSQL Stored Procedures: Why Your Lineage Is Silently Empty — and How to Fix It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>DataHub #15327: Why Power BI M-Language Queries Produce Zero Lineage — 4-Query Deep Analysis</title>
		<link>https://www.dpriver.com/blog/2026/04/datahub-15327-why-power-bi-m-language-queries-produce-zero-lineage-4-query-deep/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Mon, 20 Apr 2026 06:30:33 +0000</pubDate>
				<category><![CDATA[gsp]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[column-lineage]]></category>
		<category><![CDATA[data-governance]]></category>
		<category><![CDATA[datahub]]></category>
		<category><![CDATA[gsp-datahub-sidecar]]></category>
		<category><![CDATA[m-language]]></category>
		<category><![CDATA[power-bi]]></category>
		<category><![CDATA[Snowflake]]></category>
		<category><![CDATA[sqlglot]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3196</guid>

					<description><![CDATA[<p>Deep analysis of 4 Power BI M-language queries from DataHub issue #15327. Both patterns solved: Pattern A (M navigation chains) produces table-level lineage; Pattern B (Value.NativeQuery with embedded SQL) produces column-level lineage with 6 column mappings traced through expressions.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/datahub-15327-why-power-bi-m-language-queries-produce-zero-lineage-4-query-deep/">DataHub #15327: Why Power BI M-Language Queries Produce Zero Lineage — 4-Query Deep Analysis</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>DataHub issue <a href="https://github.com/datahub-project/datahub/issues/15327" target="_blank" rel="noopener">#15327</a> reports a common Power BI + DataHub problem: <strong>four Snowflake-connected Power BI datasets produce zero upstream lineage</strong>. No tables. No columns. The lineage graph just stops at the Power BI boundary.</p>
<p>We analyzed all four M-language queries from the issue, identified two distinct failure patterns, and ran real lineage extraction against each one using GSP&#8217;s Power Query M-language parser. Here&#8217;s what we found — and how we solve both patterns.</p>
<h2>Why DataHub Can&#8217;t Parse These Queries</h2>
<p>The root cause is that Power BI doesn&#8217;t generate SQL — it generates <strong>M-language</strong> (Power Query Formula Language). M is a fully functional language (closer to F# than SQL) that Power BI uses behind the scenes for all &#8220;Get Data&#8221; and &#8220;Transform Data&#8221; operations.</p>
<p>DataHub&#8217;s ingestion pipeline expects SQL. When it receives M-language, no SQL parser (sqlglot, sqllineage, or any other) can extract lineage from it because <em>it&#8217;s not SQL</em>.</p>
<h2>Two Patterns, Both Solved</h2>
<p>The four queries in #15327 fall into two fundamentally different categories. GSP&#8217;s Power Query M-language extractor (<code>dbvpowerquery</code>) handles both:</p>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>Queries</th>
<th>What Happens</th>
<th>GSP Result</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Pattern A</strong> — M Navigation</td>
<td>Q1, Q3</td>
<td>Table reference encoded in M&#8217;s record-access syntax: <code>{[Name="X", Kind="Database"]}[Data]</code>. No SQL anywhere.</td>
<td><strong>Table-level lineage recovered</strong> via navigation chain parsing</td>
</tr>
<tr>
<td><strong>Pattern B</strong> — Value.NativeQuery()</td>
<td>Q2, Q4</td>
<td>M wraps actual SQL inside a <code>Value.NativeQuery()</code> call. SQL has M-encoded escapes (<code>#(lf)</code>, <code>""</code>).</td>
<td><strong>Column-level lineage recovered</strong> via SQL extraction + parsing</td>
</tr>
</tbody>
</table>
<h2>Pattern A: Pure M Navigation (Queries 1 &amp; 3)</h2>
<p>These queries use M-language&#8217;s navigation syntax to reach Snowflake objects — no SQL at all. GSP&#8217;s M-language parser walks the navigation chain <code>{[Name="X", Kind="Y"]}[Data]</code> to extract the full <code>database.schema.table</code> path.</p>
<h3>Query 1: Navigation to a Snowflake View</h3>
<p><strong>Original M-Language:</strong></p>
<pre><code>let
    Source = Snowflake.Databases(SnowFlakeConnector, SnowflakeWarehouse),
    my_Database = Source{[Name="PROD201_DB_redacted", Kind="Database"]}[Data],
    CONSUMPTION_Schema = my_Database{[Name="CONSUMPTION", Kind="Schema"]}[Data],
    Details_View = CONSUMPTION_Schema{[Name="details", Kind="View"]}[Data],
    #"Changed Type1" = Table.TransformColumnTypes(Details_View, ...),
    #"Filtered Rows" = Table.SelectRows(#"Changed Type1", ...)
in #"Filtered Rows"</code></pre>
<p><strong>GSP analysis:</strong> The parser identifies <code>Snowflake.Databases</code> as the connector, then walks the navigation chain through three steps:</p>
<ul>
<li><code>my_Database</code>: <code>{[Name="PROD201_DB_redacted", Kind="Database"]}</code> → Database level</li>
<li><code>CONSUMPTION_Schema</code>: <code>{[Name="CONSUMPTION", Kind="Schema"]}</code> → Schema level</li>
<li><code>Details_View</code>: <code>{[Name="details", Kind="View"]}</code> → View level</li>
</ul>
<p><strong>GSP result:</strong></p>
<pre><code>Navigation resolved:
  Step: Details_View | Vendor: dbvsnowflake
  Path: PROD201_DB_redacted.CONSUMPTION.details
  Synthetic SQL: SELECT * FROM "PROD201_DB_redacted"."CONSUMPTION"."details"</code></pre>
<p><strong>Upstream table recovered: <code>PROD201_DB_redacted.CONSUMPTION.details</code></strong> (table-level lineage)</p>
<h3>Query 3: Navigation to a Snowflake Table + Column Renaming</h3>
<p><strong>Original M-Language:</strong></p>
<pre><code>let
    Source = Snowflake.Databases("redacted.snowflakecomputing.com", #"Warehouse name"),
    my_Database = Source{[Name="PROD201_DB_redacted", Kind="Database"]}[Data],
    CONSUMPTION_Schema = my_Database{[Name="CONSUMPTION", Kind="Schema"]}[Data],
    MY_View = CONSUMPTION_Schema{[Name="MESSAGES", Kind="Table"]}[Data],
    #"Renamed Columns" = Table.RenameColumns(MY_View, {{"DATE","Date"}, ...}),
    #"Added Custom" = Table.AddColumn(#"Renamed Columns", "Message Type", ...)
in #"Added Custom"</code></pre>
<p><strong>GSP result:</strong></p>
<pre><code>Navigation resolved:
  Step: MY_View | Vendor: dbvsnowflake
  Path: PROD201_DB_redacted.CONSUMPTION.MESSAGES
  Synthetic SQL: SELECT * FROM "PROD201_DB_redacted"."CONSUMPTION"."MESSAGES"</code></pre>
<p><strong>Upstream table recovered: <code>PROD201_DB_redacted.CONSUMPTION.MESSAGES</code></strong> (table-level lineage). The <code>Table.RenameColumns</code> and <code>Table.AddColumn</code> steps are M-level transformations that don&#8217;t change the upstream source.</p>
<h2>Pattern B: Embedded SQL (Queries 2 &amp; 4)</h2>
<p>These queries wrap real Snowflake SQL inside M&#8217;s <code>Value.NativeQuery()</code> function. GSP extracts the SQL string, decodes M-specific escapes, infers the SQL dialect from the inline connector call, and parses the SQL for full column-level lineage.</p>
<h3>Query 2: Simple SELECT with #(lf) escape</h3>
<p><strong>Original M-Language:</strong></p>
<pre><code>let
    Source = Value.NativeQuery(
        Snowflake.Databases(SnowFlakeServer, SnowFlakeDWH)
            {[Name="PROD201_redacted"]}[Data],
        "SELECT *#(lf)FROM PROD201_DB_redacted.CONSUMPTION.CV_OFFER_redacted_CITY",
        null, [EnableFolding=true])
in Source</code></pre>
<p><strong>Step 1 — Decode M escapes:</strong> <code>#(lf)</code> → newline. <strong>Step 2 — Infer vendor:</strong> The inline <code>Snowflake.Databases()</code> tells GSP this is Snowflake SQL. <strong>Step 3 — Parse SQL:</strong> <code>SELECT *</code> produces a wildcard column mapping.</p>
<p><strong>GSP result:</strong></p>
<pre><code>NativeQuery resolved:
  Step: Source | Vendor: dbvsnowflake | Parse RC: 0
  Decoded SQL: SELECT *
               FROM PROD201_DB_redacted.CONSUMPTION.CV_OFFER_redacted_CITY
  Column lineage: * → * (all columns flow through)</code></pre>
<p><strong>Upstream: <code>PROD201_DB_redacted.CONSUMPTION.CV_OFFER_redacted_CITY</code></strong> — 1 table-level + 1 column-level (<code>* → *</code>) lineage.</p>
<h3>Query 4: Complex Snowflake SQL with GROUP BY ALL, quoted aliases, expressions</h3>
<p><strong>Original M-Language</strong> (abbreviated):</p>
<pre><code>let
    Source = Value.NativeQuery(
        Snowflake.Databases("redacted.snowflakecomputing.com",
            WAREHOUSE, [Implementation="2.0"])
            {[Name="PROD201_DB_redacted"]}[Data],
        "
SELECT
  SERVICE_START_DATE AS ""Service Date"",
  TO_NUMBER(OPERATIVE_OFFICE) || '_' || statement || '_'
    || VENDOR_CODE || '_' || TO_CHAR(SERVICE_START_DATE, 'YYYYMMDD')
    AS OFFICE_STATEMENT_VENDOR_DATE,
  VENDOR_CODE
FROM PROD201_DB_redacted_DATAMARTS.SALES_MARKETING
     .CV_COMMDX_SALES_TRANSFERS
WHERE SERVICE_START_DATE &gt;= '2023-10-01' ...
GROUP BY ALL
HAVING SUM(COST_OF_SALES_CUR) = 0 AND COUNT(*) &gt; 1
        ", null, [EnableFolding=true]),
    #"TimeFilter" = Table.SelectRows(Source, ...)
in #"TimeFilter"</code></pre>
<p><strong>Step 1 — Decode M escapes:</strong> <code>""Service Date""</code> → <code>"Service Date"</code> (Snowflake quoted identifier). <strong>Step 2 — Infer vendor:</strong> Inline <code>Snowflake.Databases()</code> → Snowflake. <strong>Step 3 — Analyze SQL:</strong></p>
<ul>
<li><strong>Column 1:</strong> <code>SERVICE_START_DATE AS "Service Date"</code> — simple alias (1 mapping)</li>
<li><strong>Column 2:</strong> <code>TO_NUMBER(OPERATIVE_OFFICE) || '_' || statement || '_' || VENDOR_CODE || '_' || TO_CHAR(SERVICE_START_DATE, 'YYYYMMDD') AS OFFICE_STATEMENT_VENDOR_DATE</code> — concatenation with 4 source columns traced through <code>TO_NUMBER()</code> and <code>TO_CHAR()</code> (4 mappings)</li>
<li><strong>Column 3:</strong> <code>VENDOR_CODE</code> — direct pass-through (1 mapping)</li>
</ul>
<p><strong>GSP result — 6 column-level lineage mappings:</strong></p>
<table>
<thead>
<tr>
<th>Source Column</th>
<th></th>
<th>Target Column</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>SERVICE_START_DATE</code></td>
<td>→</td>
<td><code>SERVICE DATE</code></td>
<td>Direct alias: <code>AS "Service Date"</code></td>
</tr>
<tr>
<td><code>OPERATIVE_OFFICE</code></td>
<td>→</td>
<td><code>OFFICE_STATEMENT_VENDOR_DATE</code></td>
<td>Via <code>TO_NUMBER()</code> in concat</td>
</tr>
<tr>
<td><code>STATEMENT</code></td>
<td>→</td>
<td><code>OFFICE_STATEMENT_VENDOR_DATE</code></td>
<td>Direct in <code>||</code> concat</td>
</tr>
<tr>
<td><code>VENDOR_CODE</code></td>
<td>→</td>
<td><code>OFFICE_STATEMENT_VENDOR_DATE</code></td>
<td>Direct in <code>||</code> concat</td>
</tr>
<tr>
<td><code>SERVICE_START_DATE</code></td>
<td>→</td>
<td><code>OFFICE_STATEMENT_VENDOR_DATE</code></td>
<td>Via <code>TO_CHAR()</code> in concat</td>
</tr>
<tr>
<td><code>VENDOR_CODE</code></td>
<td>→</td>
<td><code>VENDOR_CODE</code></td>
<td>Direct pass-through</td>
</tr>
</tbody>
</table>
<p><strong>Upstream: <code>PROD201_DB_redacted_DATAMARTS.SALES_MARKETING.CV_COMMDX_SALES_TRANSFERS</code></strong> — 1 table-level + 6 column-level lineages.</p>
<h2>Summary: All 4 Queries Solved</h2>
<table>
<thead>
<tr>
<th>Query</th>
<th>Pattern</th>
<th>Upstream Table</th>
<th>DataHub Today</th>
<th>With GSP</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Q1</strong></td>
<td>A (M Navigation)</td>
<td><code>PROD201_DB_redacted.CONSUMPTION.details</code></td>
<td>0 lineage</td>
<td><strong>Table-level lineage</strong></td>
</tr>
<tr>
<td><strong>Q2</strong></td>
<td>B (NativeQuery)</td>
<td><code>PROD201_DB_redacted.CONSUMPTION.CV_OFFER_redacted_CITY</code></td>
<td>0 lineage</td>
<td><strong>1 table + 1 column (*)</strong></td>
</tr>
<tr>
<td><strong>Q3</strong></td>
<td>A (M Navigation)</td>
<td><code>PROD201_DB_redacted.CONSUMPTION.MESSAGES</code></td>
<td>0 lineage</td>
<td><strong>Table-level lineage</strong></td>
</tr>
<tr>
<td><strong>Q4</strong></td>
<td>B (NativeQuery)</td>
<td><code>...DATAMARTS.SALES_MARKETING.CV_COMMDX_SALES_TRANSFERS</code></td>
<td>0 lineage</td>
<td><strong>1 table + 6 columns</strong></td>
</tr>
</tbody>
</table>
<h2>How It Works</h2>
<p>GSP&#8217;s Power Query module (<code>dbvpowerquery</code>) is a lightweight M-language extractor — not a full M parser. It answers exactly 6 questions about any M document:</p>
<ol>
<li>What are the <code>let</code> bindings?</li>
<li>Which binding is returned by <code>in</code>?</li>
<li>Does a binding contain <code>Value.NativeQuery()</code>? → Extract SQL, decode escapes, delegate to vendor SQL parser</li>
<li>Does a binding represent a navigation chain? → Extract <code>database.schema.table</code></li>
<li>What connector function is at the root? → Infer SQL dialect (Snowflake, SQL Server, Oracle, PostgreSQL, MySQL, BigQuery, Redshift, Databricks, Hive)</li>
<li>Which bindings reference other bindings? → Follow step references</li>
</ol>
<p>Everything else in M (closures, <code>each</code>, <code>Table.*</code> transforms, type system) is parsed permissively — unknown constructs produce warnings, never wrong lineage.</p>
<h2>Try It</h2>
<pre><code># Install the sidecar
pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git

# Test with your Power BI queries
gsp-datahub-sidecar --mode authenticated \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file your_query.sql \
  --db-vendor dbvsnowflake \
  --dry-run</code></pre>
<p>Sign up for a free API key at <a href="https://docs.gudusoft.com/sign-up/" target="_blank" rel="noopener">docs.gudusoft.com/sign-up</a>. For a full interactive analysis, see our <a href="https://www.sqlparser.com/datahub/powerbi-m-language/" target="_blank" rel="noopener">dedicated analysis page</a>.</p>
<h2>Related Issues</h2>
<ul>
<li><a href="https://github.com/datahub-project/datahub/issues/15327" target="_blank" rel="noopener">#15327</a> — No upstream lineage tracked for specific Power BI tables (this analysis)</li>
<li><a href="https://github.com/datahub-project/datahub/issues/11251" target="_blank" rel="noopener">#11251</a> — Power BI SQL comments break lineage (#(lf) encoding) — <a href="https://www.sqlparser.com/datahub/powerbi-lineage/">solved</a></li>
<li><a href="https://github.com/datahub-project/datahub/issues/11654" target="_blank" rel="noopener">#11654</a> — BigQuery procedural SQL drops lineage — <a href="https://www.sqlparser.com/datahub/bigquery-lineage/">solved</a></li>
</ul>
<p><em>Disclosure: I work at <a href="https://www.gudusoft.com" target="_blank" rel="noopener">Gudu Software</a>, the company behind GSP SQL Parser, SQLFlow, and the gsp-datahub-sidecar. All lineage results shown were produced by running the M scripts through GSP&#8217;s Power Query parser (<code>dbvpowerquery</code>) and verified against the actual M-language queries from DataHub issue #15327.</em></p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/datahub-15327-why-power-bi-m-language-queries-produce-zero-lineage-4-query-deep/">DataHub #15327: Why Power BI M-Language Queries Produce Zero Lineage — 4-Query Deep Analysis</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why Power BI SQL Comments Break Your DataHub Lineage — and How to Fix It</title>
		<link>https://www.dpriver.com/blog/2026/04/why-power-bi-sql-comments-break-your-datahub-lineage-and-how-to-fix-it/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Sun, 19 Apr 2026 06:24:12 +0000</pubDate>
				<category><![CDATA[gsp]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[column-lineage]]></category>
		<category><![CDATA[data-governance]]></category>
		<category><![CDATA[datahub]]></category>
		<category><![CDATA[m-language]]></category>
		<category><![CDATA[power-bi]]></category>
		<category><![CDATA[sql-comments]]></category>
		<category><![CDATA[sqlglot]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3192</guid>

					<description><![CDATA[<p>Power BI encodes newlines as #(lf) in M-language SQL. When DataHub parses queries with -- comments, it silently drops all subsequent JOINs from lineage. The gsp-datahub-sidecar recovers every missing relationship.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/why-power-bi-sql-comments-break-your-datahub-lineage-and-how-to-fix-it/">Why Power BI SQL Comments Break Your DataHub Lineage — and How to Fix It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>If you use Power BI with DataHub, some of your lineage is probably missing right now — and you don&#8217;t know it.</p>



<h2 class="wp-block-heading">Background: How Power BI Handles SQL</h2>



<p>A common question from newcomers: does Power BI have its own SQL dialect? The short answer is <strong>no</strong>. Power BI connects to external databases — SQL Server, Snowflake, PostgreSQL, and others — and sends SQL queries written in <em>that database&#8217;s native dialect</em>. If your Power BI report pulls from SQL Server, the SQL inside is T-SQL; if it pulls from Snowflake, it&#8217;s Snowflake SQL.</p>



<p>However, Power BI does have its own query language called <strong>M</strong> (also known as Power Query Formula Language). M wraps everything — including embedded SQL. When you write a <code>Value.NativeQuery</code> expression in Power BI, the SQL string lives inside an M expression, and M applies its own encoding rules.</p>



<p>The most important encoding rule: <strong>M replaces actual newline characters with <code>#(lf)</code></strong> (M&#8217;s literal for &#8220;line feed&#8221;). So a query that a human writes as:</p>



<pre class="wp-block-code"><code>SELECT name
FROM customers
-- old filter
WHERE active = 1</code></pre>



<p>gets stored by Power BI as:</p>



<pre class="wp-block-code"><code>select#(lf)name#(lf)from customers#(lf)-- old filter#(lf)where active = 1</code></pre>



<p>This is where the trouble starts. The SQL itself is standard — the encoding around it is not.</p>



<h2 class="wp-block-heading">The Problem: Why Comments Break Lineage</h2>



<p>In standard SQL, <code>--</code> marks a single-line comment: everything from <code>--</code> to the next <em>actual newline character</em> is ignored. But when Power BI encodes newlines as <code>#(lf)</code>, those aren&#8217;t real newline characters — they&#8217;re just ordinary text. So any SQL parser that receives this encoded string sees the <code>--</code> comment as running all the way to the end of the <em>entire query</em>, because there is no real newline to terminate it.</p>



<p>This is not a parser bug in sqlglot or any other SQL parser — it&#8217;s a <strong>missing preprocessing step</strong>. Before parsing, someone needs to decode M&#8217;s <code>#(lf)</code> sequences back into real newline characters. DataHub&#8217;s Power BI ingestion connector does not do this, and neither does GSP (General SQL Parser) on its own — no SQL parser should be expected to understand Power BI M-language encoding.</p>



<p>Here&#8217;s a real query from <a href="https://github.com/datahub-project/datahub/issues/11251">DataHub issue #11251</a> that demonstrates the problem:</p>



<pre class="wp-block-code"><code>select
upper(cs.customercode) as customercode
, cs.ear2id as ear2id
, db.branch_rollup_name
, db.subregion1
, db.subregion2
from nast_truckload_domain.broker.dim_customer cs
--join ... (commented out)
join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so
  on cs.ear2id = so.ear2_id and is_current = true
join nast_customer_domain.broker.ref_branch db
  on db.Branch_code = so.branch_code
-- join ... (commented out)
where cs.customerstatusid = 1 --active
and db.primary_business_line_id in ('62','73')</code></pre>



<p>This query references two upstream tables: <code>dim_customer</code> and <code>ref_branch</code>. But when Power BI encodes it with <code>#(lf)</code> and DataHub&#8217;s parser reads it, the <code>--join</code> comment on line 8 swallows everything after it — because <code>#(lf)</code> doesn&#8217;t terminate the comment. DataHub sees only one table.</p>



<h2 class="wp-block-heading">Why It&#8217;s Worse Than It Looks</h2>



<p>SQL comments are everywhere in Power BI datasets. Developers comment out JOINs while debugging, leave notes like <code>--active</code> next to filter conditions, and disable WHERE clauses during development. This is normal SQL practice — but when Power BI encodes these queries with <code>#(lf)</code>, every single <code>--</code> comment becomes a lineage-destroying landmine.</p>



<p>Any Power BI dataset with even a single commented-out JOIN or WHERE clause will produce incomplete lineage. The lineage graph shows only a fraction of the upstream tables, and DataHub gives no warning. We verified this: sqlglot handles <code>--</code> comments perfectly when the newlines are real — the problem is entirely that <code>#(lf)</code> is not decoded before parsing.</p>



<p>The issue has been open since August 2024. Three users have confirmed it blocks DataHub adoption in their organizations:</p>



<blockquote class="wp-block-quote"><p>&#8220;This is something very important. Especially when SQL comments are very common in PBI M-queries.&#8221; — @AntonisCSt</p></blockquote>



<blockquote class="wp-block-quote"><p>&#8220;We also run into this issue and it is a real issue for part of our business to accept Datahub as a common catalog solution.&#8221; — @rospe</p></blockquote>



<h2 class="wp-block-heading">The Fix: gsp-datahub-sidecar</h2>



<p>The <a href="https://github.com/gudusoftware/gsp-datahub-sidecar">gsp-datahub-sidecar</a> solves this with a two-step approach:</p>



<ol class="wp-block-list">
<li><strong>Preprocessing</strong>: The sidecar decodes Power BI&#8217;s M-language escape sequences (<code>#(lf)</code>, <code>#(cr)</code>, <code>#(tab)</code>) back into real characters before any SQL parsing happens. This is the step that DataHub&#8217;s ingestion pipeline is missing.</li>
<li><strong>Parsing</strong>: The decoded SQL — now with real newlines — is sent to Gudu&#8217;s General SQL Parser (GSP) via SQLFlow, which correctly handles comments, JOINs, and WHERE clauses to extract complete lineage.</li>
</ol>



<p>We verified this with the exact query from the issue. First, the raw Power BI encoded form (all on one line with <code>#(lf)</code>):</p>



<pre class="wp-block-code"><code>select#(lf)upper(cs.customercode) as customercode#(lf), cs.ear2id ...#(lf)from nast_truckload_domain.broker.dim_customer cs#(lf)--join ...#(lf)join ...#(lf)where cs.customerstatusid = 1 --active#(lf)and ...</code></pre>



<p>Without preprocessing, <strong>every SQL parser fails</strong> — GSP included. The <code>--join</code> on the encoded &#8220;line 8&#8221; swallows everything after it. Zero lineage extracted.</p>



<p>With the sidecar&#8217;s preprocessing, the <code>#(lf)</code> sequences become real newlines, and GSP correctly parses the result:</p>



<pre class="wp-block-code"><code>$ gsp-datahub-sidecar \
    --sql-file powerbi_comments.sql \
    --db-vendor dbvmssql \
    --dry-run

Extracted 2 table-level lineage relationships
  NAST_TRUCKLOAD_DOMAIN.BROKER.DIM_CUSTOMER --&gt; DBO.CUSTOMER_BRANCHES (2 columns)
  NAST_CUSTOMER_DOMAIN.BROKER.REF_BRANCH --&gt; DBO.CUSTOMER_BRANCHES (3 columns)
Column-level mappings extracted: 5</code></pre>



<p>Both upstream tables found. All five column-level lineages recovered:</p>



<ul class="wp-block-list">
<li><code>customercode</code> ← <code>dim_customer.customercode</code> (through <code>UPPER()</code>)</li>
<li><code>ear2id</code> ← <code>dim_customer.ear2id</code></li>
<li><code>branch_rollup_name</code> ← <code>ref_branch.branch_rollup_name</code></li>
<li><code>subregion1</code> ← <code>ref_branch.subregion1</code></li>
<li><code>subregion2</code> ← <code>ref_branch.subregion2</code></li>
</ul>



<p>Comments correctly stripped. JOINs preserved. WHERE clause intact.</p>



<h2 class="wp-block-heading">Quick Start</h2>



<p>Three commands to recover your Power BI lineage:</p>



<pre class="wp-block-code"><code># 1. Install the sidecar
pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git

# 2. Verify with the built-in example
gsp-datahub-sidecar \
  --sql-file examples/powerbi_comments.sql \
  --db-vendor dbvmssql \
  --dry-run

# 3. Point at your DataHub instance
gsp-datahub-sidecar \
  --sql-file your_query.sql \
  --db-vendor dbvmssql \
  --datahub-url http://your-datahub:8080</code></pre>



<p>The sidecar emits standard DataHub MCPs (Metadata Change Proposals) via the REST API — DataHub merges the lineage with your existing metadata automatically.</p>



<h2 class="wp-block-heading">Your Power BI Connects to More Than One Database</h2>



<p>The <code>#(lf)</code> encoding problem is Power BI specific, but the SQL inside your Power BI datasets comes from many different databases. One report might query SQL Server, another Snowflake, another Oracle. The sidecar handles all of them — GSP supports 20+ SQL dialects, so you just change the <code>--db-vendor</code> flag to match your backend.</p>



<p>This matters because SQL dialects differ in syntax — T-SQL&#8217;s <code>TOP</code>, Snowflake&#8217;s <code>QUALIFY</code>, Oracle&#8217;s <code>CONNECT BY</code> — and a parser that only handles one dialect will silently drop lineage from the others. The sidecar gives you correct lineage regardless of which database your Power BI report connects to.</p>



<h2 class="wp-block-heading">Part of a Broader Pattern</h2>



<p>The <code>#(lf)</code> comment issue is one example of a broader pattern: DataHub&#8217;s parser silently drops lineage on SQL constructs it can&#8217;t fully parse. We&#8217;ve documented similar gaps with <a href="https://www.dpriver.com/blog/2026/04/why-your-datahub-bigquery-lineage-silently-breaks-on-procedural-sql-and-how-to-f/">BigQuery procedural SQL</a> (DECLARE, IF/END IF, temp tables) and <a href="https://github.com/datahub-project/datahub/issues/11670">dbt deduplication macros</a> (array_agg with struct unpacking).</p>



<p>The sidecar fills these gaps without replacing DataHub&#8217;s parser — it augments it.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<p><em>Disclosure: The gsp-datahub-sidecar is built by <a href="https://www.gudusoft.com">Gudu Software</a>, the company behind General SQL Parser and SQLFlow. The sidecar is open source under the Apache 2.0 license.</em></p>



<p><em>Have a SQL parsing challenge? <a href="https://sqlflow.gudusoft.com">Try SQLFlow</a> with your own SQL, or visit the <a href="https://github.com/gudusoftware/gsp-datahub-sidecar">sidecar repo</a> to get started.</em></p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/why-power-bi-sql-comments-break-your-datahub-lineage-and-how-to-fix-it/">Why Power BI SQL Comments Break Your DataHub Lineage — and How to Fix It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why Your DataHub BigQuery Lineage Silently Breaks on Procedural SQL — and How to Fix It</title>
		<link>https://www.dpriver.com/blog/2026/04/why-your-datahub-bigquery-lineage-silently-breaks-on-procedural-sql-and-how-to-f/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Sat, 18 Apr 2026 03:42:05 +0000</pubDate>
				<category><![CDATA[gsp]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[bigquery]]></category>
		<category><![CDATA[column-lineage]]></category>
		<category><![CDATA[data-governance]]></category>
		<category><![CDATA[datahub]]></category>
		<category><![CDATA[procedural-sql]]></category>
		<category><![CDATA[sql-lineage]]></category>
		<category><![CDATA[sqlglot]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3187</guid>

					<description><![CDATA[<p>DataHub's BigQuery ingestion silently loses lineage on procedural SQL (DECLARE, IF/THEN, CALL). This post explains why, and shows how to recover the missing lineage using an open-source sidecar tool.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/why-your-datahub-bigquery-lineage-silently-breaks-on-procedural-sql-and-how-to-f/">Why Your DataHub BigQuery Lineage Silently Breaks on Procedural SQL — and How to Fix It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>If you use DataHub to track data lineage from BigQuery, you may have noticed something unsettling: some of your most important queries produce <strong>no lineage at all</strong>. No upstream tables, no column mappings — just a silent gap in your lineage graph.</p>
<p>This isn&#8217;t a DataHub bug. It&#8217;s a parser limitation that affects a specific but increasingly common class of SQL: <strong>GoogleSQL procedural blocks</strong>.</p>
<p>This post explains exactly what breaks, why it happens, and how to recover the missing lineage using an open-source sidecar tool — without modifying your DataHub installation.</p>
<h2 class="wp-block-heading">What breaks: procedural SQL in BigQuery</h2>
<p>BigQuery&#8217;s GoogleSQL supports procedural constructs: <code>DECLARE</code>, <code>IF/THEN/END IF</code>, <code>CALL</code>, <code>BEGIN...EXCEPTION...END</code>, and <code>CREATE TEMP TABLE</code> inside scripting blocks. These are widely used in production BigQuery workloads for ETL orchestration, conditional logic, and error handling.</p>
<p>Here&#8217;s a simplified example (derived from <a href="https://github.com/datahub-project/datahub/issues/11654">datahub-project/datahub#11654</a>):</p>
<pre class="wp-block-code"><code>DECLARE current_job_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP;
DECLARE partitions STRUCT&lt;max_record_ts TIMESTAMP, dates ARRAY&lt;DATE&gt;&gt;;

CALL `internal_project.get_partitions`(
  ('project.dataset.view_name', 'EventTimestamp'),
  partitions
);

IF ARRAY_LENGTH(partitions.dates) &gt; 0 THEN
  CREATE OR REPLACE TEMP TABLE temp_table AS
  SELECT * EXCEPT (SnapshotTimestamp)
  FROM `project.dataset.view_name`
  WHERE (IDField, FlagField, ForeignKeyField, StartDate)
        IN UNNEST(partitions.dates)
  ORDER BY EventTimestamp;

  IF (SELECT COUNT(1) FROM temp_table_delta) &gt; 0 THEN
    CREATE OR REPLACE TEMP TABLE final_output AS
    SELECT DISTINCT IDField, Email, UserID,
           EventTimestamp, BusinessDate
    FROM temp_table_delta
    WHERE EventTimestamp BETWEEN '2023-01-01' AND '2023-12-31';
  END IF;
END IF;</code></pre>
<p>This SQL contains two clear lineage relationships:</p>
<ul class="wp-block-list">
<li><code>project.dataset.view_name</code> → <code>temp_table</code> (6 columns)</li>
<li><code>temp_table_delta</code> → <code>final_output</code> (5 columns)</li>
</ul>
<p>But when DataHub ingests this query, <strong>both relationships are lost</strong>.</p>
<h2 class="wp-block-heading">Why it happens</h2>
<p>DataHub&#8217;s BigQuery ingestion uses sqlglot for SQL parsing and lineage extraction. sqlglot handles standard SQL well — <code>SELECT</code>, <code>INSERT</code>, <code>CREATE VIEW</code>, joins, CTEs, window functions — but its BigQuery dialect does not support procedural language constructs.</p>
<p>When sqlglot encounters <code>DECLARE</code>, <code>IF/THEN</code>, or <code>CALL</code>, it falls back to an opaque <code>Command</code> node. The SQL inside the procedural block is not parsed, and no lineage is extracted. This isn&#8217;t a bug in sqlglot — procedural SQL is genuinely hard to parse, and full procedural language support is outside sqlglot&#8217;s current scope.</p>
<p>The result: DataHub silently drops lineage for any BigQuery query that uses procedural syntax. No error message, no warning in the UI — the lineage simply doesn&#8217;t appear.</p>
<p>This issue was first reported in October 2024 (<a href="https://github.com/datahub-project/datahub/issues/11654">datahub-project/datahub#11654</a>) and remains open.</p>
<h2 class="wp-block-heading">The fix: a post-processor sidecar</h2>
<p>We built <a href="https://github.com/gudusoftware/gsp-datahub-sidecar">gsp-datahub-sidecar</a>, an open-source (Apache 2.0) tool that recovers the missing lineage. It works as a post-processor alongside your existing DataHub ingestion — no changes to your DataHub installation required.</p>
<p>The sidecar re-parses the SQL that sqlglot couldn&#8217;t handle using <a href="https://sqlflow.gudusoft.com">Gudu SQLFlow</a>, which supports procedural SQL natively across 20+ dialects, including BigQuery&#8217;s GoogleSQL. It then emits the recovered lineage to DataHub via the standard REST API.</p>
<pre class="wp-block-code"><code>DataHub ingestion (unchanged)       gsp-datahub-sidecar
  BQ audit log → sqlglot → lineage     |
                    |                   | re-parse procedural SQL
                    v                   | with Gudu SQLFlow
              "Command fallback"        v
              (lineage lost)      DataHub GMS (lineage restored)</code></pre>
<p>The sidecar does <strong>not</strong> replace or interfere with sqlglot-generated lineage. It adds to it — filling the gaps that procedural SQL creates.</p>
<h2 class="wp-block-heading">How to use it</h2>
<h3 class="wp-block-heading">Step 1: Install</h3>
<pre class="wp-block-code"><code>pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git</code></pre>
<h3 class="wp-block-heading">Step 2: Dry-run (no DataHub needed)</h3>
<p>Preview the lineage that would be recovered:</p>
<pre class="wp-block-code"><code>gsp-datahub-sidecar \
  --sql-file examples/bigquery_procedural.sql \
  --dry-run</code></pre>
<p>Output:</p>
<pre class="wp-block-code"><code>Processing 1 SQL statement(s) in 'anonymous' mode...
  Lineage: PROJECT.DATASET.VIEW_NAME --&gt; TEMP_TABLE (12 columns)
  Lineage: TEMP_TABLE_DELTA --&gt; FINAL_OUTPUT (5 columns)
[DRY RUN] Would emit 2 MCPs to http://localhost:8080</code></pre>
<p>The two lineage relationships that DataHub was missing are recovered, along with 11 column-level mappings.</p>
<h3 class="wp-block-heading">Step 3: Emit to DataHub</h3>
<p>Point the sidecar at your DataHub GMS:</p>
<pre class="wp-block-code"><code>gsp-datahub-sidecar \
  --sql-file examples/bigquery_procedural.sql \
  --datahub-server http://localhost:8080</code></pre>
<h3 class="wp-block-heading">Step 4: Verify in DataHub</h3>
<p>Open the DataHub UI and search for <code>temp_table</code>. The lineage tab now shows the upstream relationship from <code>project.dataset.view_name</code>, with column-level lineage arrows:</p>
<figure class="wp-block-image"><img decoding="async" src="https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/images/temp_table.png" alt="temp_table lineage in DataHub — table-level and column-level arrows recovered" /></figure>
<p>And for <code>final_output</code>, the lineage from <code>temp_table_delta</code> with all 5 column mappings:</p>
<figure class="wp-block-image"><img decoding="async" src="https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/images/final_output.png" alt="final_output lineage in DataHub — 5 column-level lineage arrows from temp_table_delta" /></figure>
<p>Both table-level and column-level lineage are fully visible in DataHub&#8217;s lineage graph — recovered from procedural SQL that was previously invisible.</p>
<h2 class="wp-block-heading">Three backend modes</h2>
<p>The sidecar supports three backends, depending on your security and volume requirements:</p>
<figure class="wp-block-table">
<table style="border-collapse:collapse;width:100%">
<thead>
<tr style="background:#f0f4f8">
<th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Mode</th>
<th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Auth</th>
<th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Limit</th>
<th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Data location</th>
<th style="padding:8px 12px;border:1px solid #ddd;text-align:left">Use case</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd"><code>anonymous</code> (default)</td>
<td style="padding:8px 12px;border:1px solid #ddd">None</td>
<td style="padding:8px 12px;border:1px solid #ddd">50/day</td>
<td style="padding:8px 12px;border:1px solid #ddd">SQL sent to api.gudusoft.com</td>
<td style="padding:8px 12px;border:1px solid #ddd">Quick evaluation</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd"><code>authenticated</code></td>
<td style="padding:8px 12px;border:1px solid #ddd">userId + secretKey</td>
<td style="padding:8px 12px;border:1px solid #ddd">10k/month</td>
<td style="padding:8px 12px;border:1px solid #ddd">SQL sent to api.gudusoft.com</td>
<td style="padding:8px 12px;border:1px solid #ddd">Extended evaluation</td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd"><code>self_hosted</code></td>
<td style="padding:8px 12px;border:1px solid #ddd">userId + secretKey</td>
<td style="padding:8px 12px;border:1px solid #ddd">Unlimited</td>
<td style="padding:8px 12px;border:1px solid #ddd">SQL stays in your VPC</td>
<td style="padding:8px 12px;border:1px solid #ddd">Production</td>
</tr>
</tbody>
</table>
</figure>
<p>For production use with sensitive SQL, the <a href="https://docs.gudusoft.com/docker/">self-hosted SQLFlow Docker</a> keeps all data in your infrastructure — no SQL leaves your network.</p>
<h2 class="wp-block-heading">What&#8217;s next</h2>
<p>If you&#8217;re affected by <a href="https://github.com/datahub-project/datahub/issues/11654">datahub-project/datahub#11654</a> or similar lineage gaps in your DataHub instance:</p>
<ol class="wp-block-list">
<li><strong>Try the dry-run</strong> with your own SQL files to see what lineage you&#8217;re currently missing</li>
<li><strong>Check the <a href="https://github.com/gudusoftware/gsp-datahub-sidecar">GitHub repo</a></strong> for full documentation, configuration options, and additional examples (including Oracle stored procedures)</li>
<li><strong>Sign up for the authenticated tier</strong> at <a href="https://docs.gudusoft.com/sign-up/">docs.gudusoft.com/sign-up</a> for higher volume evaluation</li>
</ol>
<p>The sidecar is Apache 2.0 licensed. Gudu SQLFlow is a commercial product with a free evaluation tier. For questions or feedback, open an issue on GitHub or reach out at support@gudusoft.com.</p>
<p><em>Disclosure: This tool is built by <a href="https://gudusoft.com">Gudu Software</a>, the team behind General SQL Parser and SQLFlow. We specialize in deep SQL parsing and data lineage across 20+ SQL dialects.</em></p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/why-your-datahub-bigquery-lineage-silently-breaks-on-procedural-sql-and-how-to-f/">Why Your DataHub BigQuery Lineage Silently Breaks on Procedural SQL — and How to Fix It</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Analyzing Oracle PL/SQL Dependencies Before Your Snowflake Migration</title>
		<link>https://www.dpriver.com/blog/2026/04/analyzing-oracle-plsql-dependencies-before-your-snowflake-migration/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Mon, 06 Apr 2026 03:59:54 +0000</pubDate>
				<category><![CDATA[sql]]></category>
		<category><![CDATA[SQL language]]></category>
		<category><![CDATA[SQLFlow]]></category>
		<category><![CDATA[column lineage]]></category>
		<category><![CDATA[data lineage]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PL/SQL]]></category>
		<category><![CDATA[Snowflake]]></category>
		<category><![CDATA[SQL parser]]></category>
		<category><![CDATA[stored procedures]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3185</guid>

					<description><![CDATA[<p>Oracle-to-Snowflake conversion tools handle syntax translation, but skip dependency analysis. Learn how to use GSP and SQLFlow to map PL/SQL call graphs, column-level lineage, and vendor-specific constructs before you start converting.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/analyzing-oracle-plsql-dependencies-before-your-snowflake-migration/">Analyzing Oracle PL/SQL Dependencies Before Your Snowflake Migration</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Oracle-to-Snowflake is the #1 database migration path in 2026. Every major cloud consulting firm has a practice dedicated to it, and the tooling ecosystem has matured significantly: <strong>SnowConvert AI</strong> (now free), <strong>AWS Schema Conversion Tool</strong>, and <strong>Ora2pg</strong> all handle syntax translation from PL/SQL to Snowflake SQL or Snowflake Scripting.</p>



<p>But here is the problem: these tools focus on <em>converting</em> SQL syntax. They skip the critical first step of <strong>analyzing what you actually have</strong>. If you have 2,000 stored procedures in your Oracle environment, which ones call which? Which columns flow through which procedures? Which procedures contain vendor-specific constructs that no tool can auto-convert? Without answering these questions first, you are converting blind.</p>



<h2 class="wp-block-heading">Why Pre-Migration Dependency Analysis Matters</h2>



<p>Oracle PL/SQL codebases are not flat lists of independent procedures. They are interconnected systems with hidden dependencies that conversion tools do not surface:</p>



<ul class="wp-block-list">
<li><strong>Call graphs</strong>: Procedure A calls Procedure B, which calls Package C&#8217;s function. If you convert A but not B, A breaks in Snowflake.</li>
<li><strong>Cross-procedure column flows</strong>: Data flows from a source table through a chain of procedures, each transforming specific columns. Conversion tools translate each procedure in isolation — they do not show you the end-to-end column lineage.</li>
<li><strong>Package-level coupling</strong>: Oracle packages group related procedures and functions with shared state (package-level variables, cursors). Snowflake has no direct package equivalent, so each package must be decomposed — but you need to know the internal dependency structure first.</li>
<li><strong>Dynamic SQL</strong>: Procedures that build and execute SQL strings at runtime. These create dependencies that are invisible to static conversion tools.</li>
</ul>



<h3 class="wp-block-heading">What Conversion Tools Actually Produce</h3>



<p>To be fair, conversion tools are good at what they do — translating syntax. But it helps to understand their limitations for complex PL/SQL:</p>



<p><strong>SnowConvert AI</strong> (Mobilize.Net, now free): Handles straightforward PL/SQL well. For complex constructs like <code>CONNECT BY</code> hierarchical queries or the <code>MODEL</code> clause, it generates what the documentation calls &#8220;compilable but non-functional objects&#8221; — the output compiles in Snowflake without errors, but it does not produce the same results as the Oracle original. You need to know which procedures contain these constructs <em>before</em> conversion so you can plan manual rework.</p>



<p><strong>AWS Schema Conversion Tool (SCT)</strong>: Wraps untranslatable logic in comments with action items. This is honest and helpful, but with hundreds of procedures, you end up with a long list of flagged items and no way to prioritize them by dependency or business impact.</p>



<p><strong>Ora2pg</strong>: The open-source workhorse for Oracle-to-PostgreSQL (and by extension Snowflake via Postgres compatibility). It handles most DML well but struggles with PL/SQL-specific patterns like <code>BULK COLLECT</code> + <code>FORALL</code> batch processing, collection types, and pipelined table functions.</p>



<p>None of these tools answer the fundamental pre-migration questions: <em>What depends on what? What is the migration order? What is the blast radius if this table changes?</em></p>



<h2 class="wp-block-heading">PL/SQL Patterns That Migration Tools Struggle With</h2>



<p>Let us look at concrete PL/SQL constructs that require pre-migration analysis because no tool auto-converts them reliably.</p>



<h3 class="wp-block-heading">1. CONNECT BY Hierarchical Queries</h3>



<p>Oracle&#8217;s <code>CONNECT BY</code> syntax for hierarchical/tree queries has no direct Snowflake equivalent. You must rewrite these as recursive CTEs:</p>



<pre class="wp-block-code"><code>-- Oracle: Hierarchical query with CONNECT BY
CREATE OR REPLACE PROCEDURE get_org_hierarchy(p_root_id NUMBER) AS
  CURSOR c_hierarchy IS
    SELECT employee_id, manager_id, first_name, last_name,
           LEVEL AS depth,
           SYS_CONNECT_BY_PATH(last_name, '/') AS path,
           CONNECT_BY_ISLEAF AS is_leaf
    FROM employees
    START WITH employee_id = p_root_id
    CONNECT BY PRIOR employee_id = manager_id
    ORDER SIBLINGS BY last_name;
BEGIN
  FOR rec IN c_hierarchy LOOP
    DBMS_OUTPUT.PUT_LINE(RPAD(' ', rec.depth * 2) || rec.path);
  END LOOP;
END;</code></pre>



<p>This procedure uses <code>LEVEL</code>, <code>SYS_CONNECT_BY_PATH</code>, <code>CONNECT_BY_ISLEAF</code>, and <code>ORDER SIBLINGS BY</code> — all Oracle-specific. Before converting, you need to know: what other procedures call <code>get_org_hierarchy</code>? What tables does it read from? Are there other procedures that also query the <code>employees</code> table hierarchically?</p>



<h3 class="wp-block-heading">2. MODEL Clause for Spreadsheet-Style Calculations</h3>



<p>Oracle&#8217;s <code>MODEL</code> clause allows spreadsheet-like inter-row calculations directly in SQL. There is no Snowflake equivalent:</p>



<pre class="wp-block-code"><code>-- Oracle: MODEL clause for financial projections
CREATE OR REPLACE VIEW revenue_forecast AS
SELECT country, product, year, amount
FROM quarterly_revenue
MODEL
  PARTITION BY (country)
  DIMENSION BY (product, year)
  MEASURES (revenue AS amount)
  RULES (
    amount['Electronics', 2027] =
      amount['Electronics', 2026] * 1.12,
    amount['Software', 2027] =
      amount['Software', 2026] * 1.25
      + amount['Services', 2026] * 0.05
  );</code></pre>



<p>The <code>MODEL</code> clause creates column-level dependencies that span across dimension values. In this example, the 2027 revenue for Software depends on both the 2026 Software <em>and</em> 2026 Services rows. Understanding these cross-row data flows is essential before attempting a manual rewrite for Snowflake.</p>



<h3 class="wp-block-heading">3. BULK COLLECT + FORALL Batch Processing</h3>



<p>High-performance PL/SQL uses <code>BULK COLLECT</code> and <code>FORALL</code> for batch operations. This pattern has no Snowflake equivalent — Snowflake Scripting processes rows differently:</p>



<pre class="wp-block-code"><code>-- Oracle: Batch processing with BULK COLLECT + FORALL
CREATE OR REPLACE PROCEDURE sync_customer_scores AS
  TYPE t_customer_ids IS TABLE OF customers.customer_id%TYPE;
  TYPE t_scores IS TABLE OF NUMBER;
  l_ids   t_customer_ids;
  l_scores t_scores;
BEGIN
  SELECT customer_id, calculate_score(customer_id)
  BULK COLLECT INTO l_ids, l_scores
  FROM customers
  WHERE last_updated &lt; SYSDATE - 30;

  FORALL i IN 1..l_ids.COUNT
    UPDATE customer_scores
    SET score = l_scores(i),
        updated_at = SYSDATE
    WHERE customer_id = l_ids(i);

  COMMIT;
END;</code></pre>



<p>This procedure references the <code>customers</code> table, the <code>customer_scores</code> table, and calls the <code>calculate_score</code> function. A conversion tool might translate the SQL syntax, but it will not tell you that <code>calculate_score</code> itself calls three other functions and reads from two additional tables. You need the full call graph to plan the migration.</p>



<h3 class="wp-block-heading">4. Package-Level Dependencies</h3>



<p>Oracle packages bundle related procedures, functions, and shared state. Snowflake has no package concept — you must decompose them:</p>



<pre class="wp-block-code"><code>-- Oracle: Package with internal dependencies
CREATE OR REPLACE PACKAGE customer_mgmt AS
  PROCEDURE onboard_customer(p_name VARCHAR2, p_email VARCHAR2);
  FUNCTION  get_risk_score(p_customer_id NUMBER) RETURN NUMBER;
  PROCEDURE update_credit_limit(p_customer_id NUMBER);
END customer_mgmt;
/

CREATE OR REPLACE PACKAGE BODY customer_mgmt AS
  -- Private package variable (shared state)
  g_default_credit_limit NUMBER := 5000;

  FUNCTION get_risk_score(p_customer_id NUMBER) RETURN NUMBER IS
    l_score NUMBER;
  BEGIN
    SELECT risk_rating INTO l_score
    FROM customer_risk_profiles
    WHERE customer_id = p_customer_id;
    RETURN l_score;
  END;

  PROCEDURE update_credit_limit(p_customer_id NUMBER) IS
    l_risk NUMBER;
  BEGIN
    l_risk := get_risk_score(p_customer_id);  -- internal call
    UPDATE customers
    SET credit_limit = g_default_credit_limit * (1 - l_risk/100)
    WHERE customer_id = p_customer_id;
  END;

  PROCEDURE onboard_customer(p_name VARCHAR2, p_email VARCHAR2) IS
    l_id NUMBER;
  BEGIN
    INSERT INTO customers(name, email, credit_limit)
    VALUES (p_name, p_email, g_default_credit_limit)
    RETURNING customer_id INTO l_id;

    -- Cross-package call
    audit_pkg.log_event('CUSTOMER_ONBOARD', l_id);
    update_credit_limit(l_id);  -- internal call
  END;
END customer_mgmt;</code></pre>



<p>This package has internal function calls (<code>update_credit_limit</code> calls <code>get_risk_score</code>), cross-package calls (<code>onboard_customer</code> calls <code>audit_pkg.log_event</code>), shared state (<code>g_default_credit_limit</code>), and touches multiple tables (<code>customers</code>, <code>customer_risk_profiles</code>). Decomposing this for Snowflake requires understanding all these relationships first.</p>



<h2 class="wp-block-heading">Using GSP to Analyze PL/SQL Before Migration</h2>



<p><a href="https://www.sqlparser.com">General SQL Parser (GSP)</a> is a SQL parsing library that handles Oracle PL/SQL natively, including all the constructs shown above. Here is how it supports pre-migration analysis.</p>



<h3 class="wp-block-heading">Extracting Table and Column References</h3>



<p>GSP parses each stored procedure and extracts every table and column reference, including those inside complex constructs like <code>CONNECT BY</code>, <code>MODEL</code>, and <code>BULK COLLECT</code>:</p>



<ul class="wp-block-list">
<li><strong>Direct table references</strong>: Which tables does each procedure SELECT from, INSERT into, UPDATE, or DELETE from?</li>
<li><strong>Column-level detail</strong>: Which specific columns are read and which are written, even through aliases, subqueries, and CTEs?</li>
<li><strong>Vendor-specific constructs</strong>: GSP flags Oracle-specific syntax (<code>CONNECT BY</code>, <code>MODEL</code>, <code>MERGE</code> with error logging, <code>FORALL</code>) so you can identify procedures that need manual rework.</li>
</ul>



<p>For the <code>sync_customer_scores</code> procedure above, GSP identifies:</p>



<ul class="wp-block-list">
<li>Reads from: <code>customers.customer_id</code>, <code>customers.last_updated</code></li>
<li>Writes to: <code>customer_scores.score</code>, <code>customer_scores.updated_at</code></li>
<li>Calls: <code>calculate_score()</code> function</li>
<li>Vendor-specific: <code>BULK COLLECT INTO</code>, <code>FORALL</code>, <code>SYSDATE</code>, <code>%TYPE</code> attribute</li>
</ul>



<h3 class="wp-block-heading">Building the Call Graph</h3>



<p>By parsing all procedures in your Oracle schema, GSP builds a complete call graph showing which procedures call which other procedures. For the <code>customer_mgmt</code> package example:</p>



<pre class="wp-block-code"><code>customer_mgmt.onboard_customer
  ├── audit_pkg.log_event          (cross-package)
  └── customer_mgmt.update_credit_limit
        └── customer_mgmt.get_risk_score

Tables affected:
  customers              ← INSERT (onboard), UPDATE (update_credit_limit)
  customer_risk_profiles ← SELECT (get_risk_score)
  audit_log              ← INSERT (audit_pkg.log_event)</code></pre>



<p>This call graph tells you the migration order: <code>get_risk_score</code> has no procedure dependencies (migrate first), <code>update_credit_limit</code> depends on <code>get_risk_score</code> (migrate second), and <code>onboard_customer</code> depends on both plus the <code>audit_pkg</code> (migrate last, and audit_pkg must also be ready).</p>



<h3 class="wp-block-heading">Identifying Vendor-Specific Syntax</h3>



<p>GSP categorizes Oracle-specific constructs by migration complexity so you can estimate effort:</p>


<figure class="wp-block-table"><table style="border-collapse: collapse;width: 100%"><thead><tr><th style="border: 1px solid #ddd;padding: 10px;text-align: left;background-color: #f5f5f5">Oracle Construct</th><th style="border: 1px solid #ddd;padding: 10px;text-align: left;background-color: #f5f5f5">Snowflake Equivalent</th><th style="border: 1px solid #ddd;padding: 10px;text-align: left;background-color: #f5f5f5">Migration Effort</th></tr></thead><tbody><tr><td style="border: 1px solid #ddd;padding: 10px"><code>CONNECT BY</code> / <code>START WITH</code></td><td style="border: 1px solid #ddd;padding: 10px">Recursive CTE</td><td style="border: 1px solid #ddd;padding: 10px">Medium — mechanical rewrite but must verify edge cases (cycles, NOCYCLE)</td></tr><tr><td style="border: 1px solid #ddd;padding: 10px"><code>MODEL</code> clause</td><td style="border: 1px solid #ddd;padding: 10px">None — must redesign as window functions, CTEs, or application logic</td><td style="border: 1px solid #ddd;padding: 10px">High — requires understanding business logic intent</td></tr><tr><td style="border: 1px solid #ddd;padding: 10px"><code>BULK COLLECT</code> + <code>FORALL</code></td><td style="border: 1px solid #ddd;padding: 10px">Set-based SQL or Snowflake Scripting loops</td><td style="border: 1px solid #ddd;padding: 10px">Medium — often simplifies to a single SQL statement</td></tr><tr><td style="border: 1px solid #ddd;padding: 10px">Package spec + body</td><td style="border: 1px solid #ddd;padding: 10px">Individual stored procedures + UDFs</td><td style="border: 1px solid #ddd;padding: 10px">High — shared state and internal dependencies must be untangled</td></tr><tr><td style="border: 1px solid #ddd;padding: 10px"><code>DBMS_OUTPUT</code>, <code>UTL_FILE</code></td><td style="border: 1px solid #ddd;padding: 10px">Snowflake logging, stages</td><td style="border: 1px solid #ddd;padding: 10px">Medium — different patterns but well-documented</td></tr><tr><td style="border: 1px solid #ddd;padding: 10px">Pipelined table functions</td><td style="border: 1px solid #ddd;padding: 10px">UDTFs or JavaScript UDFs</td><td style="border: 1px solid #ddd;padding: 10px">High — fundamental paradigm shift</td></tr></tbody></table></figure>


<p>When GSP identifies that 15% of your procedures use <code>CONNECT BY</code> and 3% use the <code>MODEL</code> clause, you can estimate the manual rework effort before committing to a migration timeline.</p>



<h2 class="wp-block-heading">Visualizing Lineage with SQLFlow</h2>



<p><a href="https://www.gudusoft.com">SQLFlow</a> takes the parsing results from GSP and renders them as interactive lineage diagrams. For pre-migration analysis, this provides two capabilities that spreadsheets and documentation cannot match.</p>



<h3 class="wp-block-heading">Column-Level Lineage Visualization</h3>



<p>Upload your PL/SQL to SQLFlow and see column-level data flow across procedures. For the examples above, SQLFlow shows:</p>



<ul class="wp-block-list">
<li><code>customers.customer_id</code> flows into <code>customer_scores.customer_id</code> via <code>sync_customer_scores</code></li>
<li><code>customer_risk_profiles.risk_rating</code> flows through <code>get_risk_score</code> into <code>customers.credit_limit</code> via <code>update_credit_limit</code></li>
<li><code>employees.employee_id</code> and <code>employees.manager_id</code> form a self-referential hierarchy in <code>get_org_hierarchy</code></li>
</ul>



<p>This visual lineage map helps migration teams understand data flow before they touch any code.</p>



<h3 class="wp-block-heading">Impact Analysis: What Breaks When a Table Changes?</h3>



<p>The most practical use of SQLFlow during migration planning is impact analysis. Select any Oracle table — say, <code>customers</code> — and SQLFlow shows every procedure, view, and downstream table that depends on it. If you are migrating the <code>customers</code> table first, you can see that <code>onboard_customer</code>, <code>update_credit_limit</code>, and <code>sync_customer_scores</code> all need to be migrated or adapted at the same time.</p>



<p>This answers the question that migration project managers ask most often: <em>what is the minimum set of objects I need to migrate together for a working increment?</em></p>



<h2 class="wp-block-heading">For VS Code Users: SQL Omni</h2>



<p>If your team works in VS Code and wants to run this analysis locally without uploading SQL to any cloud service, <a href="https://gudu-sql-omni.gudusoft.com">SQL Omni</a> brings GSP&#8217;s parsing and lineage capabilities into VS Code as an extension. It supports Oracle PL/SQL natively and runs entirely offline — useful for organizations where SQL code cannot leave the local environment.</p>



<p>SQL Omni provides the same dependency analysis and column-level lineage visualization described above, directly in your editor. You can install it from the <a href="https://marketplace.visualstudio.com/items?itemName=gudusoftware.gudu-sql-omni">VS Code Marketplace</a>.</p>



<h2 class="wp-block-heading">Putting It Together: Analyze Before You Convert</h2>



<p>Oracle-to-Snowflake migration tools have come a long way. SnowConvert AI being free removes a major cost barrier. AWS SCT and Ora2pg are solid open and semi-open options. But all of them work best when you know what you are feeding them.</p>



<p>A practical pre-migration workflow looks like this:</p>



<ol class="wp-block-list">
<li><strong>Inventory</strong>: Use GSP to parse all PL/SQL objects and extract table/column references, call graphs, and vendor-specific syntax.</li>
<li><strong>Visualize</strong>: Use SQLFlow to render the dependency graph and column-level lineage. Identify clusters of tightly-coupled objects.</li>
<li><strong>Assess</strong>: Categorize each procedure by migration complexity based on which Oracle-specific constructs it uses.</li>
<li><strong>Prioritize</strong>: Use the dependency graph to determine migration order — start with leaf objects that have no downstream dependencies.</li>
<li><strong>Convert</strong>: Run SnowConvert AI, AWS SCT, or Ora2pg on each batch, knowing what to expect and where manual rework is needed.</li>
<li><strong>Validate</strong>: After conversion, use GSP to parse the Snowflake output and verify that the lineage is preserved.</li>
</ol>



<p>The tools for converting Oracle SQL to Snowflake are good and getting better. The missing piece for most teams is understanding what they have before they start converting. GSP and SQLFlow fill that gap.</p>



<h2 class="wp-block-heading">Try It With Your Own PL/SQL</h2>



<ul class="wp-block-list">
<li><strong>SQLFlow online</strong>: Visit <a href="https://www.gudusoft.com">gudusoft.com</a>, select Oracle as the dialect, and paste a stored procedure to see its column-level lineage.</li>
<li><strong>GSP</strong>: Visit <a href="https://www.sqlparser.com">sqlparser.com</a> for the parsing library that powers the analysis.</li>
<li><strong>SQL Omni for VS Code</strong>: Install from the <a href="https://marketplace.visualstudio.com/items?itemName=gudusoftware.gudu-sql-omni">VS Code Marketplace</a> for offline analysis in your editor.</li>
</ul>



<p><em>Planning an Oracle-to-Snowflake migration? Start with analysis, not conversion. Your migration timeline will thank you.</em></p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/analyzing-oracle-plsql-dependencies-before-your-snowflake-migration/">Analyzing Oracle PL/SQL Dependencies Before Your Snowflake Migration</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>GSP vs JSQLParser vs sqlglot — SQL Parser Comparison 2026</title>
		<link>https://www.dpriver.com/blog/2026/04/gsp-vs-jsqlparser-vs-sqlglot-sql-parser-comparison-2026/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Mon, 06 Apr 2026 03:20:32 +0000</pubDate>
				<category><![CDATA[gsp]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[SQL language]]></category>
		<category><![CDATA[bigquery]]></category>
		<category><![CDATA[comparison]]></category>
		<category><![CDATA[data lineage]]></category>
		<category><![CDATA[JSQLParser]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Snowflake]]></category>
		<category><![CDATA[SQL parser]]></category>
		<category><![CDATA[sqlglot]]></category>
		<category><![CDATA[stored procedures]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3181</guid>

					<description><![CDATA[<p>We tested General SQL Parser, JSQLParser, and sqlglot against 14 real-world SQL patterns across 6 dialects. Here are the verified results — including where each parser shines and where it struggles with Oracle, stored procedures, and BigQuery procedural SQL.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/gsp-vs-jsqlparser-vs-sqlglot-sql-parser-comparison-2026/">GSP vs JSQLParser vs sqlglot — SQL Parser Comparison 2026</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>If you work with SQL across multiple databases, you have probably hit a wall where your parser just stops understanding your queries. Maybe it chokes on Oracle&#8217;s MODEL clause, or silently drops structure from a T-SQL stored procedure. Choosing the right SQL parser matters more than ever in 2026 — data lineage tools, SQL migration pipelines, and AI-powered SQL validation all depend on accurate parsing across dialects.</p>
<p>We tested three widely-used SQL parsers against 14 real-world SQL patterns spanning 6 dialects. Here is what we found.</p>
<h2 class="wp-block-heading">The Contenders</h2>
<p><strong>General SQL Parser (GSP) 4.1.0</strong> — A commercial Java library from Gudu Software that supports 20+ SQL dialects. It focuses on deep parsing of vendor-specific syntax including stored procedures, and provides built-in column-level lineage extraction. (Disclosure: this blog is published by Gudu Software.)</p>
<p><strong>JSQLParser 5.3</strong> — An open-source Java SQL parser (Apache 2.0 / LGPL) widely used in the Java ecosystem. It handles standard SQL well and has growing support for vendor-specific syntax. Available on <a href="https://github.com/JSQLParser/JSqlParser">GitHub</a> and Maven Central.</p>
<p><strong>sqlglot 30.2</strong> — An open-source Python SQL parser and transpiler that has gained significant traction for SQL migration and lineage use cases. It supports reading and writing SQL across 20+ dialects and includes basic lineage extraction. Available on <a href="https://github.com/tobymao/sqlglot">GitHub</a> and PyPI.</p>
<h2 class="wp-block-heading">Test Methodology</h2>
<p>We ran 14 test cases across 6 SQL dialects: standard SQL (ANSI), Oracle, SQL Server (T-SQL), PostgreSQL, BigQuery, and Snowflake. The tests cover three categories:</p>
<ul class="wp-block-list">
<li><strong>Standard DML</strong> — CTEs, window functions, subqueries (the baseline every parser should handle)</li>
<li><strong>Vendor-specific syntax</strong> — Oracle MODEL, MERGE with LOG ERRORS, T-SQL CROSS APPLY, BigQuery UNNEST with STRUCT, Snowflake FLATTEN and QUALIFY</li>
<li><strong>Stored procedures and procedural SQL</strong> — T-SQL TRY/CATCH blocks, PL/pgSQL RETURN QUERY functions, Oracle PL/SQL BULK COLLECT/FORALL, BigQuery procedural DECLARE/IF</li>
</ul>
<p>Versions tested: GSP 4.1.0.10, JSQLParser 5.3, sqlglot 30.2.1. All tests run on 2026-04-05.</p>
<p>A &#8220;PASS&#8221; means the parser correctly recognized the SQL structure and produced a usable parse tree. A &#8220;FAIL&#8221; means the parser either threw an error or could not represent the syntax meaningfully.</p>
<h2 class="wp-block-heading">Results</h2>
<figure class="wp-block-table">
<table style="border-collapse:collapse;width:100%;border:1px solid #ddd">
<thead>
<tr style="background:#f0f4f8">
<th style="padding:10px 12px;border:1px solid #ddd;text-align:left">Test Case</th>
<th style="padding:10px 12px;border:1px solid #ddd;text-align:center">GSP 4.1.0</th>
<th style="padding:10px 12px;border:1px solid #ddd;text-align:center">JSQLParser 5.3</th>
<th style="padding:10px 12px;border:1px solid #ddd;text-align:center">sqlglot 30.2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd">Standard CTE with JOIN</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd">Window function with PARTITION BY</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd">Subquery in SELECT</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd">Oracle CONNECT BY</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd"><strong>Oracle MODEL clause</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#dc2626"><strong>FAIL</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#dc2626"><strong>FAIL</strong></td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd"><strong>Oracle MERGE with error logging</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#dc2626"><strong>FAIL</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#dc2626"><strong>FAIL</strong></td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd">T-SQL CROSS APPLY</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd"><strong>T-SQL stored procedure (TRY/CATCH)</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#d97706"><strong>PASS*</strong></td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd">PL/pgSQL RETURN QUERY function</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd"><strong>Oracle PL/SQL BULK COLLECT</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#dc2626"><strong>FAIL</strong></td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd">BigQuery UNNEST with STRUCT</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd"><strong>BigQuery procedural DECLARE/IF</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#dc2626"><strong>FAIL</strong></td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr>
<td style="padding:8px 12px;border:1px solid #ddd">Snowflake FLATTEN</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#f9fafb">
<td style="padding:8px 12px;border:1px solid #ddd">Snowflake QUALIFY</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
<td style="padding:8px 12px;border:1px solid #ddd;text-align:center;color:#16a34a">PASS</td>
</tr>
<tr style="background:#e8f0fe;font-weight:bold">
<td style="padding:10px 12px;border:1px solid #ddd"><strong>Total</strong></td>
<td style="padding:10px 12px;border:1px solid #ddd;text-align:center;color:#16a34a"><strong>14/14</strong></td>
<td style="padding:10px 12px;border:1px solid #ddd;text-align:center;color:#d97706"><strong>11/14</strong></td>
<td style="padding:10px 12px;border:1px solid #ddd;text-align:center;color:#d97706"><strong>11/14</strong></td>
</tr>
</tbody>
</table>
</figure>
<p>*sqlglot 30.2 returned &#8220;contains unsupported syntax. Falling back to parsing as a &#8216;Command&#8217;&#8221; for the T-SQL stored procedure. It technically parsed without error, but the result lost structural understanding of the procedure body — TRY/CATCH blocks, variable declarations, and control flow were not represented in the parse tree. We counted this as a PASS with a caveat.</p>
<h2 class="wp-block-heading">Analysis</h2>
<h3 class="wp-block-heading">Standard SQL: All Three Handle It Well</h3>
<p>For standard ANSI SQL patterns — CTEs with JOINs, window functions with PARTITION BY and ORDER BY, and correlated subqueries — all three parsers performed without issue. If your SQL stays within standard syntax, any of these parsers will serve you well.</p>
<h3 class="wp-block-heading">Oracle-Specific Syntax: The Dividing Line</h3>
<p>Oracle is where the differences become clear. All three parsers handle CONNECT BY (hierarchical queries), which has been around for decades and is widely implemented. But Oracle&#8217;s MODEL clause and MERGE with LOG ERRORS are a different story.</p>
<p>The MODEL clause is a spreadsheet-like computation feature unique to Oracle. Both JSQLParser and sqlglot failed to parse it. Similarly, Oracle&#8217;s MERGE statement with the LOG ERRORS INTO extension — used in ETL pipelines for error handling — tripped up both open-source parsers.</p>
<p>GSP handled all Oracle syntax including these less common but production-critical patterns.</p>
<h3 class="wp-block-heading">Stored Procedures: Where Depth Matters</h3>
<p>Stored procedure parsing is where things get interesting for lineage and migration use cases. You cannot extract column-level lineage from a data warehouse if your parser skips the business logic inside stored procedures.</p>
<ul class="wp-block-list">
<li><strong>T-SQL stored procedures</strong>: GSP and JSQLParser both parsed the full procedure including TRY/CATCH blocks. sqlglot accepted the input but fell back to its &#8220;Command&#8221; mode, which means it treated the procedure body as an opaque string rather than parsing the internal structure. If you need to analyze what happens inside T-SQL procedures — variable assignments, control flow, DML within TRY blocks — sqlglot&#8217;s Command fallback will not give you that.</li>
</ul>
<ul class="wp-block-list">
<li><strong>PL/pgSQL functions</strong>: All three handled RETURN QUERY functions. PostgreSQL&#8217;s procedural syntax has received good attention from all parser projects.</li>
</ul>
<ul class="wp-block-list">
<li><strong>Oracle PL/SQL BULK COLLECT/FORALL</strong>: GSP and JSQLParser both parsed this pattern. sqlglot failed. BULK COLLECT is fundamental to performant Oracle PL/SQL and appears frequently in production codebases.</li>
</ul>
<h3 class="wp-block-heading">BigQuery Procedural SQL: sqlglot Shines</h3>
<p>BigQuery&#8217;s procedural extensions (DECLARE, IF/ELSE, loops) are relatively new, and here sqlglot showed its strength. sqlglot parsed BigQuery procedural DECLARE/IF blocks successfully, while JSQLParser could not. This makes sense given sqlglot&#8217;s strong focus on modern cloud SQL dialects.</p>
<p>All three handled BigQuery&#8217;s UNNEST with STRUCT patterns, which is good news since this pattern is common in BigQuery analytics.</p>
<h3 class="wp-block-heading">Snowflake: Solid Across the Board</h3>
<p>Both FLATTEN (Snowflake&#8217;s way of unnesting semi-structured data) and QUALIFY (a Snowflake/Teradata extension for filtering window function results) were handled by all three parsers. Snowflake syntax support has matured well across the ecosystem.</p>
<h2 class="wp-block-heading">When to Use Which</h2>
<h3 class="wp-block-heading">JSQLParser — Free, Java, Good for Standard SQL</h3>
<p>JSQLParser is a solid choice if you are building a Java application that needs to parse standard SQL with some vendor-specific syntax. It handles the majority of real-world SQL patterns and has an active open-source community. It works well for SQL validation, basic query rewriting, and extracting table references.</p>
<p><strong>Best for</strong>: Java projects, standard SQL parsing, query validation, table-level lineage where you do not need deep vendor-specific or procedural coverage.</p>
<p><strong>Limitations</strong>: Oracle MODEL, MERGE LOG ERRORS, and BigQuery procedural SQL are not supported as of version 5.3.</p>
<ul class="wp-block-list">
<li><a href="https://github.com/JSQLParser/JSqlParser">JSQLParser on GitHub</a></li>
</ul>
<h3 class="wp-block-heading">sqlglot — Free, Python, Best for Transpilation</h3>
<p>sqlglot has earned its reputation as the go-to tool for SQL transpilation (converting SQL between dialects). Its Python API is clean, and it includes basic lineage extraction capabilities. It handles BigQuery procedural SQL better than JSQLParser, making it a strong choice for cloud-first data teams.</p>
<p><strong>Best for</strong>: Python projects, SQL transpilation/migration, BigQuery-heavy workloads, basic column-level lineage for standard SQL patterns.</p>
<p><strong>Limitations</strong>: Oracle MODEL, MERGE LOG ERRORS, and PL/SQL BULK COLLECT are not supported. T-SQL stored procedures parse but lose structural detail. If you need to extract lineage from inside stored procedures, the Command fallback mode will not provide it.</p>
<ul class="wp-block-list">
<li><a href="https://github.com/tobymao/sqlglot">sqlglot on GitHub</a></li>
</ul>
<h3 class="wp-block-heading">General SQL Parser — Commercial, Java, Deepest Dialect Coverage</h3>
<p>GSP is the only parser in this comparison that handled all 14 test cases, including the most challenging Oracle-specific syntax and stored procedure patterns. It is designed for use cases where parsing accuracy across every dialect is non-negotiable — data lineage platforms, SQL migration tools, and compliance analysis.</p>
<p><strong>Best for</strong>: Enterprise environments with multiple database dialects, stored procedure lineage extraction, Oracle-heavy workloads, column-level lineage that needs to trace through procedural code.</p>
<p><strong>Trade-off</strong>: It is a commercial product, so there is a licensing cost. For teams that need the depth, the trade-off is straightforward. For simpler use cases, the open-source options may be sufficient.</p>
<ul class="wp-block-list">
<li><a href="https://www.gudusoft.com/general-sql-parser/">General SQL Parser</a></li>
</ul>
<p><strong>All test scripts are available on GitHub:</strong> <a href="https://github.com/sqlparser/sqlflow_public/tree/master/demos/sql-parser-comparison-2026">sqlparser/sqlflow_public/tree/master/demos/sql-parser-comparison-2026</a></p>
<h2 class="wp-block-heading">Try It Yourself</h2>
<p>All three parsers are easy to test with your own SQL:</p>
<ul class="wp-block-list">
<li><strong>General SQL Parser</strong>: Download from <a href="https://www.gudusoft.com/general-sql-parser/">gudusoft.com</a> or try lineage visualization with <a href="https://sqlflow.gudusoft.com">SQLFlow Online Demo</a></li>
<li><strong>JSQLParser</strong>: Add to your Maven/Gradle project from <a href="https://github.com/JSQLParser/JSqlParser">Maven Central</a></li>
<li><strong>sqlglot</strong>: <code>pip install sqlglot</code> and start parsing in Python</li>
</ul>
<p>If you work in VS Code and want SQL parsing, lineage visualization, and ER diagrams without leaving your editor, take a look at <a href="https://marketplace.visualstudio.com/items?itemName=gudusoftware.gudu-sql-omni">SQL Omni</a> — it uses GSP under the hood and runs 100% offline.</p>
<h2 class="wp-block-heading">Conclusion</h2>
<p>There is no single &#8220;best&#8221; SQL parser — the right choice depends on your language ecosystem, the SQL dialects you need to support, and how deep you need to go into vendor-specific and procedural syntax.</p>
<p>For standard SQL in Java, JSQLParser is capable and free. For Python-based transpilation and BigQuery work, sqlglot is hard to beat. For full dialect coverage including stored procedures and Oracle-specific syntax, GSP handles patterns that the open-source alternatives currently do not.</p>
<p>The test data speaks for itself. We encourage you to run your own SQL through all three and see which one fits your needs.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/gsp-vs-jsqlparser-vs-sqlglot-sql-parser-comparison-2026/">GSP vs JSQLParser vs sqlglot — SQL Parser Comparison 2026</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why Most SQL Lineage Tools Skip Stored Procedures (And What You Can Do About It)</title>
		<link>https://www.dpriver.com/blog/2026/04/why-most-sql-lineage-tools-skip-stored-procedures-and-what-you-can-do-about-it/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Sun, 05 Apr 2026 15:00:38 +0000</pubDate>
				<category><![CDATA[gsp]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[SQL language]]></category>
		<category><![CDATA[column lineage]]></category>
		<category><![CDATA[data lineage]]></category>
		<category><![CDATA[PL/SQL]]></category>
		<category><![CDATA[SQL parser]]></category>
		<category><![CDATA[stored procedures]]></category>
		<category><![CDATA[T-SQL]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3177</guid>

					<description><![CDATA[<p>Most SQL lineage tools silently skip stored procedures. Learn why PL/SQL, T-SQL, and PL/pgSQL procedural code breaks typical parsers, and how General SQL Parser extracts column-level lineage through stored procedures -- not just around them.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/why-most-sql-lineage-tools-skip-stored-procedures-and-what-you-can-do-about-it/">Why Most SQL Lineage Tools Skip Stored Procedures (And What You Can Do About It)</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>If you work with enterprise databases, you already know: stored procedures are everywhere. They hold critical business logic, enforce data transformations, and move data between tables in ways that no view or SELECT statement can capture. PL/SQL packages in Oracle, T-SQL procedures in SQL Server, PL/pgSQL functions in PostgreSQL &#8212; these are the workhorses of data processing in large organizations.</p>



<p>And yet, when you fire up a SQL lineage tool and point it at your database, those stored procedures are often quietly ignored. The lineage diagram shows tables and views connected by SELECT and INSERT statements, but the procedural code that actually drives data flow? It is either missing entirely or represented as a black box with no column-level detail.</p>



<p>This is not a minor gap. In regulated industries, auditors want to know exactly how data moves from source to target. In migration projects, you need to understand what a stored procedure does before you can rewrite it. And for impact analysis, a schema change that breaks a stored procedure can cascade into production failures that no lineage tool warned you about.</p>



<p>Let us look at why this happens, what the evidence looks like in practice, and what a proper solution requires.</p>



<h2 class="wp-block-heading">Why Tools Skip Stored Procedures</h2>



<p>The core challenge is architectural. Most SQL parsers are built to handle <strong>declarative SQL</strong> &#8212; SELECT, INSERT, UPDATE, DELETE, CREATE VIEW. These statements describe <em>what</em> data to retrieve or modify. Stored procedures, on the other hand, contain <strong>procedural logic</strong> &#8212; IF/ELSE branches, loops, exception handlers, cursor operations, dynamic SQL construction, and vendor-specific control flow.</p>



<p>There are three main reasons tools struggle:</p>



<h3 class="wp-block-heading">1. Preprocessing vs. Native Parsing</h3>



<p>Many lineage tools use a preprocessing approach: they extract the SQL statements embedded inside a procedure body and feed them into a standard SQL parser one at a time. This loses context. A variable assigned in one statement and used three lines later in an INSERT becomes invisible. A cursor that iterates over a result set and inserts rows conditionally cannot be resolved without understanding the control flow.</p>



<p>Native parsing means the parser understands the procedural language itself &#8212; DECLARE blocks, BEGIN/END scoping, cursor declarations, loop constructs, exception handlers &#8212; as first-class syntax. This is significantly harder to build, which is why most tools take the shortcut.</p>



<h3 class="wp-block-heading">2. Control Flow Complexity</h3>



<p>A stored procedure is not a single data flow &#8212; it is a program with branches. Consider a PL/SQL procedure that uses <code>BULK COLLECT</code> to fetch 10,000 rows into a collection, then uses <code>FORALL</code> to insert them into a target table with an exception handler that logs failures to an error table. The lineage from source columns to target columns passes through collection variables, loop indices, and conditional branches. Tracking column-level lineage through this requires something closer to program analysis than SQL parsing.</p>



<h3 class="wp-block-heading">3. Vendor-Specific Syntax</h3>



<p>Each database vendor has its own procedural language with unique syntax. Oracle&#8217;s PL/SQL has packages, nested procedures, autonomous transactions, and pipelined table functions. SQL Server&#8217;s T-SQL has TRY/CATCH, table variables, the GO batch separator, and MERGE with OUTPUT. PostgreSQL&#8217;s PL/pgSQL has PERFORM (execute and discard), RAISE (logging), RETURN QUERY, and dollar-quoted strings. Supporting all of these at production quality means maintaining separate, deep grammars for each dialect.</p>



<h2 class="wp-block-heading">Evidence From the Field</h2>



<p>The pattern is consistent across the ecosystem. Without naming and shaming, here is what we see:</p>



<p><strong>Open-source metadata platforms</strong> have acknowledged procedural SQL gaps. One major platform has had open issues for over five months related to stored procedure lineage extraction, with community members requesting support for PL/SQL and T-SQL procedure parsing that the platform&#8217;s SQL parser cannot handle.</p>



<p><strong>Python-based SQL transpilers</strong> that are popular in the dbt ecosystem work well for SELECT statement analysis but struggle with procedural constructs. T-SQL&#8217;s GO batch delimiter, complex PL/SQL blocks with nested BEGIN/END, and vendor-specific syntax like Oracle&#8217;s <code>EXECUTE IMMEDIATE</code> are known limitations. The transpiler approach works for the 80% case (queries and views) but hits a wall with procedural code.</p>



<p><strong>Large-scale migration tools at Fortune 500 companies</strong> have encountered this directly. One major technology company&#8217;s open-source migration utility (issue #2098) documented how their SQL parser could not propagate column-level lineage tokens through procedural logic, and the team discussed replacing their parser entirely due to dialect limitations in handling stored procedures.</p>



<p><strong>Database migration tools</strong> &#8212; both commercial (cloud provider schema conversion tools) and open-source (Oracle-to-PostgreSQL converters) &#8212; frequently fail on specific PL/SQL patterns. <code>BULK COLLECT</code> combined with <code>FORALL</code>, which is one of the most common performance optimization patterns in Oracle, is a known pain point. The migration tool either errors out or produces incorrect target code because it does not fully parse the procedural context.</p>



<p><strong>Data catalog platforms</strong> have added stored procedure support in their UI &#8212; you can view and edit procedure code &#8212; but the actual <em>parsing</em> for lineage extraction is not there. Showing the code is not the same as understanding the data flow within it.</p>



<h2 class="wp-block-heading">What Good Stored Procedure Parsing Looks Like</h2>



<p>To extract accurate column-level lineage from stored procedures, a parser needs to handle the full procedural language, not just the SQL statements embedded in it. Here are three real-world examples that illustrate the complexity.</p>



<h3 class="wp-block-heading">PL/SQL: Package With Cursor Loops and BULK COLLECT</h3>



<pre class="wp-block-code"><code>CREATE OR REPLACE PACKAGE BODY etl_pkg AS
  PROCEDURE load_customer_summary IS
    TYPE t_cust_rec IS RECORD (
      customer_id   customers.customer_id%TYPE,
      total_amount  NUMBER,
      order_count   NUMBER
    );
    TYPE t_cust_tab IS TABLE OF t_cust_rec;
    l_batch t_cust_tab;

    CURSOR c_active IS
      SELECT c.customer_id,
             SUM(o.amount) AS total_amount,
             COUNT(o.order_id) AS order_count
      FROM customers c
      JOIN orders o ON o.customer_id = c.customer_id
      WHERE c.status = 'ACTIVE'
      GROUP BY c.customer_id;
  BEGIN
    OPEN c_active;
    LOOP
      FETCH c_active BULK COLLECT INTO l_batch LIMIT 5000;
      EXIT WHEN l_batch.COUNT = 0;

      FORALL i IN 1..l_batch.COUNT
        MERGE INTO customer_summary cs
        USING (SELECT l_batch(i).customer_id AS cid,
                      l_batch(i).total_amount AS amt,
                      l_batch(i).order_count AS cnt FROM dual) src
        ON (cs.customer_id = src.cid)
        WHEN MATCHED THEN
          UPDATE SET cs.total_amount = src.amt,
                     cs.order_count = src.cnt,
                     cs.updated_at = SYSDATE
        WHEN NOT MATCHED THEN
          INSERT (customer_id, total_amount, order_count, updated_at)
          VALUES (src.cid, src.amt, src.cnt, SYSDATE);

      COMMIT;
    END LOOP;
    CLOSE c_active;
  EXCEPTION
    WHEN OTHERS THEN
      ROLLBACK;
      INSERT INTO etl_error_log (procedure_name, error_msg, error_time)
      VALUES ('load_customer_summary', SQLERRM, SYSDATE);
      COMMIT;
      RAISE;
  END load_customer_summary;
END etl_pkg;</code></pre>



<p>The lineage here flows from <code>customers.customer_id</code>, <code>orders.amount</code>, and <code>orders.order_id</code> through a cursor, into a PL/SQL collection variable, through a <code>FORALL</code> + <code>MERGE</code> statement, and finally into <code>customer_summary.customer_id</code>, <code>customer_summary.total_amount</code>, and <code>customer_summary.order_count</code>. The error path also creates lineage into <code>etl_error_log</code>. A parser that only sees the MERGE statement without understanding the cursor and collection context will miss the source columns entirely.</p>



<p><a href="https://www.gudusoft.com">General SQL Parser</a> handles this by parsing the full PL/SQL package body, resolving variable types through <code>%TYPE</code> and record declarations, tracking data flow through <code>BULK COLLECT</code> into collection variables, and connecting those variables through the <code>FORALL</code> loop to the final MERGE target.</p>



<h3 class="wp-block-heading">T-SQL: Procedure With Temp Tables, Dynamic SQL, and TRY/CATCH</h3>



<pre class="wp-block-code"><code>CREATE PROCEDURE dbo.usp_refresh_product_metrics
    @CategoryID INT,
    @StartDate DATE
AS
BEGIN
    SET NOCOUNT ON;
    BEGIN TRY
        -- Stage data in temp table
        SELECT p.product_id,
               p.product_name,
               p.category_id,
               SUM(s.quantity) AS total_qty,
               SUM(s.quantity * s.unit_price) AS total_revenue
        INTO #product_staging
        FROM products p
        INNER JOIN sales s ON s.product_id = p.product_id
        WHERE p.category_id = @CategoryID
          AND s.sale_date &gt;= @StartDate
        GROUP BY p.product_id, p.product_name, p.category_id;

        -- Apply business rules
        UPDATE #product_staging
        SET total_revenue = total_revenue * 0.9
        WHERE total_qty &gt; 1000;

        -- Merge into target
        MERGE product_metrics AS tgt
        USING #product_staging AS src
        ON tgt.product_id = src.product_id
        WHEN MATCHED THEN
            UPDATE SET tgt.product_name = src.product_name,
                       tgt.total_qty = src.total_qty,
                       tgt.total_revenue = src.total_revenue,
                       tgt.last_updated = GETDATE()
        WHEN NOT MATCHED THEN
            INSERT (product_id, product_name, category_id,
                    total_qty, total_revenue, last_updated)
            VALUES (src.product_id, src.product_name, src.category_id,
                    src.total_qty, src.total_revenue, GETDATE());

        DROP TABLE #product_staging;
    END TRY
    BEGIN CATCH
        IF OBJECT_ID('tempdb..#product_staging') IS NOT NULL
            DROP TABLE #product_staging;

        INSERT INTO dbo.error_log (proc_name, error_message, error_time)
        VALUES ('usp_refresh_product_metrics', ERROR_MESSAGE(), GETDATE());

        THROW;
    END CATCH
END
GO</code></pre>



<p>This procedure creates a temp table via <code>SELECT INTO</code>, modifies it with an UPDATE, then merges into the final target. The column-level lineage must trace <code>products.product_name</code> and <code>sales.quantity</code> through <code>#product_staging</code> and into <code>product_metrics</code>. The <code>GO</code> batch delimiter, <code>TRY/CATCH</code> error handling, and temp table lifecycle all need to be understood. General SQL Parser resolves the temp table schema from the <code>SELECT INTO</code> statement, tracks the UPDATE transformation, and maps columns through the MERGE into the final target &#8212; including the error path to <code>error_log</code>.</p>



<h3 class="wp-block-heading">PL/pgSQL: Function With PERFORM, RAISE, and RETURN QUERY</h3>



<pre class="wp-block-code"><code>CREATE OR REPLACE FUNCTION analytics.refresh_user_engagement(
    p_days_back INTEGER DEFAULT 30
)
RETURNS TABLE (
    user_id BIGINT,
    engagement_score NUMERIC,
    segment TEXT
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_threshold NUMERIC;
    v_count INTEGER;
BEGIN
    -- Cleanup old data
    PERFORM analytics.archive_old_engagement(p_days_back * 2);

    SELECT AVG(login_count)::NUMERIC INTO v_threshold
    FROM analytics.user_activity
    WHERE activity_date &gt;= CURRENT_DATE - p_days_back;

    INSERT INTO analytics.engagement_daily (
        user_id, engagement_score, segment, calc_date
    )
    SELECT ua.user_id,
           (ua.login_count * 0.4 + ua.page_views * 0.3
            + ua.actions_taken * 0.3)::NUMERIC AS engagement_score,
           CASE
             WHEN (ua.login_count * 0.4 + ua.page_views * 0.3
                   + ua.actions_taken * 0.3) &gt; v_threshold THEN 'high'
             ELSE 'standard'
           END AS segment,
           CURRENT_DATE
    FROM analytics.user_activity ua
    WHERE ua.activity_date &gt;= CURRENT_DATE - p_days_back
    ON CONFLICT (user_id, calc_date) DO UPDATE
    SET engagement_score = EXCLUDED.engagement_score,
        segment = EXCLUDED.segment;

    GET DIAGNOSTICS v_count = ROW_COUNT;
    RAISE NOTICE 'Processed % user engagement records', v_count;

    RETURN QUERY
    SELECT ed.user_id, ed.engagement_score, ed.segment
    FROM analytics.engagement_daily ed
    WHERE ed.calc_date = CURRENT_DATE;
END;
$$;</code></pre>



<p>Here, <code>PERFORM</code> calls another function (with side effects), <code>RAISE NOTICE</code> is PostgreSQL-specific logging, and <code>RETURN QUERY</code> makes this a table-returning function whose output can feed into other queries. The lineage traces <code>user_activity.login_count</code>, <code>user_activity.page_views</code>, and <code>user_activity.actions_taken</code> through computed expressions into <code>engagement_daily.engagement_score</code> and <code>engagement_daily.segment</code>, and also through the function&#8217;s return type. General SQL Parser understands all of these PL/pgSQL constructs natively.</p>



<h2 class="wp-block-heading">Column-Level Lineage Through Stored Procedures, Not Around Them</h2>



<p>The key distinction is whether a tool extracts lineage <em>through</em> stored procedures or just <em>around</em> them. Working around them means treating the procedure as a black box &#8212; you know it reads from table A and writes to table B, but you do not know which specific columns are involved or how they are transformed. Working through them means the parser understands the procedural logic well enough to trace individual columns from source to target, including through variables, cursors, collections, temp tables, and control flow branches.</p>



<p><a href="https://www.gudusoft.com">General SQL Parser</a> takes the &#8220;through&#8221; approach. It parses the full procedural language for each supported dialect, builds an internal representation of variable declarations, assignments, and data flow, and resolves column-level lineage across procedural boundaries. When you feed it a PL/SQL package or a T-SQL stored procedure, the output is not just &#8220;this procedure touches these tables&#8221; &#8212; it is a complete column-to-column lineage map that shows exactly how data flows from source to target.</p>



<p>You can see this in action with <a href="https://sqlflow.gudusoft.com">SQLFlow</a>, which visualizes the lineage extracted by General SQL Parser. Upload a stored procedure and you will see the column-level connections rendered as a data flow diagram, with each transformation step visible.</p>



<h2 class="wp-block-heading">Try It Yourself</h2>



<p>If stored procedure lineage is a gap in your current tooling, here is what you can do:</p>



<ul class="wp-block-list">
<li><strong>Visualize your procedures</strong>: Upload a stored procedure to the <a href="https://sqlflow.gudusoft.com">SQLFlow online demo</a> and see the column-level lineage diagram it produces.</li>
<li><strong>Test your edge cases</strong>: Try the patterns that break other tools &#8212; BULK COLLECT + FORALL, temp tables in T-SQL, PL/pgSQL RETURN QUERY. These are the cases where parser quality shows.</li>
<li><strong>Analyze in your IDE</strong>: If you use VS Code, <a href="https://gudu-sql-omni.gudusoft.com/index.html">Gudu SQL Omni</a> brings column-level lineage, ER diagrams, and impact analysis directly into your editor &#8212; 100% offline, 34 dialects, free trial. <a href="https://marketplace.visualstudio.com/items?itemName=gudusoftware.gudu-sql-omni">Install from the VS Code Marketplace</a>.</li>
<li><strong>Evaluate against your codebase</strong>: General SQL Parser supports Oracle PL/SQL, SQL Server T-SQL, PostgreSQL PL/pgSQL, MySQL stored procedures, and DB2 SQL PL. Details and downloads are at <a href="https://www.gudusoft.com">gudusoft.com</a>.</li>



<p>Stored procedures are not going away. If anything, the push toward database-side logic for performance and security is growing. A lineage solution that skips them is a lineage solution with blind spots &#8212; and in an era of data governance and regulatory compliance, blind spots are not acceptable.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/04/why-most-sql-lineage-tools-skip-stored-procedures-and-what-you-can-do-about-it/">Why Most SQL Lineage Tools Skip Stored Procedures (And What You Can Do About It)</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Understanding SQLFlow Job: Automate Your Data Lineage Analysis at Scale</title>
		<link>https://www.dpriver.com/blog/2026/02/understanding-sqlflow-job-automate-your-data-lineage-analysis-at-scale/</link>
		
		<dc:creator><![CDATA[James]]></dc:creator>
		<pubDate>Thu, 05 Feb 2026 11:02:29 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://www.dpriver.com/blog/?p=3169</guid>

					<description><![CDATA[<p>When dealing with large-scale data environments, analyzing data lineage manually becomes impractical. You might have hundreds of SQL files across different repositories, databases, and ETL processes. This is where SQLFlow Job comes into play – a powerful feature that enables batch processing of SQL files for automated data lineage discovery. In this article, we&#8217;ll explore two ways to create and manage SQLFlow Jobs: through the intuitive Web UI and programmatically via the REST API. Part 1: Creating Jobs Through the SQLFlow UI The SQLFlow web interface provides a user-friendly way to create and manage jobs without writing any code. This…</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/02/understanding-sqlflow-job-automate-your-data-lineage-analysis-at-scale/">Understanding SQLFlow Job: Automate Your Data Lineage Analysis at Scale</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>When dealing with large-scale data environments, analyzing data lineage manually becomes impractical. You might have hundreds of SQL files across different repositories, databases, and ETL processes. This is where <strong>SQLFlow Job</strong> comes into play – a powerful feature that enables batch processing of SQL files for automated data lineage discovery.</p>



<p>In this article, we&#8217;ll explore two ways to create and manage SQLFlow Jobs: through the intuitive <strong>Web UI</strong> and programmatically via the <strong>REST API</strong>.</p>



<h2 class="wp-block-heading">Part 1: Creating Jobs Through the SQLFlow UI</h2>



<p>The SQLFlow web interface provides a user-friendly way to create and manage jobs without writing any code. This is perfect for ad-hoc analysis, exploration, and users who prefer a visual approach.</p>



<h3 class="wp-block-heading">Step 1: Access the Job Creation Panel</h3>



<p>After logging into <a href="https://sqlflow.gudusoft.com">SQLFlow</a>, navigate to the Job List section and click the <strong>Job Creation</strong> button.</p>



<h3 class="wp-block-heading">Step 2: Choose Your Job Source</h3>



<p>SQLFlow supports multiple data sources for job creation:</p>



<ul>
<li><strong>Upload File</strong>: Upload SQL files or a ZIP archive containing multiple SQL files (up to 200MB). This is ideal for analyzing DDL scripts, stored procedures, or any SQL codebase.</li>
<li><strong>From Database</strong>: Connect directly to your database server. SQLFlow will read metadata and SQL objects to generate lineage automatically.</li>
<li><strong>Upload File + Database Metadata</strong>: Combine uploaded SQL files with database metadata to resolve column ambiguities for more accurate lineage.</li>
<li><strong>dbt</strong>: Import lineage from your dbt transformation projects using manifest.json and catalog.json files.</li>
<li><strong>Snowflake Query History</strong>: Analyze lineage from Snowflake query history automatically.</li>
<li><strong>Redshift Log</strong>: Parse Amazon Redshift logs (.gz or .zip format) for lineage discovery.</li>
</ul>



<h3 class="wp-block-heading">Step 3: Configure Job Parameters</h3>



<p>Set up your job with these options:</p>



<ul>
<li><strong>Database Vendor</strong>: Select your SQL dialect (Oracle, SQL Server, MySQL, PostgreSQL, Snowflake, etc.)</li>
<li><strong>Default Server/Database/Schema</strong>: Set default values for unqualified object references</li>
<li><strong>Job Type</strong>: Choose between regular job or summary mode for large datasets</li>
<li><strong>Advanced Settings</strong>: Filter specific schemas, stored procedures, or views to include/exclude</li>
</ul>



<h3 class="wp-block-heading">Step 4: View and Manage Results</h3>



<p>Once the job completes, you can:</p>



<ul>
<li><strong>Lineage Overview</strong>: View table-level lineage across your entire database</li>
<li><strong>Lineage Detail</strong>: Drill down into column-level relationships</li>
<li><strong>Job Info</strong>: Check metadata like creation time, execution time, and job status</li>
<li><strong>Export</strong>: Download results in JSON, CSV, or GraphML formats</li>
</ul>



<h2 class="wp-block-heading">Part 2: Creating Jobs Through the REST API</h2>



<p>For automation, CI/CD integration, and programmatic access, SQLFlow provides a comprehensive REST API. This approach is ideal for scheduled lineage scans, integration with data pipelines, and enterprise automation scenarios.</p>



<h3 class="wp-block-heading">API Authentication</h3>



<p>First, obtain your API credentials (userId and token) from your SQLFlow account settings. See the <a href="https://docs.gudusoft.com/3.-api-docs/prerequisites#generate-account-secret">prerequisites documentation</a> for details.</p>



<h3 class="wp-block-heading">Submit a Job</h3>



<p>Use the <code>/submitUserJob</code> endpoint to upload SQL files:</p>



<pre class="wp-block-code"><code>curl -X POST "https://api.gudusoft.com/gspLive_backend/sqlflow/job/submitUserJob" \
  -H "Content-Type:multipart/form-data" \
  -F "userId=YOUR_USER_ID" \
  -F "token=YOUR_TOKEN" \
  -F "sqlfiles=@/path/to/queries.sql" \
  -F "sqlfiles=@/path/to/procedures.sql" \
  -F "dbvendor=dbvoracle" \
  -F "jobName=my-lineage-analysis"</code></pre>



<p>The API returns a <code>jobId</code> for tracking:</p>



<pre class="wp-block-code"><code>{
  "code": 200,
  "data": {
    "jobId": "c359aef4bd9641d697732422debd8055",
    "jobName": "my-lineage-analysis",
    "status": "create"
  }
}</code></pre>



<h3 class="wp-block-heading">Check Job Status</h3>



<p>Monitor progress with <code>/displayUserJobSummary</code>:</p>



<pre class="wp-block-code"><code>curl -X POST "https://api.gudusoft.com/gspLive_backend/sqlflow/job/displayUserJobSummary" \
  -F "jobId=c359aef4bd9641d697732422debd8055" \
  -F "userId=YOUR_USER_ID" \
  -F "token=YOUR_TOKEN"</code></pre>



<p>When <code>status</code> becomes <code>success</code>, your lineage is ready.</p>



<h3 class="wp-block-heading">Export Lineage Results</h3>



<p>Download results in your preferred format:</p>



<pre class="wp-block-code"><code># JSON format
curl -X POST ".../exportLineageAsJson" -F "jobId=..." --output lineage.json

# CSV format  
curl -X POST ".../exportFullLineageAsCsv" -F "jobId=..." --output lineage.csv

# GraphML format (for yEd visualization)
curl -X POST ".../exportLineageAsGraphml" -F "jobId=..." --output lineage.graphml</code></pre>



<h3 class="wp-block-heading">Incremental Jobs</h3>



<p>For evolving codebases, use incremental analysis with <code>/submitPersistJob</code>:</p>



<pre class="wp-block-code"><code>curl -X POST ".../submitPersistJob" \
  -F "incremental=true" \
  -F "jobId=EXISTING_JOB_ID" \
  -F "sqlfiles=@/path/to/updated_queries.sql" \
  ...</code></pre>



<p>This updates the existing lineage graph without re-analyzing everything.</p>



<h2 class="wp-block-heading">When to Use UI vs API</h2>



<figure class="wp-block-table"><table><thead><tr><th>Scenario</th><th>Recommended Approach</th></tr></thead><tbody><tr><td>Ad-hoc analysis and exploration</td><td>UI</td></tr><tr><td>One-time database lineage scan</td><td>UI</td></tr><tr><td>CI/CD pipeline integration</td><td>API</td></tr><tr><td>Scheduled nightly lineage scans</td><td>API</td></tr><tr><td>Integration with data catalog tools</td><td>API</td></tr><tr><td>Quick visualization needs</td><td>UI</td></tr></tbody></table></figure>



<h2 class="wp-block-heading">Real-World Use Cases</h2>



<ol>
<li><strong>Data Governance Automation</strong>: Schedule nightly API jobs to scan your data warehouse SQL and detect lineage changes</li>
<li><strong>CI/CD Integration</strong>: Trigger lineage analysis when SQL changes are committed to your repository</li>
<li><strong>Migration Projects</strong>: Use the UI to batch analyze legacy stored procedures before modernization</li>
<li><strong>Compliance Audits</strong>: Generate lineage reports in CSV format for regulatory requirements</li>
<li><strong>Impact Analysis</strong>: Before modifying a table, understand all downstream dependencies</li>
</ol>



<h2 class="wp-block-heading">Getting Started</h2>



<p>Ready to try SQLFlow Job? Here&#8217;s how to get started:</p>



<ol>
<li><strong>Sign up</strong> at <a href="https://sqlflow.gudusoft.com">sqlflow.gudusoft.com</a> for a free trial</li>
<li><strong>Try the UI</strong>: Upload a SQL file and explore the interactive lineage visualization</li>
<li><strong>Explore the API</strong>: Check the <a href="https://docs.gudusoft.com/3.-api-docs/sqlflow-rest-api-reference/job-interface/">Job API documentation</a> for code samples</li>
</ol>



<p>Whether you prefer the visual approach or need full automation, SQLFlow Job provides the flexibility to handle data lineage analysis at any scale.</p>



<p><a href="https://sqlflow.gudusoft.com">Try SQLFlow today</a> – your data lineage journey starts here.</p>
<p>The post <a rel="nofollow" href="https://www.dpriver.com/blog/2026/02/understanding-sqlflow-job-automate-your-data-lineage-analysis-at-scale/">Understanding SQLFlow Job: Automate Your Data Lineage Analysis at Scale</a> appeared first on <a rel="nofollow" href="https://www.dpriver.com/blog">SQL and Data Blog</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
