<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>R-bloggers</title>
	<atom:link href="https://www.r-bloggers.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.r-bloggers.com</link>
	<description>R news and tutorials contributed by hundreds of R bloggers</description>
	<lastBuildDate>Sat, 14 Mar 2026 18:27:36 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.5.18</generator>

<image>
	<url>https://i0.wp.com/www.r-bloggers.com/wp-content/uploads/2016/08/cropped-R_single_01-200.png?fit=32%2C32&#038;ssl=1</url>
	<title>R-bloggers</title>
	<link>https://www.r-bloggers.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">11524731</site>	<item>
		<title>Bayesian Linear Regression in R: A Step-by-Step Tutorial</title>
		<link>https://www.r-bloggers.com/2026/03/bayesian-linear-regression-in-r-a-step-by-step-tutorial/</link>
		
		<dc:creator><![CDATA[rprogrammingbooks]]></dc:creator>
		<pubDate>Sat, 14 Mar 2026 18:27:36 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rprogrammingbooks.com/?p=2516</guid>

					<description><![CDATA[<p>Bayesian linear regression is one of the best ways to learn Bayesian modeling in R because it combines familiar regression ideas with a more realistic treatment of uncertainty. Instead of estimating a single fixed coefficient for each parameter, Bayesian methods estimate full probability distributions. That means we can talk about ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/bayesian-linear-regression-in-r-a-step-by-step-tutorial/">Bayesian Linear Regression in R: A Step-by-Step Tutorial</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rprogrammingbooks.com/bayesian-linear-regression-in-r-step-by-step-tutorial/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=bayesian-linear-regression-in-r-step-by-step-tutorial"> Blog - R Programming Books</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <meta name="description" content="A complete step-by-step Bayesian linear regression tutorial in R using brms, bayesplot, tidybayes, and related packages." />
  <meta name="keywords" content="Bayesian Linear Regression in R, Bayesian regression tutorial, brms tutorial, rstanarm, bayesplot, tidybayes, Bayesian modeling R" />
  <meta name="author" content="RProgrammingBooks" />
  <meta name="robots" content="index, follow" />

  <style>
    body {
      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Arial, sans-serif;
      line-height: 1.8;
      background: #f8fafc;
      color: #1f2937;
      max-width: 920px;
      margin: 0 auto;
      padding: 40px 24px 80px;
    }

    h1, h2, h3, h4 {
      color: #0f172a;
      line-height: 1.3;
      margin-top: 42px;
      margin-bottom: 16px;
    }

    p {
      margin: 16px 0;
    }

    ul, ol {
      margin: 16px 0 16px 22px;
    }

    li {
      margin-bottom: 10px;
    }

    pre {
      background: #0b1220;
      color: #e5e7eb;
      padding: 18px;
      overflow-x: auto;
      border-radius: 8px;
      font-size: 0.95rem;
      line-height: 1.6;
      margin: 20px 0;
    }

    code {
      font-family: "SFMono-Regular", Consolas, "Liberation Mono", Menlo, monospace;
    }

    .callout {
      background: #eef6ff;
      border-left: 4px solid #2563eb;
      padding: 18px;
      border-radius: 6px;
      margin: 28px 0;
    }

    .package-grid {
      display: grid;
      grid-template-columns: repeat(auto-fit, minmax(230px, 1fr));
      gap: 16px;
      margin: 28px 0;
    }

    .package-card {
      background: #ffffff;
      border: 1px solid #e5e7eb;
      border-radius: 8px;
      padding: 16px;
    }

    .package-card h4 {
      margin-top: 0;
      margin-bottom: 10px;
      font-size: 1rem;
    }

    .book-box {
      background: #ffffff;
      border: 1px solid #dbe3ee;
      border-radius: 8px;
      padding: 22px;
      margin: 38px 0;
    }

    .table-wrap {
      overflow-x: auto;
      margin: 24px 0;
    }

    table {
      width: 100%;
      border-collapse: collapse;
      background: #ffffff;
    }

    th, td {
      border: 1px solid #e5e7eb;
      padding: 12px;
      text-align: left;
    }

    th {
      background: #f1f5f9;
    }

    a {
      color: #1d4ed8;
      text-decoration: none;
    }

    a:hover {
      text-decoration: underline;
    }

    .section-divider {
      height: 1px;
      background: #e5e7eb;
      margin: 42px 0;
    }
  </style>
</head>
<body>

  <p>
    Bayesian linear regression is one of the best ways to learn Bayesian modeling in R because it combines familiar regression ideas with a more realistic treatment of uncertainty.
    Instead of estimating a single fixed coefficient for each parameter, Bayesian methods estimate full probability distributions.
    That means we can talk about uncertainty, prior beliefs, posterior updates, and credible intervals in a way that is often more intuitive than classical statistics.
  </p>

  <p>
    In this tutorial, you will learn how to fit a Bayesian linear regression model in R step by step.
    We will start with the theory, build a dataset, choose priors, fit a model with <strong>brms</strong>, inspect posterior distributions, evaluate diagnostics, perform posterior predictive checks, and generate predictions for new observations.
    We will also look at several R packages that belong to a practical Bayesian workflow.
  </p>

  <div class="callout">
    <strong>What you will learn in this tutorial:</strong>
    <ul>
      <li>How Bayesian linear regression works</li>
      <li>How priors and posteriors differ from classical estimates</li>
      <li>How to fit a model in R using <code>brms</code></li>
      <li>How to inspect convergence and model quality</li>
      <li>How to use related packages such as <code>tidybayes</code>, <code>bayesplot</code>, and <code>rstanarm</code></li>
    </ul>
  </div>

  <h2>Why Bayesian Linear Regression Matters</h2>

  <p>
    In classical linear regression, the model estimates coefficients such as the intercept and slope as fixed unknown values.
    In Bayesian regression, those same coefficients are modeled as random variables with prior distributions.
    Once data is observed, those priors are updated into posterior distributions.
  </p>

  <p>
    This gives us several advantages:
  </p>

  <ul>
    <li>We can incorporate prior knowledge into the model</li>
    <li>We get full uncertainty distributions, not just point estimates</li>
    <li>Predictions naturally include uncertainty</li>
    <li>Bayesian methods scale well into multilevel and hierarchical models</li>
    <li>The interpretation of intervals is often more direct</li>
  </ul>

  <p>
    If you are working in predictive analytics, experimental analysis, or sports modeling, this framework is especially useful because it lets you update beliefs as new data arrives.
  </p>

  <h2>The Bayesian Formula Behind Linear Regression</h2>

  <p>
    A simple linear regression can be written as:
  </p>

  <pre>y = β0 + β1x + ε</pre>

  <p>
    Where:
  </p>

  <ul>
    <li><strong>y</strong> is the response variable</li>
    <li><strong>x</strong> is the predictor</li>
    <li><strong>β0</strong> is the intercept</li>
    <li><strong>β1</strong> is the slope</li>
    <li><strong>ε</strong> is the error term, typically assumed to be normally distributed</li>
  </ul>

  <p>
    In the Bayesian version, we add priors:
  </p>

  <pre>β0 ~ Normal(0, 10)
β1 ~ Normal(0, 5)
σ  ~ Student_t(3, 0, 2.5)</pre>

  <p>
    After seeing the data, we compute:
  </p>

  <pre>Posterior ∝ Likelihood × Prior</pre>

  <p>
    That one line is the core of Bayesian inference.
  </p>

  <h2>R Packages You Should Know for Bayesian Regression</h2>

  <p>
    Before fitting models, it helps to understand the ecosystem. Bayesian modeling in R is not just about one package.
    It is usually a workflow involving model fitting, posterior extraction, diagnostics, and visualization.
  </p>

  <div class="package-grid">
    <div class="package-card">
      <h4>brms</h4>
      <p>High-level Bayesian regression modeling with formula syntax similar to <code>lm()</code> and <code>glm()</code>.</p>
    </div>
    <div class="package-card">
      <h4>rstanarm</h4>
      <p>Bayesian applied regression with an interface that feels familiar to many R users.</p>
    </div>
    <div class="package-card">
      <h4>tidybayes</h4>
      <p>Extracts and visualizes posterior draws in a tidy format for easy analysis and plotting.</p>
    </div>
    <div class="package-card">
      <h4>bayesplot</h4>
      <p>Useful for trace plots, posterior predictive checks, and MCMC diagnostics.</p>
    </div>
    <div class="package-card">
      <h4>posterior</h4>
      <p>Helpful for working with draws, summaries, and posterior diagnostics in a standardized way.</p>
    </div>
    <div class="package-card">
      <h4>cmdstanr</h4>
      <p>R interface to CmdStan, useful for users who want more direct Stan workflows and model control.</p>
    </div>
    <div class="package-card">
      <h4>loo</h4>
      <p>Widely used for approximate leave-one-out cross-validation and model comparison.</p>
    </div>
    <div class="package-card">
      <h4>ggplot2</h4>
      <p>Still essential for custom data exploration and clean visualization of posterior summaries.</p>
    </div>
  </div>

  <h2>Installing the Required Packages</h2>

  <p>
    For this tutorial, we will focus on <code>brms</code> for model fitting, while also using a few companion packages for diagnostics and visualization.
  </p>

  <pre>install.packages(c(
  &quot;brms&quot;,
  &quot;tidyverse&quot;,
  &quot;tidybayes&quot;,
  &quot;bayesplot&quot;,
  &quot;posterior&quot;,
  &quot;loo&quot;,
  &quot;rstanarm&quot;
))</pre>

  <p>
    Then load the packages:
  </p>

  <pre>library(brms)
library(tidyverse)
library(tidybayes)
library(bayesplot)
library(posterior)
library(loo)
library(rstanarm)</pre>

  <div class="callout">
    <strong>Tip:</strong> If you want a lower-level interface to Stan, you can also explore <code>cmdstanr</code>.
    For most readers, however, <code>brms</code> is a better starting point because it keeps the syntax concise while still giving access to advanced Bayesian models.
  </div>

  <h2>Creating a Simple Dataset</h2>

  <p>
    To make the tutorial reproducible, we will simulate a small dataset where one predictor explains a continuous response.
  </p>

  <pre>set.seed(123)

n &lt;- 120

advertising_spend &lt;- rnorm(n, mean = 15, sd = 4)

sales &lt;- 20 + 3.5 * advertising_spend + rnorm(n, mean = 0, sd = 8)

df &lt;- data.frame(
  advertising_spend = advertising_spend,
  sales = sales
)

head(df)</pre>

  <p>
    In this synthetic example, higher advertising spend tends to increase sales.
    The true slope used in the simulation is 3.5, but in a real modeling situation we would not know that value ahead of time.
  </p>

  <h2>Exploring the Data First</h2>

  <p>
    It is always a good idea to inspect the data visually before fitting any Bayesian model.
  </p>

  <pre>summary(df)

ggplot(df, aes(x = advertising_spend, y = sales)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = &quot;lm&quot;, se = FALSE) +
  theme_minimal()</pre>

  <p>
    This scatterplot helps confirm that the relationship is roughly linear.
    Even when using Bayesian methods, the basic logic of exploratory data analysis still applies.
  </p>

  <h2>A Quick Classical Baseline with lm()</h2>

  <p>
    Before fitting the Bayesian model, it is useful to compare it with a standard linear regression.
  </p>

  <pre>lm_model &lt;- lm(sales ~ advertising_spend, data = df)
summary(lm_model)</pre>

  <p>
    This gives us a baseline estimate for the intercept and slope.
    Later, we will compare it to Bayesian posterior summaries.
  </p>

  <h2>Choosing Priors</h2>

  <p>
    Priors are a defining part of Bayesian modeling.
    A prior reflects what we believe about a parameter before seeing the current data.
    Priors can be weakly informative, informative, or strongly regularizing depending on the context.
  </p>

  <p>
    In many practical applications, weakly informative priors are a good default.
  </p>

  <pre>priors &lt;- c(
  prior(normal(0, 20), class = &quot;Intercept&quot;),
  prior(normal(0, 10), class = &quot;b&quot;),
  prior(student_t(3, 0, 10), class = &quot;sigma&quot;)
)

priors</pre>

  <p>
    This prior specification says:
  </p>

  <ul>
    <li>The intercept is centered around 0 with broad uncertainty</li>
    <li>The slope is also centered around 0 with a wide standard deviation</li>
    <li>The residual standard deviation is positive and given a weakly informative prior</li>
  </ul>

  <p>
    In real projects, priors should reflect domain knowledge whenever possible.
    For example, in marketing, finance, or sports analytics, prior expectations often come from previous seasons, experiments, or historical model performance.
  </p>

  <h2>Fitting the Bayesian Linear Regression Model</h2>

  <p>
    Now we can fit the Bayesian model using <code>brm()</code>.
  </p>

  <pre>bayes_model &lt;- brm(
  formula = sales ~ advertising_spend,
  data = df,
  prior = priors,
  family = gaussian(),
  chains = 4,
  iter = 4000,
  warmup = 2000,
  seed = 123
)</pre>

  <p>
    Here is what the most important arguments mean:
  </p>

  <ul>
    <li><code>chains = 4</code>: run four independent Markov chains</li>
    <li><code>iter = 4000</code>: total iterations per chain</li>
    <li><code>warmup = 2000</code>: burn-in samples used for adaptation</li>
    <li><code>family = gaussian()</code>: assume a normal likelihood for the response</li>
  </ul>

  <h2>Reading the Model Summary</h2>

  <pre>summary(bayes_model)</pre>

  <p>
    The summary output typically reports:
  </p>

  <ul>
    <li>Posterior mean and standard error for each parameter</li>
    <li>Credible intervals</li>
    <li>R-hat values for convergence</li>
    <li>Effective sample sizes</li>
  </ul>

  <p>
    A good sign is when R-hat values are close to 1.00.
    That suggests the MCMC chains mixed well and converged.
  </p>

  <h2>Understanding the Posterior Output</h2>

  <p>
    Suppose the slope posterior is centered near 3.4 with a 95% credible interval from 3.0 to 3.8.
    In Bayesian terms, that means the model assigns high posterior probability to the slope being in that interval.
    This is one reason many analysts find Bayesian intervals easier to interpret.
  </p>

  <p>
    In practical language, we could say:
  </p>

  <div class="callout">
    Based on the model and the observed data, higher advertising spend is strongly associated with higher sales, and the posterior distribution indicates that the effect is likely positive and substantial.
  </div>

  <h2>Extracting Posterior Draws</h2>

  <p>
    One of the strengths of Bayesian modeling is that you can work directly with posterior draws.
  </p>

  <pre>draws &lt;- as_draws_df(bayes_model)
head(draws)</pre>

  <p>
    This lets you explore parameter distributions, uncertainty, and custom probabilities.
  </p>

  <pre>mean(draws$b_advertising_spend &gt; 0)</pre>

  <p>
    The code above estimates the posterior probability that the slope is greater than zero.
    That is a very natural Bayesian quantity.
  </p>

  <h2>Visualizing Posterior Distributions</h2>

  <pre>plot(bayes_model)</pre>

  <p>
    The default plot gives a quick view of posterior densities and chain behavior.
    You can also visualize intervals more explicitly:
  </p>

  <pre>mcmc_areas(
  as.array(bayes_model),
  pars = c(&quot;b_Intercept&quot;, &quot;b_advertising_spend&quot;, &quot;sigma&quot;)
)</pre>

  <p>
    This is where <code>bayesplot</code> becomes especially useful.
  </p>

  <h2>Checking Convergence with Trace Plots</h2>

  <p>
    Trace plots help determine whether the MCMC chains have mixed properly.
  </p>

  <pre>mcmc_trace(
  as.array(bayes_model),
  pars = c(&quot;b_Intercept&quot;, &quot;b_advertising_spend&quot;, &quot;sigma&quot;)
)</pre>

  <p>
    Healthy trace plots should look like fuzzy horizontal bands rather than trending lines or stuck sequences.
  </p>

  <h2>Posterior Predictive Checks</h2>

  <p>
    Posterior predictive checks are one of the most important parts of a Bayesian workflow.
    They compare the observed data to data simulated from the fitted model.
  </p>

  <pre>pp_check(bayes_model)</pre>

  <p>
    If the simulated data looks broadly similar to the observed data, that is a sign the model captures the main structure reasonably well.
  </p>

  <p>
    You can also try more specific checks:
  </p>

  <pre>pp_check(bayes_model, type = &quot;dens_overlay&quot;)
pp_check(bayes_model, type = &quot;hist&quot;)
pp_check(bayes_model, type = &quot;scatter_avg&quot;)</pre>

  <h2>Using tidybayes for Tidy Posterior Workflows</h2>

  <p>
    The <code>tidybayes</code> package is extremely useful when you want to extract posterior draws into tidy data frames and build custom visualizations with <code>ggplot2</code>.
  </p>

  <pre>tidy_draws &lt;- bayes_model %&gt;%
  spread_draws(b_Intercept, b_advertising_spend, sigma)

head(tidy_draws)</pre>

  <p>
    For example, you can visualize the slope distribution:
  </p>

  <pre>tidy_draws %&gt;%
  ggplot(aes(x = b_advertising_spend)) +
  geom_density(fill = &quot;steelblue&quot;, alpha = 0.4) +
  theme_minimal()</pre>

  <p>
    This makes posterior analysis much more flexible than relying only on built-in summary output.
  </p>

  <h2>Generating Predictions for New Data</h2>

  <p>
    One of the biggest reasons to use regression is prediction.
    Bayesian models make this especially valuable because predictions come with uncertainty intervals.
  </p>

  <pre>new_customers &lt;- data.frame(
  advertising_spend = c(10, 15, 20, 25)
)

predict(bayes_model, newdata = new_customers)</pre>

  <p>
    You can also generate expected values without residual noise:
  </p>

  <pre>fitted(bayes_model, newdata = new_customers)</pre>

  <p>
    The difference is important:
  </p>

  <ul>
    <li><code>predict()</code> includes outcome uncertainty</li>
    <li><code>fitted()</code> focuses on the expected mean response</li>
  </ul>

  <h2>Visualizing the Regression Line with Uncertainty</h2>

  <pre>conditional_effects(bayes_model)</pre>

  <p>
    This is a quick way to visualize the fitted relationship and credible intervals.
    It is particularly useful when presenting results to readers who are new to Bayesian modeling.
  </p>

  <h2>Comparing the Classical and Bayesian Models</h2>

  <div class="table-wrap">
    <table>
      <thead>
        <tr>
          <th>Aspect</th>
          <th>Classical lm()</th>
          <th>Bayesian Model</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Parameter estimates</td>
          <td>Single point estimate</td>
          <td>Full posterior distribution</td>
        </tr>
        <tr>
          <td>Intervals</td>
          <td>Confidence intervals</td>
          <td>Credible intervals</td>
        </tr>
        <tr>
          <td>Prior knowledge</td>
          <td>Not included directly</td>
          <td>Included through priors</td>
        </tr>
        <tr>
          <td>Predictions</td>
          <td>Often point-centered</td>
          <td>Naturally uncertainty-aware</td>
        </tr>
        <tr>
          <td>Interpretability</td>
          <td>Frequentist framework</td>
          <td>Probability-based framework</td>
        </tr>
      </tbody>
    </table>
  </div>

  <h2>Alternative Approach with rstanarm</h2>

  <p>
    If you want a very approachable alternative to <code>brms</code>, you can fit a similar model with <code>rstanarm</code>.
  </p>

  <pre>rstanarm_model &lt;- stan_glm(
  sales ~ advertising_spend,
  data = df,
  family = gaussian(),
  chains = 4,
  iter = 4000,
  seed = 123
)

print(rstanarm_model)</pre>

  <p>
    This package is especially attractive for users who want Bayesian estimation with minimal syntax changes from familiar regression workflows.
  </p>

  <h2>Model Comparison with loo</h2>

  <p>
    For more advanced workflows, model comparison is often done with approximate leave-one-out cross-validation.
  </p>

  <pre>loo_result &lt;- loo(bayes_model)
print(loo_result)</pre>

  <p>
    This becomes particularly useful when comparing multiple Bayesian models with different predictors or structures.
  </p>

  <h2>Common Beginner Mistakes in Bayesian Regression</h2>

  <ul>
    <li>Using priors without thinking about the scale of the data</li>
    <li>Ignoring convergence diagnostics such as R-hat and trace plots</li>
    <li>Skipping posterior predictive checks</li>
    <li>Confusing credible intervals with classical confidence intervals</li>
    <li>Treating Bayesian modeling as only a different fitting function rather than a full workflow</li>
  </ul>

  <h2>When Bayesian Linear Regression Is a Great Choice</h2>

  <p>
    Bayesian linear regression is especially useful when:
  </p>

  <ul>
    <li>You want to express uncertainty directly</li>
    <li>You have prior knowledge from previous studies or historical data</li>
    <li>Your sample size is not huge and regularization helps</li>
    <li>You plan to expand into hierarchical or multilevel models later</li>
    <li>You need probabilistic predictions rather than just fitted coefficients</li>
  </ul>

  <h2>From Linear Regression to Real-World Prediction</h2>

  <p>
    Once you understand Bayesian linear regression, you can move into more realistic applications such as multilevel models, logistic regression, time series forecasting, and domain-specific predictive systems.
    In practice, many analysts first learn Bayesian methods through regression, then extend them into richer workflows for decision-making and forecasting.
  </p>

  <div class="book-box">
    <p>
      If your interest goes beyond introductory examples and into real prediction workflows, Bayesian methods are especially valuable in sports modeling, where uncertainty, updating, and probabilistic forecasts matter a lot.
    </p>

    <ul>
      <li>
        <a href="https://rprogrammingbooks.com/product/bayesian-sports-analytics-r-predictive-modeling-betting-performance/" rel="nofollow" target="_blank">
          Bayesian Sports Analytics R: Predictive Modeling, Betting, and Performance
        </a>
      </li>
      <li>
        <a href="https://rprogrammingbooks.com/product/bayesian-sports-betting-with-r/" rel="nofollow" target="_blank">
          Bayesian Sports Betting with R
        </a>
      </li>
    </ul>

    <p>
      Those kinds of projects often build on the same foundations covered here: priors, posterior updating, uncertainty-aware prediction, and iterative model improvement.
    </p>
  </div>

  <h2>Conclusion</h2>

  <p>
    Bayesian linear regression in R is one of the best entry points into Bayesian statistics because it combines familiar regression ideas with a much richer treatment of uncertainty.
    Instead of asking only for a coefficient estimate, you ask for a distribution of plausible values.
    Instead of pretending uncertainty is secondary, you put it at the center of the analysis.
  </p>

  <p>
    In this tutorial, we covered the full process:
  </p>

  <ul>
    <li>Building a dataset</li>
    <li>Understanding priors</li>
    <li>Fitting a model with <code>brms</code></li>
    <li>Inspecting posterior summaries</li>
    <li>Checking convergence and fit</li>
    <li>Generating predictions</li>
    <li>Using additional packages from the Bayesian R ecosystem</li>
  </ul>

  <p>
    Once you are comfortable with these steps, the next natural move is to explore Bayesian logistic regression, hierarchical models, and domain-specific forecasting systems.
  </p>

</body>
</html>

<p>The post <a href="https://rprogrammingbooks.com/bayesian-linear-regression-in-r-step-by-step-tutorial/" rel="nofollow" target="_blank">Bayesian Linear Regression in R: A Step-by-Step Tutorial</a> appeared first on <a href="https://rprogrammingbooks.com/" rel="nofollow" target="_blank">R Programming Books</a>.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rprogrammingbooks.com/bayesian-linear-regression-in-r-step-by-step-tutorial/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=bayesian-linear-regression-in-r-step-by-step-tutorial"> Blog - R Programming Books</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/bayesian-linear-regression-in-r-a-step-by-step-tutorial/">Bayesian Linear Regression in R: A Step-by-Step Tutorial</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399842</post-id>	</item>
		<item>
		<title>Three Posit Platform Features Worth Knowing About</title>
		<link>https://www.r-bloggers.com/2026/03/three-posit-platform-features-worth-knowing-about/</link>
		
		<dc:creator><![CDATA[The Jumping Rivers Blog]]></dc:creator>
		<pubDate>Fri, 13 Mar 2026 23:59:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.jumpingrivers.com/blog/posit-platform-updates/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
<p>We recently ran a session on Posit platform updates, the kind of features that don’t always make it onto your radar but can make a real difference once you know they’re there.<br />
This post covers the three highlights: speeding up R packa...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/three-posit-platform-features-worth-knowing-about/">Three Posit Platform Features Worth Knowing About</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.jumpingrivers.com/blog/posit-platform-updates/"> The Jumping Rivers Blog</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p>
<a href = "https://www.jumpingrivers.com/blog/posit-platform-updates/">
<img src="https://i1.wp.com/www.jumpingrivers.com/blog/posit-platform-updates/featured.png?w=400&#038;ssl=1" style="width:400px" class="image-center" style="display: block; margin: auto;" data-recalc-dims="1" />
</a>
</p>
<p>We recently ran a session on <a href="https://www.jumpingrivers.com/blog/data-processing-pandas-polars-webinar/" rel="nofollow" target="_blank">Posit platform updates</a>, the kind of features that don’t always make it onto your radar but can make a real difference once you know they’re there.</p>
<p>This post covers the three highlights: speeding up R package installation with Posit Package Manager, a new way to explore example apps on Connect, and Workbench Jobs for long-running tasks.</p>
<h2 id="r-package-installs-dont-have-to-take-26-minutes">R package installs don’t have to take 26 minutes</h2>
<p>If you’ve ever kicked off a Tidyverse install and gone to make a coffee (and come back to find it still running), this one’s for you. When installing from source, which is what happens if you point R at a plain CRAN mirror on Linux — R downloads the source tarball and compiles everything from scratch. That takes time. A lot of it. In our test, a clean Tidyverse install on R 4.4 took 26 minutes.</p>
<p>The fix is to point R at a binary-supporting mirror, which is exactly what <a href="https://www.jumpingrivers.com/posit/managed-services/" rel="nofollow" target="_blank">Posit Package Manager</a> provides. With binaries, that same install dropped to under two minutes, no compilation, no hunting down system dependencies.</p>
<p>If you’re on R 4.5, it gets better. R 4.5 introduced parallel package downloads, which cuts that two-minute install down to around 40 seconds. Throw in parallel CPU usage for installation as well via the <code>Ncpus</code> argument, and you’re looking at 15 seconds for a full Tidyverse install in a clean environment.</p>
<p>There’s also a preview feature to keep an eye on: ManyLinux support in <a href="https://posit.co/products/enterprise/package-manager/" rel="nofollow" target="_blank">Package Manager</a>. The idea is to bundle more of the system-level dependencies into the package itself, which means less dependency management for sysadmins. Downloads are a bit larger, but the maintenance overhead is lower. If you want a deeper dive into PPM itself, we have a <a href="https://www.jumpingrivers.com/training/course/package-management-with-ppm/" rel="nofollow" target="_blank">Managing Packages with Posit Package Manager training course</a> that covers this in detail.</p>
<p><strong>The short version:</strong> use binaries + R 4.5 + parallel installs. You can go from half an hour to about 15 seconds.</p>
<h2 id="connect-gallery-example-apps-without-the-setup-friction">Connect Gallery: example apps without the setup friction</h2>
<p>If you’ve used <a href="https://posit.co/products/enterprise/connect/" rel="nofollow" target="_blank">Posit Connect</a> for a while, you might remember the quick-start popup that appeared on first login — a set of example apps you could try out. That’s been replaced by Connect Gallery, which lives in the interface rather than popping up in front of you.</p>
<p>What’s changed isn’t just where it lives. Installing an example app is now one click. Previously you’d follow a set of instructions to get it running; now it just deploys.</p>
<p>Two examples worth highlighting from the gallery:</p>
<p><strong>Usage Metrics</strong> — shows you which content on your Connect server is actually being used, filtered by time period and user. It uses a visitor key, so the app shows each viewer only the content they have permission to see. Useful for admins wondering what’s getting traction and what isn’t.</p>
<p><strong>Command Center for Publishers</strong> — a dashboard built with Python that reimplements much of the Connect admin interface inside an app. You can rename deployed content, lock it, and manage it through the Connect API. Worth looking at both as a tool and as an example of how to build admin functionality on top of Connect.</p>
<p>If you’re new to Connect or want to get more from it, our <a href="https://www.jumpingrivers.com/training/course/r-posit-workbench-team-cloud/" rel="nofollow" target="_blank">Introduction to Posit Workbench training course</a> covers the full Posit environment including how Workbench and Connect work together.</p>
<h2 id="workbench-jobs-run-something-long-and-close-your-session">Workbench Jobs: run something long and close your session</h2>
<p>This one comes up as a question fairly often: if I start a background job in <a href="https://www.jumpingrivers.com/posit/managed-services/" rel="nofollow" target="_blank">Posit Workbench</a> and close my session, will it keep running?</p>
<p>The old answer was no. Background jobs were child processes of your session, close the session and the job goes with it.</p>
<p>Workbench Jobs are different. They run independently of your session. You can start a job, close RStudio Pro or VS Code entirely, and the job keeps going. When you open a new session, you can still see it running, check its live output, and monitor resource usage.</p>
<p>This is handy for anything that takes longer than you want to babysit: data processing pipelines, model training runs, file exports. The job has access to your data sources and connections, and you can pick up wherever you left off.</p>
<p>There’s also an auditing option for Workbench Jobs. When enabled, the output gets a cryptographic signature, useful if you need to demonstrate not just that the job ran, but exactly what it produced.</p>
<h2 id="workbench-jobs-vs-scheduled-content-on-connect">Workbench Jobs vs scheduled content on Connect</h2>
<p>A quick note on when to use which. If you need to run something once from inside your current workflow and you want access to local files, data connections, and everything in your working environment, a Workbench Job makes sense. It’s more hands-on.</p>
<p>If you need to schedule something to run repeatedly, share the results with other people, or get an email when it’s done, that’s what Connect is for. The two tools complement each other rather than compete.</p>
<p>If any of this is relevant to your setup, whether you’re looking at speeding up your package environment, making better use of Connect, or running longer jobs in Workbench — <a href="https://www.jumpingrivers.com/posit/managed-services/" rel="nofollow" target="_blank">get in touch</a>. As a <a href="https://www.jumpingrivers.com/posit/license-resale/" rel="nofollow" target="_blank">certified Posit Partner</a>, we help teams get the most from their Posit investment from infrastructure setup to long-term managed support.</p>
<hr>
<blockquote>
<h3 id="ai-in-production--45-june-2026-newcastle">AI in Production — 4–5 June 2026, Newcastle</h3>
<p>If you’re thinking about how AI fits into production data science environments, this is the conference for it. Two days of real-world talks and hands-on workshops from practitioners across engineering and ML; covering deployment, monitoring, scaling, and what actually works when AI leaves the prototype stage.</p>
<p><a href="https://ai-in-production.jumpingrivers.com/" rel="nofollow" target="_blank"><strong>Register now at ai-in-production.jumpingrivers.com</strong></a></p>
</blockquote>
<p>
For updates and revisions to this article, see the <a href = "https://www.jumpingrivers.com/blog/posit-platform-updates/">original post</a>
</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.jumpingrivers.com/blog/posit-platform-updates/"> The Jumping Rivers Blog</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/three-posit-platform-features-worth-knowing-about/">Three Posit Platform Features Worth Knowing About</a>]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">399825</post-id>	</item>
		<item>
		<title>Current views on generative AI</title>
		<link>https://www.r-bloggers.com/2026/03/current-views-on-generative-ai/</link>
		
		<dc:creator><![CDATA[Fran&#231;oisn - f@briatte.org]]></dc:creator>
		<pubDate>Fri, 13 Mar 2026 23:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://f.briatte.org/r/current-views-on-generative-ai</guid>

					<description><![CDATA[<p>This post contains my current views on generative artificial intelligence, and Large Language Models in particular. The context is mostly academia, which is about research and teaching.</p>
<p>Personal context</p>
<p>Generative AI is slowly creeping into my profes...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/current-views-on-generative-ai/">Current views on generative AI</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://f.briatte.org/r/current-views-on-generative-ai"> R / Notes</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>This post contains my current views on <a href="https://en.wikipedia.org/wiki/Generative_artificial_intelligence" rel="nofollow" target="_blank">generative artificial intelligence</a>, and <a href="https://en.wikipedia.org/wiki/Large_language_model" rel="nofollow" target="_blank">Large Language Models</a> in particular. The context is mostly academia, which is about research and teaching.</p>

<h1>Personal context</h1>

<p>Generative AI is slowly creeping into my professional workflow, not because I am using it myself (I don&#8217;t, although I guess that I will, at some point), but because everyone around me is.</p>

<p>My students use <a href="https://en.wikipedia.org/wiki/ChatGPT" rel="nofollow" target="_blank">ChatGPT</a> and other tools like, I believe, <a href="https://en.wikipedia.org/wiki/NotebookLM" rel="nofollow" target="_blank">NotebookLM</a> and <a href="https://en.wikipedia.org/wiki/Comet_(browser)" rel="nofollow" target="_blank">Perplexity Comet</a>. My RSS news feed (that&#8217;s how old I am) recently had an <a href="https://www.newyorker.com/magazine/2026/02/16/what-is-claude-anthropic-doesnt-know-either" rel="nofollow" target="_blank">article on Claude</a>, and I use use Google applications, so I keep getting passively-aggressively asked to use <a href="https://en.wikipedia.org/wiki/Gemini_(language_model)" rel="nofollow" target="_blank">Gemini</a>, which I might do one day via <a href="https://scholar.googleblog.com/2025/11/scholar-labs-ai-powered-scholar-search.html" rel="nofollow" target="_blank">Scholar Labs</a>.</p>

<p>My workplace, which is a university, has taken a very basic stance on generative AI: unless stated otherwise, students are to follow <a href="https://info.lse.ac.uk/staff/divisions/Eden-Centre/Artificial-Intelligence-Education-and-Assessment/School-position-on-generative-AI" rel="nofollow" target="_blank">LSE Position 1</a> (no use of generative AI in graded work), which I suppose goes both ways (no use of generative AI in grading, either).</p>

<p>I do not know of any equivalent position on generative AI in research. It seems like everyone wants to discuss the topic and play around with whatever is available for free online, but no one wants to make hard decisions about it yet, possibly due to <a href="https://artificialintelligenceact.eu/" rel="nofollow" target="_blank">upcoming EU-level regulations</a>.</p>

<h1>Risks for teaching and learning</h1>

<p>From a teaching perspective, generative AI is only useful to me if it helps students going through the following process:</p>

<ol>
<li>Learn</li>
<li>Draft</li>
<li>Revise</li>
<li>Submit</li>
<li>Defend</li>
</ol>

<p>Part of what I teach is code, and code is the topic of this blog. As it happens, generative AI is already very good with code, and I am confident that it can be put to good use to go through Steps 1&#8211;3 of the process above.</p>

<p>There are, however, at least four reasons why I am currently taking ‘LSE Position 1’ on using generative AI in graded work that relies on code:</p>

<ol>
<li>Many students are using AI to <a href="https://www.chronicle.com/article/is-ai-enhancing-education-or-replacing-it" rel="nofollow" target="_blank">bypass the learning process</a>, rather than enhance it. This creates <a href="https://f.briatte.org/r/ai-generated-code-security-risks" rel="nofollow" target="_blank">security risks</a>, and violates <strong>academic ethics</strong> in the same way that hiring an external party would. This comes on top of other breaches of students ethics, such as plagiarism.</li>
<li>The two issues mentioned in the previous point <strong>cannot be defended against</strong> at my level, at least not with my current resources. I can spot security risks, but I cannot reliably detect AI-generated code, which is neither watermarked or scannable through anti-plagiarism tools.</li>
<li>The software that I use in class is mostly open-source, and <strong>reproducibility</strong> is part of the core principles that I teach in class. As far as I understand, and unless proven otherwise, the kind of generative AI technology used by my students does not enforce these principles.</li>
<li>To make things worse, most generative AI also violates <strong>intellectual property</strong>, rather than reconfigure it around the ‘<a href="https://en.wikipedia.org/wiki/Copyleft" rel="nofollow" target="_blank">copyleft</a>’ and ‘<a href="https://creativecommons.org/" rel="nofollow" target="_blank">creative commons</a>’ principles that many of us have spent years defending and advocating within fields such as academic publishing.</li>
</ol>

<p>I have not been exposed to any argument that makes any attempt at solving the ethical, logistical, moral and eventually legal issues that I have outlined above. Until I do, I will treat generative AI as a form of <a href="https://en.wikipedia.org/wiki/Doping_in_sport" rel="nofollow" target="_blank">doping</a>, and will keep banning it.</p>

<p>The analogy above with doping is not an innocent one. There is, in my view, a very real rhetorical arc that goes from generative AI to the <a href="https://en.wikipedia.org/wiki/Enhanced_Games" rel="nofollow" target="_blank">Enhanced Games</a>. Higher education does not approve of students taking <a href="https://en.wikipedia.org/wiki/Adderall" rel="nofollow" target="_blank">Adderall</a>, and neither do I.</p>

<h1>Risks for scientific research</h1>

<p>From a research perspective, generative AI is only useful to me if it helps me going through the following process:</p>

<ol>
<li>Compile existing <strong>evidence</strong></li>
<li>Collect meaningful <strong>data</strong></li>
<li>Produce meaningful <strong>measures</strong></li>
<li>Formulate correct <strong>interpretations</strong></li>
<li>Enhance existing <strong>knowledge</strong></li>
</ol>

<p>There is no doubt that generative AI can help with every step above, especially perhaps at the level of data collection and, in the case of ‘big data’ or whatever people call it today, classification. I am also very interested in what it can contribute with regards to compiling scientific studies, in the same way that it is already helping with <a href="https://spectrum.ieee.org/ai-proof-verification" rel="nofollow" target="_blank">mathematical problems</a>.</p>

<p>The risks that I have heard about so far when it comes to generative AI and social science research (which is what I do) are the following:</p>

<ol>
<li>Generative AI can <strong>poison the evidence base</strong> (<a href="https://doi.org/10.1073/pnas.2314021121" rel="nofollow" target="_blank">Bail 2024</a>) through the mass production of low-quality academic output, or by compromising data such as online surveys (<a href="https://doi.org/10.1073/pnas.2518075122" rel="nofollow" target="_blank">Westwood 2025</a>, <a href="https://doi.org/10.1073/pnas.2537420123" rel="nofollow" target="_blank">Westwood and Frederick 2026</a>). This is already happening.</li>
<li>Generative AI does not yet produce <strong>reliable data annotations</strong> for the kind of data that I am interested in (<a href="https://www.eddieyang.net/research/llm_annotation.pdf" rel="nofollow" target="_blank">Yang <em>et al.</em> 2025</a>), and even if its coding reliability improves, it will require additional effort to mitigate related issues (<a href="https://doi.org/10.48550/arXiv.2509.08825" rel="nofollow" target="_blank">Baumann <em>et al.</em> 2025</a>).</li>
<li>Relatedly, generative AI cannot improve organically if it maintains its <strong>human bias</strong> towards evidence produced in the Global North (<a href="https://doi.org/10.31219/osf.io/w8q3y_v1" rel="nofollow" target="_blank">Ramirez-Ruiz and Senninger 2025</a>), mostly by ‘WEIRD’ individuals (<a href="https://doi.org/10.31234/osf.io/5b26t" rel="nofollow" target="_blank">Atari <em>et al.</em> 2023</a>). This will be hard and slow to solve.</li>
<li>Last but not least, generative AI will be used to <strong>erode scientific authority</strong> at the profit of <a href="https://en.wikipedia.org/wiki/Merchants_of_Doubt" rel="nofollow" target="_blank">those</a> who are interested in attacking the contribution that scientific (and higher education) institutions make to society. This is of course far from a trivial issue.</li>
</ol>

<p>The issues listed are all real, hard to solve, and are controversial insofar as some people have a vested interest in seeing them <em>not</em> addressed, at least not in the short term.</p>

<p>None of these issues will stop me from installing and trying out <a href="https://tidyverse.org/blog/2025/11/ellmer-0-4-0/" rel="nofollow" target="_blank"><code>ellmer</code></a> one day. However, I do expect this to happen within a scientific environment that will have acknowledged each issue in one way or another, and formulated guidelines to address them.</p>

<p>Are we there yet?</p>

<hr />

<p>This post was inspired by the <a href="https://www.bydamo.la/p/ai-manifesto" rel="nofollow" target="_blank">/ai ‘manifesto’</a>, which I discovered <a href="https://www.andrewheiss.com/ai/" rel="nofollow" target="_blank">thanks to Andrew Heiss</a>. I obtained some of the cited references through Jessica Hullman&#8217;s <a href="https://statmodeling.stat.columbia.edu/2026/03/10/new-course-on-generative-ai-for-behavioral-science/" rel="nofollow" target="_blank">‘New course on generative AI for behavioral science’</a> blog post.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://f.briatte.org/r/current-views-on-generative-ai"> R / Notes</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/current-views-on-generative-ai/">Current views on generative AI</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399844</post-id>	</item>
		<item>
		<title>Does every finite string of numbers appear within π ?</title>
		<link>https://www.r-bloggers.com/2026/03/does-every-finite-string-of-numbers-appear-within-%cf%80/</link>
		
		<dc:creator><![CDATA[Jerry Tuttle]]></dc:creator>
		<pubDate>Fri, 13 Mar 2026 15:05:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">http://www.r-bloggers.com/?guid=b4731aa991998963c61574f6bfc13b22</guid>

					<description><![CDATA[<p>     <br />
  In honor of Pi Day (March 14), I offer the following:<br />
Does every finite string of numbers like your social security number eventually appear somewhere within the decimal expansion of π ? </p>
<p>    &#038;n...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/does-every-finite-string-of-numbers-appear-within-%cf%80/">Does every finite string of numbers appear within π ?</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://onlinecollegemathteacher.blogspot.com/2026/03/does-every-finite-string-of-numbers.html"> Online College Math Teacher</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<font size = 3>
  
     
  In honor of Pi Day (March 14), I offer the following:
Does every finite string of numbers like your social security number eventually appear somewhere within the decimal expansion of π ? <p>

     
  So said Harold Finch, the computer genius on TV show <i>&#8220;Person of Interest&#8221;</i> (&#8220;You are being watched &#8230; &#8220;) in a 2013 episode where he poses as a substitute high school math teacher. <p>
  
<center>
<iframe width="450" src="https://www.youtube.com/embed/yGmYCfWyVAM" frameborder="0" allowfullscreen></iframe>
</center>
<br><br>

     
&#8220;Pi&#8230; keeps on going, forever, without ever repeating. Which means that contained within this string of decimals, is every single other number. Your birthdate, combination to your locker, your social security number, it&#8217;s all in there, somewhere. And if you convert these decimals into letters, you would have every word that ever existed&#8230; all of the world&#8217;s infinite possibilities rest within this one simple circle.&#8221; <p>
  
  
     
  We know π is <i>irrational</i>;  it cannot be expressed as the quotient of two integers.  
<a href = "https://www.youtube.com/watch?v=PgKmstECld0">Video</a> <p>

  
     
An irrational number has an infinite, non-repeating decimal expansion. <p>
  
     
A number is defined as <i>normal</i> if any finite string of numbers like your social security number will eventually appear somewhere in its decimal expansion.  (The is a simpler version of a more complicated mathematical
<a href="https://onlinecollegemathteacher.blogspot.com/2026/03/en.wikipedia,org/wiki/Normal_number" rel="nofollow" target="_blank">
definition </a>, but it is sufficient for this discussion.)  
Mr. Finch is
claiming  π is a normal number.  However, mathematicians have not yet proved this. (I once attempted to add this last statement and the fact that Finch was wrong, to Wikipedia&#8217;s Person of Interest page, but the Wikipedia editors declined it.) <p>
  
  
     
Although π is defined geometrically as the ratio of a circle&#8217;s circumference to its diameter, mathematicians calculate long sequences of π using infinite series, rather than physical measurements.  Various websites such as <a href = "https://www.angio.net/pi/digits.html">angio</a> provide files containing up to 1 million digits of π. <p>
  
     
I wrote some R code that would take a number (or a word, converted to a number)
  and search for where that number could be found within π.   I decided the first 100,000 digits of π were enough for me. Of course, the fact that a number can not be found within the first 100,000 digits, does not preclude that it can be found in the next 100,000 digits, or the next 1 million digits.  And so on.  Infinitely many digits goes on for a long time. <p>

  
     
I tested these examples:<p>
<ul>
<li>&#8220;Eggs&#8221;.  e = 5, g = 7, g = 7, s = 19.   57719 appears starting at the 6026 th digit of π.  </li>
<li>&#8220;0123&#8221; appears starting at the 27846 th digit of π.  </li>
<li>&#8220;01234567&#8221; was not found in the first 100k digits. But according to <a href="https://onlinecollegemathteacher.blogspot.com/2026/03/angio.net/pi/piqery" rel="nofollow" target="_blank">The Pi Search Page</a> it can be found starting at the 112,099,767 digit.  </li>
</ul>
I did not try &#8220;0123456789&#8221;, but another source says it was not found in the first 2 billion digits.<p>
  
  
     
Here is my R code.  Happy Pi Day! <p>
  
  
<pre>
# Symbol for pi is \u03c0
library(stringr)
library(readr)

find_in_pi &lt;- function(input_value) {
  # 1. Load the Pi data
  data.raw &lt;- readr::read_file(&quot;https://assets.angio.net/100000.txt&quot;)
  data.vec &lt;- unlist(str_split(data.raw, pattern = &quot;&quot;))
  data.vec &lt;- data.vec[-c(1, 2)] # remove '3.'
  pi_string &lt;- paste(as.character(data.vec), collapse = &quot;&quot;)
  
  # 2. Process Input: Determine if it's a word or a numeric string
  # Convert to character and lowercase for uniformity
  clean_input &lt;- tolower(as.character(input_value))
  
  if (grepl(&quot;[a-z]&quot;, clean_input)) {
    # It's a word: Convert letters to numbers (1-26)
    word_only &lt;- gsub(&quot;[^a-z]&quot;, &quot;&quot;, clean_input)
    z &lt;- unlist(str_split(word_only, &quot;&quot;))
    q &lt;- match(z, letters)
    search_target &lt;- paste(q, collapse = &quot;&quot;)
  } else {
    # It's already a number: Just strip non-digits (like decimals or spaces)
    search_target &lt;- gsub(&quot;[^0-9]&quot;, &quot;&quot;, clean_input)
  }
  
  # 3. Search Pi
  pos &lt;- regexpr(search_target, pi_string)
  
  # 4. Handle Results
  if (pos[1] == -1) {
    return(paste0(&quot;The sequence '&quot;, search_target, &quot;' was not found in the first 100k digits.&quot;))
  } else {
    start &lt;- pos[1]
    end &lt;- start + nchar(search_target) - 1
    return(list(
      found = search_target,
      start_index = start,
      end_index = end,
      index_sequence = seq(from = start, to = end)
    ))
  }
}

# Search for a word
find_in_pi(&quot;Eggs&quot;)

# Search for a number;  enclose in quotes if a leading zero
find_in_pi(&quot;01234&quot;)
find_in_pi(&quot;01234567&quot;)

</pre><p>
End
</font>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://onlinecollegemathteacher.blogspot.com/2026/03/does-every-finite-string-of-numbers.html"> Online College Math Teacher</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/does-every-finite-string-of-numbers-appear-within-%cf%80/">Does every finite string of numbers appear within π ?</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399836</post-id>	</item>
		<item>
		<title>Is Your Dashboard User Friendly?</title>
		<link>https://www.r-bloggers.com/2026/03/is-your-dashboard-user-friendly/</link>
		
		<dc:creator><![CDATA[The Jumping Rivers Blog]]></dc:creator>
		<pubDate>Thu, 12 Mar 2026 23:59:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.jumpingrivers.com/blog/dashboard-ux/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
<p>For a while we, at Jumping Rivers, have offered a Dashboard Health Check (DHC) largely focused around backend features and other facets the end-user doesn’t see: things like version control, documentation and deployment. However, the DHC also included a few checks related to user experience and accessibility. While we’...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/is-your-dashboard-user-friendly/">Is Your Dashboard User Friendly?</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.jumpingrivers.com/blog/dashboard-ux/"> The Jumping Rivers Blog</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p>
<a href = "https://www.jumpingrivers.com/blog/dashboard-ux/">
<img src="https://i0.wp.com/www.jumpingrivers.com/blog/dashboard-ux/featured.jpg?w=400&#038;ssl=1" style="width:400px" class="image-center" style="display: block; margin: auto;" data-recalc-dims="1" />
</a>
</p>
<style>
figure {
background-color: var(--cream);
padding: 1rem;
font-size: 0.8rem;
display: flex;
flex-direction: column;
row-gap: 1rem;
margin-bottom: 1rem;
width: 625px;
max-width: 100%;
border-radius: 0.5rem;
margin-left: auto;
margin-right: auto;
}
figure img {
width: 100%;
}
figcaption {
border-top: 1px solid var(--off-white);
padding-top: 0.25rem;
}
</style>
<p>For a while we, at Jumping Rivers, have offered a Dashboard Health Check (DHC) largely focused around backend features and other facets the end-user doesn’t see: things like version control, documentation and deployment. However, the DHC also included a few checks related to user experience and accessibility. While we’ve always believed these are useful additions, we would like to offer more in-depth guidance to our clients on how they can make their applications more user-friendly. To facilitate this, we are now introducing the Frontend Dashboard Health Check (FDHC).</p>
<h2 id="what-could-an-fdhc-help-me-with">What could an FDHC help me with?</h2>
<p>So what kind of advice can you get from us from a Frontend Dashboard Healthcheck, you might wonder. Here are just a few of the possibilities:</p>
<ul>
<li>Tools like Shiny and Dash make it relatively quick and easy to build data dashboards. These can often start out as a fixed single page of data and, over time, morph into something much more complex and interactive with multiple views. Such applications can be incredibly powerful, but with great power comes great <del>responsibility</del> complexity. For a dashboard to be successful, users need to understand how to use it effectively to answer their questions. This can mean discovering and/or learning many features from basic navigation between views to how to interrogate the data contained within using techniques like search, filter, sort, partition, drill-down and summarise. We can suggest places where users may get stuck or confused, and suggest means of amelioration.</li>
<li>A successful, production-ready, dashboard also needs to be robust. At minimum that means resilient to unexpected user input and to its own (perhaps temporary) inability to provide the output its supposed to (if a server is down, for example). An app that just hangs when something goes wrong is going to confuse and frustrate users and can lead to wasted time and even loss of work. We can show you where your app may fall over so that you can take action to prevent it.</li>
<li>These days we consume pages from the world wide web using all manner of devices. Does your app work on 4k and 5k monitors? More importantly, at the other end of the scale, there is now usually the expectation that things should work on mobile and other touchscreen devices. We can show you at which dimensions your app layout may become difficult or impossible to use and where users using specific input methods &#8211; e.g. mouse, touch, keyboard &#8211; may have difficulties.</li>
</ul>
<h2 id="what-deliverables-would-i-get-from-an-fdhc">What deliverables would I get from an FDHC?</h2>
<p>The principle deliverable from an FDHC is a detailed spreadsheet indicating what issues we’ve found and where they can be found (or how to reproduce them). Wherever practical we will also include annotated screenshots (or occasionally recordings) giving a visual outline of a problem (see below). We will also strive to suggest possible remedies.</p>
<figure>
<picture id="page-spill" aria-labelledby="page-spill-label">
<source srcset_temp="assets/page-spill.webp 1x, assets/page-spill@2x.webp 2x" type="image/webp">
<img src="https://www.jumpingrivers.com/blog/dashboard-ux/assets/page-spill.%7B%7B%20$fallback%20%7D%7D" alt="{{ $alt }}">
</picture>
<figcaption id="page-spill-label">An example of annotated screenshots highlighting an issue with the page layout for certain width-ranges for an old version of our own Litmus Dashboard application.</figcaption>
</figure>
<figure>
<picture id="permanent-labels" aria-labelledby="permanent-labels-label">
<source srcset_temp="assets/permanent-labels.webp 1x, assets/permanent-labels@2x.webp 2x" type="image/webp">
<img src="https://www.jumpingrivers.com/blog/dashboard-ux/assets/permanent-labels.%7B%7B%20$fallback%20%7D%7D" alt="{{ $alt }}">
</picture>
<figcaption id="permanent-labels-label">An example of an annotated screenshot highlighting an issue with input labelling for an old version of our own Litmus Dashboard application.</figcaption>
</figure>
<h2 id="what-about-the-old-dhc">What about the old DHC?</h2>
<p>We will continue to offer a separate, report-based, health check for data dashboards. This “Backend Dashboard Health Check” (BDHC) will cover things like version control, documentation, deployment as before. We are, of course, more than happy to run a BDHC and an FDHC on the same application.</p>
<h2 id="how-do-i-find-out-more">How do I find out more?</h2>
<p>Please get in touch via <a href="https://www.jumpingrivers.com/contact/" rel="nofollow" target="_blank">this contact form</a> or drop us an email at <a href="mailto:hello@jumpingrivers.com" rel="nofollow" target="_blank">hello@jumpingrivers.com</a>.</p>
<p>
For updates and revisions to this article, see the <a href = "https://www.jumpingrivers.com/blog/dashboard-ux/">original post</a>
</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.jumpingrivers.com/blog/dashboard-ux/"> The Jumping Rivers Blog</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/is-your-dashboard-user-friendly/">Is Your Dashboard User Friendly?</a>]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">399795</post-id>	</item>
		<item>
		<title>deltatest 0.2.0: Statistical Hypothesis Testing Using the Delta Method for Online A/B Testing</title>
		<link>https://www.r-bloggers.com/2026/03/deltatest-0-2-0-statistical-hypothesis-testing-using-the-delta-method-for-online-a-b-testing/</link>
		
		<dc:creator><![CDATA[Koji Makiyama]]></dc:creator>
		<pubDate>Thu, 12 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://hoxo-m.github.io/blog/posts/deltatest-0-2-0/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
<p>I’m happy to share a new release of deltatest.<br />
This release includes two main changes:</p>
<p>a new tidy() method for deltatest objects<br />
a fix for p-value calculation in one-sided tests</p>
<p>Before looking at what changed in this release, let’s briefly rev...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/deltatest-0-2-0-statistical-hypothesis-testing-using-the-delta-method-for-online-a-b-testing/">deltatest 0.2.0: Statistical Hypothesis Testing Using the Delta Method for Online A/B Testing</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://hoxo-m.github.io/blog/posts/deltatest-0-2-0/"> HOXO-M Blog</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 





<p>I’m happy to share a new release of <strong>deltatest</strong>.</p>
<p>This release includes two main changes:</p>
<ul>
<li>a new <code>tidy()</code> method for <code>deltatest</code> objects</li>
<li>a fix for p-value calculation in one-sided tests</li>
</ul>
<p>Before looking at what changed in this release, let’s briefly revisit the purpose of <strong>deltatest</strong>.</p>
<section id="what-deltatest-is-for" class="level2">
<h2 class="anchored" data-anchor-id="what-deltatest-is-for">What deltatest is for</h2>
<p>The <strong>deltatest</strong> package provides <code>deltatest()</code>, a function for performing two-sample Z-tests using the delta method.</p>
<p>It is designed for common settings in online A/B testing where:</p>
<ul>
<li>randomization is done at the user level, but</li>
<li>the metric is measured at a finer unit such as page views or sessions.</li>
</ul>
<p>In such settings, naive tests can underestimate uncertainty—for example, standard Z-tests, chi-squared tests, or tests for differences in proportions—because observations within a user are not independent. <code>deltatest()</code> addresses this issue by using a delta-method-based variance estimator.</p>
<pre># Install the released version from CRAN
install.packages(&quot;deltatest&quot;)

# Load packages
library(dplyr)
library(deltatest)

# Generate dummy data
data &lt;- deltatest::generate_dummy_data(2000) |&gt;
  mutate(group = if_else(group == 0, &quot;control&quot;, &quot;treatment&quot;)) |&gt;
  group_by(user_id, group) |&gt;
  summarise(clicks = sum(metric), pageviews = n(), .groups = &quot;drop&quot;)

# Run a test
deltatest(data, clicks / pageviews, by = group)</pre>
<p>Typical output:</p>
<pre>#&gt; Two Sample Z-test Using the Delta Method
#&gt; 
#&gt; data:  clicks/pageviews by group
#&gt; Z = 0.31437, p-value = 0.7532
#&gt; alternative hypothesis: true difference in means between control and treatment is not equal to 0
#&gt; 95 percent confidence interval:
#&gt;  -0.01410593  0.01949536
#&gt; sample estimates:
#&gt;   mean in control mean in treatment        difference
#&gt;       0.245959325       0.248654038       0.002694713</pre>
</section>
<section id="whats-new-in-0.2.0" class="level2">
<h2 class="anchored" data-anchor-id="whats-new-in-0.2.0">What’s new in 0.2.0</h2>
<section id="tidy-support-for-deltatest-objects" class="level3">
<h3 class="anchored" data-anchor-id="tidy-support-for-deltatest-objects"><code>tidy()</code> support for <code>deltatest</code> objects</h3>
<p>With this release, <code>deltatest()</code> results can now be converted directly into a tidy tibble with <code>broom::tidy()</code>.</p>
<p><code>deltatest()</code> returns an <code>htest</code>-class object, which is convenient for printing and interactive use. But in a tidyverse workflow, it is often much easier to work with results in a tidy tibble format. This makes it easier to combine results across many experiments or metrics, and to visualize patterns in estimates, confidence intervals, or p-values with tools like <strong>ggplot2</strong>.</p>
<p>First, here is a simple example of converting the result to a tidy format:</p>
<div class="cell">
<pre>library(dplyr)
library(deltatest)
library(broom)

data &lt;- deltatest::generate_dummy_data(2000) |&gt;
  mutate(group = if_else(group == 0, &quot;control&quot;, &quot;treatment&quot;)) |&gt;
  group_by(user_id, group) |&gt;
  summarise(clicks = sum(metric), pageviews = n(), .groups = &quot;drop&quot;)

result &lt;- deltatest(data, clicks / pageviews, by = group)

tidy(result)
#&gt; # A tibble: 1 × 9
#&gt;   estimate mean_ctrl mean_treat statistic p.value conf.low conf.high method     
#&gt;      &lt;dbl&gt;     &lt;dbl&gt;      &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt; &lt;chr&gt;      
#&gt; 1  0.00269     0.246      0.249     0.314   0.753  -0.0141    0.0195 Two Sample…
#&gt; # &#x2139; 1 more variable: alternative &lt;chr&gt;</pre>
</div>
<p>Next, here is an example of using the tidy results to compare multiple experiments in a plot:</p>
<div class="cell">
<pre>library(ggplot2)

data2 &lt;- deltatest::generate_dummy_data(2000, xi = 0.05) |&gt;
  mutate(group = if_else(group == 0, &quot;control&quot;, &quot;treatment&quot;)) |&gt;
  group_by(user_id, group) |&gt;
  summarise(clicks = sum(metric), pageviews = n(), .groups = &quot;drop&quot;)

result2 &lt;- deltatest(data2, clicks / pageviews, by = group)

result_tidy1 &lt;- tidy(result)  |&gt; mutate(experiment_id = &quot;test01&quot;)
result_tidy2 &lt;- tidy(result2) |&gt; mutate(experiment_id = &quot;test02&quot;)

result_tidy &lt;- bind_rows(result_tidy1, result_tidy2)

ggplot(result_tidy, aes(experiment_id, estimate)) +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
  geom_hline(yintercept = 0, color = &quot;red&quot;) +
  xlab(NULL) + ylab(&quot;Estimated CTR difference&quot;) +
  ggtitle(&quot;Treatment effects by experiment&quot;)</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i2.wp.com/hoxo-m.github.io/blog/posts/deltatest-0-2-0/index_files/figure-html/tidy-example2-1.png?w=384&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="fix-for-one-sided-p-value-calculation" class="level3">
<h3 class="anchored" data-anchor-id="fix-for-one-sided-p-value-calculation">Fix for one-sided p-value calculation</h3>
<p>This release also fixes a bug in the p-value calculation for one-sided tests. In the previous version, p-values for one-sided tests could be incorrectly calculated using the two-sided formula. That behavior has now been fixed.</p>
<p>I would like to thank Kazuyuki Sano for reporting this issue and contributing to the fix.</p>
</section>
</section>
<section id="final-thoughts" class="level2">
<h2 class="anchored" data-anchor-id="final-thoughts">Final thoughts</h2>
<p>I’m glad to keep improving <strong>deltatest</strong> little by little. If you use R for online A/B experiments, I hope it is useful to you.</p>
<p>For more details, see:</p>
<ul>
<li>Package website: <a href="https://hoxo-m.github.io/deltatest/" rel="nofollow" target="_blank">https://hoxo-m.github.io/deltatest/</a></li>
<li>GitHub repository: <a href="https://github.com/hoxo-m/deltatest" rel="nofollow" target="_blank">https://github.com/hoxo-m/deltatest</a></li>
</ul>


</section>

 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://hoxo-m.github.io/blog/posts/deltatest-0-2-0/"> HOXO-M Blog</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/deltatest-0-2-0-statistical-hypothesis-testing-using-the-delta-method-for-online-a-b-testing/">deltatest 0.2.0: Statistical Hypothesis Testing Using the Delta Method for Online A/B Testing</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399801</post-id>	</item>
		<item>
		<title>MBBEFDLite has its mature release</title>
		<link>https://www.r-bloggers.com/2026/03/mbbefdlite-has-its-mature-release/</link>
		
		<dc:creator><![CDATA[Avi]]></dc:creator>
		<pubDate>Wed, 11 Mar 2026 00:49:31 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.avrahamadler.com/?p=921</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> The MBBEFD Distribution After two years, the MBBEFDLite package for R is finally mature! As a brand-new actuary, in my first job, I had the privilege of working with Dr. Stefan Bernegger at Swiss Reinsurance. Albeit I did not have much to do with him—I was a one-exam, no ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/mbbefdlite-has-its-mature-release/">MBBEFDLite has its mature release</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.avrahamadler.com/2026/03/10/mbbefdlite-v1/"> R Archives | Strange Attractors</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<div id="attachment_923" style="width: 653px" class="wp-caption alignleft"><img decoding="async" aria-describedby="caption-attachment-923" loading="lazy" class="wp-image-923 size-full" src="https://i1.wp.com/www.avrahamadler.com/wp-content/uploads/2026/03/MBBEFD2.png?w=450&#038;ssl=1" alt="Definition of probability density function for the MBBEFD family of curves." srcset_temp="https://i1.wp.com/www.avrahamadler.com/wp-content/uploads/2026/03/MBBEFD2.png?w=450&#038;ssl=1 643w, https://www.avrahamadler.com/wp-content/uploads/2026/03/MBBEFD2-300x97.png 300w" sizes="auto, (max-width: 643px) 100vw, 643px" data-recalc-dims="1" /><p id="caption-attachment-923" class="wp-caption-text">PDF of the MBBEFD distribution</p></div>
<h3>The MBBEFD Distribution</h3>
<p>After two years, the <a href="https://cran.r-project.org/package=MBBEFDLite" rel="nofollow" target="_blank">MBBEFDLite</a> package for R is finally mature! As a brand-new actuary, in my first job, I had the privilege of working with Dr. Stefan Bernegger at Swiss Reinsurance. Albeit I did not have much to do with him—I was a one-exam, no experience new actuary and he was a PhD. in nuclear physics and already a world-reknown actuary—Dr. Bernegger treated me as the scholar and gentleman he is and made time to explain what we were doing and why. One of his best known contributions to actuarial science is <a href="https://www.cambridge.org/core/journals/astin-bulletin-journal-of-the-iaa/article/swiss-re-exposure-curves-and-the-mbbefd-distribution-class1/0360BFFA7640908DC177687523164485" rel="nofollow" target="_blank">his introduction in 1997</a> of the Maxwell-Boltzmann Bose-Einstein Fermi-Dirac family of curves to (re)insurance exposure rating. The curves, thankfully known by their acronym MBBEFD, are smooth two-parameter curves that range from 0 to 1, or no-loss to complete loss in insurance terms.</p>
<h3>MBBEFDLite Package</h3>
<p>Over the course of my career I have had many opportunities to use these curves when exposure rating accounts or building stochastic models. While I have implemented this in Excel, painfully, I tend to do so in R, which is much more apropos for statistical work. A number of years ago, I looked through CRAN and saw the excellent <a href="https://cran.r-project.org/package=mbbefd" rel="nofollow" target="_blank">mbbefd</a> package, written by three actuaries/statisticians who are well-known in the reinsurance world: Drs. Christophe Dutang, Giorgio Spedicato, and Markus Gesmann. The package is a tour-de-force and does significantly more than implement the simple (limited) MBBEFD commonly used in reinsurance. Also, or because of this, it has a number of dependencies, which means installing other packages. My philosophy is to do my best to <a href="https://www.avrahamadler.com/2022/01/20/reduce-dependency-hell/" rel="nofollow" target="_blank">reduce dependency hell</a>. Therefore, I wrote my own, simple, package called MBBEFDLite.</p>
<p>The nascent MBBEFDLite package worked well enough for my purposes, followed CRAN styles, and was reasonably fast given that the heavy-lifting was done in C. One of the functions of the package was a method-of-moments (MoM) fitting routine based on the original paper’s suggestions in sections 4.2 and 4.3. The 1997 paper was a bit vague on the mechanics, so I wrote my own expectation-maximization type algorithm which depends on the difference between the expected value of the second moment and the point-mass at 1. It worked well, but there are some samples for which it simply did not converge; samples for which the “implied” point mass is not positive. This bothered me, and the package was left in development mode (main version 0).</p>
<h3>Recent Updates</h3>
<p>A short time ago, in early 2026, Dr. Bernegger <a href="https://www.researchgate.net/publication/400516019_Properties_of_the_MBBEF_D_Distribution_Classes" rel="nofollow" target="_blank">published a new paper</a> which describes the distribution more completely and also provides pseudocode for a fitting algorithm, using the mean and cv as opposed to the first two moments. I was now able to implement the grid search! What I found was that samples which were problematic for my algorithm also failed to properly converge under the grid search. For those who have read the 2026 paper, the <img decoding="async" loading="lazy" src="https://i2.wp.com/www.avrahamadler.com/wp-content/ql-cache/quicklatex.com-a50139d27599034008c74b749d771be7_l3.png?resize=19%2C21&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="b_\mu" title="Rendered by QuickLaTeX.com" height="21" width="19" style="vertical-align: -6px;" data-recalc-dims="1"/> and <img decoding="async" loading="lazy" src="https://i0.wp.com/www.avrahamadler.com/wp-content/ql-cache/quicklatex.com-1cf1cef845be3a027db0313346d09382_l3.png?resize=18%2C21&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="b_\rho" title="Rendered by QuickLaTeX.com" height="21" width="18" style="vertical-align: -6px;" data-recalc-dims="1"/> curves did not intersect.</p>
<p>As part of revisiting the fitting algorithm, looked at three implementations: an actual grid-search, two-dimensional nonlinear fitting—Nelder-Mead on <img decoding="async" loading="lazy" src="https://i1.wp.com/www.avrahamadler.com/wp-content/ql-cache/quicklatex.com-70039f9a81a73c9c93e2b7b0701fcd15_l3.png?resize=9%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="b" title="Rendered by QuickLaTeX.com" height="15" width="9" style="vertical-align: 0px;" data-recalc-dims="1"/> and <img decoding="async" loading="lazy" src="https://i0.wp.com/www.avrahamadler.com/wp-content/ql-cache/quicklatex.com-374c1ced5eb02c94c301c7a30c2114a5_l3.png?resize=10%2C13&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="g" title="Rendered by QuickLaTeX.com" height="13" width="10" style="vertical-align: -4px;" data-recalc-dims="1"/> simultaneously, and a nested pair of one-dimensional fits, where the unidimensional fit on <img decoding="async" loading="lazy" src="https://i0.wp.com/www.avrahamadler.com/wp-content/ql-cache/quicklatex.com-374c1ced5eb02c94c301c7a30c2114a5_l3.png?resize=10%2C13&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="g" title="Rendered by QuickLaTeX.com" height="13" width="10" style="vertical-align: -4px;" data-recalc-dims="1"/> calls the one on <img decoding="async" loading="lazy" src="https://i1.wp.com/www.avrahamadler.com/wp-content/ql-cache/quicklatex.com-70039f9a81a73c9c93e2b7b0701fcd15_l3.png?resize=9%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="b" title="Rendered by QuickLaTeX.com" height="15" width="9" style="vertical-align: 0px;" data-recalc-dims="1"/>. All three returned functionally the same answers, and the same as the EM algorithm I originally wrote. I decided to implement the nested 1D fits as that was the fastest of the three.</p>
<h3>Mature Release</h3>
<p>Now, I feel much better about the fitting algorithms, in that there are some samples that simply don’t lend themselves to MoM fitting. Anyway, maximum likelihood is almost always more robust <strong>if</strong> you have the actual observations. The MoM routine is nice in that if all you have are the moments, you can still find a fit—much of the time. With that done, and a slew of code review and hardenings, I felt it was time for the MBBEFDLite package to have its mature release, so version 1.0.0 is now on CRAN. If you happen to use it, I would appreciate any constructive criticism you may have! There is more detail in the functions’ documentation, the NEWS, and the commit comments. Enjoy!</p>
<p>The post <a href="https://www.avrahamadler.com/2026/03/10/mbbefdlite-v1/" rel="nofollow" target="_blank">MBBEFDLite has its mature release</a> appeared first on <a href="https://www.avrahamadler.com/" rel="nofollow" target="_blank">Strange Attractors</a>.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.avrahamadler.com/2026/03/10/mbbefdlite-v1/"> R Archives | Strange Attractors</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/mbbefdlite-has-its-mature-release/">MBBEFDLite has its mature release</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399775</post-id>	</item>
		<item>
		<title>Sharing data across shiny modules, an update</title>
		<link>https://www.r-bloggers.com/2026/03/sharing-data-across-shiny-modules-an-update/</link>
		
		<dc:creator><![CDATA[Colin Fay]]></dc:creator>
		<pubDate>Tue, 10 Mar 2026 20:26:59 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rtask.thinkr.fr/?p=29604</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> You can read the original post in its original format on Rtask website by ThinkR here: Sharing data across shiny modules, an update<br />
Some people have recently been vocal about misuses of the “stratégie du petit r”, a mechanism for sharing data across {shiny} modules that was detailed both ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/sharing-data-across-shiny-modules-an-update/">Sharing data across shiny modules, an update</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rtask.thinkr.fr/sharing-data-across-shiny-modules-an-update/"> Rtask</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>You can read the original post in its original format on <a rel="nofollow" href="https://rtask.thinkr.fr/" target="_blank">Rtask</a> website by ThinkR here: <a rel="nofollow" href="https://rtask.thinkr.fr/sharing-data-across-shiny-modules-an-update/" target="_blank">Sharing data across shiny modules, an update</a></p>
<p>Some people have recently been vocal about misuses of the <code>&quot;stratégie du petit r&quot;</code>, a mechanism for sharing data across <code>{shiny}</code> modules that was detailed both in the <a href="https://engineering-shiny.org/structuring-project.html#communication-between-modules" rel="nofollow" target="_blank">Engineering Production-Grade Shiny Apps</a> book and in an <a href="https://rtask.thinkr.fr/communication-between-modules-and-its-whims/" rel="nofollow" target="_blank">older post written in 2019</a> on this blog.<br />
And yes, if you’re wondering, I did feel old when I realized this blog post is almost 7 years old now <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f605.png" alt="😅" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>I’m always happy to be proven wrong, to challenge the way I build software, and to become a better software engineer. But given that we weren’t contacted to discuss the ideas behind this strategy<strong>, I thought the moment was perfect to give y’all an update on the latest approaches I’ve been using to share data across <code>{shiny}</code> modules</strong>, along with some thoughts and comments on the <code>&quot;stratégie du petit r&quot;</code>.</p>
<h2>Discovering {shiny} modules</h2>
<p>I’ve been building <code>{shiny}</code> apps for quite a while now. I’ve probably built more apps than I can remember, and for far longer than I dare to admit. One of the apps I’m most proud of is used by 1,000+ people in a large company, managing millions of euros every year. I’ve been working on this one for five years now.</p>
<p><strong>I can still remember the day Vincent sent us a message on Slack, sharing a video about <code>{shiny}</code> modules and how they would change the way we build apps</strong>. I was on a train ride back from one of the many R conferences I’ve attended (and spoken at) over the years. I think this might be in 2017, but I’m not really sure. And I’ll admit it: I had absolutely no idea what this video was about, or how it would fit into my current <code>{shiny}</code> apps.</p>
<p>Being the nerd that I am, I watched this video a couple of times because I wanted to understand it and use it. Any time Vincent shows up and says something will change your coding style, you’d better listen and understand. It took me a bit of time, but after a couple of months, <strong><code>{shiny}</code> modules were a core part of my development workflow</strong>, and the <code>add_module()</code> function from <code>{golem}</code> became one of my favorites: it has saved me five minutes of perilous copy-pasting every time I need a new module. That’s a significant amount of lifetime saved thanks to a simple function.</p>
<p>But one of the more complex things with <code>{shiny}</code> modules is this: how do you share global state, data, and reactivity between them? How do I access the CSV read in <code>mod_csv_reader</code> from <code>mod_data_visualisation</code>?</p>
<p>Let’s dive into this question.</p>
<h2>What is a {shiny} module</h2>
<h3>Modules are functions</h3>
<p>I feel like <code>{shiny}</code> modules have been mistakenly presented as “reusable pieces of <code>{shiny}</code> code”. Well, they are, but <strong>95% of the modules I’ve written in my career have been used only once</strong>. And that’s because most of the time, parts and pieces of an app are too specific to be reused anywhere else.</p>
<p>So <code>{shiny}</code> modules are useful primarily because they address a scoping issue: via two functions, they let you define a small part of your app without having to worry about ID uniqueness across the whole application. Basically, they are building blocks: you start at the top level, then break things down into smaller and smaller pieces.</p>
<p><code>{shiny}</code> modules being functions means several things:</p>
<ul>
<li>They operate in an <strong>ecosystem of environments</strong></li>
<li>They are <strong>scoped</strong>, meaning what happens in them usually stays there unless you actively decide otherwise</li>
<li>They can take <strong>inputs</strong> and generate <strong>outputs</strong></li>
</ul>
<p>Good software engineering practice tells us: <strong>a function should take a set of inputs, do just one thing, and produce an output,</strong> and we plug these functions into one another like Russian dolls to build a larger workflow. So we might have, for example, one module that contains a tab of the app, which contains two cards, with one card being a module that contains a module with a <code>fileInput</code> to read a CSV.</p>
<p>Let’s take a look at a simple application like <a href="https://connect.thinkr.fr/gpxviewer" rel="nofollow" target="_blank">this GPX Viewer</a>, with the source code available at <a href="https://github.com/ThinkR-open/gpxviewer" rel="nofollow" target="_blank">https://github.com/ThinkR-open/gpxviewer</a>.</p>
<p><img loading="lazy" fetchpriority="high" decoding="async" class="aligncenter size-large wp-image-29607" src="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/gpx-viewer-1024x576.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/gpx-viewer-1024x576.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/gpx-viewer-300x169.png 300w, https://rtask.thinkr.fr/wp-content/uploads/gpx-viewer-768x432.png 768w, https://rtask.thinkr.fr/wp-content/uploads/gpx-viewer-1536x864.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/gpx-viewer.png 1920w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: gpx-viewer.png --></p>
<p>This app follows a pretty common Shiny workflow: take a dataset, plot it, and summarize it. There are multiple ways to split this app into modules.</p>
<p>This can be seen as splitting modules by “doing just one thing” (data configuration / data visualization):</p>
<p><img loading="lazy" decoding="async" class="aligncenter size-large wp-image-29608" src="https://i0.wp.com/rtask.thinkr.fr/wp-content/uploads/two-modules-1024x561.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i0.wp.com/rtask.thinkr.fr/wp-content/uploads/two-modules-1024x561.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/two-modules-300x164.png 300w, https://rtask.thinkr.fr/wp-content/uploads/two-modules-768x421.png 768w, https://rtask.thinkr.fr/wp-content/uploads/two-modules-1536x841.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/two-modules-2048x1122.png 2048w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: two-modules.png --></p>
<p>Another could be (upload / configure / plot / summarize):</p>
<p><img loading="lazy" decoding="async" class="aligncenter size-large wp-image-29609" src="https://i0.wp.com/rtask.thinkr.fr/wp-content/uploads/four-modules-1024x575.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i0.wp.com/rtask.thinkr.fr/wp-content/uploads/four-modules-1024x575.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/four-modules-300x168.png 300w, https://rtask.thinkr.fr/wp-content/uploads/four-modules-768x431.png 768w, https://rtask.thinkr.fr/wp-content/uploads/four-modules-1536x862.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/four-modules-2048x1149.png 2048w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: four-modules.png --></p>
<p>This too (check for example / upload / configure / download / plot / summarize):</p>
<p><img loading="lazy" decoding="async" class="aligncenter size-large wp-image-29610" src="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/eight-modules-1024x575.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/eight-modules-1024x575.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/eight-modules-300x168.png 300w, https://rtask.thinkr.fr/wp-content/uploads/eight-modules-768x431.png 768w, https://rtask.thinkr.fr/wp-content/uploads/eight-modules-1536x862.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/eight-modules-2048x1149.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: eight-modules.png --></p>
<p>What I’m trying to show here is that “just one thing” can be relative in the context of a big app. Furthermore, and I think this is the biggest point: <strong>we need to strike a balance between perfect and practical</strong>. For example, I’m currently working for a client where the codebase (R code only) is just under 20,000 lines, and if you start from the top level, the deepest module in the stack is six levels down. Of course, some of the modules could be split into smaller ones, and then into smaller ones, and so on.</p>
<p>But I’m trying to keep things easy to maintain. Given the current size of the codebase, <strong>adding more layers or going deeper would make the code far more complex and harder to maintain </strong>with no real benefit. So yes, some of these modules are not perfect, and they might not be doing “just one thing”.</p>
<p>You know what they say: “<em>perfect is the enemy of good</em>.”</p>
<h3>Can functions take lists as parameters?</h3>
<p>This is something I’ve been struggling with for a long time. Can a function take only scalar parameters, or can those parameters be lists too?</p>
<p>I’ve come to terms with this idea for two reasons:</p>
<ol>
<li>Data frames are lists, and I don’t see any good reason to forbid passing a <code>data.frame</code> as an argument to a function.</li>
<li>JavaScript is full of functions that take scalar values <em>and</em> a list of parameters, and it works well.<br />
For example, making an HTTP request in JS looks like this:</li>
</ol>
<pre>fetch(
  &quot;/api/users&quot;,
  {
    method: &quot;GET&quot;,
    headers: {
      &quot;Content-Type&quot;: &quot;application/json&quot;,
      &quot;Accept&quot;: &quot;application/json&quot;,
      &quot;Authorization&quot;: &quot;Bearer YOUR_TOKEN&quot;,
    }
  }
)
</pre>
<p>Guess what: in <code>{httr}</code> (I know, I’m old-school), you’d do:</p>
<pre>GET(
  url = &quot;/api/users&quot;,
  config  = add_headers(
    `Content-Type`  = &quot;application/json&quot;,
    `Accept`        = &quot;application/json&quot;,
    Authorization   = &quot;Bearer YOUR_TOKEN&quot;
  )
)
</pre>
<p>Yep, <code>config</code> is a <code>list()</code>.</p>
<p>If you feel like I’m digressing a bit from my original point, you’re right, a little, but it’s relevant to what I’ll be explaining in the rest of this blog post.</p>
<h2>Sharing data across modules</h2>
<h3>What are we even talking about?</h3>
<p>Let’s imagine, for a moment, the following Shiny architecture, which is, to be honest, a very simple one (most of the time, modules won’t be split this evenly).</p>
<p><img loading="lazy" decoding="async" class="aligncenter size-large wp-image-29611" src="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/archi-1024x349.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/archi-1024x349.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/archi-300x102.png 300w, https://rtask.thinkr.fr/wp-content/uploads/archi-768x262.png 768w, https://rtask.thinkr.fr/wp-content/uploads/archi-1536x524.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/archi-2048x698.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: archi.png --></p>
<p>Modules usually live in two scopes:</p>
<ul>
<li>They do things <strong>within themselves</strong></li>
<li>They do things <strong>that need to be passed to other modules</strong></li>
</ul>
<p>Doing things within themselves is pretty standard and doesn’t require a lot of thought (as long as you don’t forget the <code>ns()</code> <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f605.png" alt="😅" class="wp-smiley" style="height: 1em; max-height: 1em;" />), but <strong>sharing things from one module to another in a reactive context can be more challenging</strong>. For example, let’s say our app contains the following: <code>mod_3_a</code> has a checkbox, <code>mod_3_b</code> has a data upload module, <code>mod_3_c</code> has a set of configuration options for cleaning the data, <code>mod_3_d</code> has a set of configuration options for the plot, and finally <code>mod_3_g</code> is the module that draws the plot. Once the data is uploaded and cleaned, the code has to be organized in a way that allows two things to happen:</p>
<ol>
<li>The dataset and configuration are available in <code>mod_3_g</code></li>
<li><code>mod_3_g</code>’s context is invalidated and a new plot is drawn (i.e., reactivity is triggered)</li>
</ol>
<p>If we only had (1), things would be a bit easier, but we also need to make everything reactive.</p>
<p>Let’s now explore the patterns we could use.</p>
<h3>Passing reactive objects</h3>
<p>One thing I’ve learned over the years is that <strong>what works for example apps can be a nightmare in a production context</strong>. The official Shiny docs recommend the following pattern: return one or more <code>reactive()</code> objects that can be passed to other modules.</p>
<p>Here, in our context, that would mean the following code (going from the bottom left of the tree to the bottom right):</p>
<pre># Bottom left level
mod_3a_server(){
  return(reactive({ input$abc }))
}
mod_3b_server(){
  return(reactive({ input$def }))
}
mod_3c_server(){
  return(reactive({ input$ghi }))
}
mod_3d_server(){
  return(reactive({ input$jkl }))
}

mod_2_a(){
  mod_3a_reactive &lt;- mod_3a_server()
  mod_3b_reactive &lt;- mod_3b_server()
  return(
    list(
      mod_3a_reactive = mod_3a_reactive,
      mod_3b_reactive  = mod_3b_reactive
    )
  )
}
mod_2_b(){
  mod_3c_reactive &lt;- mod_3c_server()
  mod_3d_reactive &lt;- mod_3d_server()
  return(
    list(
      mod_3c_reactive  = mod_3_c(),
      mod_3d_reactive  = mod_3d()
    )
  )
}

mod_1_a(){
  mod_2_a_results &lt;- mod_2_a()
  mod_2_b_results &lt;- mod_2_b()
  return(
    list(
      mod_3a_reactive = mod_2_a_results$mod_3a_reactive,
      mod_3b_reactive = mod_2_a_results$mod_3b_reactive,
      mod_3c_reactive = mod_2_b_results$mod_3c_reactive,
      mod_3d_reactive = mod_2_b_results$mod_3d_reactive
    )
  )
}

# in server

reactives_from_mod_1_a &lt;- mod_1_a(...)

mod_1_b_server(
  mod_3a_reactive = reactives_from_mod_1_a$mod_3a_reactive,
  mod_3b_reactive = reactives_from_mod_1_a$mod_3b_reactive,
  mod_3c_reactive = reactives_from_mod_1_a$mod_3c_reactive,
  mod_3d_reactive = reactives_from_mod_1_a$mod_3d_reactive
)

# in mod_1_b
mod_1_b_server(){
  mod_2_d_server(
    mod_3a_reactive = mod_3a_reactive,
    mod_3b_reactive = mod_3b_reactive,
    mod_3c_reactive = mod_3c_reactive,
    mod_3d_reactive = mod_3d_reactive
  )
}

mod_2_d_server(){
  mod_3g_server(
    mod_3a_reactive = mod_3a_reactive,
    mod_3b_reactive = mod_3b_reactive,
    mod_3c_reactive = mod_3c_reactive,
    mod_3d_reactive = mod_3d_reactive
  )
}

mod_3g_server &lt;- function(...){
  output$xyz &lt;- renderPlot({
    draw(
      mod_3a_reactive = mod_3a_reactive(),
      mod_3b_reactive = mod_3b_reactive(),
      mod_3c_reactive = mod_3c_reactive(),
      mod_3d_reactive = mod_3d_reactive()
    )
  })
}
</pre>
<p><strong>If you feel like it’s a mess and complex to reason about, that’s because it is</strong>. And we’re in a simple case where data travels at the same depth in the stack.</p>
<p><img loading="lazy" decoding="async" class="aligncenter size-large wp-image-29612" src="https://i0.wp.com/rtask.thinkr.fr/wp-content/uploads/archi-travel-2-1024x388.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i0.wp.com/rtask.thinkr.fr/wp-content/uploads/archi-travel-2-1024x388.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/archi-travel-2-300x114.png 300w, https://rtask.thinkr.fr/wp-content/uploads/archi-travel-2-768x291.png 768w, https://rtask.thinkr.fr/wp-content/uploads/archi-travel-2-1536x583.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/archi-travel-2-2048x777.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: archi-travel-2.png --></p>
<p>With this, we’d get:</p>
<pre>mod_1_b_server(){
  mod_2_d_server(
    mod_3a_reactive = mod_3a_reactive,
    mod_3b_reactive = mod_3b_reactive,
    mod_3c_reactive = mod_3c_reactive,
    mod_3d_reactive = mod_3d_reactive
  )
}

mod_2_d_server(){

  output$abc &lt;- renderText({
    mod_3d_reactive()
  })

  mod_3g_server(
    mod_3a_reactive = mod_3a_reactive,
    mod_3b_reactive = mod_3b_reactive,
    mod_3c_reactive = mod_3c_reactive
  )
}

mod_3g_server(){
  output$xyz &lt;- renderPlot({
    draw(
      mod_3a_reactive = mod_3a_reactive(),
      mod_3b_reactive = mod_3b_reactive(),
      mod_3c_reactive = mod_3c_reactive()
    )
  })
}
</pre>
<p>And that’s just to make four values travel through the module graph. In a pretty shallow and evenly organized stack, as I said. And that’s because we’re only passing reactives as parameters.</p>
<pre>mod_3g_server &lt;- function(
  dataset,
  mod_3a_reactive,
  mod_3b_reactive,
  mod_3c_reactive,
  with_coordflip = TRUE
){
  output$xyz &lt;- renderPlot({
    draw(
      dataset = dataset,
      mod_3a_reactive = mod_3a_reactive(),
      mod_3b_reactive = mod_3b_reactive(),
      mod_3c_reactive = mod_3c_reactive(),
      with_coordflip = with_coordflip
    )
  })
}
</pre>
<p>Which is even more complex if you add a layer of <code>reactive()</code> inside your module:</p>
<pre>mod_3g_server &lt;- function(
  dataset,
  mod_3a_reactive,
  mod_3b_reactive,
  mod_3c_reactive,
  with_coordflip = TRUE
){
  the_plot_to_draw &lt;- reactive({
    drawing &lt;- draw(
      dataset = dataset,
      mod_3a_reactive = mod_3a_reactive(),
      mod_3b_reactive = mod_3b_reactive(),
      mod_3c_reactive = mod_3c_reactive(),
      with_coordflip = with_coordflip
    )
    return(drawing)
  })
  output$xyz &lt;- renderPlot({
    the_plot_to_draw()
  })
}
</pre>
<p><strong>Good luck understanding the reactive graph for that one.</strong></p>
<p>As a side note, I think <code>reactive()</code> objects are conceptually neat, but I don’t think they should be your go-to building block.<br />
Let’s have a quick look at:</p>
<pre>the_data_frame &lt;- reactive({
  result &lt;- clean_and_transform(
    input$dataset
  )
  return(result)
})

output$table_one &lt;- renderDT({
  the_data_frame()
})
</pre>
<p>That’s indeed neat: whenever <code>input$dataset</code> changes, something is computed and displayed. It works well for small examples, but as soon as you have to pass it to other functions or modules, it starts to feel harder to reason about, especially if you’re not used to manipulating functions as objects.</p>
<p>I’ve met a lot of R developers who didn’t know you could pass a function as a parameter to another function, and most of the time, with <code>reactive()</code>, people are copying examples from the web without really understanding what’s happening.</p>
<h3>But how do they do that in other languages?</h3>
<p>I haven’t built real apps in that many languages, but there is one I know (more or less) well: JavaScript.</p>
<p>In the summer of 2024, we spent a couple of weeks working on <a href="https://rtask.thinkr.fr/introducing-rlinguo-a-native-mobile-app-that-runs-r/" rel="nofollow" target="_blank">Rlinguo</a>, a mobile app that can run R code. It’s built in React, and it works just like <code>{shiny}</code> does (well, from a conceptual point of view): <strong>you have stateful objects, and when these objects change, they trigger another part of the app to be recomputed. In our case, whenever you interact with the first tab, the second tab (with the visualization) is updated</strong>.</p>
<p>In the app, the first layer creates a webR instance, an SQLite connection, and a score object, which is used to trigger a recomputation of the viz. When the app launches, you get a loading screen that waits for webR to be ready. Once it is, webR is queried for functions, and once you’ve validated your answer (in “module” 1), an alert is sent to the viz (in “module 2”) to query the SQLite DB and recompute the graph.</p>
<p>To sum up, some objects are created at the top level and used to share data and trigger reactivity from one “module” to the other.</p>
<p><em>Note: my colleague Arthur pointed that Vue.js has something called <code>store</code> in <a href="https://pinia.vuejs.org/core-concepts/" rel="nofollow" target="_blank">Pinia</a>. I’m not exactly sure how it works but apparently it’s more or less the same as <code>reactiveValues</code>. And Claude confirmed it <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f604.png" alt="😄" class="wp-smiley" style="height: 1em; max-height: 1em;" /></em></p>
<h3>The “stratégie du petit r”</h3>
<p>One strategy we recommended is what we called the “stratégie du petit r”. Looking back, <strong>I can admit that it was a poor choice of name, but you know, sh*t happens</strong>.</p>
<p>The principle is quite simple: instead of returning and passing <code>reactive()</code> objects as arguments, you create one <strong>or more</strong> <code>reactiveValues()</code> at an upper level, which you then pass downstream to lower-level modules. <code>reactiveValues()</code> behave a lot like environments, meaning that values set down the stack are available everywhere.</p>
<p><img loading="lazy" decoding="async" class="aligncenter size-large wp-image-29613" src="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/rv-1024x388.png?w=450&#038;ssl=1" alt="" srcset_temp="https://i1.wp.com/rtask.thinkr.fr/wp-content/uploads/rv-1024x388.png?w=450&#038;ssl=1 1024w, https://rtask.thinkr.fr/wp-content/uploads/rv-300x114.png 300w, https://rtask.thinkr.fr/wp-content/uploads/rv-768x291.png 768w, https://rtask.thinkr.fr/wp-content/uploads/rv-1536x583.png 1536w, https://rtask.thinkr.fr/wp-content/uploads/rv-2048x777.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></p>
<p><!-- Image: rv.png --></p>
<p><strong>I still think this is a valid way to share data</strong>, but only if you avoid applying it too literally and focus on how to work with it in practice.</p>
<p>The main criticism I’ve read about this approach is that you’ll end up with a huge <code>r</code> object with 300 entries in it, creating a monster that’s impossible to debug.</p>
<p>So yes, these monsters exist. But I don’t think the idea itself is the problem.<strong> It’s always easier to blame the tool than to acknowledge the lack of understanding behind its misuse</strong>. Or, as Beckett wrote, “<em>Voilà l’homme tout entier, s’en prenant à sa chaussure alors que c’est son pied le coupable.</em>” (“<em>There’s man all over for you, blaming on his boots the faults of his feet.</em>”)</p>
<p>Here are some random thoughts:</p>
<h4>1. Don’t call it <code>r</code> (probably)</h4>
<p>Conventions are great, and they help humans thrive. I think we need them when building software: <strong>we spend more time reading than writing code, and conventions help us navigate an unfamiliar codebase</strong>. For example, I know that all files starting with <code>mod_</code> in <code>{golem}</code> contain modules.</p>
<p>When presenting examples for the “stratégie du petit r”, we used <code>r &lt;- reactiveValues()</code>. But that was just for the example. In this post, I’ve used <code>mod_1_a</code>, <code>mod_3_g</code>. Please don’t reuse these names, they’re only examples.</p>
<p>So yes, a small <code>r</code> might be confusing if you don’t work with people who know that convention. <strong>If I stumble upon a codebase with an <code>r</code>, I’ll know what it is because I’ve used it before</strong>. But nowadays, I tend to go for more expressive naming, usually either <code>global</code> (since it’s global storage), or simply <code>storage</code>. You might prefer other names like <code>global_storage</code>, <code>reactive_storage</code>, or anything that would be clearer to your team.</p>
<p>That being said, <strong>everything is a matter of context and convention</strong>. For example, <code>dplyr::mutate()</code> has a parameter called <code>.data</code>. You could debate whether it’s a good choice or not, but anytime I see <code>.data</code>, I know it’s a table and that we’re in the tidyverse.</p>
<h4>2. You don’t need to share everything between all modules</h4>
<p>There is a very, very small chance that you need to share <strong>everything</strong> across all modules.</p>
<p>Think about the app you’re working on right now. Yes, there are probably a handful of things that need to be available in all modules, but <strong>there is no need to store everything in an upper-level <code>reactiveValues()</code></strong>.</p>
<p>Your modules need to stay scoped, and, this is probably one of the most important idea to have this implementation work:</p>
<ul>
<li><strong>Things that are only needed inside a module should not be stored in a <code>reactiveValues()</code> defined at an upper level</strong></li>
<li><strong>Things that are only needed inside a module should not be passed down to lower-level modules</strong></li>
</ul>
<p>That’s as simple as that. Think of your app as a tree: values that are only necessary at level N should not “go up” to level N + 1.</p>
<h4>3. You <del>can</del> need to have several <code>reactiveValues()</code></h4>
<p>The corollary of the last point is simple: <strong>you need several <code>reactiveValues()</code></strong>, operating at different scopes in your application.</p>
<p>Here is a simplified extract of a module from an app I’m currently working on:</p>
<pre>mod_abstract_server &lt;- function(
  id,
  global
) {

  local &lt;- reactiveValues()

  observeEvent(
    input$language,{
    local$ai_alert &lt;- build_text_for_ai_alert(
      input$language
    )
  })

  output$alert_ai &lt;- renderUI({
    local$ai_alert
  })

  country_rv &lt;- reactiveValues()

  observe({
    country_rv$country &lt;- input$country
  })

  mod_checklist_server(
      &quot;checklist_1&quot;,
      country_rv = country_rv,
      global = global
  )
}
</pre>
<p>So here, we have:</p>
<ul>
<li><code>global</code> (which could also be named <code>r_global</code>) which is the <code>reactiveValues()</code> s<strong>hared across all modules</strong>. It contains a dataset that can be updated in an admin panel, but needs to be read in the other modules. It’s passed down from <code>app_server</code>, goes through <code>mod_abstract_server</code>, and down into <code>mod_checklist_server</code>. I can name 10 use cases from client app where this is a valid pattern, just ask me next time you meet me at a conference.</li>
<li><code>local</code>, (which could also be named <code>r_local</code>) which is a <code>reactiveValues()</code> that stores values needed only inside the current module.</li>
<li><code>country_rv</code>, which is defined within the module and passed down to <code>mod_checklist_server</code>.</li>
</ul>
<p>I could have stored everything in <code>global</code>, and it would still work. But that wouldn’t be good organization or separation of concerns.</p>
<h4>To sum up</h4>
<p><strong>No structure, no idea, and no framework will ever prevent someone from writing bad code</strong>. JavaScript used to be joked about as a language that’s too permissive. Then TypeScript came along and imposed more structure on the language, with a loophole: you can hack the language and use <code>any</code>as a type for everything, and it will still work. You can write bad code with TypeScript, even if the language is supposed to enforce structure. <strong>Nothing can stop you from writing bad code</strong>.</p>
<p>Yes, using <code>reactiveValues()</code> as a storage object shared between modules can create monsters if you don’t really think about what you’re doing.</p>
<p>Yes, in an app with a very large number of values floating around, trying to pass data via strict function parameters can create even scarier monsters.</p>
<p>Yes, it’s OK to have a list as a parameter to a module function.</p>
<h3>Other patterns</h3>
<p>Here are some other patterns that can be used in a <code>{shiny}</code> app to share data across modules.</p>
<h4>Storage using an R6 object</h4>
<p>One downside I can think of when using the <code>reactiveValues()</code> strategy I just described is that, well, it’s reactive, meaning it can lead to uncontrolled reactivity if things aren’t scoped correctly.</p>
<p>One pattern I’ve used in an app is combining an <code>R6</code> object, used to store and process data, with the trigger mechanism from <code>{gargoyle}</code>. Basically, the idea behind <code>{gargoyle}</code> is simple: instead of relying on the reactive graph to invalidate itself, you <code>init</code> flags that are <code>trigger</code>ed in the code, and when a flag is triggered, the context where the flag is <code>watch</code>ed is invalidated.<br />
It’s a bit longer to implement, but you get better control over what is happening.</p>
<p>Combined with this, you can use an <code>R6</code> object that is passed along the modules, and that gets transformed to store, process, and serve the data.</p>
<p>You can read more about this in “15.1.3 Building triggers and watchers” and “15.1.4 Using R6 as data storage” in <a href="https://engineering-shiny.org/common-app-caveats.html" rel="nofollow" target="_blank">Chapter 15</a> of the Engineering Shiny book.</p>
<h3><code>session$userData</code></h3>
<p>This one should be used with a lot of caution, but it can be very effective if you know what you’re doing (and if you don’t have too many things to share).</p>
<p>The <code>session</code> object is an environment available everywhere in your Shiny app. It represents the current interaction between each user and the R session (i.e., each user has their own). This environment has a special slot called <code>userData</code> that can be populated with data, and it is scoped to the session.</p>
<p>The way I’ve used it in the past is via wrappers, which would look like:</p>
<pre>set_this &lt;- function(value, session = shiny::getDefaultReactiveDomain()){
  session$userData$this &lt;- compute_this(value)
}
get_this &lt;- function(session = shiny::getDefaultReactiveDomain()){
  session$userData$this
}
</pre>
<p>So anywhere I need it, I’ll use the wrapper function instead of <code>session$userData$this</code>. I would generally use it to define things at the top level that need to be accessible everywhere downstream, but I feel it might be a bit complex to manage if you need to pass data from <code>mod_3_a</code> to <code>mod_3_g</code>.</p>
<p>The documentation says it can be used <em>“to store whatever session-specific data (we) want”</em>, but my gut feeling is that it’s best not to shove too much into it. But I don’t have any rationale reason and I’d be happy to be proven wrong.</p>
<h3>An environment in the scope of the package/top level of the app</h3>
<p>This is something a lot of R developers do: define an environment inside the package namespace so that, when the package is loaded, you can CRUD into it. For example, there are some (well, several) in <code>{shiny}</code>:</p>
<pre>&gt; shiny:::.globals

</pre>
<p>The function <code>shinyOptions()</code> writes to it, and <code>getShinyOption()</code> reads from it.</p>
<p>This pattern can be used as global storage, but be careful: it’s not session-scoped, so whatever is in this environment is shared across sessions.</p>
<h3>An external database or storage system</h3>
<p>Another solution is to store values in an external database, and query that DB inside modules.</p>
<p>If you try to implement this solution, two things to keep in mind are:</p>
<ul>
<li>Make the data session-scoped, i.e., use <code>session$token</code> to identify the current session, and remove the data when the session ends.</li>
<li>You’ll need to handle reactivity manually, for example with <code>{gargoyle}</code>.</li>
</ul>
<p>For example, with <code>{storr}</code>:</p>
<pre># Mimicking a session
session &lt;- shiny::MockShinySession$new()

# In module 1
st &lt;- storr::storr_rds(here::here())
st$set(&quot;dataset&quot;, mtcars, namespace = session$token)

# In module 2
st &lt;- storr::storr_rds(here::here())
st$get(&quot;dataset&quot;, namespace = session$token)
</pre>
<p>Of course, this is a short piece of code and you’ll need more engineering, but you get the idea.</p>
<h2>Conclusion</h2>
<p>It’s been a long post, but I wanted to dive a bit deeper into the why, and to develop the ideas and drawbacks behind the “stratégie du petit r”.<br />
I should have written this post much sooner, but I suppose being attacked publicly on social media without being consulted first is quite the motivator.</p>
<p>Anyway, <strong>I’m always happy to chat about the ideas developed here</strong>, so feel free to comment or reach out to me (I’m pretty sure that if you need to, it’s very easy to find a way to contact me <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f605.png" alt="😅" class="wp-smiley" style="height: 1em; max-height: 1em;" />).</p>
<p><strong>As with anything in life, writing software is always a matter of compromise</strong>. Any decision you make while writing code has benefits and drawbacks, and if you can’t find any drawbacks, it’s because you haven’t thought hard enough. When building applications for production, the codebase can become very large. I mentioned an app with 20,000 lines of code, which I recently spent a week refactoring to reduce its size by 20%, but I’m sure other apps I’ve worked on are larger. Still manageable if well organized, but complex anyway.</p>
<p>In the perfect world of software engineering, modules would be so small that they handle just one value, reactive graphs would be fully under control, we’d get code coverage of 100%, all required inputs would be passed as parameters, we would use a typed language that wouldn’t allow unsafe values, and no variable would ever be called <code>x</code> or <code>result</code>.</p>
<p>And then there’s reality.</p>
<p>The client needed this yesterday. Their boss needed it last month. I’m out of coffee. And, to be honest, I’d rather be out in the woods running than debugging <code>renv::install()</code> again.</p>
<p>So we might take shortcuts, use bad variable names, forget to delete a test <code>data.frame</code> from the SQL database, and create <code>reactiveValues()</code> that are monsters.</p>
<p>Still, I genuinely believe <strong>nobody is here to sabotage the project</strong>.</p>
<p>That <strong>we’re all doing the best we can with what we have</strong>.</p>
<p>This post is better presented on its original ThinkR website here: <a rel="nofollow" href="https://rtask.thinkr.fr/sharing-data-across-shiny-modules-an-update/" target="_blank">Sharing data across shiny modules, an update</a></p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rtask.thinkr.fr/sharing-data-across-shiny-modules-an-update/"> Rtask</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/sharing-data-across-shiny-modules-an-update/">Sharing data across shiny modules, an update</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399748</post-id>	</item>
		<item>
		<title>Special k: The Science (or Art) of Finding the Optimal k in Clustering</title>
		<link>https://www.r-bloggers.com/2026/03/special-k-the-science-or-art-of-finding-the-optimal-k-in-clustering/</link>
		
		<dc:creator><![CDATA[Jason Bryer]]></dc:creator>
		<pubDate>Tue, 10 Mar 2026 04:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://bryer.org/posts/2026-03-10-Special_k.html</guid>

					<description><![CDATA[<p>Download slides<br />
Cluster analysis is a statistical procedure for grouping observations using an observation-centered approach as compared to variable-centered approaches (e.g. PCA, factor analysis). As an unsupervised method true cluster mem...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/special-k-the-science-or-art-of-finding-the-optimal-k-in-clustering/">Special k: The Science (or Art) of Finding the Optimal k in Clustering</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://bryer.org/posts/2026-03-10-Special_k.html"> Jason Bryer</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 




<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/B4MU7ORbCWI" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><a href="https://github.com/jbryer/clav/blob/master/slides/clav_nyhackr_2026.pdf" rel="nofollow" target="_blank">Download slides</a></p>
<p>Cluster analysis is a statistical procedure for grouping observations using an observation-centered approach as compared to variable-centered approaches (e.g. PCA, factor analysis). As an unsupervised method true cluster membership is usually not known. Hence, determining the optimal number of clusters, or k, poses unique challenges. A review of six common metrics for determining k with several clustering methods using two data sets will be explored. An introduction to two bootstrapping fit statistics will be provided along with validation techniques for evaluating the validity and stability of the cluster results across bootstrap samples.</p>



 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://bryer.org/posts/2026-03-10-Special_k.html"> Jason Bryer</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/special-k-the-science-or-art-of-finding-the-optimal-k-in-clustering/">Special k: The Science (or Art) of Finding the Optimal k in Clustering</a>]]></content:encoded>
					
		
		<enclosure url="https://bryer.org/posts/2026-03-10-Special_k.png" length="0" type="image/png" />

		<post-id xmlns="com-wordpress:feed-additions:1">399810</post-id>	</item>
		<item>
		<title>Breaking Release of the patentsview R Package</title>
		<link>https://www.r-bloggers.com/2026/03/breaking-release-of-the-patentsview-r-package/</link>
		
		<dc:creator><![CDATA[rOpenSci]]></dc:creator>
		<pubDate>Tue, 10 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://ropensci.org/blog/2026/03/10/patentsview-breaking-release/</guid>

					<description><![CDATA[<p>The patentsview R package was created by Chris Baker to simplify interactions with the<br />
PatentsView API as announced in Chris’<br />
blog post<br />
in 2017. The API can be queried for data from US patents granted since 1976 as well as<br />
patent applications si...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/breaking-release-of-the-patentsview-r-package/">Breaking Release of the patentsview R Package</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://ropensci.org/blog/2026/03/10/patentsview-breaking-release/"> rOpenSci - open tools for open science</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p>The <a href="http://docs.ropensci.org/patentsview/" rel="nofollow" target="_blank">patentsview</a> R package was created by Chris Baker to simplify interactions with the
PatentsView API as announced in Chris’
<a href="https://ropensci.org/blog/2017/09/19/patentsview/" rel="nofollow" target="_blank">blog post</a>
in 2017. The API can be queried for data from US patents granted since 1976 as well as
patent applications since 2001 (not all going on to become granted patents).<br>
As shown in the package’s vignettes, location data can be mapped, charts of
assignees can be created etc. using other R packages, only limited by the
developer’s imagination.</p>
<p>Fast-forwarding to today finds us in a precarious
position as the PatentsView API team had made breaking changes and obsoleted
the original API (all calls to the original endpoints return 410 Gone).
As such we have spent some time now updating patentsview to work with these API changes.
The updated patentsview package is now on CRAN but, unfortunately, as this Tech Note was being prepared
the PatentsView API team made troubling changes.</p>
<p>In late February they replaced their
<a href="https://patentsview.org/forum" rel="nofollow" target="_blank">forum</a> with a message saying the page was temporarily
unavailable. Further, they have also removed the link to request an API key, so it’s unclear
whether they’d honor requests for API keys using the link below. Nothing has been officially
announced but the long term viability of the API seems uncertain.</p>
<p><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f374.png" alt="🍴" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Here you’ve come to a fork (and knife) in the road, continue reading
if you are/were using the original version of the patentsview package, and we’ll guide
you through the necessary changes. If you have an interest in US patent data but haven’t
used the patentsview package yet (and are willing to take the risk!), check out the <a href="https://docs.ropensci.org/patentsview/articles/ropensci-blog-post.html" rel="nofollow" target="_blank">vignette</a> reworked from Chris’ original blog post to use the new version of the R package and API.</p>
<h2>
New changes to the API:
</h2><ol>
<li>Users will need to <a href="https://patentsview-support.atlassian.net/servicedesk/customer/portals" rel="nofollow" target="_blank">request an API key</a> and set an environmental variable PATENTSVIEW_API_KEY to this value.</li>
<li>Endpoint changes:
<ul>
<li>The <code>nber_subcategories</code>, one of the original seven endpoints, was removed</li>
<li><code>cpc_subsections</code> is now <code>cpc_group</code></li>
<li>The remaining five original endpoints went from plural to singular, <code>patents</code> is now <code>patent</code>, for example.
Interestingly, the returned data structures are still plural for the most part.</li>
<li>There are now 27 endpoints, and more than one may need to be called to retrieve fields that were
available from a single call to one of the original endpoints</li>
<li>Some of the endpoints now return HATEOAS (Hypermedia as the Engine of Application State) links that can be called to retrieve additional data</li>
</ul>
</li>
<li>Some fields are now nested and need to be fully qualified when used in a query,
for instance, <code>search_pv('{&quot;cpc_current.cpc_group_id&quot;:&quot;A01B1/00&quot;}')</code> when using the patent endpoint.
In the fields parameter, nested fields can be fully qualified or a new API shorthand can be used,
which allows you to specify group names. When group names are used, all of the group’s nested fields will be returned.
For example, defining <code>fields = c(&quot;assignees&quot;)</code> when
using the patent endpoint means that all nested assignees’ fields will be returned by the API.</li>
<li>Some field names have changed, most significantly, <code>patent_number</code> is now <code>patent_id</code>,
and some fields were removed entirely, for instance, <code>rawinventor_first_name</code> and <code>rawinventor_last_name</code>.</li>
<li>The original version of the API had queryable fields and additional fields which could be
retrieved but couldn’t be part of a conditional query. That notion does not apply to the
new version of the API as all fields are now queryable. You may be able
to simplify your code if you found yourself doing post processing on returned data
because a field you were interested in was not queryable.</li>
<li>Now there isn’t supposed to be a difference between
operators used on strings vs full text fields, as there was in the original
version of the API. See the tip below the <a href="https://search.patentsview.org/docs/docs/Search%20API/SearchAPIReference/#syntax" rel="nofollow" target="_blank">Syntax section</a>.</li>
<li>Result set paging has changed significantly. This only matter to users implementing their own
paging, as the package continues to handle result set paging with <code>search_pv()</code>’s <code>all_pages = TRUE</code>.
There is a new <a href="https://docs.ropensci.org/patentsview/articles/result-set-paging.html" rel="nofollow" target="_blank">Result set paging</a> vignette to explain the API’s paging,
using the <code>size</code> and <code>after</code> parameters rather than <code>page</code> and <code>per_page</code>.</li>
<li>Result set sizes are seemingly unbounded now. The original version of the API capped result sets at
100,000 rows.</li>
</ol>
<p>The API team also <a href="https://search.patentsview.org/docs/#naming-update" rel="nofollow" target="_blank">renamed the API</a>,
PatentsView’s Search API is now the PatentSearch API.
Note that the R package will retain its name, continue to use <code>library(patentsview)</code> to load the package.</p>
<h2>
New changes to the R package:
</h2><ol>
<li>Throttling is now enforced by the API and handled by the R package (sleep as specified by the throttle response before a retry)</li>
<li>There are five new vignettes
<ul>
<li><a href="https://docs.ropensci.org/patentsview/articles/api-changes.html" rel="nofollow" target="_blank">API Changes</a></li>
<li><a href="https://docs.ropensci.org/patentsview/articles/converting-an-existing-script.html" rel="nofollow" target="_blank">Converting an existing script</a></li>
<li><a href="https://docs.ropensci.org/patentsview/articles/result-set-paging.html" rel="nofollow" target="_blank">Result set paging</a>, should custom paging be needed</li>
<li><a href="https://docs.ropensci.org/patentsview/articles/understanding-the-api.html" rel="nofollow" target="_blank">Understanding the API</a>, the API team’s jupyter notebook converted to R and enhanced</li>
<li><a href="https://ropensci.org/blog/2017/09/19/patentsview/" rel="nofollow" target="_blank">Accessing patent data with the patentsview package</a>, the blog post that announced the original version of the R package has been updated to work with the new version of the API</li>
</ul>
</li>
<li>The R package changed internally from using httr to httr2. This only affects users if
they passed additional arguments (<code>...</code>) to <code>search_pv()</code>. Previously if they passed <code>config = httr::timeout(40)</code>
they’d now pass <code>timeout = 40</code> (name-value pairs of valid curl options, as found in <code>curl::curl_options()</code> see <a href="https://httr2.r-lib.org/reference/req_options.html" rel="nofollow" target="_blank">req_options</a>)</li>
<li>Now that the R package is using httr2, users can make use of its <code>last_request()</code> method to see what was sent to the API. This could be useful when trying to fix an invalid request. Also fun would be seeing the raw API response.</li>
</ol>
<pre>httr2::last_request()
httr2::last_response()
httr2::last_response() |&gt; httr2::resp_body_json()
</pre><ol start="5">
<li>A new function was added
<code>retrieve_linked_data()</code> to retrieve data from a HATEOAS link the API sent back, retrying if throttled</li>
<li>An existing function was removed. With the API changes, there is less of a need for
<code>cast_pv_data()</code> which was previously part of the R package. The API now returns most fields as appropriate
types, boolean, numeric etc., instead of always returning strings.</li>
</ol>
<h2>
Online API documentation
</h2><p>The PatentsView API team has thoughtfully provided a Swagger UI page for the new version of the API at <a href="https://search.patentsview.org/swagger-ui/" rel="nofollow" target="_blank">https://search.patentsview.org/swagger-ui/</a>.
Think of it as an online version of Postman already loaded with the API’s new endpoints and returns.
The Swagger UI page documents what fields are returned by each endpoint on a successful call
(http response code 200).
You can even send in requests and see actual API responses if you enter your API key and press
an endpoint’s “Try it out” and “Execute” buttons. Even error responses can be informative, the API’s X-Status-Reason response header
usually points out what went wrong.</p>
<p>In a similar format, the <a href="https://search.patentsview.org/docs/docs/Search%20API/EndpointDictionary/" rel="nofollow" target="_blank">updated API documentation</a>
lists what each endpoint does. Additionally, the R package’s <code>fieldsdf</code> data frame has been updated,
now listing the new set of endpoints and fields that can be queried and/or returned. The R package’s
reference pages have also been updated.</p>
<h2>
Final thoughts
</h2><p>As shown in the updated <a href="https://docs.ropensci.org/patentsview/articles/top-assignees.html" rel="nofollow" target="_blank">Top Assignees</a> vignette, there will be occasions now where multiple API calls are needed to retrieve the same data as in a single API call in the original version of the API and R package.
Additionally, the <a href="https://docs.ropensci.org/patentsview/articles/ropensci-blog-post.html" rel="nofollow" target="_blank">reworked rOpenSci post</a> explains what changes had to be made since assignee latitude
and longitude are no longer available from the patent endpoint.</p>
<p>Issues for the R package can be raised in the <a href="https://github.com/ropensci/patentsview/issues" rel="nofollow" target="_blank">patentsview repo</a>.</p>
<p>As we mentioned at the start, the future of the PatentsView API is a bit uncertain. PatentsView is funded by the <a href="https://www.uspto.gov/" rel="nofollow" target="_blank">USPTO</a>, who may be looking to cut costs. However, until we know for certain, we hope patentsview serves you well. If nothing else,
it’s been a great run, starting in 2015 for the API and 2017 for the R package!</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://ropensci.org/blog/2026/03/10/patentsview-breaking-release/"> rOpenSci - open tools for open science</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/breaking-release-of-the-patentsview-r-package/">Breaking Release of the patentsview R Package</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399746</post-id>	</item>
		<item>
		<title>Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance</title>
		<link>https://www.r-bloggers.com/2026/03/formula-1-analysis-in-r-with-f1datar-lap-times-pit-stops-and-driver-performance/</link>
		
		<dc:creator><![CDATA[rprogrammingbooks]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 20:32:37 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rprogrammingbooks.com/?p=2507</guid>

					<description><![CDATA[<p>Formula 1 is one of the most compelling areas for data analysis in R because it combines structured results, lap-by-lap timing, pit strategy, and driver performance into one of the richest datasets in sport. For anyone building authority in technical R content, this is an excellent niche: it is specific enough ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/formula-1-analysis-in-r-with-f1datar-lap-times-pit-stops-and-driver-performance/">Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rprogrammingbooks.com/formula-1-analysis-r-f1datar/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=formula-1-analysis-r-f1datar"> Blog - R Programming Books</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p>Formula 1 is one of the most compelling areas for data analysis in R because it combines structured results, lap-by-lap timing, pit strategy, and driver performance into one of the richest datasets in sport. For anyone building authority in technical R content, this is an excellent niche: it is specific enough to stand out, but broad enough to support tutorials, visualizations, predictive models, and long-form analytical writing.</p>

<p>One of the biggest advantages of working in this space is that <code>f1dataR</code> gives R users access to both historical Formula 1 data and richer session-level workflows linked to the wider Ergast/Jolpica and FastF1 ecosystem. That makes it possible to move from simple race results into much more interesting questions: Who had the strongest race pace? Which driver managed tyre degradation best? Did a pit stop strategy actually work? Can we build a basic model to estimate race outcomes?</p>

<p>This is where Formula 1 becomes much more than a sports topic. It becomes a practical case study in data wrangling, time-series thinking, feature engineering, visualization, and prediction. And because the R blog space has relatively little deep Formula 1 content compared with more general analytics topics, a strong tutorial here can help position your site as a serious source of expertise.</p>

<h2>Why Formula 1 analysis in R is such a strong niche</h2>

<p>Most R tutorials on the web focus on standard examples: sales dashboards, housing prices, or generic machine learning datasets. Formula 1 is different. The data has context, drama, and a built-in audience. Every race gives you new material to analyze, and every session contains multiple layers of information: qualifying pace, stint length, tyre compounds, safety car timing, sector performance, overtakes, and pit strategy.</p>

<p>That is part of what makes this topic attractive for long-form content. You are not just teaching code. You are showing how code helps explain real competitive decisions. A lap time is not just a number. It is evidence of tyre wear, traffic, fuel load, track evolution, and driver execution.</p>

<p>For readers who want to go deeper into this kind of workflow, resources such as <a href="https://rprogrammingbooks.com/product/racing-with-data-formula-1-and-nascar-analytics-with-r/" rel="nofollow" target="_blank"><em>Racing with Data: Formula 1 and NASCAR Analytics with R</em></a> are useful because they reinforce the idea that racing analytics in R can go well beyond basic charts and into serious, code-driven analysis.</p>

<h2>Installing the packages</h2>

<p>The first step is to set up a workflow that is both reproducible and flexible. For most Formula 1 analysis projects in R, you will want <code>f1dataR</code> plus a small set of packages for data cleaning, plotting, reporting, and modeling.</p>

<pre>install.packages(c(
  &quot;f1dataR&quot;,
  &quot;tidyverse&quot;,
  &quot;lubridate&quot;,
  &quot;janitor&quot;,
  &quot;scales&quot;,
  &quot;slider&quot;,
  &quot;broom&quot;,
  &quot;tidymodels&quot;,
  &quot;gt&quot;,
  &quot;patchwork&quot;
))

library(f1dataR)
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)
library(slider)
library(broom)
library(tidymodels)
library(gt)
library(patchwork)</pre>

<p>If you want to work with official session-level timing data, it is also a good idea to configure FastF1 support and define a local cache.</p>

<pre>setup_fastf1()

options(f1dataR.cache = &quot;f1_cache&quot;)
dir.create(&quot;f1_cache&quot;, showWarnings = FALSE)</pre>

<p>That may look like a small detail, but caching matters when you are building serious analytical content. It makes your workflow faster, cleaner, and much easier to reproduce when updating notebooks, reports, or blog posts later.</p>

<h2>Start with race results</h2>

<p>Before diving into laps and strategy, start with historical race results. They provide the backbone for season summaries, driver comparisons, constructor trends, and predictive features.</p>

<pre>results_2024 &lt;- load_results(season = 2024)

results_2024 %&gt;%
  clean_names() %&gt;%
  select(round, race_name, driver, constructor, grid, position, points, status) %&gt;%
  glimpse()</pre>

<p>Once the results are loaded, you can build a season summary table that gives readers an immediate overview of the competitive picture.</p>

<pre>season_table &lt;- results_2024 %&gt;%
  clean_names() %&gt;%
  group_by(driver, constructor) %&gt;%
  summarise(
    races = n(),
    wins = sum(position == 1, na.rm = TRUE),
    podiums = sum(position &lt;= 3, na.rm = TRUE),
    avg_finish = mean(position, na.rm = TRUE),
    avg_grid = mean(grid, na.rm = TRUE),
    points = sum(points, na.rm = TRUE),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(points), avg_finish)

season_table</pre>

<p>You can also convert that summary into a cleaner publication table for a blog or report.</p>

<pre>season_table %&gt;%
  mutate(
    avg_finish = round(avg_finish, 2),
    avg_grid = round(avg_grid, 2)
  ) %&gt;%
  gt() %&gt;%
  tab_header(
    title = &quot;2024 Driver Season Summary&quot;,
    subtitle = &quot;Wins, podiums, average finish, and points&quot;
  )</pre>

<p>This type of summary is useful, but by itself it does not explain much about how results were achieved. That is why the next step matters.</p>

<h2>Looking beyond the finishing position</h2>

<p>One of the easiest ways to improve an F1 analysis is to move beyond final classification. A driver finishing sixth may have delivered an excellent performance in a midfield car, while a podium in a dominant car may tell a much simpler story. A stronger framework compares results to starting position, teammate performance, and race pace.</p>

<p>A good place to begin is position gain.</p>

<pre>position_gain_table &lt;- results_2024 %&gt;%
  clean_names() %&gt;%
  mutate(
    position_gain = grid - position
  ) %&gt;%
  group_by(driver, constructor) %&gt;%
  summarise(
    mean_gain = mean(position_gain, na.rm = TRUE),
    median_gain = median(position_gain, na.rm = TRUE),
    total_gain = sum(position_gain, na.rm = TRUE),
    races = n(),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(mean_gain))

position_gain_table</pre>

<p>This metric is simple, but it is still valuable because it gives a first signal of race execution. Of course, it has limits. Front-runners have less room to gain places, and midfield races are often influenced by strategy variance, incidents, and reliability. Still, that nuance is exactly what makes the discussion interesting.</p>

<h2>Add race and circuit context</h2>

<p>Formula 1 performance is always track-dependent. Some cars are stronger on high-speed circuits, some drivers thrive on street tracks, and some teams handle tyre-sensitive venues better than others. Joining race results with schedule data allows you to frame those questions more clearly.</p>

<pre>schedule_2024 &lt;- load_schedule(season = 2024) %&gt;%
  clean_names()

results_with_schedule &lt;- results_2024 %&gt;%
  clean_names() %&gt;%
  left_join(
    schedule_2024 %&gt;%
      select(round, race_name, circuit_name, locality, country, race_date),
    by = c(&quot;round&quot;, &quot;race_name&quot;)
  )

results_with_schedule %&gt;%
  select(round, race_name, circuit_name, country, driver, constructor, grid, position) %&gt;%
  slice_head(n = 10)</pre>

<p>Even at this stage, you already have enough structure to write multiple types of posts: best performing drivers by circuit type, constructor consistency across the season, teammate gaps by venue, or overperformance relative to starting position.</p>

<h2>Lap times: where the analysis gets serious</h2>

<p>Race results tell you what happened. Lap times tell you how it happened. This is where Formula 1 analysis becomes much more valuable, because you can begin to evaluate race pace, traffic effects, tyre degradation, and the shape of a driver’s performance over the full event.</p>

<p>It is usually best to focus on one race session first, especially if your goal is to explain the process clearly.</p>

<pre>session_laps &lt;- load_laps(
  season = 2024,
  round = 10,
  session = &quot;R&quot;
) %&gt;%
  clean_names()

session_laps %&gt;%
  select(driver, lap_number, lap_time, compound, tyre_life, stint, pit_out_time, pit_in_time) %&gt;%
  glimpse()</pre>

<p>Lap time fields often need cleaning before they are suitable for visualization or modeling. Converting them into seconds is usually the most practical approach.</p>

<pre>laps_clean &lt;- session_laps %&gt;%
  mutate(
    lap_time_seconds = as.numeric(lap_time),
    sector1_seconds = as.numeric(sector_1_time),
    sector2_seconds = as.numeric(sector_2_time),
    sector3_seconds = as.numeric(sector_3_time)
  ) %&gt;%
  filter(!is.na(lap_time_seconds)) %&gt;%
  filter(lap_time_seconds &gt; 50, lap_time_seconds &lt; 200)

summary(laps_clean$lap_time_seconds)</pre>

<h2>Comparing race pace by driver</h2>

<p>Once the lap data is cleaned, you can compare selected drivers and visualize how their pace evolves through the race.</p>

<pre>selected_drivers &lt;- c(&quot;VER&quot;, &quot;NOR&quot;, &quot;LEC&quot;, &quot;HAM&quot;)

laps_clean %&gt;%
  filter(driver %in% selected_drivers) %&gt;%
  ggplot(aes(x = lap_number, y = lap_time_seconds, color = driver)) +
  geom_line(alpha = 0.8, linewidth = 0.8) +
  geom_point(size = 1.2, alpha = 0.7) +
  scale_y_continuous(labels = label_number(accuracy = 0.1)) +
  labs(
    title = &quot;Race pace by lap&quot;,
    subtitle = &quot;Raw lap times across the Grand Prix&quot;,
    x = &quot;Lap&quot;,
    y = &quot;Lap time (seconds)&quot;,
    color = &quot;Driver&quot;
  ) +
  theme_minimal(base_size = 13)</pre>

<p>Raw lap time plots are useful, but they are often noisy because pit laps, out-laps, and unusual traffic can distort the pattern. A stronger analysis filters some of that noise and focuses on green-flag pace.</p>

<pre>green_flag_laps &lt;- laps_clean %&gt;%
  filter(driver %in% selected_drivers) %&gt;%
  filter(is.na(pit_in_time), is.na(pit_out_time)) %&gt;%
  group_by(driver) %&gt;%
  mutate(
    median_lap = median(lap_time_seconds, na.rm = TRUE),
    lap_delta = lap_time_seconds - median_lap
  ) %&gt;%
  ungroup() %&gt;%
  filter(abs(lap_delta) &lt; 5)

green_flag_laps %&gt;%
  ggplot(aes(lap_number, lap_time_seconds, color = driver)) +
  geom_line(linewidth = 0.9) +
  geom_smooth(se = FALSE, method = &quot;loess&quot;, span = 0.25, linewidth = 1.1) +
  labs(
    title = &quot;Green-flag race pace&quot;,
    subtitle = &quot;Smoothed lap-time profile after removing pit laps and large outliers&quot;,
    x = &quot;Lap&quot;,
    y = &quot;Lap time (seconds)&quot;
  ) +
  theme_minimal(base_size = 13)</pre>

<p>This kind of chart is one of the most useful in F1 analytics because it shows whether a driver was genuinely fast, merely benefiting from track position, or fading late in the race.</p>

<h2>Tyre degradation and stint analysis</h2>

<p>One of the best ways to add real authority to an F1 post is to quantify degradation. Instead of simply saying a driver “managed tyres well,” you can estimate how lap time changed as tyre life increased during a stint.</p>

<pre>stint_degradation &lt;- laps_clean %&gt;%
  filter(driver %in% selected_drivers) %&gt;%
  filter(!is.na(stint), !is.na(tyre_life), !is.na(compound)) %&gt;%
  filter(is.na(pit_in_time), is.na(pit_out_time)) %&gt;%
  group_by(driver, stint, compound) %&gt;%
  filter(n() &gt;= 8) %&gt;%
  nest() %&gt;%
  mutate(
    model = map(data, ~ lm(lap_time_seconds ~ tyre_life, data = .x)),
    tidied = map(model, broom::tidy)
  ) %&gt;%
  unnest(tidied) %&gt;%
  filter(term == &quot;tyre_life&quot;) %&gt;%
  transmute(
    driver,
    stint,
    compound,
    degradation_per_lap = estimate,
    p_value = p.value
  ) %&gt;%
  arrange(degradation_per_lap)

stint_degradation</pre>

<p>A positive slope generally means pace is dropping as the stint gets older. A smaller slope suggests better tyre preservation or more stable pace. The interpretation is not always simple, because race context matters, but the method is very effective for turning race discussion into evidence.</p>

<pre>laps_clean %&gt;%
  filter(driver %in% selected_drivers, !is.na(stint), !is.na(tyre_life)) %&gt;%
  filter(is.na(pit_in_time), is.na(pit_out_time)) %&gt;%
  ggplot(aes(tyre_life, lap_time_seconds, color = driver)) +
  geom_point(alpha = 0.5, size = 1.6) +
  geom_smooth(method = &quot;lm&quot;, se = FALSE, linewidth = 1) +
  facet_wrap(~ compound, scales = &quot;free_x&quot;) +
  labs(
    title = &quot;Tyre degradation by compound&quot;,
    subtitle = &quot;Linear approximation of pace loss as the stint ages&quot;,
    x = &quot;Tyre life (laps)&quot;,
    y = &quot;Lap time (seconds)&quot;
  ) +
  theme_minimal(base_size = 13)</pre>

<p>This is exactly the kind of analysis that makes a technical article memorable, because it moves from “who won?” to “why did the performance pattern look the way it did?”</p>

<h2>Pit stops and strategy</h2>

<p>Pit strategy is one of the clearest examples of how Formula 1 combines data and decision-making. A stop is not just an event; it is a trade-off between track position, tyre life, race pace, and the behaviour of nearby competitors.</p>

<pre>pit_summary &lt;- session_laps %&gt;%
  clean_names() %&gt;%
  mutate(
    had_pit_event = !is.na(pit_out_time) | !is.na(pit_in_time)
  ) %&gt;%
  group_by(driver) %&gt;%
  summarise(
    total_laps = n(),
    pit_events = sum(had_pit_event, na.rm = TRUE),
    stints = n_distinct(stint, na.rm = TRUE),
    first_compound = first(na.omit(compound)),
    last_compound = last(na.omit(compound)),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(pit_events))

pit_summary</pre>

<p>A better way to explain strategy is to reconstruct the stints directly.</p>

<pre>strategy_table &lt;- session_laps %&gt;%
  clean_names() %&gt;%
  arrange(driver, lap_number) %&gt;%
  group_by(driver, stint) %&gt;%
  summarise(
    start_lap = min(lap_number, na.rm = TRUE),
    end_lap = max(lap_number, na.rm = TRUE),
    laps_in_stint = n(),
    compound = first(na.omit(compound)),
    avg_lap = mean(as.numeric(lap_time), na.rm = TRUE),
    median_lap = median(as.numeric(lap_time), na.rm = TRUE),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(driver, stint)

strategy_table
strategy_table %&gt;%
  ggplot(aes(x = start_lap, xend = end_lap, y = driver, yend = driver, color = compound)) +
  geom_segment(linewidth = 6, lineend = &quot;round&quot;) +
  labs(
    title = &quot;Race strategy by driver&quot;,
    subtitle = &quot;Stint map reconstructed from lap-level data&quot;,
    x = &quot;Lap window&quot;,
    y = &quot;Driver&quot;,
    color = &quot;Compound&quot;
  ) +
  theme_minimal(base_size = 13)</pre>

<p>Once you have stint maps, your analysis immediately becomes more strategic. You can discuss undercuts, overcuts, long first stints, aggressive early stops, and whether a team actually converted tyre freshness into meaningful gains.</p>

<h2>Measuring post-stop pace</h2>

<p>A useful extension is to examine whether a driver actually benefitted from fresh tyres after a stop. That is one of the simplest ways to move from descriptive pit analysis into strategic interpretation.</p>

<pre>post_stop_pace &lt;- session_laps %&gt;%
  clean_names() %&gt;%
  arrange(driver, lap_number) %&gt;%
  group_by(driver) %&gt;%
  mutate(
    pit_out_lap = !is.na(pit_out_time),
    laps_since_stop = cumsum(lag(pit_out_lap, default = FALSE))
  ) %&gt;%
  ungroup() %&gt;%
  filter(!is.na(lap_time)) %&gt;%
  group_by(driver, laps_since_stop) %&gt;%
  summarise(
    first_laps_avg = mean(as.numeric(lap_time)[1:min(3, n())], na.rm = TRUE),
    stint_avg = mean(as.numeric(lap_time), na.rm = TRUE),
    .groups = &quot;drop&quot;
  )

post_stop_pace</pre>

<p>This kind of table helps answer a much better question than “when did they pit?” It asks: “Did the stop create usable pace, and was that pace strong enough to influence the race?”</p>

<h2>Teammate comparison as the best benchmark</h2>

<p>In Formula 1, teammate comparison is often more informative than full-grid comparison because the car is the closest thing to a controlled environment. If one driver consistently beats the other in grid position, race finish, or pace consistency, that tells you something much more precise than the overall championship table.</p>

<pre>teammate_table &lt;- results_2024 %&gt;%
  clean_names() %&gt;%
  group_by(constructor, round, race_name) %&gt;%
  mutate(
    teammate_finish_rank = min_rank(position),
    teammate_grid_rank = min_rank(grid)
  ) %&gt;%
  ungroup() %&gt;%
  group_by(driver, constructor) %&gt;%
  summarise(
    avg_finish = mean(position, na.rm = TRUE),
    avg_grid = mean(grid, na.rm = TRUE),
    teammate_beating_rate_finish = mean(teammate_finish_rank == 1, na.rm = TRUE),
    teammate_beating_rate_grid = mean(teammate_grid_rank == 1, na.rm = TRUE),
    points = sum(points, na.rm = TRUE),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(teammate_beating_rate_finish), desc(points))

teammate_table</pre>

<p>That kind of comparison is especially strong in a technical post because it gives readers a benchmark they already understand intuitively, while still grounding the discussion in data.</p>

<h2>Sector analysis</h2>

<p>If lap times tell you the overall pace story, sectors can help reveal where that pace is being gained or lost. Even without diving into full telemetry, sector splits can expose whether a driver is strong in traction zones, high-speed sections, or braking-heavy parts of the circuit.</p>

<pre>sector_summary &lt;- laps_clean %&gt;%
  filter(driver %in% selected_drivers) %&gt;%
  group_by(driver) %&gt;%
  summarise(
    s1 = mean(sector1_seconds, na.rm = TRUE),
    s2 = mean(sector2_seconds, na.rm = TRUE),
    s3 = mean(sector3_seconds, na.rm = TRUE),
    total = mean(lap_time_seconds, na.rm = TRUE),
    .groups = &quot;drop&quot;
  ) %&gt;%
  pivot_longer(cols = c(s1, s2, s3), names_to = &quot;sector&quot;, values_to = &quot;seconds&quot;)

sector_summary %&gt;%
  ggplot(aes(sector, seconds, fill = driver)) +
  geom_col(position = &quot;dodge&quot;) +
  labs(
    title = &quot;Average sector times by driver&quot;,
    subtitle = &quot;A simple way to localize pace differences&quot;,
    x = &quot;Sector&quot;,
    y = &quot;Average time (seconds)&quot;,
    fill = &quot;Driver&quot;
  ) +
  theme_minimal(base_size = 13)</pre>

<p>This type of breakdown is useful because it adds shape to the analysis. Instead of saying a driver was faster overall, you can show where the time was coming from.</p>

<h2>From description to prediction</h2>

<p>One of the strongest editorial angles for an article like this is to end with a predictive modeling section. A title such as <em>Formula 1 Data Science in R: Predicting Race Results</em> works well because it combines clear intent, technical interest, and a topic with built-in audience appeal.</p>

<p>The key is to be realistic. The purpose is not to promise perfect forecasts. It is to show how descriptive Formula 1 data can be converted into features for a baseline model.</p>

<pre>model_data &lt;- results_2024 %&gt;%
  clean_names() %&gt;%
  arrange(driver, round) %&gt;%
  group_by(driver) %&gt;%
  mutate(
    rolling_avg_finish_3 = slide_dbl(position, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
    rolling_avg_grid_3 = slide_dbl(grid, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
    rolling_points_3 = slide_dbl(points, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
    prev_finish = lag(position),
    prev_grid = lag(grid)
  ) %&gt;%
  ungroup() %&gt;%
  mutate(
    target_top10 = if_else(position &lt;= 10, 1, 0),
    target_podium = if_else(position &lt;= 3, 1, 0)
  ) %&gt;%
  select(
    round, race_name, driver, constructor, grid, points, position,
    rolling_avg_finish_3, rolling_avg_grid_3, rolling_points_3,
    prev_finish, prev_grid, target_top10, target_podium
  ) %&gt;%
  drop_na()

glimpse(model_data)</pre>

<p>This dataset is intentionally simple, but that is a strength in a tutorial. It makes the logic visible and gives readers something they can actually reproduce and extend.</p>

<h2>Predicting a top-10 finish</h2>

<pre>set.seed(42)

split_obj &lt;- initial_split(model_data, prop = 0.8, strata = target_top10)
train_data &lt;- training(split_obj)
test_data &lt;- testing(split_obj)

log_recipe &lt;- recipe(
  target_top10 ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
    rolling_points_3 + prev_finish + prev_grid,
  data = train_data
) %&gt;%
  step_impute_median(all_numeric_predictors()) %&gt;%
  step_normalize(all_numeric_predictors())

log_spec &lt;- logistic_reg() %&gt;%
  set_engine(&quot;glm&quot;)

log_workflow &lt;- workflow() %&gt;%
  add_recipe(log_recipe) %&gt;%
  add_model(log_spec)

log_fit &lt;- fit(log_workflow, data = train_data)

top10_predictions &lt;- predict(log_fit, new_data = test_data, type = &quot;prob&quot;) %&gt;%
  bind_cols(predict(log_fit, new_data = test_data)) %&gt;%
  bind_cols(test_data %&gt;% select(target_top10))

top10_predictions
top10_predictions %&gt;%
  roc_auc(truth = factor(target_top10), .pred_1)

top10_predictions %&gt;%
  accuracy(truth = factor(target_top10), estimate = .pred_class)</pre>

<h2>Predicting finishing position</h2>

<pre>finish_recipe &lt;- recipe(
  position ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
    rolling_points_3 + prev_finish + prev_grid,
  data = train_data
) %&gt;%
  step_impute_median(all_numeric_predictors()) %&gt;%
  step_normalize(all_numeric_predictors())

lm_spec &lt;- linear_reg() %&gt;%
  set_engine(&quot;lm&quot;)

lm_workflow &lt;- workflow() %&gt;%
  add_recipe(finish_recipe) %&gt;%
  add_model(lm_spec)

lm_fit &lt;- fit(lm_workflow, data = train_data)

finish_predictions &lt;- predict(lm_fit, new_data = test_data) %&gt;%
  bind_cols(test_data %&gt;% select(position, driver, constructor, race_name, grid))

metrics(finish_predictions, truth = position, estimate = .pred)
finish_predictions %&gt;%
  ggplot(aes(position, .pred)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = &quot;dashed&quot;) +
  labs(
    title = &quot;Predicted vs actual finishing position&quot;,
    subtitle = &quot;Baseline linear model&quot;,
    x = &quot;Actual finish&quot;,
    y = &quot;Predicted finish&quot;
  ) +
  theme_minimal(base_size = 13)</pre>

<p>A baseline model like this is not meant to be a perfect forecasting system. Its real value is educational. It shows how to move from results tables to feature engineering, then from features into a reproducible predictive workflow.</p>

<h2>A simple custom driver rating</h2>

<p>If you want the article to feel more original, one strong option is to create a custom driver score. Composite metrics work well in Formula 1 writing because they combine multiple dimensions of performance into one interpretable ranking.</p>

<pre>driver_rating &lt;- results_2024 %&gt;%
  clean_names() %&gt;%
  group_by(driver, constructor) %&gt;%
  summarise(
    avg_finish = mean(position, na.rm = TRUE),
    avg_grid = mean(grid, na.rm = TRUE),
    points = sum(points, na.rm = TRUE),
    wins = sum(position == 1, na.rm = TRUE),
    podiums = sum(position &lt;= 3, na.rm = TRUE),
    gain = mean(grid - position, na.rm = TRUE),
    .groups = &quot;drop&quot;
  ) %&gt;%
  mutate(
    finish_score = rescale(-avg_finish, to = c(0, 100)),
    grid_score = rescale(-avg_grid, to = c(0, 100)),
    points_score = rescale(points, to = c(0, 100)),
    gain_score = rescale(gain, to = c(0, 100)),
    win_score = rescale(wins, to = c(0, 100)),
    rating = 0.30 * finish_score +
             0.20 * grid_score +
             0.25 * points_score +
             0.15 * gain_score +
             0.10 * win_score
  ) %&gt;%
  arrange(desc(rating))

driver_rating</pre>

<p>The important thing here is transparency. Readers do not need to agree with every weight in the formula. What matters is that the method is explicit, interpretable, and easy to critique or improve.</p>

<h2>Final thoughts</h2>

<p>Formula 1 analysis in R is an unusually strong content niche because it combines technical rigor with a naturally engaged audience. With <code>f1dataR</code>, you can begin with historical race results, move into lap-time and stint analysis, explore pit strategy and driver benchmarking, and then build baseline predictive models that make the workflow feel complete.</p>

<p>That range is exactly what makes this such a good topic for an authority-building article. It is practical, it is reproducible, and it opens the door to an entire cluster of follow-up posts on telemetry, qualifying, tyre degradation, teammate comparisons, and race prediction.</p>

<p>If your goal is to publish technical content that demonstrates real expertise rather than just covering surface-level examples, Formula 1 data science in R is one of the best domains you can choose.</p>
<p>The post <a href="https://rprogrammingbooks.com/formula-1-analysis-r-f1datar/" rel="nofollow" target="_blank">Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance</a> appeared first on <a href="https://rprogrammingbooks.com/" rel="nofollow" target="_blank">R Programming Books</a>.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rprogrammingbooks.com/formula-1-analysis-r-f1datar/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=formula-1-analysis-r-f1datar"> Blog - R Programming Books</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/formula-1-analysis-in-r-with-f1datar-lap-times-pit-stops-and-driver-performance/">Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399732</post-id>	</item>
		<item>
		<title>Get Better: loading multiple csv files in R</title>
		<link>https://www.r-bloggers.com/2026/03/get-better-loading-multiple-csv-files-in-r/</link>
		
		<dc:creator><![CDATA[Stephen Royle]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 15:19:18 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://quantixed.org/?p=3706</guid>

					<description><![CDATA[<p>In a previous post, I described how to run a session to teach R to cell biologists. In this post we’ll look in a bit more detail at one of the steps: how to load data into R. As a reminder, a typical analysis task in cell biology follows ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/get-better-loading-multiple-csv-files-in-r/">Get Better: loading multiple csv files in R</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://quantixed.org/2026/03/09/get-better-loading-multiple-csv-files-in-r/"> Rstats – quantixed</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p>In a previous post, I described <a href="https://quantixed.org/2025/01/20/get-better-r-for-cell-biologists/" data-type="post" data-id="3389" rel="nofollow" target="_blank">how to run a session to teach R to cell biologists</a>. In this post we’ll look in a bit more detail at one of the steps: how to load data into R.</p>



<p>As a reminder, a typical analysis task in cell biology follows these steps:</p>



<ul class="wp-block-list">
<li>do the experiment(s)</li>



<li>collect the data – e.g. microscopy images</li>



<li>analyse and get a plain text (csv) output – e.g. using Fiji</li>



<li>load the data into R</li>



<li>crunch some numbers and plot</li>
</ul>



<h2 class="wp-block-heading">Loading data into R</h2>



<p>In the test dataset from the previous post we have a single folder of 80 csv files. 4 experiments, 2 conditions, 10 files from each. And each file has 20 rows of data. See below <a href="https://quantixed.org/2026/03/09/get-better-loading-multiple-csv-files-in-r/#scenario" rel="nofollow" target="_blank">for other scenarios</a>.</p>



<p>The goal is to load of the data into R and assemble into a single data frame. There are many ways to do this, I’ll show three of the most popular.</p>



<ul class="wp-block-list">
<li>base R</li>



<li>tidyverse</li>



<li>data.table</li>
</ul>


<pre>
# Demonstrate three ways to load all CSV files in Data:
# 1) base R
# 2) tidyverse (readr + dplyr + purrr)
# 3) data.table

data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir, pattern = &quot;\\.csv$&quot;, full.names = TRUE)

# --- 1) base R ---------------------------------------------------------------
base_list &lt;- lapply(csv_files, function(path) {
	df &lt;- read.csv(path)
	df$source_file &lt;- basename(path)
	df
})

base_all &lt;- do.call(rbind, base_list)

# --- 2) tidyverse ------------------------------------------------------------
if (!requireNamespace(&quot;readr&quot;, quietly = TRUE) ||
		!requireNamespace(&quot;dplyr&quot;, quietly = TRUE) ||
		!requireNamespace(&quot;purrr&quot;, quietly = TRUE)) {
	stop(&quot;Please install tidyverse components: readr, dplyr, purrr&quot;)
}

tidy_all &lt;- purrr::map_dfr(
	csv_files,
	~ readr::read_csv(.x, show_col_types = FALSE) |&gt;
		dplyr::mutate(source_file = basename(.x))
)

# --- 3) data.table -----------------------------------------------------------
if (!requireNamespace(&quot;data.table&quot;, quietly = TRUE)) {
	stop(&quot;Please install data.table&quot;)
}

dt_list &lt;- lapply(csv_files, function(path) {
	dt &lt;- data.table::fread(path)
	dt[, source_file := basename(path)]
	dt
})

dt_all &lt;- data.table::rbindlist(dt_list, use.names = TRUE, fill = TRUE)

# Quick checks
cat(&quot;Files loaded:&quot;, length(csv_files), &quot;\n&quot;)
cat(&quot;Rows (base):&quot;, nrow(base_all), &quot;\n&quot;)
cat(&quot;Rows (tidyverse):&quot;, nrow(tidy_all), &quot;\n&quot;)
cat(&quot;Rows (data.table):&quot;, nrow(dt_all), &quot;\n&quot;)
</pre>


<p>In each case we list the files and then use this list to load each item and assemble into a single large data frame.</p>



<p>We need to know which rows of the large data frame came from which file. This is essential if there is no identifier within the data. So in each case, after loading, we add the name of the file as a new column called <code>source_file</code>. Then we assemble these modified data frames into a single large data frame.</p>



<p>Here is the output from the last part of the code:</p>


<pre>
&gt; # Quick checks
&gt; cat(&quot;Files loaded:&quot;, length(csv_files), &quot;\n&quot;)
Files loaded: 80 
&gt; cat(&quot;Rows (base):&quot;, nrow(base_all), &quot;\n&quot;)
Rows (base): 1600 
&gt; cat(&quot;Rows (tidyverse):&quot;, nrow(tidy_all), &quot;\n&quot;)
Rows (tidyverse): 1600 
&gt; cat(&quot;Rows (data.table):&quot;, nrow(dt_all), &quot;\n&quot;)
Rows (data.table): 1600 
&gt; 
</pre>


<p>The result is identical with all three approaches.</p>



<p>My preferred strategy is to use base R for tasks like this. Generally, it is best to stick to base R rather than relying on libraries. In terms of speed, <code>{data.table}</code> is renowned for being fast. If your data is massive, it is worth using <code>{data.table}</code>. However, for a set of 80 small files, speed is not a concern, and base R performs very well. It could be that you prefer the tidyverse syntax and find it easier to understand, in which case go for it. Otherwise my advice is stick to base R.</p>



<p>Note that, with each of these three approaches, there are several different ways to achieve the same thing. I am only presenting one. For example, with base R you may see examples where a for-loop is used to achieve the same thing as <code>lapply</code>. This approach is slower than the one shown here although it is arguably more readable.</p>



<p>We’ll keep going with base R as we look at a few alternative scenarios that you are likely to encounter.</p>



<h2 class="wp-block-heading" id="scenarios">Alternative scenarios</h2>



<h3 class="wp-block-heading">The base case (in more detail)</h3>



<p>We had a flat directory (<code>Data/</code>) of 80 files, where the filenames encoded the experimental details. Therefore we used this base R code to load the data in.</p>


<pre>
data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir, pattern = &quot;\\.csv$&quot;, full.names = TRUE)

base_list &lt;- lapply(csv_files, function(path) {
	df &lt;- read.csv(path)
	df$source_file &lt;- basename(path)
	df
})

base_all &lt;- do.call(rbind, base_list)
</pre>


<p>Following this, we need to use the <code>source_file</code> column to add additional columns that signify the experimental details. Our files are called things like <code>control_n2_9.csv</code> or <code>rapa_n3_2.csv</code> – in other words they are of the form condition_experiment_x.csv. The underscores can be used to split the filename with <code>strsplit()</code> and we can then take the 1st or 2nd element of the result and store them in new columns.</p>


<pre>
# source_file column has name of the file, name is of the form foo_bar_1.csv
# extract foo and bar into two columns
base_all$cond &lt;- sapply(strsplit(base_all$source_file, &quot;_&quot;), &quot;[&quot;, 1)
base_all$expt &lt;- sapply(strsplit(base_all$source_file, &quot;_&quot;), &quot;[&quot;, 2)
</pre>


<p>As I said above, there’s always other approaches and here it could be that we do these steps inside the original <code>lapply()</code> call, i.e. before we have assembled the large data frame.</p>


<pre>
base_list &lt;- lapply(csv_files, function(path) {
  df &lt;- read.csv(path)
  df$source_file &lt;- basename(path)
  df$cond &lt;- sapply(strsplit(df$source_file, &quot;_&quot;), &quot;[&quot;, 1)
  df$expt &lt;- sapply(strsplit(df$source_file, &quot;_&quot;), &quot;[&quot;, 2)
  df
})
base_all &lt;- do.call(rbind, base_list)
</pre>


<p>However, if you leave the column wrangling until you have assembled the large data frame the loading part of the code is more likely to be reusable.</p>



<p>Let’s look at a few other scenarios.</p>



<h3 class="wp-block-heading">Only a subset of columns are required</h3>



<p>If the files have many columns, you may only require a subset of columns in your data frame. More rarely, the csv files may have differing numbers of columns. In this case, it isn’t possible to use the code above because we need an equal number of columns to assemble the large data frame.</p>



<p>The solution in both of these cases is to specify which columns to load. We can do:</p>


<pre>
&gt; head(read.csv(csv_files[1]))
  X Area     Mean    StdDev Min Max IntDen RawIntDen
1 1 1248 48.83477  8.864353   0 255  60945     60945
2 2 1248 52.46805 10.564050   0 255  65480     65480
3 3 1248 72.14579  8.947991   0 255  90037     90037
4 4 1248 55.77559  9.542218   0 255  69607     69607
5 5 1248 56.42217  9.749921   0 255  70414     70414
6 6 1248 73.86571  7.626613   0 255  92184     92184
</pre>


<p>To look at the first part (<code>head()</code>) of the first file. Let’s say we only want Area, Mean (and the <code>source_file</code>) columns. We can then do:</p>


<pre>
data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir, pattern = &quot;\\.csv$&quot;, full.names = TRUE)

my_columns &lt;- c(&quot;Area&quot;, &quot;Mean&quot;)
base_list &lt;- lapply(csv_files, function(path) {
  df &lt;- read.csv(path)
  df &lt;- df[,my_columns]
  df$source_file &lt;- basename(path)
  df
})
base_all &lt;- do.call(rbind, base_list)
</pre>


<p>From here, we can assemble the <code>cond</code> and <code>expt</code> columns as shown above.</p>



<p>So far, all of the information required is encoded in the filename. If this isn’t the case, it is better at this stage to alter the script that generated the csvs so that the necessary information can be read from the filename or alternatively, from the filepath.</p>



<h3 class="wp-block-heading">Identical (non-unique) filenames in different folders</h3>



<p>In the example above, the <em>condition</em>, the <em>experiment</em> and a <em>differentiator</em> were encoded in the filename. It could be that that the csv files are organised like this:</p>



<ul class="wp-block-list">
<li>Data/
<ul class="wp-block-list">
<li>Control/
<ul class="wp-block-list">
<li>cell1.csv</li>



<li>cell2.csv</li>
</ul>
</li>



<li>Drug/
<ul class="wp-block-list">
<li>cell1.csv</li>



<li>cell2.csv</li>



<li>cell3.csv</li>
</ul>
</li>
</ul>
</li>
</ul>



<p>or</p>



<ul class="wp-block-list">
<li>Data/
<ul class="wp-block-list">
<li>Expt1/
<ul class="wp-block-list">
<li>Control/
<ul class="wp-block-list">
<li>cell1.csv</li>



<li>cell2.csv</li>
</ul>
</li>



<li>Drug/
<ul class="wp-block-list">
<li>cell1.csv</li>



<li>cell2.csv</li>
</ul>
</li>
</ul>
</li>



<li>Expt2/
<ul class="wp-block-list">
<li>Control/
<ul class="wp-block-list">
<li>cell1.csv</li>



<li>cell2.csv</li>



<li>cell3.csv</li>
</ul>
</li>



<li>Drug/
<ul class="wp-block-list">
<li>cell1.csv</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>



<p>or any other combination. The point being that the filename is no longer unique. There are several files with the same differentiator, and to access the condition or the experiment, we need to manipulate the filepath rather than the filename.</p>


<pre>
data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir,
                        pattern = &quot;\\.csv$&quot;,
                        full.names = TRUE, # get the full path (folder names)
                        recursive = TRUE) # ensures we look in subfolders of data_dir

base_list &lt;- lapply(csv_files, function(path) {
  df &lt;- read.csv(path)
  df$source_path &lt;- path
  df
})
base_all &lt;- do.call(rbind, base_list)
</pre>


<p>This code block will deal with bunch of subfolders within <code>data_dir</code> and assemble the large data frame. This time, we make a column called <code>source_path</code> and here we store the full path of the each file.</p>



<p>So, a file called cell2.csv in <code>Expt1/Drug/</code> within the <code>Data/</code> folder in the project with have the <code>source_path</code> of <code>Data/Expt1/Drug/cell2.csv</code></p>



<p>We just need to wrangle this path to extract the condition and experiment information.</p>


<pre>
# &quot;experiment&quot; folder
base_all$cond &lt;- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), &quot;[&quot;, 2)
# folder enclosing file
base_all$expt &lt;- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), &quot;[&quot;, 3)
base_all$source_file &lt;- basename(base_all$source_path) # get the filename (differentiator)
</pre>


<p>This wrangling step will need to be adjusted to your needs. We use <code>.Platform$file.sep</code> rather than <code>&quot;/&quot;</code> or <code>&quot;\&quot;</code> because the file separator differs on Windows.</p>



<h2 class="wp-block-heading">Exercises</h2>



<p>Here are three scenarios you might encounter. How would you solve them?</p>



<ol class="wp-block-list">
<li>The files are in <code>Data/</code> and are organised into three condition folders, inside each are four experiment subfolders, each with 10 csv files in.</li>



<li>The files are in <code>Data/</code> and are organised into four experiment folders, inside each are 20 csv files. They are named like this: <code>microscopy-analysis_Control_IF488_cell3.csv</code> and <code>microscopy-analysis_Drug_IF488_cell2.csv</code></li>



<li>The files are located at <code>~/Desktop/</code> and are organised into 12 folders called <code>Control-Expt1</code> or <code>DRUG1-Expt2</code> (there are three conditions and four experiments). The files inside are called <code>cell1.csv</code> etc.</li>



<li>The files are in a single folder, using experiment_condition_differentiator labelling, however the user has been inconsistent. Sometimes the Control is called <code>Control</code>, sometimes <code>Ctrl</code> or <code>ctrl</code></li>
</ol>



<h2 class="wp-block-heading">Answers</h2>



<p>Click to reveal.</p>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Problem 1</summary><pre>
data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir,
                        pattern = &quot;\\.csv$&quot;,
                        full.names = TRUE, # get the full path (folder names)
                        recursive = TRUE) # ensures we look in subfolders of data_dir

base_list &lt;- lapply(csv_files, function(path) {
  df &lt;- read.csv(path)
  df$source_path &lt;- path
  df
})
base_all &lt;- do.call(rbind, base_list)
base_all$cond &lt;- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), &quot;[&quot;, 3)
base_all$expt &lt;- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), &quot;[&quot;, 2)
base_all$source_file &lt;- basename(base_all$source_path)
</pre></details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Problem 2</summary><pre>
data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir,
                        pattern = &quot;\\.csv$&quot;,
                        full.names = TRUE, # get the full path (folder names)
                        recursive = TRUE) # ensures we look in subfolders of data_dir

base_list &lt;- lapply(csv_files, function(path) {
  df &lt;- read.csv(path)
  df$source_path &lt;- path
  df
})
base_all &lt;- do.call(rbind, base_list)
base_all$expt &lt;- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), &quot;[&quot;, 2)
base_all$source_file &lt;- basename(base_all$source_path)
base_all$cond &lt;- sapply(strsplit(base_all$source_path, &quot;_&quot;, fixed = TRUE), &quot;[&quot;, 2)
</pre></details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Problem 3</summary><pre>
# first relocate the files to &quot;Data/&quot; in the project folder and then...
data_dir &lt;- &quot;Data&quot;
csv_files &lt;- list.files(data_dir,
                        pattern = &quot;\\.csv$&quot;,
                        full.names = TRUE, # get the full path (folder names)
                        recursive = TRUE) # ensures we look in subfolders of data_dir

base_list &lt;- lapply(csv_files, function(path) {
  df &lt;- read.csv(path)
  df$source_path &lt;- path
  df
})
base_all &lt;- do.call(rbind, base_list)
base_all$condexpt &lt;- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), &quot;[&quot;, 2)
base_all$source_file &lt;- basename(base_all$source_path)
base_all$cond &lt;- sapply(strsplit(base_all$condexpt, &quot;-&quot;, fixed = TRUE), &quot;[&quot;, 1)
base_all$expt &lt;- sapply(strsplit(base_all$condexpt, &quot;-&quot;, fixed = TRUE), &quot;[&quot;, 2)
</pre></details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Problem 4</summary>
<p>There are a number of ways to deal with this problem. If the labelling is very inconsistent it is best to rerun the analysis (which gave the csv files) in a way that gives consistent labelling. If this is not possible e.g. original filenames are inconsistently named, then you can extract the condition and experiment labels as before and then rename them. Given the mix of upper and lower case, it’s advisable to run <code>to.lower()</code> first and then figure out which entries are unique to which group and then assign them accordingly. Another approach is to make a data frame showing how the entries should be renamed and use that to rename the labels. The bonus part of dealing with a problem like this is that once you have been through the pain, it will make you more consistent when naming things in the future!</p>
</details>



<p>—</p>



<p>The post title comes from Get Better by The New Fast Automatic Daffodils.</p>



<p>Part of a series on <a href="https://quantixed.org/category/development/" rel="nofollow" target="_blank">development</a> of lab members’ skills.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://quantixed.org/2026/03/09/get-better-loading-multiple-csv-files-in-r/"> Rstats – quantixed</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/get-better-loading-multiple-csv-files-in-r/">Get Better: loading multiple csv files in R</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399722</post-id>	</item>
		<item>
		<title>Using Quarto to Write a Book</title>
		<link>https://www.r-bloggers.com/2026/03/using-quarto-to-write-a-book/</link>
		
		<dc:creator><![CDATA[Kieran Healy]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 13:34:53 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> I’ve spent the last couple of months revising my Data Visualization book for a second edition that, ideally, will appear some time in the next twelve months. As with the first edition, I’ve posted a complete draft of the book at its website...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/using-quarto-to-write-a-book/">Using Quarto to Write a Book</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/"> R on kieranhealy.org</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>I’ve spent the last couple of months revising my <a href="https://press.princeton.edu/books/hardcover/9780691181615/data-visualization" rel="nofollow" target="_blank">Data Visualization book</a> for a second edition that, ideally, will appear some time in the next twelve months. As with the first edition, I’ve posted a <a href="https://socviz.co/" rel="nofollow" target="_blank">complete draft of the book</a> at its website. The production process hasn’t started yet, so it’s not ready to pre-order or anything, but the site has a one-question <a href="https://forms.gle/4xeALwJLbzdzT8rz7" rel="nofollow" target="_blank">form you can fill out</a> that asks for your email address if you’d like to be notified with one (and only one) email when it’s available. A lot has changed since the first edition, reflecting changes both in R and ggplot specifically, and in the world of coding generally. I may end up highlighting some of those new elements in other posts. But here, I want to focus on some nerdy details involved in getting the book to its final draft. I’ll discuss <a href="https://quarto.org/" rel="nofollow" target="_blank">Quarto</a>, the publishing system I used, its many advantages, and its current limits with respect to the demands I made of it.</p>
<figure class="full-width"><a href="https://i2.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-distributions-page-detail.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i2.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-distributions-page-detail.png?w=578&#038;ssl=1"
         alt="Page detail from the draft book." data-recalc-dims="1"/></a><figcaption>
            <p>A detail from facing pages in Chapter 4, in the PDF version.</p>
        </figcaption>
</figure>
<p>The book is about doing good data visualization using <a href="https://www.r-project.org/" rel="nofollow" target="_blank">R</a> and <a href="https://ggplot2.tidyverse.org/" rel="nofollow" target="_blank">ggplot</a>. The book contains many figures, almost all of which are written using the code the book shows and explains.</p>
<h3 id="reasonable-demands">Reasonable Demands</h3>
<p>My baseline list of requirements for the book manuscript was as follows:</p>
<ul>
<li>The entire text of the book is in some kind of plain-text format.</li>
<li>Figures in the book that are the result of R code should be directly produced by R code in the actual document; no cutting and pasting of code snippets and separately-produced figures. Doing that is a recipe for error.</li>
<li>The scholarly machinery of the book—chapter, section, table, and figure numbering; cross-references; in-text bibilographical references; the bibliography itself and its formatting, and so on—should be automatically handled. No manual numbering and renumbering of figures, etc.</li>
<li>It should be straightforward to repeatedly generate a fully-formatted and laid-out version of the book manuscript as I go, ideally in any of several output formats (e.g. PDF, HTML), despite it all being written in plain text.</li>
</ul>
<p>These requirements are reasonable because, for projects like this, working in <a href="https://plain-text.co/" rel="nofollow" target="_blank">plain text is the right thing to do</a>. We are writing and revising text and our code; we keep the text in a version control system; we don’t want the results of the code to come apart from the code that generated it; and we need to deliver outputs that consist both of fully-formatted material and replication packages that allow other people do see what we did. PDF is of course <a href="https://kieranhealy.org/blog/archives/2025/02/06/kerning-and-kerning-in-a-widening-gyre/" rel="nofollow" target="_blank">the worst</a>, but we still need to target it as one of our output formats.</p>
<p>Despite being reasonable, these requirements are in truth quite demanding. Once you start thinking about what all the pieces entail you realize there’s a <em>lot</em> to keep track of. Systems for doing some or all of this have been developed in whole or in part over the years. Newer ones sometimes escape the constraints of older ones; sometimes they inherit their legacies. I’m not going to review them here. This time around I used <a href="https://quarto.org/" rel="nofollow" target="_blank">Quarto</a>.</p>
<p>Quarto is a publishing system focused on documents of different kinds (articles,
presentations, books, websites), written as plain-text sources that mix prose
and code in any of several languages (R, Python, Julia, others), destined to be
fully-finished outputs in any of several formats (PDFs, HTML, or Word files).
Quarto builds on and extends many tools, notably <a href="https://pandoc.org/" rel="nofollow" target="_blank">pandoc</a>
for getting from Markdown to any number of other output formats. It’s a
spiritual descendant of <a href="https://en.wikipedia.org/wiki/Literate_programming" rel="nofollow" target="_blank">literate
programming</a> approaches for
dealing with code that needs to be run in the context of prose. In the R world
these descendants include
<a href="https://cran.r-project.org/doc/manuals/r-patched/packages/utils/vignettes/Sweave.pdf" rel="nofollow" target="_blank">Sweave</a>
and <a href="https://yihui.org/knitr/" rel="nofollow" target="_blank">RMarkdown/knitr</a>. These broadly “notebook”
approaches to writing and discussing code have benefits and also sharp
limits if your focus is full-on software development and its documentation, or
complex data analysis involving many interrelated steps.<sup id="fnref:1"><a href="https://kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/#fn:1" class="footnote-ref" role="doc-noteref" rel="nofollow" target="_blank">1</a></sup> But they’re <em>very</em> useful
if you are primarily writing longer-form text that periodically requires things
like figures and tables to be programatically generated in a reproducible
fashion.</p>
<p>If you just want to know whether you can write long-form projects like articles, books, or websites using Quarto and R, the answer is absolutely yes. A long time ago I wrote parts of my dissertation and several articles using Sweave. A few years ago I wrote the first edition of <em>Data Visualization</em> using RMarkdown. I wrote the second edition using Quarto. Each one was better than the previous version in terms of flexibility and power. Quarto eliminated several pain-points that I had to deal with for the first edition of this book. It’s very <a href="https://quarto.org/docs/guide/" rel="nofollow" target="_blank">well-documented</a> and continually improving. Its defaults are sensible and produce <a href="https://quarto.org/docs/gallery/" rel="nofollow" target="_blank">good-looking output</a>. You can stop reading now.</p>
<figure class="full-width"><a href="https://i0.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/workflow-wide-quarto.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i0.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/workflow-wide-quarto.png?w=578&#038;ssl=1"
         alt="A schematic overview of how Quarto orchestrates its document processing." data-recalc-dims="1"/></a><figcaption>
            <p>A schematic overview of how Quarto orchestrates its document processing.</p>
        </figcaption>
</figure>
<h3 id="unreasonable-demands">Unreasonable Demands</h3>
<p>I had a very clear idea about how I wanted the first edition of the book to look
in print. I also knew that I wanted to make it available as a website. I was
fortunate enough to be able to have both of these things work out. This time
around, I did the same again but I really wanted there to be as little as
possible <em>post hoc</em> work with the website version. I knew that wouldn’t be the
case with the PDF, for reasons I will discuss in a moment. I’m pleased that
Quarto performed so well with the whole process. I wrote two pretty
heavily-customized output formats (one for PDF and one for HTML) that specified
the layout of the book. Quarto’s LaTeX-based book pipeline uses the <a href="https://ctan.org/pkg/scrbook?lang=en" rel="nofollow" target="_blank"><code>scrbook</code> class</a> from the <a href="https://ctan.org/pkg/koma-script?lang=en" rel="nofollow" target="_blank">KOMA-script</a>, which has many nice features, though I find its documentation a tiny bit eccentric. (This might be because I wrote my first book using <a href="https://ctan.org/pkg/memoir?lang=en" rel="nofollow" target="_blank">the <code>memoir</code> class</a>.)  I also wrote a couple of R packages that
managed the themes and some other details of how PNG and especially PDF figures
were produced. A version of the theme is in the development version of the
<a href="https://kjhealy.github.io/socviz/" rel="nofollow" target="_blank"><code>socviz</code> package</a> that accompanies the
book.</p>
<p>The PDF design is a two-column “Tufte-style” layout with wide margins for side-notes and figures. It works very well for a book of this kind as we can show small figures alongside the code that generates them, but also have figures break out of the main text column if needed.</p>
<figure class="full-width"><a href="https://i0.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-halloween-page.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i0.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-halloween-page.png?w=578&#038;ssl=1"
         alt="Facing pages with a figure that runs the full width of one of the pages." data-recalc-dims="1"/></a><figcaption>
            <p>Facing pages with a figure that runs the full width of one of the pages.</p>
        </figcaption>
</figure>
<p>A layout like this can’t be rigidly ported over to a website, especially in an era of widely-varying screen sizes and small layouts. So the HTML version of the book has a broadly responsive layout that arranges things differently at different sizes. Organizing and tweaking it this time around was made a lot easier by Quarto’s much better support for margin notes and marginal figures. It certainly wasn’t without its headaches. Marginal figures and notes are quite annoying to deal with in both HTML and PDF formats, for different reasons. In the PDF case, it’s tricky to get captions right, and there are still a few hacks in there to make it work. But it’s <em>much</em> cleaner than what I had to do in RMarkdown for the first edition, which was in effect a lot of regular expression substitution for things I could only add after the <code>.tex</code> file was produced. That’s gone now.</p>
<p>Here’s a screenshot of a facing page layout with some code, some marginal notes, and two kinds of figures, one in the margin and one full page-width:</p>
<figure class="full-width"><a href="https://i1.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-gdppercap-page.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i1.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-gdppercap-page.png?w=578&#038;ssl=1"
         alt="Gapminder figures in the PDF version." data-recalc-dims="1"/></a><figcaption>
            <p>Gapminder figures in the PDF version.</p>
        </figcaption>
</figure>
<p>And here’s some of the same material as seen on the website:</p>
<figure class="full-width"><a href="https://i2.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-gdppercap-web.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i2.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv2-gdppercap-web.png?w=578&#038;ssl=1"
         alt="Gapminder figures in the HTML version" data-recalc-dims="1"/></a><figcaption>
            <p>Gapminder figures in the HTML version</p>
        </figcaption>
</figure>
<p><a href="https://socviz.co/04-group-facet-transform.html#facet-to-make-small-multiples" rel="nofollow" target="_blank">Here’s a direct link to the same section.</a> In the website version the marginal figures appear more marginal. There’s also a little bit of conflict to be worked out between the navigation guides and the marginal notes. In addition, the intrinsic variability of the web layout means that the positioning of the marginal notes is less precisely controllable than it is in the PDF output. But the overall result is pretty good. And I have to say it’s very satisfying to be able to produce a good website and a clean PDF (and also an ePub!) from the same folder of <code>qmd</code> files, with the text written in <a href="https://daringfireball.net/projects/markdown/" rel="nofollow" target="_blank">Markdown</a>, the bibliography managed by <a href="https://www.zotero.org/" rel="nofollow" target="_blank">Zotero</a> and <a href="https://retorque.re/zotero-better-bibtex/" rel="nofollow" target="_blank">BBT</a>, interspersed with the code that makes all the figures.</p>
<h3 id="let-the-professionals-do-a-professional-job">Let the Professionals do a Professional Job</h3>
<p>I should say “less precisely controllable <em>without substantial further adjustment</em>”. Because this is the crux of the customization biscuit. There’s no end to it. One of the benefits of being in a position to do a second edition—something I really am very grateful for—is that it allowed me to have a much better sense of the production process for the hard-copy of the book. This in turn placed sharp limits on what I was willing to do when it came to customizing the PDF version myself. Camera-ready files for books published by proper Presses are produced in many different ways. My <a href="https://theordinalsociety.com/" rel="nofollow" target="_blank">most recent book</a>, which is all prose and no code, was designed and typeset using <a href="https://en.wikipedia.org/wiki/Adobe_InDesign" rel="nofollow" target="_blank">Adobe InDesign</a>. For the first edition of <em>Data Visualization</em> I sent the Press a set of LaTeX files and PDF image assets. The LaTeX files produced a very good facsimile of the design we’d agreed on. Then the Press’s typesetter laid it out in LaTeX.</p>
<p>You might think that they just took my files, lightly edited them here and there, and added the trim, bleed, registration, and color marks for the physical print job.  That’s not how it went. Book layouts are very hard to get just right, especially layouts that have many different-sized images and notes and other paraphernalia. They’re fragile. Moving something slightly here or editing a sentence there can cause a cascade of unwanted effects. Even ordinary pages of text will have issues with excessive or insufficient spacing around paragraph and section breaks, or <a href="https://en.wikipedia.org/wiki/Widows_and_orphans" rel="nofollow" target="_blank">widows and orphans</a>, or <a href="https://en.wikipedia.org/wiki/River_(typography)" rel="nofollow" target="_blank">rivers</a>, and many other infelicities that most people won’t notice explicitly, but which cumulatively convey bad vibes even to people who don’t much care about design.</p>
<p>Some of this can be automated. That’s what layout algorithms do. The <a href="https://en.wikipedia.org/wiki/Knuth%E2%80%93Plass_line-breaking_algorithm" rel="nofollow" target="_blank">Knuth-Plass box-and-glue algorithm</a>, which is the thing that causes TeX to emit those <code>Underfull \hbox (badness 10000)</code> complaints, is a real marvel. But it can’t quite work miracles. In my case, the professional typesetter took my LaTeX file, threw away my document class and substituted their own custom one (and some custom style files). Like any document class it defined the layout and all the features of the book, but it also included a variety of commands that allowed her to finely adjust the text as needed on any particular page. Tightening up the spacing here; forcing a break there; very slightly expanding or contracting the page size when needed to make sure that the layout didn’t break in a visible way on the next page, and so on. Here’s an example from the first edition:</p>
<figure><a href="https://i1.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv-tweaks-1.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i1.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv-tweaks-1.png?w=578&#038;ssl=1" data-recalc-dims="1"/></a>
</figure>
<p>And another:</p>
<figure><a href="https://i0.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv-tweaks-2.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i0.wp.com/kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/dv-tweaks-2.png?w=578&#038;ssl=1" data-recalc-dims="1"/></a>
</figure>
<p>Those uses of <code>{\break}</code>, <code>\enlargethispage</code>, <code>\vspace{}</code>, and the non-breaking space in <code>this~way</code> are all done by hand, based on rendering and re-rendering the document as its built to make sure each page meets the Press’s standards. An automatically-produced PDF can get you eighty five or ninety percent of the way to this but, if you really want to get things right, that last stretch will inevitably mean a bunch of adjustment by hand in whatever the final format is. That’s not something you can incorporate into your reproducibility pipeline.</p>
<p>Fortunately, you don’t need to. Most of the time we don’t require anything like that level of attention to detail. It’s worth producing and circulating material in accesible and readable formats that also don’t look like garbage. And it’s gratifying to be able to reliably generate pretty high-quality versions of those outputs from plain-text sources. That’s more than good enough in almost all cases. When writing papers that end up as PDFs, for example, I use a template that’s almost 20 years old. I only touch it when something breaks. By the same token, while my amateur interests compel me to run up polished custom Quarto book formats book projects, I also know that the people who set type for a living know a lot more about the fine grain of that work than I know, or need to know. But once in a while it’s nice to see how far you can push things.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>The trick is to have the code chunks in your document be short and sweet, and have structured scripts and properly-documented packages manage the heavy lifting in any analysis. <a href="https://kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/#fnref:1" class="footnote-backref" role="doc-backlink" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></p>
</li>
</ol>
</div>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://kieranhealy.org/blog/archives/2026/03/09/using-quarto-to-write-a-book/"> R on kieranhealy.org</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/using-quarto-to-write-a-book/">Using Quarto to Write a Book</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399734</post-id>	</item>
		<item>
		<title>Leptodon 1.0.0 released!</title>
		<link>https://www.r-bloggers.com/2026/03/leptodon-1-0-0-released/</link>
		
		<dc:creator><![CDATA[Open Analytics]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 11:52:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.openanalytics.eu/blog/2026/03/09/leptodon-1.0.0/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
<p>We are releasing the first version of Leptodon, our Leptos UI toolkit, into the wild.<br />
This release of Leptodon contains UI components for general application development. However, the end goal is to make Leptodon capable enough to easily build comple...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/leptodon-1-0-0-released/">Leptodon 1.0.0 released!</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.openanalytics.eu/blog/2026/03/09/leptodon-1.0.0/"> Open Analytics</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p><img style="float: right; height: 25vh;" src="https://www.openanalytics.eu/blog-img/leptodon-logo.svg">
We are releasing the first version of <a href="https://leptodon.dev/" rel="nofollow" target="_blank">Leptodon</a>, our Leptos UI toolkit, into the wild.
This release of Leptodon contains UI components for general application development. However, the end goal is to make Leptodon capable enough to easily build complete data science dashboards and applications with. Since at Open Analytics we believe in open source, we are releasing this project <a href="https://github.com/openanalytics/leptodon" rel="nofollow" target="_blank">on GitHub</a> under an Apache-2.0 license. We hope it will prove useful to build new websites and applications with! Join us in exploring some additional information, our technical choices and a few examples below.</p>
<h1 id="about">About</h1>
<p>Leptodon is a <a href="https://leptos.dev/" rel="nofollow" target="_blank">Leptos</a> based component library written in Rust. Leptos employs a <a href="https://book.leptos.dev/appendix_reactive_graph.html#the-reactive-graph" rel="nofollow" target="_blank">reactive-graph</a> system to do targeted DOM updates. This makes Leptos suitable for highly interactive applications without causing unnecessary slowdowns in your browser. Combining the reactive-graph with a powerful fully typed language like Rust means we should be able to build robust and efficient components for data science applications.</p>
<p><figure style="float: right; margin: 2rem;">
<img class="img-responsive" src="https://i2.wp.com/upload.wikimedia.org/wikipedia/commons/3/3e/Leptodon_cayannensis_-_Gray-headed_kite.JPG?w=578&#038;ssl=1" alt="Gray-headed kite with a gray head, soft-looking white body and black wings." style="max-height: 50vh;" data-recalc-dims="1">
<figcaption>By <a href="https://commons.wikimedia.org/wiki/User:Hector_Bottai" title="User:Hector Bottai" rel="nofollow" target="_blank">Hector Bottai</a> &#8211; <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0" rel="nofollow" target="_blank">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=46529090" rel="nofollow" target="_blank">Link</a></figcaption>
</figure>
Our project is named after the <a href="https://en.wikipedia.org/wiki/Leptodon" rel="nofollow" target="_blank">Leptodon cayanensis (Grey-headed kite)</a>. The beautiful Leptodon cayanensis makes for a great mascot as we strive to make an as beautiful UI toolkit. The Leptodon was of course chosen over other birds because it sounds very similar to “Leptos” (which is Greek for light/thin).</p>
<p>Leptos allows you to write almost native HTML inside Rust using the power of procedural macros. This Rust-HTML mix is called RSX and this mix can be interpolated with real Rust code (a lot like JSX) to create interactive components. A classic example is shown below to create a blue “+1” button that increments a counter:</p>
<pre>// Rust code
let count = RwSignal::new(0);
view! {
// RSX
&lt;p&gt;
{move ||
format!(&quot;Button was pressed {} times!&quot;, count.get())
}
&lt;/p&gt;
&lt;Button
appearance=ButtonAppearance::Primary
shape=ButtonShape::Rounded
icon=icon::AddIcon()
on_click=move |_| {
count.update(|old| *old += 1);
}
&gt;
1
&lt;/Button&gt;
}
</pre>
<p>Properties like appearance are type checked by the Rust compiler, which means passing a non-existent option is impossible.</p>
<h2 id="styling-with-tailwind">Styling with Tailwind</h2>
<p>We chose to have Tailwind v3 as the default CSS framework to style components.
Tailwind maps CSS properties to class names (e.g. <code>padding-right: 1px</code> to <code>pr-px</code>). This system allows us to style almost everything straight in the component on the exact HTML element without having very large <code>style=...</code> blocks.
This keeps the components cohesive with less interference between them.
To keep this efficient, Tailwind generates a CSS file containing only the classes used in the source code.
This is problematic for our project since we are building a crate which, by default, Tailwind cannot see the source code of.
To work around this, we expose our source code via an at build-time generated function.
Projects depending on Leptodon are expected to place the Leptodon source code into a file scanned by Tailwind, this is done for you by our <a href="https://github.com/openanalytics/leptodon-starter" rel="nofollow" target="_blank">starter template</a>!</p>
<h1 id="docs-powered-by-macros">Docs powered by <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f980.png" alt="🦀" class="wp-smiley" style="height: 1em; max-height: 1em;" /> macros</h1>
<p>To aid developers in using our components we added component demos with their source code beneath on our website. See the <a href="https://leptodon.dev/demo/badge" rel="nofollow" target="_blank">Leptodon badge demo</a> for an example! To keep the demo examples’ source code blocks in sync with the demonstrated components we employed Rust procedural macros. Our <code>#[generate_codeblock]</code> macro when applied on a demo component, creates a second component containing a codeblock of the annotated function’s source code! This guarantees our demo page and the component source code stays in sync. Lastly we have a table of the parameters of each demonstrated component at the bottom of the pages. These are also generated, but via the <code>#[generate_docs]</code> macro on the Leptodon components.
This code generation enables us to more efficiently maintain the demo site, since documentation changes are immediately reflected.
You can see the relation between code and output below.</p>
<pre>#[generate_codeblock(LinkExample)] // Generates the box containing the demo + source code.
#[component]
pub fn LinkDemo() -&gt; impl IntoView {
view! {
&quot;Explore more about OA on the &quot;
&lt;Link href=&quot;https://openanalytics.eu&quot; target=&quot;_blank&quot;&gt;OA website&lt;/Link&gt;
}
}
#[component]
pub fn LinkDemoPage() -&gt; impl IntoView {
view! {
&lt;Title text=&quot;Link&quot;/&gt;
&lt;FixedCenterColumn&gt;
&lt;Heading4 anchor=&quot;link&quot;&gt;&quot;Link&quot;&lt;/Heading4&gt;
&lt;LinkExample /&gt; // call generated function
&lt;leptodon::link::LinkDocs /&gt; // call generated function
&lt;/FixedCenterColumn&gt;
}
}
</pre>
<p><b>Output:</b>
<img src="https://i0.wp.com/www.openanalytics.eu/blog-img/leptodon-link-demo.png?w=578&#038;ssl=1" alt="An HTML link component demonstration page, Shows a link embedded in a sentance, followed by its source code. At the bottom, a table shows the documentation of the Link Component" class="img-responsive" data-recalc-dims="1"></p>
<h1 id="automated-testing">Automated testing</h1>
<p>Testing is of large importance when creating production applications, we want to ensure users a frictionless experience. For traditional input/output testing of functions we use unit tests written in Rust. For anything more complex we run a suite of end-to-end tests with Playwright, in which different web browsers load our test pages to assert each component still behaves like it should. This helps catch bugs early to keep web pages looking and functioning well. We require every interactive component to be tested.
Our CI pipeline runs the tests on every commit and PR to the <code>main</code> or <code>develop</code> branch in addition to some other code style checks.</p>
<h1 id="v1-0-highlights">v1.0 highlights!</h1>
<p>We listed our favourite components below:
<table style="border-spacing: 2rem; border-collapse: separate; border: 1px solid gray;">
<tr>
<td style="width: 50%; vertical-align: top;"><b>A Calendar with custom event support:</b><img src="https://i2.wp.com/www.openanalytics.eu/blog-img/leptodon-calendar.png?w=578&#038;ssl=1" class="img-responsive" data-recalc-dims="1"></td>
<td style="width: 50%; vertical-align: top;"><b>A combobox tag-picker:</b><img src="https://i1.wp.com/www.openanalytics.eu/blog-img/leptodon-tag-picker.png?w=578&#038;ssl=1" class="img-responsive" data-recalc-dims="1"></td>
</tr>
<tr>
<td style="width: 50%; vertical-align: top;"><b>An assortment of labels:</b><img src="https://i1.wp.com/www.openanalytics.eu/blog-img/leptodon-badges.png?w=578&#038;ssl=1" class="img-responsive" data-recalc-dims="1"></td>
<td style="width: 50%; vertical-align: top;"><b>A detailed upload input:</b><img src="https://i1.wp.com/www.openanalytics.eu/blog-img/leptodon-upload.png?w=578&#038;ssl=1" class="img-responsive" data-recalc-dims="1"></td>
</tr>
<tr>
<td colspan="2" style="text-align: center; vertical-align: top;"><b>Other components can be discovered at <a href="https://leptodon.dev/" rel="nofollow" target="_blank">https://leptodon.dev</a></b>
</td>
</tr>
</table></p>
<p>If you have any questions, suggestions or feedback feel free to reach out to us via <a href="https://support.openanalytics.eu/c/leptodon/13" rel="nofollow" target="_blank">our support page</a> or <a href="https://github.com/openanalytics/leptodon" rel="nofollow" target="_blank">on GitHub</a>.</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.openanalytics.eu/blog/2026/03/09/leptodon-1.0.0/"> Open Analytics</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/leptodon-1-0-0-released/">Leptodon 1.0.0 released!</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399742</post-id>	</item>
		<item>
		<title>Getting to the bottom of TMLE: forcing the target to behave</title>
		<link>https://www.r-bloggers.com/2026/03/getting-to-the-bottom-of-tmle-forcing-the-target-to-behave/</link>
		
		<dc:creator><![CDATA[ouR data generation]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.rdatagen.net/post/2026-03-10-getting-to-the-bottom-of-tmle-2/</guid>

					<description><![CDATA[<p>In the last couple of posts (starting here), I’ve tried to unpack some of the ideas that sit underneath TMLE: viewing parameters as functionals of a distribution, thinking about sampling as a perturbation, and understanding how influence functions d...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/getting-to-the-bottom-of-tmle-forcing-the-target-to-behave/">Getting to the bottom of TMLE: forcing the target to behave</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.rdatagen.net/post/2026-03-10-getting-to-the-bottom-of-tmle-2/"> ouR data generation</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>



<p>In the last couple of posts (<a href="https://www.rdatagen.net/post/2026-02-05-getting-to-the-bottom-of-tmle-1/" rel="nofollow" target="_blank">starting here</a>), I’ve tried to unpack some of the ideas that sit underneath TMLE: viewing parameters as functionals of a distribution, thinking about sampling as a perturbation, and understanding how influence functions describe the leading behavior of estimation error. In the second <a href="https://www.rdatagen.net/post/2026-03-03-getting-to-the-bottom-of-tmle-simulating-the-orthogonality/" rel="nofollow" target="_blank">post</a>, I showed through simulation how errors in nuisance estimation can interact with sampling variability, but typically have a smaller effect than the main sampling fluctuation itself. This brings us to the central idea behind TMLE.</p>
<p>(This series is not meant to be a tutorial, just a set of notes I put together while trying to wrap my head around this important tool. If you are looking for a more comprehensive introduction, there is an excellent collection of videos and tutorials available <a href="https://ctml.berkeley.edu/home" rel="nofollow" target="_blank">here</a>.)</p>
<p>If we knew the true exposure and outcome models, the empirical mean of the influence function would already be (approximately) zero, and the estimator would differ from the truth only because of sampling variability.</p>
<p>In practice, we estimate those nuisance models without full knowledge. Even small mistakes can disturb the balance that keeps the influence function centered. When that happens, the final estimate can be pulled away from the truth not just by sampling noise, but by imperfections in the nuisance models.</p>
<p>In this simple continuous-outcome setup, TMLE makes a targeted adjustment to the initial outcome fit so that the empirical mean of the estimated influence function is driven to zero. <em>After this adjustment, the remaining discrepancy behaves like ordinary sampling noise rather than model-driven bias.</em> Rather than simply improving the nuisance fits themselves, TMLE tries to correct the behavior of the target parameter.</p>
<p>To see how this plays out in a causal setting, we first need to be explicit about the parameter we’re trying to estimate.</p>
<div id="a-brief-causal-grounding" class="section level3">
<h3>A brief causal grounding</h3>
<p>In causal inference, the parameters we care about are usually contrasts between potential outcomes. For a binary treatment <span class="math inline">\(A\)</span>, each unit has two potential outcomes:
<span class="math display">\[
Y(1),\ Y(0),
\]</span>
representing the outcomes that would be observed under treatment and control. A common causal target is the average treatment effect:
<span class="math display">\[
\psi^0 = E[Y(1) &#8211; Y(0)]
\]</span>
(In this post, I move from the generic functional notation <span class="math inline">\(T(P)\)</span> used earlier to the specific causal parameter <span class="math inline">\(\psi(P)\)</span>, the average treatment effect.) Because we observe only one of these for each person, estimating this quantity requires assumptions (consistency, exchangeability, and positivity) that allow us to express it as a functional of the observed data distribution.</p>
<p>TMLE operates entirely within this observed-data framework. It does not try to recover individual counterfactuals. Instead, it constructs an estimator whose statistical behavior is aligned with the influence function of the causal parameter.</p>
</div>
<div id="why-nuisance-models-matter" class="section level3">
<h3>Why nuisance models matter</h3>
<p>In an ideal world, we would observe both potential outcomes and simply average their differences. In reality, we only observe one outcome per person, so we rely on models for the outcome and treatment mechanism to fill in the missing structure. These nuisance models help identify the causal effect, but if they are imperfect (and more likely than not, they will be), their errors can bias the final estimate.</p>
<p>TMLE begins with these initial nuisance estimates but makes a small, carefully chosen adjustment so that their errors interact rather than accumulate. The targeting step is chosen so that the empirical average of the estimated influence function equals zero, the centering property discussed earlier.</p>
<p>In this way, TMLE does not attempt to perfectly reconstruct missing counterfactuals. Instead, it realigns the estimate so that it responds primarily to genuine sampling noise rather than to quirks of the nuisance models.</p>
</div>
<div id="the-ate-and-its-efficient-influence-function" class="section level3">
<h3>The ATE and its efficient influence function</h3>
<p>Let <span class="math inline">\(Z = (X, A ,Y)\)</span> denote observed data: baseline covariates <span class="math inline">\(X\)</span>, binary treatment <span class="math inline">\(A \in \{0,1\}\)</span>, and outcome <span class="math inline">\(Y\)</span>.</p>
<p>We can define the outcome regression (<span class="math inline">\(Q\)</span>) and treatment mechanism (<span class="math inline">\(g\)</span>) as
<span class="math display">\[Q_a(x) = E[Y ∣ A = a, X = x], \ \ \ g(x) = P(A = 1 ∣ X = x).\]</span>
The ATE can be written as the functional
<span class="math display">\[
\psi(P) = E_P\big[Q_1(X) &#8211; Q_0(X)\big],
\]</span>
with the target parameter <span class="math inline">\(\psi_0 = \psi(P_0)\)</span> under the true distribution <span class="math inline">\(P_0\)</span>.</p>
<p>Conceptually, the influence function comes from the same perturbation-and-differentiation process discussed in the initial <a href="https://www.rdatagen.net/post/2026-02-05-getting-to-the-bottom-of-tmle-1/" rel="nofollow" target="_blank">post</a>: we slightly perturb the underlying distribution and examine the component of the resulting change in the ATE that dominates when the perturbation is small. It turns out that this dominant component can be written as the perturbation <span class="math inline">\(P_n − P_0\)</span> acting on a particular function of the data:
<span class="math display">\[\psi(P_n) − \psi(P_0) ≈ (P_n−P_0)\phi_{P_0},\]</span>
Here <span class="math inline">\(\phi_{P_0}\)</span> is the efficient influence function, and <span class="math inline">\((P_n−P_0) \phi_{P_0}\)</span> simply means the difference between the empirical and population averages of <span class="math inline">\(\phi_{P_0}(Z)\)</span>.</p>
<p>The version shown below, which I am not explicitly deriving, is the efficient influence function for the ATE:
<span class="math display">\[
\phi_P(Z) = \big( Q_1(X) − Q_0(X) − \psi(P)\big) + \frac{A}{g(X)} \big(Y−Q_1(X)\big) − \frac{1−A}{1 − g(X)} \big(Y−Q_0(X)\big),
\]</span>
A key element of the EIF is its structure. It combines a plug-in piece involving the conditional mean outcomes and a residual-correction piece weighted by the propensity score.</p>
<p>It turns out that if we plug in the true outcome and treatment models, then the EIF is centered under the true distribution:
<span class="math display">\[
E_{P_0}[\phi_{P_0}(Z)] = 0.
\]</span>
But if we plug in <em>estimated</em> nuisances, the empirical mean typically won’t be zero:</p>
<p><span class="math display">\[P_n \phi_{\hat{P}} = \frac{1}{n} \sum_{i=1}^{n} \phi_{\hat{P}}(Z_i) \ne 0,\]</span>
This matters, because the ideal first-order expansion behaves like
<span class="math display">\[\psi(P_n) − \psi(P_0) ≈ (P_n−P_0)\phi_{P_0}.\]</span>
That approximation only behaves as expected if the influence function balances out in the sample (i.e., its average is zero). When <span class="math inline">\(P_n \phi_{\hat P} \neq 0\)</span>, that balance is broken, and the estimator drifts away from its clean first-order description.</p>
<p>TMLE restores the balance by slightly adjusting the nuisance fits until the empirical mean of the estimated influence function is brought back to zero.</p>
</div>
<div id="evaluating-the-eif-at-the-initial-fit" class="section level3">
<h3>Evaluating the EIF at the initial fit</h3>
<p>Let <span class="math inline">\(\hat P^0\)</span> denote the observed-data distribution indexed by the initial nuisance estimates <span class="math inline">\((\hat Q^0,\hat g,\hat\psi^0)\)</span>. The estimated EIF at that initial fit is
<span class="math display">\[
\phi_{\hat P^0}(Z)
=
\big(\hat Q_1^0(X) &#8211; \hat Q_0^0(X) &#8211; \hat\psi^0\big)
+
\frac{A}{\hat g(X)}\big(Y &#8211; \hat Q_1^0(X)\big)
&#8211;
\frac{1-A}{1-\hat g(X)}\big(Y &#8211; \hat Q_0^0(X)\big),
\]</span>
where I’m using <span class="math inline">\(\hat Q^0_a(X)\)</span> as shorthand for <span class="math inline">\(\hat Q^0(a,X)\)</span>.</p>
<p>Now compute its empirical mean:
<span class="math display">\[
P_n \phi_{\hat P^0}
=
\frac{1}{n}\sum_{i=1}^n \phi_{\hat P^0}(Z_i).
\]</span>
If this equals zero, you’re already in great shape: your estimator behaves (to first order) like the ideal one with a centered EIF. If it doesn’t equal zero, TMLE does not throw away the nuisance fits. Instead it “tilts” them just enough to remove the imbalance.</p>
</div>
<div id="bring-in-the-clever-covariate" class="section level3">
<h3>Bring in the clever covariate</h3>
<p>This raises the question of what that tilt should look like. From above, the EIF evaluated at the initial estimates is
<span class="math display">\[
\phi_{\hat P^0}(Z)
=
\big(\hat Q_1^0(X) &#8211; \hat Q_0^0(X) &#8211; \hat\psi^0\big)
+
\frac{A}{\hat g(X)}\big(Y &#8211; \hat Q_1^0(X)\big)
&#8211;
\frac{1-A}{1-\hat g(X)}\big(Y &#8211; \hat Q_0^0(X)\big).
\]</span>
Focusing on the part involving the observed outcome <span class="math inline">\(Y\)</span>:
<span class="math display">\[
\frac{A}{\hat g(X)}\big(Y &#8211; \hat Q_1^0(X)\big)
&#8211;
\frac{1-A}{1-\hat g(X)}\big(Y &#8211; \hat Q_0^0(X)\big),
\]</span>
we can rewrite this in a slightly simpler form. Because only one of these terms is active for any individual (depending on whether <span class="math inline">\(A=1\)</span> or <span class="math inline">\(A=0\)</span>), the two elements can be combined into a single expression:
<span class="math display">\[
\left(
\frac{A}{\hat g(X)}
&#8211;
\frac{1-A}{1-\hat g(X)}
\right)
\big(Y &#8211; \hat Q^0(A,X)\big).
\]</span>
This motivates the definition of the <strong>clever covariate</strong>
<span class="math display">\[
H_{\hat g}(A,X)
=
\frac{A}{\hat g(X)}
&#8211;
\frac{1-A}{1-\hat g(X)}.
\]</span>
With this notation, the outcome-dependent part of the EIF becomes
<span class="math display">\[
H_{\hat g}(A,X)\big(Y &#8211; \hat Q^0(A,X)\big).
\]</span>
Now the EIF can be written more compactly as
<span class="math display">\[
\phi_{\hat P^0}(Z)
=
\big(\hat Q_1^0(X) &#8211; \hat Q_0^0(X) &#8211; \hat\psi^0\big)
+
H_{\hat g}(A,X)\big(Y &#8211; \hat Q^0(A,X)\big).
\]</span>
This decomposition makes the source of the imbalance easier to see. The plug-in estimator is
<span class="math display">\[
\hat\psi^0 = P_n\big(\hat Q_1^0(X) &#8211; \hat Q_0^0(X)\big),
\]</span>
so by construction,
<span class="math display">\[
P_n\big(\hat Q_1^0(X) &#8211; \hat Q_0^0(X) &#8211; \hat\psi^0\big) = 0.
\]</span>
That means any imbalance in the empirical EIF must come entirely from
<span class="math display">\[
P_n\left[
H_{\hat g}(A,X)\big(Y &#8211; \hat Q^0(A,X)\big)
\right].
\]</span>
So the only part of the EIF we can directly influence is the residual <span class="math inline">\(Y &#8211; \hat Q^0(A,X)\)</span>. If we move <span class="math inline">\(\hat Q^0\)</span> slightly until this residual imbalance disappears, we can bring the empirical EIF back into balance and better target the parameter.</p>
</div>
<div id="the-fluctuation-step" class="section level3">
<h3>The fluctuation step</h3>
<p>TMLE does not refit the outcome model from scratch. Instead, it introduces a one-dimensional update that adjusts the initial regression just enough to remove the imbalance in the empirical influence-function equation:
<span class="math display">\[
\hat Q^\epsilon(A,X)
=
\hat Q^0(A,X)
+
\epsilon H_{\hat g}(A,X).
\]</span>
The parameter <span class="math inline">\(\epsilon\)</span> controls how much we tilt the regression. We estimate <span class="math inline">\(\epsilon\)</span> using the observed outcomes <span class="math inline">\(Y_i\)</span>. For a continuous outcome, we estimate <span class="math inline">\(\epsilon\)</span> by least squares. The normal equation for this regression is
<span class="math display">\[
\sum_{i=1}^n
H_{\hat g}(A_i, X_i)
\big(
Y_i &#8211; \hat Q^{\epsilon}(A_i,X_i)
\big)
=0.
\]</span>
This is equivalent to saying
<span class="math display">\[
P_n
\Big[
H_{\hat g}(A,X)\big(Y-\hat Q^{\epsilon}(A,X)\big)
\Big]
= 0,
\]</span>
after dividing both sides by <span class="math inline">\(n\)</span>. Define the updated regression <span class="math inline">\(Q^*\)</span> by plugging in the estimated <span class="math inline">\(\hat\epsilon\)</span>:
<span class="math display">\[
\hat Q^*(A,X) = \hat Q^{\hat\epsilon}(A,X).
\]</span>
Once we have <span class="math inline">\(\hat Q^*\)</span>, we update the ATE estimate to get the TMLE estimate using the usual plug-in formula
<span class="math display">\[
\hat\psi^*
=
\frac{1}{n}\sum_{i=1}^n
\big(
\hat Q^*(1,X_i) &#8211; \hat Q^*(0,X_i)
\big).
\]</span>
The updated estimated influence function becomes
<span class="math display">\[
\phi_{\hat P^*}(Z) =
\underbrace{\big(\hat Q^*_1(X) &#8211; \hat Q^*_0(X) &#8211; \hat\psi^*\big)}_{\text{plug-in}}
+
\underbrace{H_{\hat g}(A,X)\big(Y &#8211; \hat Q^*(A,X)\big)}_{\text{weighted residual error}}.
\]</span>
The plug-in construction guarantees that the first term has empirical mean zero, while the normal equation above ensures that the residual term also has empirical mean zero. As a result,
<span class="math display">\[
P_n \phi_{\hat P^*} \approx 0.
\]</span>
In other words, the targeting step removes the residual imbalance in the efficient influence function within the observed sample. Now the behavior of the estimator matches the ideal first-order expansion.</p>
</div>
<div id="returning-to-the-nuisance-interaction" class="section level3">
<h3>Returning to the nuisance interaction</h3>
<p>In the earlier posts, I tried to argue that influence-function–based estimators behave well when the interaction term
<span class="math display">\[
(P_n &#8211; P_0)(\phi_{\hat P} &#8211; \phi_{P_0}),
\]</span>
becomes small relative to the main sampling fluctuation. When that happens, the estimator behaves as if the true influence function were known. In the previous post, we explored this interaction through simulation and saw that it can shrink toward zero, though it may still be quite variable in finite samples when nuisance models are estimated imperfectly. The targeting step in TMLE is designed to enforce the empirical influence-function equation in the observed sample, which helps ensure that any remaining discrepancy appears only in the higher-order remainder.</p>
<p>To see how targeting helps achieve this, start from the identity
<span class="math display">\[
(P_n &#8211; P_0)(\phi_{\hat P^*} &#8211; \phi_{P_0})
=
P_n(\phi_{\hat P^*} &#8211; \phi_{P_0})
&#8211;
P_0(\phi_{\hat P^*} &#8211; \phi_{P_0}).
\]</span>
Expanding the first term gives
<span class="math display">\[
P_n(\phi_{\hat P^*} &#8211; \phi_{P_0})
=
P_n\phi_{\hat P^*} &#8211; P_n\phi_{P_0}.
\]</span>
The targeting step enforces
<span class="math display">\[
P_n \phi_{\hat P^*} \approx 0,
\]</span>
so this term reduces to
<span class="math display">\[
P_n(\phi_{\hat P^*} &#8211; \phi_{P_0})
\approx
-\, P_n \phi_{P_0}.
\]</span>
The quantity <span class="math inline">\(P_n \phi_{P_0}\)</span> is simply the empirical average of the true influence function, which fluctuates at order <span class="math inline">\(n^{-1/2}\)</span> due to sampling variability.</p>
<p>The second term,
<span class="math display">\[
P_0(\phi_{\hat P^*} &#8211; \phi_{P_0}),
\]</span>
reflects how far the targeted influence function is from the true one in the population. Its magnitude is largely determined by the accuracy of the nuisance estimates.</p>
<p>A key feature of the influence function is that first-order errors in either nuisance model cancel out. What remains behaves roughly like the product of the errors in the outcome regression and the propensity score model. As those nuisance estimates improve, this interaction shrinks and becomes negligible relative to the <span class="math inline">\(n^{-1/2}\)</span> sampling fluctuation. Targeting removes the leading imbalance caused by nuisance estimation in the observed sample. What remains is dominated by the usual sampling fluctuation <span class="math inline">\((P_n-P_0)\phi_{P_0}\)</span>, with nuisance errors entering only through a smaller interaction term.</p>
</div>
<div id="a-quick-word-on-cross-fitting" class="section level3">
<h3>A quick word on cross-fitting</h3>
<p>Everything above can be done with or without cross-fitting. But when <span class="math inline">\(\hat Q^0\)</span> and <span class="math inline">\(\hat g\)</span> are estimated using flexible machine-learning methods, cross-fitting helps ensure that the empirical EIF equation behaves the way the theory expects.</p>
<p>Without cross-fitting, the same observations both train the nuisance models and evaluate the influence function. Cross-fitting separates those roles, so each observation is evaluated using nuisance estimates that were learned from other data. This avoids the feedback loop that can otherwise distort the EIF centering condition and helps the usual influence-function asymptotics show up more clearly in finite samples.</p>
</div>
<div id="where-the-double-robustness-shows-up" class="section level3">
<h3>Where the “double robustness” shows up</h3>
<p>TMLE also inherits a key robustness property from the structure of the influence function. Roughly speaking, the estimator remains consistent if either the outcome regression or the propensity model is estimated correctly.</p>
<p>Nuisance errors enter the estimator multiplicatively rather than additively. If the outcome regression has error <span class="math inline">\(e_Q\)</span> and the propensity model has error <span class="math inline">\(e_g\)</span>, the leading bias behaves roughly like the product <span class="math inline">\(e_Q \times e_g\)</span>. If either model is correct the bias disappears, and even when both are imperfect their interaction can still be small.</p>
<p>This multiplicative structure comes from the orthogonality built into the efficient influence function: first-order errors in either nuisance model cancel out, so nuisance mistakes only matter through their interaction.</p>
<p>In that sense, TMLE is not trying to perfectly estimate the nuisance models themselves. Instead, it adjusts them just enough so that the target parameter obeys the influence-function equation that governs its asymptotic behavior.</p>
</div>
<div id="next-steps" class="section level3">
<h3>Next steps</h3>
<p>I had hoped to include some simulations here to see the theory in action, but this post ended up longer than anticipated. As I did after the first post, I’ll follow up with another post that focuses on simulation examples illustrating the ideas developed here.</p>
<p>
<p><small><font color="darkkhaki">
Reference:</p>
<p>Van der Laan, Mark J., and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Vol. 4. New York: Springer, 2011.</p>
<p>Support:</p>
This work was supported by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/" rel="nofollow" target="_blank">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.rdatagen.net/post/2026-03-10-getting-to-the-bottom-of-tmle-2/"> ouR data generation</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/getting-to-the-bottom-of-tmle-forcing-the-target-to-behave/">Getting to the bottom of TMLE: forcing the target to behave</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399737</post-id>	</item>
		<item>
		<title>Learning PK/PD Simulation: A Beginner&#8217;s Monte Carlo Analysis With mrgsolve in R</title>
		<link>https://www.r-bloggers.com/2026/03/learning-pk-pd-simulation-a-beginners-monte-carlo-analysis-with-mrgsolve-in-r/</link>
		
		<dc:creator><![CDATA[r on Everyday Is A School Day]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.kenkoonwong.com/blog/pkpd/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
🧪 Diving into PK/PD for the first time — simulating ceftriaxone with mrgsolve in R. Free drug levels were… surprisingly high? Even pushed it to q48h dosing out of curiosity and the results left me with more questions than answers 🤔📈</p>
<p>Motivations</p>
<p>Learning pharmacokinetics (PK) and pharmocodynamics (PD) have always been ...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/learning-pk-pd-simulation-a-beginners-monte-carlo-analysis-with-mrgsolve-in-r/">Learning PK/PD Simulation: A Beginner’s Monte Carlo Analysis With mrgsolve in R</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.kenkoonwong.com/blog/pkpd/"> r on Everyday Is A School Day</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<blockquote>
<p><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f9ea.png" alt="🧪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Diving into PK/PD for the first time — simulating ceftriaxone with mrgsolve in R. Free drug levels were… surprisingly high? Even pushed it to q48h dosing out of curiosity and the results left me with more questions than answers <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f914.png" alt="🤔" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f4c8.png" alt="📈" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
</blockquote>
<p><img src="https://i2.wp.com/www.kenkoonwong.com/blog/pkpd/logo.png?w=578&#038;ssl=1" alt="" data-recalc-dims="1"></p>




<h2 id="motivations">Motivations
  <a href="https://www.kenkoonwong.com/blog/pkpd/#motivations" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Learning pharmacokinetics (PK) and pharmocodynamics (PD) have always been an interest of mine. It’s always challenging to read through these population PK papers with all the numbers etc. What’s a better way of diving into the surface of these and see if we can at least know how to code a simulation to obtain the probability target attainment (PTA) of different minimal inhibitory concentration (mic) and learn the basics via code! Let’s dive on!</p>




<h4 id="disclaimer">Disclaimer:
  <a href="https://www.kenkoonwong.com/blog/pkpd/#disclaimer" rel="nofollow" target="_blank"></a>
</h4>
<p><em>I am not a pharmacist and not an expert in PK/PD. This is a documentation for my own learning and for educational purposes only. Not a medical advice. If you noticed anything wrong here, please let me know!</em></p>




<h2 id="objectives">Objectives:
  <a href="https://www.kenkoonwong.com/blog/pkpd/#objectives" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#poppk" rel="nofollow" target="_blank">What Is Population PK</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#param" rel="nofollow" target="_blank">What Are The Parameters of Interest On a Paper?</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#code" rel="nofollow" target="_blank">Let’s Code</a>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#crcl" rel="nofollow" target="_blank">Different CrCl</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#albumin" rel="nofollow" target="_blank">Low Albumin</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#48" rel="nofollow" target="_blank">?q48 Dosing</a></li>
</ul>
</li>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#opportunities" rel="nofollow" target="_blank">Oppotunities For Improvement</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/pkpd/#lessons" rel="nofollow" target="_blank">Lessons Learnt</a></li>
</ul>




<h2 id="poppk">What Is Population PK
  <a href="https://www.kenkoonwong.com/blog/pkpd/#poppk" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Population pharmacokinetics (popPK) is a statistical approach that describes how medications behave in the body across groups of people, accounting for variability between individuals. Instead of studying one person intensively, popPK analyzes sparse data from many patients to understand typical medication behavior and why people differ in their medication exposure.</p>
<p>We’ll use 
<a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3243010/pdf/bcp0072-0758.pdf" rel="nofollow" target="_blank">Garot D et al Population pharmacokinetics of ceftriaxone in critically ill septic patients: a reappraisal</a> as an example for learning.</p>




<h2 id="param">What Are The Parameters of Interest On a Paper?
  <a href="https://www.kenkoonwong.com/blog/pkpd/#param" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>From the paper, we can see that there are a lot of parameters and numbers. But what are the parameters of interest? We will focus on the following parameters:</p>
<p><img src="https://i1.wp.com/www.kenkoonwong.com/blog/pkpd/table3.png?w=578&#038;ssl=1" alt="" data-recalc-dims="1"></p>
<p>Looking at their <code>table 3</code>, we see these values:  <br>
<code>CL</code> = <code>\(\theta_1 + \theta_2 . (CL_{cr}/4.26)\)</code>    <br>
<code>$\theta_1$</code> : Non-renal (baseline) clearance component  <br>
<code>$\theta_2$</code> : Renal clearance scaling coefficient   <br>
<code>V1</code> : Volume of distribution of the central compartment.   <br>
<code>V2</code> : Volume of distribution of the peripheral compartment.    <br>
<code>Q</code> : Inter-compartmental clearance.   <br>
<code>$\omega^2 (CL)$</code> : Between-subject variability of clearance.   <br>
<code>$\omega^2 (V1)$</code> : Between-subject variability of volume of distribution of the central compartment.   <br>
<code>$\omega^2 (V2)$</code> : Between-subject variability of volume of distribution of the peripheral compartment.</p>
<p>These are the parameters we’ll use in our mrgsolve model. I’ve always wondered what these parameters represent and it was a bit difficult to conceptualize until we dove into the code and finally understood the rationale! It’s a mixed effect model where the estimates were modeled as a function of the fixed effect (theta) and the random effect (eta), you will see this in the code later. The fixed effect represents the typical value of the parameter in the population, while the random effect represents the variability between individuals. The random effect is assumed to be normally distributed with a mean of zero and a variance of omega squared.</p>
<p>If we were to draw a flow chart of the above, it will look something like this:</p>
<p align="center">
  <img loading="lazy" src="https://i2.wp.com/www.kenkoonwong.com/blog/pkpd/flowchart.png?w=450&#038;ssl=1" alt="image" height="auto" data-recalc-dims="1">
</p>
<p>One starts with medication being administered into the central compartment, and then it goes into either peripheral compartment (tissue etc) and medication clearance. Notice that the <code>Q</code> is a bidirectional flow between central and peripheral, whereas all other directions are either into central or out from central to clearance. This is very helpful for me to get a surface understanding of the distribution. Let’s get on with the code!</p>




<h2 id="code">Let’s Code
  <a href="https://www.kenkoonwong.com/blog/pkpd/#code" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<pre>library(mrgsolve)
library(tidyverse)

mod &lt;- mcode(model = &quot;ceftriaxone&quot;, code=&#39;
$PARAM
theta1 = 0.56,
theta2 = 0.32,
CLcr   = 4.26,   // median creatinine clearance of 68.5 ml min-1, hence ~4.26 L hr-1
V1     = 10.3,
V2     = 7.35,
Q      = 5.28,  
fu     = 0.10  // fraction of unbound, picked a static value from package insert range

$CMT CENT PERI

$MAIN
double CL  = theta1 + theta2 * (CLcr / 4.26);  // technically we could use 0.88 as reported on their result section
double CLi = CL * exp(ETA(1));                  // ETA here means log normal distibution of mean 
double V1i = V1 * exp(ETA(2));     
double V2i = V2 * exp(ETA(3));

$OMEGA
0.24   // omega2(CL) from table
0.23
0.42

$SIGMA
0.0576   // √0.0576 = 0.24, 

$ODE
dxdt_CENT = -(CLi/V1i)*CENT - (Q/V1i)*CENT + (Q/V2i)*PERI;
dxdt_PERI =  (Q/V1i)*CENT   - (Q/V2i)*PERI;

$TABLE
double Cp_total = (CENT / V1i)*(1+EPS(1)); 
double Cp_free = fu * Cp_total;

$CAPTURE Cp_free
&#39;)

dosing &lt;- ev(amt = 2000, rate = 4000, ii = 24, addl = 2, cmt = &quot;CENT&quot;)

set.seed(1)
sims &lt;- mod |&gt;
  ev(dosing) |&gt;
  mrgsim(nid = 1000, end = 72, delta = 0.25) |&gt;
  as_tibble()
</pre><p>ETA in the above means the random effect, which is assumed to be normally distributed with a mean of zero and a variance of omega squared. ETA is a greek letter (eh-ta). EPS here is Epsilon.</p>
<p>For <code>ev</code>, amount is dosing in <code>mg</code>; <code>rate</code> is amount given per hour; <code>ii</code> is frequency; <code>addl</code> is number of additional doses; <code>cmt</code> is the compartment where the dose is given. In this case, we are giving 2000 mg of ceftriaxone as a 30 minute infusion every 24 hours for 3 doses (1 initial dose + 2 additional doses) into the central compartment.</p>
<p>We will then need to set seed for reproducibility, pipe in your initial model with dosing, then <code>nid</code> is number of individuals you want to simulate, <code>end</code> is the end time of the simulation in hours, and <code>delta</code> is the time interval for the simulation output in hours. In this case, we are simulating 1000 individuals for 72 hours with a time interval of 0.25 hours (15 minutes).</p>
<p>Next, we’ll calculate the probability of target attainment (PTA) for different minimal inhibitory concentration (MIC) values. The PTA is the probability that the free drug concentration exceeds the MIC for a certain percentage of the dosing interval.</p>
<pre>MIC &lt;- 1

print(paste0(&quot;Probability of Target Attainment: &quot;, sims |&gt;
  filter(time &gt;= 48) |&gt;
  group_by(ID) |&gt;
  summarise(fT = mean(Cp_free &gt; MIC)) |&gt;
  summarise(PTA = mean(fT &gt;= 0.50)) |&gt;
  pull()))

## [1] &quot;Probability of Target Attainment: 0.996&quot;

sims |&gt;
  ggplot(aes(x=time,y=Cp_free,group=ID)) +
  geom_line(alpha=0.01) +
  geom_hline(yintercept = MIC, color = &quot;red&quot;) +
  theme_bw()
</pre><img src="https://i0.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-2-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>In the above we choose mic of 1, filtered off time after 48 hours for steady state, then calculate the average free ceftriaxone that is above the mic, then assess the mean of times where free ceftriaxone is above 50% per simulated subject. We can see that the probability of target attainment is around 99.6%. We can also visualize the free ceftriaxone concentration over time with a red line indicating the mic of 1. Not too shabby! Now let’s assess when there is a difference in CrCl and albumin. I’ll spare you the code.</p>




<h3 id="crcl">Changes in CrCl
  <a href="https://www.kenkoonwong.com/blog/pkpd/#crcl" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h3>
<details>
<summary>code</summary>
<pre>library(glue)

crcl_vec &lt;- c(1.8,7.2,10.8)
crcl_vec_i &lt;- c(30, 120, 180)

# 30 ml/min = 1.8 L/hr
# 120 ml/min = 7.2 L/hr
# 180 ml/min = 10.8 L/hr

for (crcl in crcl_vec) {

mod &lt;- mcode(model = &quot;ceftriaxone&quot;, code=glue(&#39;
$PARAM
theta1 = 0.56,
theta2 = 0.32,
CLcr   = {crcl},
V1     = 10.3,
V2     = 7.35,
Q      = 5.28,  
fu     = 0.10  // fraction of unbound, picked a static value from package insert range

$CMT CENT PERI

$MAIN
double CL  = theta1 + theta2 * (CLcr / 4.26);  // technically we could use 0.88 as reported on their result section
double CLi = CL * exp(ETA(1));                  // ETA here means log normal distibution of mean 
double V1i = V1 * exp(ETA(2));     
double V2i = V2 * exp(ETA(3));

$OMEGA
0.24   // omega2(CL) from table
0.23
0.42

$SIGMA
0.0576   // √0.0576 = 0.24, 

$ODE
dxdt_CENT = -(CLi/V1i)*CENT - (Q/V1i)*CENT + (Q/V2i)*PERI;
dxdt_PERI =  (Q/V1i)*CENT   - (Q/V2i)*PERI;

$TABLE
double Cp_total = (CENT / V1i)*(1+EPS(1)); 
double Cp_free = fu * Cp_total;

$CAPTURE Cp_free
&#39;,crcl))

dosing &lt;- ev(amt = 2000, rate = 4000, ii = 24, addl = 2, cmt = &quot;CENT&quot;)

set.seed(1)
sims &lt;- mod |&gt;
  ev(dosing) |&gt;
  mrgsim(nid = 1000, end = 72, delta = 0.25) |&gt;
  as_tibble()

MIC &lt;- 1

pta &lt;- paste0(&quot;Probability of Target Attainment: &quot;, sims |&gt;
  filter(time &gt;= 48) |&gt;
  group_by(ID) |&gt;
  summarise(fT = mean(Cp_free &gt; MIC)) |&gt;
  summarise(PTA = mean(fT &gt;= 0.50)) |&gt;
  pull(), &quot; ,CrCl: &quot;, crcl_vec_i[crcl_vec==crcl], &quot;ml/min&quot;)


plot &lt;- sims |&gt;
  ggplot(aes(x=time,y=Cp_free,group=ID)) +
  geom_line(alpha=0.01) +
  geom_hline(yintercept = MIC, color = &quot;red&quot;) +
  theme_bw() +
  ggtitle(pta)

plot(plot)
}
</pre></details>
<p><img src="https://i2.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-4-1.png?w=450&#038;ssl=1" data-recalc-dims="1" /><img src="https://i0.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-4-2.png?w=450&#038;ssl=1" data-recalc-dims="1" /><img src="https://i1.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-4-3.png?w=450&#038;ssl=1" data-recalc-dims="1" /></p>
<p>That’s interesting! That makes sense, increased CrCl will increase clearance of ceftriaxone, hence decrease in PTA. It’s still pretty good though! Though, what is considered acceptable? 90%? 70%? 50%? Also, the above PTA is based on 50% of a time free ceftriaxone is above MIC. What is the acceptable number for that then? <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f937-200d-2642-fe0f.png" alt="🤷‍♂️" class="wp-smiley" style="height: 1em; max-height: 1em;" /> What if, since ceftriaxone is albumin bound, if we model albumin into the model as well?</p>




<h3 id="albumin">Changes With Hypoalbuminemia
  <a href="https://www.kenkoonwong.com/blog/pkpd/#albumin" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h3>
<p>Notice that our initial model had a fixed fraction of unbound (fu) of 0.1, which is the middle of the range reported in the package insert. However, in critically ill patients, hypoalbuminemia is common and can lead to an increase in the fraction of unbound drug, which can affect the pharmacokinetics and pharmacodynamics of ceftriaxone. Let’s see how we can model this in our mrgsolve code. From the paper in method section, they used the formula below to estimate free ceftriaxone from total ceftriaxone:</p>
<p align="center">
  <img loading="lazy" src="https://i0.wp.com/www.kenkoonwong.com/blog/pkpd/formula.png?w=450&#038;ssl=1" alt="image" height="auto" data-recalc-dims="1">
</p>
<p>We’ll add that to our model and adjust the <code>np</code> (total concentration of protein binding sites) according to estimate with lower albumin (np = 295), this number again was from the paper in the discussion portion where their median albumin were ~25g/L.</p>
<details>
<summary>code</summary>
<pre>mod &lt;- mcode(model = &quot;ceftriaxone&quot;, code=&#39;
$PARAM
theta1 = 0.56,
theta2 = 0.32,
CLcr   = 4.26,
V1     = 10.3,
V2     = 7.35,
Q      = 5.28,
np     = 517,  
kaff   = 0.0367 

$OMEGA
0.24
0.23
0.42

$SIGMA
0.0576

$CMT CENT PERI

$GLOBAL
double solveFree(double CTOT, double np, double kaff) {
  double cf   = (-(np+1/kaff-CTOT)+sqrt(pow(np+1/kaff-CTOT,2.0)+(4.0*CTOT/kaff)));
  return cf &gt; 0 ? cf : 0;
}

$MAIN
double CL  = theta1 + theta2 * (CLcr / 4.26);
double CLi = CL * exp(ETA(1));
double V1i = V1 * exp(ETA(2));
double V2i = V2 * exp(ETA(3));

$ODE
double CTOT  = CENT / V1i;           // renamed: avoid clash with $TABLE
double CFREE = solveFree(CTOT, np, kaff);

dxdt_CENT = -CLi * CFREE
            - (Q / V1i) * CENT
            + (Q / V2i) * PERI;

dxdt_PERI =  (Q / V1i) * CENT
            - (Q / V2i) * PERI;

$TABLE
double CTOTAL      = CENT / V1i;      // notice this is not CTOT
double Cp_free     = solveFree(CTOTAL, np, kaff);
double Cp_bound    = CTOTAL - Cp_free;
double FU          = Cp_free / (CTOTAL + 1e-9);
double Cp_obs      = CTOTAL * (1 + EPS(1));

$CAPTURE CTOTAL Cp_free Cp_bound FU Cp_obs
&#39;)

dosing &lt;- ev(amt = 2000, rate = 4000, ii = 24, addl = 2, cmt = &quot;CENT&quot;)

set.seed(1)
sims &lt;- mod |&gt;
  param(np = 295) |&gt;
  ev(dosing) |&gt;
  mrgsim(nid = 1000, end = 72, delta = 0.25) |&gt;
  as_tibble()

MIC &lt;- 1

pta &lt;- paste0(&quot;Probability of Target Attainment: &quot;, sims |&gt;
  filter(time &gt;= 48) |&gt;
  group_by(ID) |&gt;
  summarise(fT = mean(Cp_free &gt; MIC)) |&gt;
  summarise(PTA = mean(fT &gt;= 0.50)) |&gt;
  pull(),&quot;, Albumin: ~25g/L, CrCl: ~63 ml/min&quot;)

plot &lt;- sims |&gt;
  ggplot(aes(x=time,y=Cp_free,group=ID)) +
  geom_line(alpha=0.01) +
  geom_hline(yintercept = MIC, color = &quot;red&quot;) +
  # geom_text(aes(x=20,y=150,label=pta)) +
  theme_bw() +
  ggtitle(pta)

plot(plot)
</pre></details>
<img src="https://i1.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-6-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>Wow, that’s interesting! After we correctly fit in the free ceftriaxone estimation, it actually improved the PTA even when albumin is lower. What if we make albumin even lower to ~15g/L (np=~172), and increase our CrCl to 180 ml/min, and increase our fT >= 0.7 (more than 70% of the time free ceftriaxone is above mic), and see if we’ll be able to clear the medication faster?</p>
<img src="https://i0.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-7-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>PTA is still 100% !?!?! wow, ceftriaxone 2g really is a beast! Hmmm.. The free ceftriaxone is REALLY high, around ~200-300, can we simulate a q48h dosing and see what the PTA is like, even for our worse case scenario, low albumin, high CrCl, and stil cover ft>mic >= 70%?</p>




<h3 id="48">?q48 Dosing
  <a href="https://www.kenkoonwong.com/blog/pkpd/#48" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h3>
<details>
<summary>code</summary>
<pre>dosing &lt;- ev(amt = 2000, rate = 4000, ii = 48, addl = 2, cmt = &quot;CENT&quot;)

set.seed(1)
sims &lt;- mod |&gt;
  param(np = 172, CLcr = 10.8) |&gt;
  ev(dosing) |&gt;
  mrgsim(nid = 1000, end = 144, delta = 0.25) |&gt;
  as_tibble()

MIC &lt;- 1

pta &lt;- paste0(&quot;PTA: &quot;, sims |&gt;
  filter(time &gt;= 48) |&gt;
  group_by(ID) |&gt;
  summarise(fT = mean(Cp_free &gt; MIC)) |&gt;
  summarise(PTA = mean(fT &gt;= 0.70)) |&gt;
  pull(),&quot;, Albumin: ~15g/L, CrCl: ~180 ml/min, fT &gt; mic &gt;= 70%, q48h dosing&quot;)

plot &lt;- sims |&gt;
  ggplot(aes(x=time,y=Cp_free,group=ID)) +
  geom_line(alpha=0.01) +
  geom_hline(yintercept = MIC, color = &quot;red&quot;) +
  theme_bw() +
  ggtitle(pta)

plot(plot)
</pre></details>
<img src="https://i0.wp.com/www.kenkoonwong.com/blog/pkpd/index_files/figure-html/unnamed-chunk-9-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>Seriously!? PTA is still so high !? What does this actually mean? Is there literature on this? Maybe my code is not right… <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f914.png" alt="🤔" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f937-200d-2642-fe0f.png" alt="🤷‍♂️" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>Let’s look at fT > mic >= 99%.</p>
<pre>## [1] &quot;PTA: 0.979, Albumin: ~15g/L, CrCl: ~180 ml/min, fT &gt; mic &gt;= 99%, q48h dosing&quot;
</pre><p>If you know anything about this, please let me know! this is for organism with mic <= 1, ceftriaxone 2g. Again, make note that this is purely for educational and learning purposes. The finding we got above is just a curious exploration. I wonder if there is some coding error on my part. Click the <code>code</code> above to expand for details. I also wonder if most of the trials we had before were based on higher mic, whereas the mic nowadays for ceftriaxone are mainly <= 1. &#x1f914;</p>




<h2 id="opportunities">Opportunities For Improvement
  <a href="https://www.kenkoonwong.com/blog/pkpd/#opportunities" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<ul>
<li>Learn how they model popPK, this will really help us understand how they got those theta and omega estimates</li>
<li>I don’t quite understand the sigma portion yet, will dive into this the next time, especially when estimating these values</li>
<li>Try to learn other properties such as AUC/mic, Cmax/mic etc, and see how the PTA changes</li>
<li>Learn more from literature which is preferred regarding acceptable free medication level above mic and acceptable PTA</li>
</ul>




<h2 id="lessons">Lessons learnt
  <a href="https://www.kenkoonwong.com/blog/pkpd/#lessons" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<ul>
<li>learnt some mrgsolve model coding (uses cpp)</li>
<li>learnt some basic pk/pd equations, popPK</li>
<li>learnt about the 2 compartments</li>
<li>found unexpected result for q48 dosing through simulation, still not sure if this is something real/true</li>
<li>learnt that thetas are not related to central/peripheral, rather theta1 is baseline clearance and theta2 is ?renal scaling</li>
</ul>
<p>If you like this article:</p>
<ul>
<li>please feel free to send me a 
<a href="https://www.kenkoonwong.com/blog/" rel="nofollow" target="_blank">comment or visit my other blogs</a></li>
<li>please feel free to follow me on 
<a href="https://bsky.app/profile/kenkoonwong.bsky.social" rel="nofollow" target="_blank">BlueSky</a>, 
<a href="https://twitter.com/kenkoonwong/" rel="nofollow" target="_blank">twitter</a>, 
<a href="https://github.com/kenkoonwong/" rel="nofollow" target="_blank">GitHub</a> or 
<a href="https://rstats.me/@kenkoonwong" rel="nofollow" target="_blank">Mastodon</a></li>
<li>if you would like collaborate please feel free to 
<a href="https://www.kenkoonwong.com/contact/" rel="nofollow" target="_blank">contact me</a></li>
</ul>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.kenkoonwong.com/blog/pkpd/"> r on Everyday Is A School Day</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/learning-pk-pd-simulation-a-beginners-monte-carlo-analysis-with-mrgsolve-in-r/">Learning PK/PD Simulation: A Beginner’s Monte Carlo Analysis With mrgsolve in R</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399704</post-id>	</item>
		<item>
		<title>DuckDB + dbplyr: When Your Pipeline Gives Different Results Every Time It Runs</title>
		<link>https://www.r-bloggers.com/2026/03/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/</link>
		
		<dc:creator><![CDATA[Rtask]]></dc:creator>
		<pubDate>Sun, 08 Mar 2026 18:39:48 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rtask.thinkr.fr/?p=29639</guid>

					<description><![CDATA[<p>You can read the original post in its original format on Rtask website by ThinkR here: DuckDB + dbplyr: When Your Pipeline Gives Different Results Every Time It Runs<br />
Short on time? Here’s the gist: DuckDB parallelizes query execution and never guarantees row order unless you explicitly ask for it. ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/">DuckDB + dbplyr: When Your Pipeline Gives Different Results Every Time It Runs</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rtask.thinkr.fr/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/"> Rtask</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>You can read the original post in its original format on <a rel="nofollow" href="https://rtask.thinkr.fr/" target="_blank">Rtask</a> website by ThinkR here: <a rel="nofollow" href="https://rtask.thinkr.fr/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/" target="_blank">DuckDB + dbplyr: When Your Pipeline Gives Different Results Every Time It Runs</a></p>
<blockquote><p>
  Short on time? Here’s the gist: DuckDB parallelizes query execution and <strong>never guarantees row order</strong> unless you explicitly ask for it. If any step in your pipeline is order-sensitive, <code>row_number()</code>, <code>cumsum()</code>, <code>lag()</code>, <code>distinct(.keep_all = TRUE)</code>, inequality joins, you are silently producing non-deterministic results. This post shows the four patterns that bite you and how to fix each one.
</p></blockquote>
<hr />
<h2>The Setup: A SAS Pipeline, Now in R</h2>
<p>You have inherited (or written) a data pipeline originally coded in SAS. It processes administrative billing records: matching line items against reference tables, applying time-varying coefficients, deduplicating based on business identifiers, computing running counters. Classic ETL work.</p>
<p>The migration to R goes well. You use <code>{DBI}</code> to open a DuckDB connection, load your source files as lazy tables via <code>{arrow}</code> or <code>dplyr::tbl()</code>, build the transformations with <code>{dbplyr}</code>, and collect the result at the very end. Your code is readable, your tests compare the R output to the SAS reference, and they pass (maybe using <a href="https://github.com/ThinkR-open/datadiff" rel="nofollow" target="_blank">{datadiff}</a>).</p>
<p>Then you run the pipeline again.</p>
<p>The numbers are different.</p>
<p>Not wildly different. A few rows shifted, a few amounts swapped. Exactly the kind of difference that would slip through a quick visual check but break a reconciliation report. You run it ten more times. Seven match the first run. Three match the second. You are now staring at intermittent, data-dependent non-determinism, which is the worst kind.</p>
<p>This post documents the four root causes we encountered in production and the patterns that fix them.</p>
<hr />
<h2>Why DuckDB Is Different From What You Expect</h2>
<p>In SAS, the data step processes rows in physical order, the order they sit on disk. That order is stable. Procedures like <code>PROC SORT</code> make it explicit. The whole language is built around the idea that row order matters and is predictable.</p>
<p>DuckDB is a columnar, parallel query engine. It splits work across CPU cores, processes data in chunks (vectors), and reassembles results. <strong>The order in which chunks are processed is not guaranteed.</strong> It depends on the query plan, the number of threads, the size of the data, and internal scheduling decisions that can change between runs.</p>
<p>This is not a bug. It is the expected behavior of any modern analytical database. The SQL standard does not define a row order unless you write <code>ORDER BY</code>. DuckDB simply makes this visible in ways that SQLite or an in-memory data frame do not, because it actually parallelizes.</p>
<p>The consequence for <code>{dbplyr}</code> users: any R code that <em>implicitly</em> relies on row order, even if it looks like ordinary dplyr, will produce unpredictable results when translated to SQL and executed by DuckDB.</p>
<hr />
<h2>Source 1, Window Functions Without an Explicit Order</h2>
<p>This is the most common culprit.</p>
<h3>The problem</h3>
<pre># Looks fine. It isn't.
data |&gt;
  group_by(entity_id) |&gt;
  mutate(rn = row_number()) |&gt;
  filter(rn == 1)
</pre>
<p><code>row_number()</code> without an order clause assigns numbers in whatever order the rows happen to arrive at the window function. In DuckDB that order is non-deterministic. The row you keep is random.</p>
<p>The same applies to <code>cumsum()</code>, <code>lag()</code>, and <code>lead()</code>:</p>
<pre># cumsum() accumulates in random order if rows aren't sorted first
data |&gt;
  group_by(entity_id, invoice_id, delay) |&gt;
  mutate(counter = cumsum(code == &quot;TYPE_A&quot;))

# lag() reads the &quot;previous&quot; row, undefined if order is undefined
data |&gt;
  group_by(code) |&gt;
  mutate(prev_rate = lag(rate))
</pre>
<h3>The fix: <code>window_order()</code> before every window function</h3>
<p><code>dbplyr</code> provides <code>window_order()</code> to inject an <code>ORDER BY</code> clause inside the window frame. The key is that the columns listed must collectively <strong>break all ties</strong> within a group, otherwise rows with identical sort keys are still processed in random order.</p>
<pre># WRONG, all rows in the same group have identical values for these three columns
# The tie is never broken
data |&gt;
  window_order(entity_id, invoice_id, delay) |&gt;
  group_by(entity_id, invoice_id, delay) |&gt;
  mutate(rn = row_number())

# CORRECT, row_id is unique per line and breaks every tie
data |&gt;
  window_order(entity_id, invoice_id, delay, row_id) |&gt;
  group_by(entity_id, invoice_id, delay) |&gt;
  mutate(rn = row_number())
</pre>
<p><strong>Rule:</strong> the <code>window_order()</code> key must include at least one column that is unique within the group. The columns of <code>group_by()</code> alone are never sufficient, they are identical for every row in the group by definition.</p>
<hr />
<h2>Source 2, <code>distinct(.keep_all = TRUE)</code></h2>
<h3>The problem</h3>
<p><code>distinct()</code> without <code>.keep_all</code> is safe: it only retains the columns listed, which are identical across all matching rows by definition. But <code>.keep_all = TRUE</code> asks DuckDB to also return the <em>other</em> columns from <em>one</em> of the matching rows, and it picks arbitrarily.</p>
<pre># If multiple rows share (client_id, product_id) with different amounts,
# the amount you get back is random
data |&gt;
  distinct(client_id, product_id, .keep_all = TRUE)

# Adding a filter upstream doesn't save you if the filter can still
# return multiple rows per group
data |&gt;
  group_by(client_id, product_id) |&gt;
  filter(date == min(date, na.rm = TRUE)) |&gt;  # ties on date → multiple rows
  ungroup() |&gt;
  distinct(client_id, product_id, .keep_all = TRUE)   # ← still random
</pre>
<h3>Option A: <code>summarise()</code> when you only need one aggregated value</h3>
<pre>data |&gt;
  group_by(client_id, product_id) |&gt;
  summarise(
    first_date = min(date, na.rm = TRUE),
    .groups = &quot;drop&quot;
  )
</pre>
<h3>Option B: <code>window_order() + filter(row_number() == 1L)</code> when you need the whole row</h3>
<pre>data |&gt;
  group_by(client_id, product_id) |&gt;
  window_order(date, desc(amount)) |&gt;   # explicit, deterministic choice
  filter(row_number() == 1L) |&gt;
  ungroup()
</pre>
<p>The second option lets you express <em>which</em> row you actually want, which is almost always what the business logic intended in the first place.</p>
<hr />
<h2>Source 3, Inequality Joins That Create a Fan-Out</h2>
<p>This one is subtle and data-dependent, which makes it especially dangerous.</p>
<h3>The problem</h3>
<p>A common pattern in billing pipelines is joining a transaction table against a reference table of time-varying rates or coefficients:</p>
<pre>data |&gt;
  left_join(
    ref_rates,
    by = join_by(code, date &gt;= rate_start, date &lt;= rate_end)
  )
</pre>
<p>If <code>ref_rates</code> has two overlapping validity periods for the same <code>code</code>, say one row covers Jan–Dec and another covers Jul–Dec for a corrected value, then every transaction in that period matches <em>two</em> rows in <code>ref_rates</code>. The join doubles those rows (fan-out ×2).</p>
<p>This fan-out then propagates silently through every downstream step. Your <code>cumsum()</code> accumulates double. Your <code>row_number()</code> sees duplicate keys and becomes non-deterministic even with a <code>window_order()</code> that was previously sufficient.</p>
<p>The worst part: this only manifests for the specific <code>code</code> values that happen to have overlapping periods in your reference data. It may affect one entity out of fifty, making it look like a rare data quality issue rather than a structural pipeline bug.</p>
<pre># Verify whether a fan-out has already occurred
data_after_join |&gt;
  count(entity_id, line_id) |&gt;
  filter(n &gt; 1) |&gt;
  collect()
# Non-empty → fan-out confirmed
</pre>
<h3>The fix: pre-resolve by (key × date) before the equi-join</h3>
<p>Instead of joining the full transaction table against the reference with an inequality condition, first build a small lookup that maps each unique (key, date) pair to exactly one reference row:</p>
<pre># Step 1: find all unique (code, date) combinations present in the data
# Step 2: apply the inequality join only on this small lookup
# Step 3: deduplicate to one row per (code, date), choosing explicitly which period wins
# Step 4: join back to the full table with a simple equi-join, no fan-out possible

rates_resolved &lt;- data |&gt;
  distinct(code, date) |&gt;
  left_join(
    ref_rates,
    by = join_by(code, date &gt;= rate_start, date &lt;= rate_end)
  ) |&gt;
  group_by(code, date) |&gt;
  window_order(desc(rate_start), desc(rate_end)) |&gt;  # most recent period wins
  filter(row_number() == 1L) |&gt;
  ungroup() |&gt;
  select(-rate_start, -rate_end)

data &lt;- data |&gt;
  left_join(rates_resolved, by = c(&quot;code&quot;, &quot;date&quot;))  # equi-join, safe
</pre>
<p>Note that you should <strong>not</strong> deduplicate the reference table globally by key before the join. That would discard non-overlapping historical periods that are still valid for other dates. The pre-resolution must be surgical: resolve only the pairs where multiple periods are simultaneously valid for a given target date.</p>
<hr />
<h2>Source 4, Synthetic Rows That Are Perfectly Identical</h2>
<h3>The problem</h3>
<p>Some pipelines expand rows based on a quantity field: an invoice line with <code>qty = 3</code> becomes three separate line items. If you discard the expansion index after duplicating, the three rows become perfect duplicates, identical on every column. No <code>window_order()</code> can distinguish them.</p>
<pre># Expansion creates qty identical rows, then throws away the only discriminant
data |&gt;
  slice(rep(seq_len(n()), times = qty)) |&gt;
  select(-qty)   # ← now you have perfect duplicates
</pre>
<p>Any downstream window function operating on these rows will produce arbitrary results because the engine has no way to deterministically assign numbers or order to indistinguishable objects.</p>
<h3>The fix: keep the expansion index as a tiebreaker</h3>
<pre># Keep the position within the expansion as a discriminant column
expanded &lt;- data |&gt;
  mutate(series = as.integer(series))   # position 1, 2, 3, …

# Include it in every downstream window_order
expanded |&gt;
  window_order(entity_id, line_id, series) |&gt;
  group_by(entity_id) |&gt;
  mutate(rn = row_number())

# Drop it only at the very end of the pipeline, after all window operations
result |&gt;
  select(-series)
</pre>
<p>The same logic applies whenever you <code>union_all()</code> tables that might contain identical rows: add a source tag before the union so downstream steps can use it as a tiebreaker.</p>
<hr />
<h2>Bonus: Type-Dependent Deduplication</h2>
<p>A related trap: when a table contains multiple row types that share a key column, a single deduplication pass using one type’s counter will silently drop the other type’s rows.</p>
<pre># records contains TYPE_A and TYPE_B rows sharing the same entity_id
# Deduplicating by (entity_id, counter_a) eliminates TYPE_B rows
# because counter_a is the same for both types within a given entity_id
records |&gt;
  group_by(entity_id, counter_a) |&gt;
  window_order(entity_id, counter_a, line_id) |&gt;
  filter(row_number() == 1L) |&gt;
  ungroup()
</pre>
<p>The fix is to split into branches and apply the correct counter to each type:</p>
<pre>records_a &lt;- records |&gt;
  filter(type != &quot;TYPE_B&quot;) |&gt;
  group_by(entity_id, counter_a) |&gt;
  window_order(entity_id, counter_a, line_id) |&gt;
  filter(row_number() == 1L) |&gt;
  ungroup()

records_b &lt;- records |&gt;
  filter(type == &quot;TYPE_B&quot;) |&gt;
  group_by(entity_id, counter_b) |&gt;
  window_order(entity_id, counter_b, line_id) |&gt;
  filter(row_number() == 1L) |&gt;
  ungroup()

records_final &lt;- union_all(records_a, records_b)
</pre>
<hr />
<h2>Checklist Before You Ship DuckDB/dbplyr Code</h2>
<p>Copy this into your code review template:</p>
<p><strong>Window functions</strong><br />
– [ ] Every <code>mutate(rn = row_number())</code> is preceded by <code>window_order()</code> with a key that breaks all ties within the group<br />
– [ ] Every <code>mutate(x = cumsum(...))</code> is preceded by <code>window_order()</code> that includes at least one column unique within the group<br />
– [ ] Every <code>mutate(prev = lag(...))</code> is preceded by a deterministic <code>window_order()</code><br />
– [ ] None of the <code>window_order()</code> columns are exclusively <code>group_by()</code> columns</p>
<p><strong>distinct()</strong><br />
– [ ] No <code>distinct(..., .keep_all = TRUE)</code> is used unless the upstream filter is guaranteed to return exactly one row per group<br />
– [ ] All <code>distinct(.keep_all = TRUE)</code> have been replaced by <code>summarise()</code> or <code>window_order() + filter(row_number() == 1L)</code></p>
<p><strong>Inequality joins</strong><br />
– [ ] Every <code>join_by(key, date &gt;= start, date &lt;= end)</code> is followed by a check that no two periods in the reference table overlap for the same key<br />
– [ ] Where overlap is possible, the pre-resolution pattern (key × target date) is used instead of a direct join<br />
– [ ] Deduplication after an inequality join is on (key × target date), not on (key) alone</p>
<p><strong>Synthetic rows</strong><br />
– [ ] Every <code>slice(rep(...))</code> or equivalent expansion retains an index column usable as a tiebreaker in downstream <code>window_order()</code> calls<br />
– [ ] That index column is dropped only after all window operations are complete</p>
<p><strong>Type-dependent logic</strong><br />
– [ ] When deduplication logic differs by row type, each type is processed in a separate branch with its own reference counter</p>
<hr />
<h2>How to Detect Residual Non-Determinism</h2>
<p>The most direct method: run the pipeline multiple times and compare the aggregate output.</p>
<pre>library(purrr)

runs &lt;- map(1:8, function(i) {
  source(&quot;pipeline.R&quot;)
  result_table |&gt;
    summarise(
      total_amount = sum(amount, na.rm = TRUE),
      n_rows = n()
    ) |&gt;
    collect()
})

map_dfr(runs, identity)
# If total_amount or n_rows varies across the 8 runs → residual non-determinism
</pre>
<p>If you find variation, binary-search your pipeline: collect intermediate tables at the midpoint of your transformation chain and run the first half N times. If the midpoint is stable, the bug is in the second half. Repeat until you isolate the step where variation first appears.</p>
<hr />
<h2>Conclusion</h2>
<p>DuckDB is an excellent tool for this kind of work, fast, embeddable, compatible with Arrow and Parquet, and it composes beautifully with <code>{dbplyr}</code>. But it is not a data frame with SQL syntax. It is a parallel query engine, and it will silently expose every assumption your code makes about row order.</p>
<p>The good news: all four patterns described here are fixable without restructuring your pipeline. The rules are simple once you internalize them:</p>
<ol>
<li>Every order-sensitive window operation needs an explicit <code>window_order()</code> with a true tiebreaker.</li>
<li><code>distinct(.keep_all = TRUE)</code> is a code smell in DuckDB, replace it with an explicit choice.</li>
<li>Inequality joins need a pre-resolution step if the reference table can have overlapping periods.</li>
<li>Synthetic rows need to keep their expansion index until the end.</li>
</ol>
<p>The tricky part is that none of these bugs announce themselves. The code runs without errors, tests pass on some data, and the difference between two runs can be as small as one row in a thousand. The only defense is systematic code review against the checklist above, and running your pipeline more than once during development.</p>
<p>This post is better presented on its original ThinkR website here: <a rel="nofollow" href="https://rtask.thinkr.fr/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/" target="_blank">DuckDB + dbplyr: When Your Pipeline Gives Different Results Every Time It Runs</a></p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rtask.thinkr.fr/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/"> Rtask</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/duckdb-dbplyr-when-your-pipeline-gives-different-results-every-time-it-runs/">DuckDB + dbplyr: When Your Pipeline Gives Different Results Every Time It Runs</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399677</post-id>	</item>
		<item>
		<title>Explaining Time-Series Forecasts with Exact Shapley Values (ahead::dynrmf with external regressors applied to scenarios)</title>
		<link>https://www.r-bloggers.com/2026/03/explaining-time-series-forecasts-with-exact-shapley-values-aheaddynrmf-with-external-regressors-applied-to-scenarios/</link>
		
		<dc:creator><![CDATA[T. Moudiki]]></dc:creator>
		<pubDate>Sun, 08 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://thierrymoudiki.github.io//blog/2026/03/08/r/exact-shapley-dynrmf</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> Explaining Time-Series Forecasts with Exact Shapley Values (ahead::dynrmf with external regressors applied to macroeconomic scenarios)</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/explaining-time-series-forecasts-with-exact-shapley-values-aheaddynrmf-with-external-regressors-applied-to-scenarios/">Explaining Time-Series Forecasts with Exact Shapley Values (ahead::dynrmf with external regressors applied to scenarios)</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://thierrymoudiki.github.io//blog/2026/03/08/r/exact-shapley-dynrmf"> T. Moudiki's Webpage - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>Shapley values constitute a widely adopted way to attribute the contribution of each feature (explanatory variable) to the prediction of a model. Mostly used in supervised learning, this post illustrates an example of how to use them to explain time-series forecasts, with exact Shapley values, and based on the <code>ahead::dynrmf</code> model with external regressors.</p>

<p>The code below uses the <code>ahead</code> package to compute exact Shapley values for a time-series forecast. It uses the <code>ahead::dynrmf_shap</code> function to compute the Shapley values and the <code>ahead::plot_dynrmf_shap_waterfall</code> function to plot them.</p>

<p>First, install the package:</p>

<pre>devtools::install_github(&quot;Techtonique/ahead&quot;)
</pre>

<p>Then, run the following code (applies Shapley values to the <code>dynrmf</code> model, for different scenarios). I use the <code>uschange</code> dataset (quarterly changes in US macroeconomic variables) from the <code>fpp2</code> package. The target time series variable is <code>Consumption</code>; the regressors are <code>Income</code>, <code>Savings</code>, and <code>Unemployment</code> (scaled).</p>

<pre>library(fpp2); library(ahead); library(e1071); library(misc)
library(ggplot2); library(patchwork)

y       &lt;- fpp2::uschange[, &quot;Consumption&quot;]
xreg    &lt;- scale(fpp2::uschange[, c(&quot;Income&quot;, &quot;Savings&quot;, &quot;Unemployment&quot;)])
split   &lt;- misc::splitts(y, split_prob = 0.9)
xreg_train &lt;- window(xreg, start = start(split$training), end = end(split$training))
xreg_test &lt;- window(xreg, start = start(split$testing),  end = end(split$testing))

shap &lt;- ahead::dynrmf_shap(
  y            = split$training,
  xreg_fit     = xreg_train,
  xreg_predict = xreg_test,
  fit_func     = e1071::svm
)

p1 &lt;- ahead::plot_dynrmf_shap_waterfall(shap, title = &quot;Baseline scenario&quot;)

xreg_pess &lt;- xreg_test
xreg_pess[,&quot;Income&quot;] &lt;- -1;
xreg_pess[,&quot;Savings&quot;] &lt;- -0.5

shap_pess &lt;- dynrmf_shap(
  y            = split$training,
  xreg_fit     = xreg_train,
  xreg_predict = xreg_pess,
  fit_func     = e1071::svm
)

p2 &lt;- ahead::plot_dynrmf_shap_waterfall(shap_pess, title = &quot;Pessimistic scenario&quot;)

xreg_opt  &lt;- xreg_test
xreg_opt[,&quot;Income&quot;]  &lt;-  2;
xreg_opt[,&quot;Savings&quot;]  &lt;-  0.5

shap_opt &lt;- dynrmf_shap(
  y            = split$training,
  xreg_fit     = xreg_train,
  xreg_predict = xreg_opt,
  fit_func     = e1071::svm
)

p3 &lt;- ahead::plot_dynrmf_shap_waterfall(shap_opt, title = &quot;Optimistic scenario&quot;)

xreg_ovr  &lt;- xreg_test
xreg_ovr[,&quot;Income&quot;]  &lt;-  2.5;
xreg_ovr[,&quot;Savings&quot;] &lt;-  0.75

shap_ovr &lt;- ahead::dynrmf_shap(
  y            = split$training,
  xreg_fit     = xreg_train,
  xreg_predict = xreg_ovr,
  fit_func     = e1071::svm
)

p4 &lt;- plot_dynrmf_shap_waterfall(shap_ovr, title = &quot;Overly optimistic scenario&quot;)

(p1 + p2)/(p3 + p4)
</pre>

<p><img src="https://i0.wp.com/thierrymoudiki.github.io/images/2026-03-08/2026-03-08-image1.png?w=578&#038;ssl=1" alt="image-title-here" class="img-responsive" data-recalc-dims="1" /></p>

<p>One check, which is always a good practice when using Shapley values, is to see if the sum of the Shapley values equals the difference between the prediction and the baseline forecast (the model forecast when every regressor is replaced by its training set column mean). It’s the case on the plots above.</p>

<p>It’s worth mentioning that exact Shapley values can be computed in this context because there are only a few external regressors. This remains feasible for a small number of regressors (less than 15, which, again, in this context, is not absurd to consider).</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://thierrymoudiki.github.io//blog/2026/03/08/r/exact-shapley-dynrmf"> T. Moudiki's Webpage - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/explaining-time-series-forecasts-with-exact-shapley-values-aheaddynrmf-with-external-regressors-applied-to-scenarios/">Explaining Time-Series Forecasts with Exact Shapley Values (ahead::dynrmf with external regressors applied to scenarios)</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399671</post-id>	</item>
		<item>
		<title>Pacific island remittances by @ellis2013nz</title>
		<link>https://www.r-bloggers.com/2026/03/pacific-island-remittances-by-ellis2013nz/</link>
		
		<dc:creator><![CDATA[free range statistics - R]]></dc:creator>
		<pubDate>Sat, 07 Mar 2026 13:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://freerangestats.info/blog/2026/03/08/pacific-remittances</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> This post is the sixth of a series of seven on population issues in the Pacific, re-generating the charts I used in a keynote speech before the November 2025 meeting of the Pacific Heads of Planning and Statistics in Wellington, New Zealand. The seven ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/pacific-island-remittances-by-ellis2013nz/">Pacific island remittances by @ellis2013nz</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://freerangestats.info/blog/2026/03/08/pacific-remittances"> free range statistics - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>This post is the sixth of a series of seven on population issues in the Pacific, re-generating the charts I used in a keynote speech before the November 2025 meeting of the Pacific Heads of Planning and Statistics in Wellington, New Zealand. The seven pieces of the puzzle are:</p>
<ul>
  <li><a href="https://freerangestats.info/blog/2025/11/30/pacific-population" rel="nofollow" target="_blank">Visual summaries of population size and growth</a></li>
  <li><a href="https://freerangestats.info/blog/2025/12/04/pacific-net-migration" rel="nofollow" target="_blank">Net migration</a></li>
  <li><a href="https://freerangestats.info/blog/2026/02/16/pacific-cities" rel="nofollow" target="_blank">World cities with the most Pacific Islanders</a></li>
  <li><a href="https://freerangestats.info/blog/2026/02/18/pacific-diaspora" rel="nofollow" target="_blank">Pacific diaspora</a></li>
  <li><a href="https://freerangestats.info/blog/2026/03/01/pacific-pyramids" rel="nofollow" target="_blank">Population pyramids</a></li>
  <li>Remittances (This post today)</li>
  <li>Tying it all together (to come)</li>
</ul>

<p>Remittances are payments from family or other contacts overseas, typically in a higher income country. The source of remittances can be people on relatively short trips overseas—in the Pacific, examples include people in the Pacific Australia Labour Mobility scheme or the New Zealand Recognised Seasonal Employer scheme—or from long term migrants who have made the other country their indefinite home.</p>

<p>The distinction between the two types of duration is important for where these funds appear in the National Accounts, but unfortunately is difficult to measure statistically. Banks can keep track of how much money is being transferred and give this information to a central bank or national statistical office, but generally will not be able to classify the sources as short term or long term residents.</p>

<p>The implications of all this, in the context of how many Pacific islanders reside overseas and where (the subject of previous posts in this series) will all be discussed later. But for now, here is the chart of Pacific remittances:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0315-remittances-bar.svg" width="450"><img src="https://i2.wp.com/freerangestats.info/img/0315-remittances-bar.png?w=450&#038;ssl=1" data-recalc-dims="1" /></object>

<p>This is designed mostly to a) show how a number of Pacific countries have very high levels of remittances, relative to their national economy (more than 40% of GDP for Tonga) compared to world averages and b) highlight a few of the Pacific island countries in particular that are most extreme in this respect. Sometimes a simple bar chart is all you need to make the point. Although this bar chart isn’t as simple as it might seem at first look; there’s quite a bit of thought gone into the sequencing of the country categories at the bottom to maximise the impact, and of course colour-coding the bars to distinguish the Pacific countries from the global comparators.</p>

<p>Here’s the code to produce this chart. Super simple today, just pulling the data from the World Bank’s World Development Indicators and turning it into a single chart:</p>

<figure class="highlight"><pre># This script draws a simple bar chart of the latest year of remittances data
#
# Peter Ellis November 2025

library(WDI)
library(tidyverse)
library(glue)

picts &lt;- c(
  &quot;Fiji&quot;, &quot;New Caledonia&quot;, &quot;Papua New Guinea&quot;, &quot;Solomon Islands&quot;,                                             
  &quot;Guam&quot;, &quot;Kiribati&quot;, &quot;Marshall Islands&quot;, &quot;Micronesia, Fed. Sts.&quot;, &quot;Nauru&quot;,
  &quot;Vanuatu&quot;, &quot;Northern Mariana Islands&quot;,&quot;Palau&quot;, &quot;American Samoa&quot;, &quot;Cook Islands&quot;,
  &quot;French Polynesia&quot;, &quot;Niue&quot;, &quot;Samoa&quot;, &quot;Tokelau&quot;, &quot;Tonga&quot;, &quot;Tuvalu&quot;, &quot;Wallis and Futuna Islands&quot; 
)
length(picts)
sort(picts) # all 22 SPC PICT members except for Pitcairn

# Used this to see what series are available:
# WDIsearch(&quot;remittance&quot;) |&gt;  View()
#
# Download data from World Bank's World Development Indicators.
# Apparently worker remittances is a subset of personal. But
# the worker remittances are all NA anyway:

remit &lt;- WDI(indicator = c(personal = &quot;BX.TRF.PWKR.DT.GD.ZS&quot;,
                           worker = &quot;BX.TRF.PWKR.GD.ZS&quot;), start = 2000) |&gt; 
  as_tibble()

# which countries have we got?
sort(unique(remit$country))

# check who missing, just the 3 NZ Realm countries plus Wallis and futuna:
picts[!picts %in% unique(remit$country)]

# data for bar chart:
pac_data &lt;- remit |&gt; 
  group_by(country) |&gt; 
  filter(!is.na(personal)) |&gt; 
  arrange(desc(year)) |&gt; 
  slice(1) |&gt; 
  ungroup() |&gt; 
  filter(country %in% c(picts, &quot;Middle income&quot;, &quot;Low income&quot;, &quot;Small states&quot;, &quot;World&quot;, &quot;Australia&quot;, &quot;New Zealand&quot;)) |&gt; 
  mutate(is_pict = ifelse(country %in% picts, &quot;Pacific island&quot;, &quot;Comparison&quot;)) |&gt; 
  mutate(country_order = ifelse(country %in% picts, personal, 1000 - personal),
         country = fct_reorder(country, country_order)) 

# draw bar chart
pac_data|&gt; 
  ggplot(aes(x = country, y = personal, fill = is_pict)) +
  geom_col() +
  scale_y_continuous(label = percent_format(scale = 1)) +
  scale_fill_manual(values = c(&quot;brown&quot;, &quot;steelblue&quot;)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position  = &quot;none&quot;,
        plot.caption = element_text(colour = &quot;grey50&quot;)) +
  labs(x = &quot;&quot;, fill = &quot;&quot;,
      subtitle = glue('{attr(remit$personal, &quot;label&quot;)}, {min(pac_data$year)} to {max(pac_data$year)}'),
        y = &quot;&quot;,
       title = &quot;High dependency on remittances for many Pacific Island countries and territories&quot;,
       caption = &quot;Source: World Bank World Development Indicators, series BX.TRF.PWKR.DT.GD.ZS&quot;)</pre></figure>

<p>That’s all for today. Coming soon (I hope), a more narrative blog tying all this Pacific population stuff together, more or less as a written version of the talk this is all based on.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://freerangestats.info/blog/2026/03/08/pacific-remittances"> free range statistics - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/pacific-island-remittances-by-ellis2013nz/">Pacific island remittances by @ellis2013nz</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399654</post-id>	</item>
		<item>
		<title>eSports Analytics in R: Predicting Dota 2 Matches</title>
		<link>https://www.r-bloggers.com/2026/03/esports-analytics-in-r-predicting-dota-2-matches/</link>
		
		<dc:creator><![CDATA[rprogrammingbooks]]></dc:creator>
		<pubDate>Fri, 06 Mar 2026 18:51:30 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rprogrammingbooks.com/?p=2499</guid>

					<description><![CDATA[<p>eSports analytics is still an underexplored area in the R ecosystem, which makes it a great niche for practical, original work. While football, basketball, and betting models already have strong communities, competitive games such as Dota 2 and Counter-Strike offer rich event data, fast feedback loops, and interesting prediction problems. In ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/esports-analytics-in-r-predicting-dota-2-matches/">eSports Analytics in R: Predicting Dota 2 Matches</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rprogrammingbooks.com/esports-analytics-in-r-predicting-dota-2-matches/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=esports-analytics-in-r-predicting-dota-2-matches"> Blog - R Programming Books</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<section class="post-content">

  <p>
    eSports analytics is still an underexplored area in the R ecosystem, which makes it a great niche for practical, original work. While football, basketball, and betting models already have strong communities, competitive games such as Dota 2 and Counter-Strike offer rich event data, fast feedback loops, and interesting prediction problems. In this post, I will show how R can be used to extract match data, engineer useful features, classify players or teams, and build models to predict match outcomes.
  </p>

  <p>
    The main idea is simple: treat eSports matches like any other structured competition dataset. We can collect historical match information, transform it into team-level or player-level predictors, and then train machine learning models that estimate the probability of victory. For Dota 2, the OpenDota ecosystem is especially useful because it exposes public match and player data through an API that can be accessed from R.
  </p>

  <h2>Why eSports analytics is a strong fit for R</h2>

  <p>
    R is particularly well suited for eSports analytics because it combines data collection, cleaning, visualization, modeling, and reporting in a single workflow. With packages from the tidyverse, tidymodels, and API tools such as <code>httr</code> and <code>jsonlite</code>, it becomes straightforward to move from raw match endpoints to a predictive pipeline.
  </p>

  <p>
    This is also one of the reasons the topic stands out. Compared with mainstream sports, eSports still has much less mature R coverage, so a post focused on <strong>predicting Dota 2 matches in R</strong> feels fresh. It is practical, technically interesting, and relevant to analysts who want to work on non-traditional sports datasets.
  </p>

  <h2>Typical analytics questions in Dota 2 or CS-style games</h2>

  <p>
    Once match data is available, several interesting problems appear naturally:
  </p>

  <ul>
    <li>Which team features are most associated with winning?</li>
    <li>Can we predict the outcome of a match before it starts?</li>
    <li>Which players outperform their role or bracket expectations?</li>
    <li>Do certain heroes, maps, or compositions create measurable edges?</li>
    <li>How stable are team ratings over time?</li>
  </ul>

  <p>
    Some of these are classification tasks, others are ranking or regression problems, and several can benefit from time-aware modeling. If you enjoy probabilistic approaches, a <a href="https://rprogrammingbooks.com/product/bayesian-sports-analytics-r-predictive-modeling-betting-performance/" rel="nofollow" target="_blank">Bayesian sports analytics book in R</a> can be a useful complement when you want to move from point predictions to uncertainty-aware forecasts.
  </p>

  <h2>Data collection in R with OpenDota</h2>

  <p>
    A practical starting point is Dota 2 match data from the OpenDota API. In R, you can work either with a dedicated wrapper such as <code>ROpenDota</code> when available in your environment, or call the API directly with <code>httr2</code>, <code>httr</code>, and <code>jsonlite</code>. I often prefer direct API calls because they make the data flow more transparent and easier to debug.
  </p>

  <p>
    The example below shows a simple way to retrieve recent professional matches and convert them into a tidy tibble.
  </p>

  <pre>library(httr2)
library(jsonlite)
library(dplyr)
library(purrr)
library(tibble)

base_url &lt;- &quot;https://api.opendota.com/api/proMatches&quot;

resp &lt;- request(base_url) |&gt;
  req_perform()

pro_matches &lt;- resp |&gt;
  resp_body_string() |&gt;
  fromJSON(flatten = TRUE) |&gt;
  as_tibble()

glimpse(pro_matches)
</pre>

  <p>
    At this stage, the key goal is not modeling yet. It is understanding what the dataset contains. You want to inspect variables such as match identifiers, start times, radiant and dire team names, duration, league information, and the final winner. Once the structure is clear, the next step is to collect richer match-level or player-level details.
  </p>

  <h2>Downloading detailed match records</h2>

  <p>
    Predictive models usually need more than a top-level match result. We often want per-match detail: kills, deaths, assists, gold per minute, experience per minute, hero picks, bans, lobby type, patch information, and team-level aggregates. A common workflow is to fetch a set of match IDs and then loop through the detailed endpoint for each match.
  </p>

  <pre>library(httr2)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)

get_match_details &lt;- function(match_id) {
  url &lt;- paste0(&quot;https://api.opendota.com/api/matches/&quot;, match_id)

  tryCatch({
    request(url) |&gt;
      req_perform() |&gt;
      resp_body_string() |&gt;
      fromJSON(flatten = TRUE)
  }, error = function(e) {
    NULL
  })
}

sample_ids &lt;- pro_matches |&gt;
  slice_head(n = 50) |&gt;
  pull(match_id)

match_details_raw &lt;- map(sample_ids, get_match_details)
match_details_raw &lt;- compact(match_details_raw)
</pre>

  <p>
    This gives us a list of match records. From there, we can create a team-level modeling table. For predictive work, that usually means one row per team per match, along with a target variable indicating whether that team won.
  </p>

  <h2>Feature engineering for match prediction</h2>

  <p>
    Feature engineering is where most of the value is created. A model rarely becomes useful because of the algorithm alone; it becomes useful because the input variables capture something meaningful about team quality, momentum, or composition.
  </p>

  <p>
    Some strong candidate features include:
  </p>

  <ul>
    <li>Recent win rate over the last 5 or 10 matches</li>
    <li>Average team KDA from recent games</li>
    <li>Average gold per minute and experience per minute</li>
    <li>Hero-pool diversity</li>
    <li>Patch-specific performance</li>
    <li>Opponent strength proxies</li>
    <li>Side indicator such as Radiant vs Dire</li>
    <li>Time since the team last played</li>
  </ul>

  <p>
    A basic team-level engineering pipeline in R might look like this:
  </p>

  <pre>library(dplyr)
library(purrr)
library(tidyr)
library(stringr)

team_rows &lt;- map_dfr(match_details_raw, function(m) {
  if (is.null(m$players) || length(m$players) == 0) return(NULL)

  players &lt;- as_tibble(m$players)

  players &lt;- players |&gt;
    mutate(
      side = if_else(player_slot &lt; 128, &quot;radiant&quot;, &quot;dire&quot;)
    )

  team_summary &lt;- players |&gt;
    group_by(side) |&gt;
    summarise(
      team_kills = sum(kills, na.rm = TRUE),
      team_deaths = sum(deaths, na.rm = TRUE),
      team_assists = sum(assists, na.rm = TRUE),
      avg_gpm = mean(gold_per_min, na.rm = TRUE),
      avg_xpm = mean(xp_per_min, na.rm = TRUE),
      hero_diversity = n_distinct(hero_id),
      .groups = &quot;drop&quot;
    ) |&gt;
    mutate(
      match_id = m$match_id,
      duration = m$duration,
      radiant_win = m$radiant_win,
      win = if_else(
        (side == &quot;radiant&quot; & radiant_win) | (side == &quot;dire&quot; & !radiant_win),
        1, 0
      )
    )

  team_summary
})

team_rows
</pre>

  <p>
    This table is already enough for a first classification model. It is not perfect, and it does not yet include pre-match only features, but it is ideal for prototyping. In real forecasting, we should be careful not to leak post-match information into the prediction target. For example, final kills and average GPM are fine for explanatory analysis but not for true pre-match forecasting.
  </p>

  <h2>Building a proper pre-match dataset</h2>

  <p>
    If the goal is to predict the winner before the game begins, then every feature must be available before the first second of the match. That means historical rolling summaries are usually better than in-match totals. A cleaner setup is:
  </p>

  <ol>
    <li>Sort matches chronologically</li>
    <li>Create one row per team per match</li>
    <li>Compute rolling features from previous matches only</li>
    <li>Join the two competing teams into a head-to-head row</li>
    <li>Train a binary classifier on the winner</li>
  </ol>

  <p>
    Here is a simplified example of rolling team form:
  </p>

  <pre>library(dplyr)
library(slider)

team_history &lt;- team_rows |&gt;
  arrange(match_id) |&gt;
  group_by(side) |&gt;
  mutate(
    recent_win_rate = slide_dbl(win, mean, .before = 5, .complete = FALSE),
    recent_avg_kills = slide_dbl(team_kills, mean, .before = 5, .complete = FALSE),
    recent_avg_deaths = slide_dbl(team_deaths, mean, .before = 5, .complete = FALSE)
  ) |&gt;
  ungroup()
</pre>

  <p>
    In a more complete dataset, you would calculate these rolling statistics by actual team identity rather than by side alone. That produces a much more realistic team-strength signal.
  </p>

  <h2>Predicting match outcomes with tidymodels</h2>

  <p>
    Once a clean modeling table is ready, <code>tidymodels</code> provides an elegant framework for splitting data, preprocessing predictors, training models, and evaluating performance. Logistic regression is a strong baseline because it is interpretable and fast. After that, tree-based methods such as random forests or gradient boosting can be tested.
  </p>

  <pre>library(tidymodels)

model_data &lt;- team_rows |&gt;
  select(win, team_kills, team_deaths, team_assists, avg_gpm, avg_xpm, hero_diversity, duration) |&gt;
  mutate(win = factor(win, levels = c(0, 1)))

set.seed(123)

split_obj &lt;- initial_split(model_data, prop = 0.8, strata = win)
train_data &lt;- training(split_obj)
test_data  &lt;- testing(split_obj)

rec &lt;- recipe(win ~ ., data = train_data) |&gt;
  step_impute_median(all_numeric_predictors()) |&gt;
  step_normalize(all_numeric_predictors())

log_spec &lt;- logistic_reg() |&gt;
  set_engine(&quot;glm&quot;)

wf &lt;- workflow() |&gt;
  add_recipe(rec) |&gt;
  add_model(log_spec)

fit_log &lt;- fit(wf, data = train_data)

preds &lt;- predict(fit_log, test_data, type = &quot;prob&quot;) |&gt;
  bind_cols(predict(fit_log, test_data)) |&gt;
  bind_cols(test_data)

roc_auc(preds, truth = win, .pred_1)
accuracy(preds, truth = win, .pred_class)
</pre>

  <p>
    The first model is rarely the final model, but it gives us a baseline. If performance is weak, that usually means the issue is in the feature set rather than the modeling syntax. Better historical variables, better team identifiers, and better patch-aware data often matter more than switching algorithms immediately.
  </p>

  <h2>Moving beyond logistic regression</h2>

  <p>
    After a baseline, several improvements are possible. Random forests can capture nonlinear relationships. Gradient boosting often performs well when feature interactions matter. Bayesian models can be especially attractive when sample sizes are uneven or when you want probability distributions instead of single-point estimates. For readers interested in probabilistic thinking and predictive uncertainty, a resource on <a href="https://rprogrammingbooks.com/product/bayesian-sports-betting-with-r/" rel="nofollow" target="_blank">Bayesian sports betting with R</a> can help connect model outputs with practical decision-making.
  </p>

  <pre>rf_spec &lt;- rand_forest(
  trees = 500,
  min_n = 5
) |&gt;
  set_engine(&quot;ranger&quot;) |&gt;
  set_mode(&quot;classification&quot;)

rf_wf &lt;- workflow() |&gt;
  add_recipe(rec) |&gt;
  add_model(rf_spec)

fit_rf &lt;- fit(rf_wf, data = train_data)

rf_preds &lt;- predict(fit_rf, test_data, type = &quot;prob&quot;) |&gt;
  bind_cols(predict(fit_rf, test_data)) |&gt;
  bind_cols(test_data)

roc_auc(rf_preds, truth = win, .pred_1)
accuracy(rf_preds, truth = win, .pred_class)
</pre>

  <p>
    A good post does not need to claim perfect predictive power. In fact, readers usually trust the analysis more when you clearly explain the constraints. Team rosters change, patches alter the meta, public data can be incomplete, and many matches are influenced by contextual factors that are difficult to encode numerically.
  </p>

  <h2>Player classification and rating ideas</h2>

  <p>
    Match prediction is only one angle. Another strong direction is player classification. For example, we can cluster players based on aggression, farming style, support contribution, and efficiency. This is particularly interesting because eSports roles are both strategic and behavioral.
  </p>

  <p>
    A simple unsupervised workflow could include:
  </p>

  <ul>
    <li>K-means clustering on player performance metrics</li>
    <li>PCA for dimensionality reduction and visualization</li>
    <li>Role classification using labeled examples</li>
    <li>Elo-style or Glicko-style rating systems for evolving skill estimates</li>
  </ul>

  <pre>library(dplyr)
library(ggplot2)

player_data &lt;- map_dfr(match_details_raw, function(m) {
  if (is.null(m$players) || length(m$players) == 0) return(NULL)

  as_tibble(m$players) |&gt;
    transmute(
      match_id = m$match_id,
      account_id = account_id,
      hero_id = hero_id,
      kills = kills,
      deaths = deaths,
      assists = assists,
      gpm = gold_per_min,
      xpm = xp_per_min,
      last_hits = last_hits
    )
}) |&gt;
  filter(!is.na(account_id))

player_summary &lt;- player_data |&gt;
  group_by(account_id) |&gt;
  summarise(
    avg_kills = mean(kills, na.rm = TRUE),
    avg_deaths = mean(deaths, na.rm = TRUE),
    avg_assists = mean(assists, na.rm = TRUE),
    avg_gpm = mean(gpm, na.rm = TRUE),
    avg_xpm = mean(xpm, na.rm = TRUE),
    avg_last_hits = mean(last_hits, na.rm = TRUE),
    matches = n(),
    .groups = &quot;drop&quot;
  ) |&gt;
  filter(matches &gt;= 10)
</pre>

  <p>
    From there, clustering or supervised classification becomes straightforward. This is the kind of section that makes an eSports article feel broader than a simple API tutorial.
  </p>

  <h2>Visualization ideas that make the post stronger</h2>

  <p>
    Visuals can turn a technical post into a memorable one. In eSports, a few plots are especially effective:
  </p>

  <ul>
    <li>Win probability calibration plots</li>
    <li>Rolling team form charts</li>
    <li>Hero usage and win-rate heatmaps</li>
    <li>Player cluster scatterplots from PCA</li>
    <li>Feature importance plots for tree models</li>
  </ul>

  <p>
    For example, here is a simple variable importance chart after fitting a random forest:
  </p>

  <pre>library(vip)

fit_rf |&gt;
  extract_fit_parsnip() |&gt;
  vip()
</pre>

  <p>
    The purpose of these plots is not just decoration. They help answer the analytical question visually: what actually drives team success, and which signals seem stable across matches?
  </p>

  <h2>What about Counter-Strike or other eSports titles?</h2>

  <p>
    The same workflow generalizes well. Even if package support is less standardized than in Dota 2, the modeling logic remains the same:
  </p>

  <ul>
    <li>Collect historical match data</li>
    <li>Build team and player features</li>
    <li>Use rolling windows to represent recent form</li>
    <li>Train classification or rating models</li>
    <li>Evaluate probabilities, not just hard predictions</li>
  </ul>

  <p>
    In Counter-Strike style datasets, likely features include map win rates, side-specific strength, recent kill differential, roster stability, and head-to-head history. In that sense, the sport changes, but the R workflow does not.
  </p>

  <h2>Why this kind of post can stand out</h2>

  <p>
    A post on eSports analytics in R stands out because it sits at the intersection of data science novelty and practical modeling. It is specific enough to be useful, but unusual enough to attract readers who are tired of the same repeated examples from traditional sports. A title built around predicting Dota 2 matches is especially effective because it immediately communicates a concrete deliverable.
  </p>

  <p>
    It also fits naturally into a broader sports analytics learning path. Readers who discover this topic through eSports may later want to explore work in football, soccer, or multi-sport modeling, where books such as <a href="https://rprogrammingbooks.com/product/football-analytics-r-nflfastr-nflverse/" rel="nofollow" target="_blank">Football Analytics with R</a>, <a href="https://rprogrammingbooks.com/product/mastering-sports-analytics-with-r-soccer/" rel="nofollow" target="_blank">Mastering Sports Analytics with R: Soccer</a>, or <a href="https://rprogrammingbooks.com/product/sports-analytics-with-r-data-science-for-six-major-sports-nfl-nba-tennis-golf-boxing/" rel="nofollow" target="_blank">Sports Analytics with R across multiple sports</a> can expand the same analytical mindset into other domains.
  </p>

  <h2>Final thoughts</h2>

  <p>
    eSports analytics deserves more attention in R, and Dota 2 is one of the best places to start. With API access, tidy data workflows, and flexible modeling tools, it is possible to go from raw public match records to meaningful predictive systems entirely in R. Even a simple first version can teach a lot about data engineering, feature design, classification, and evaluation.
  </p>

  <p>
    The real opportunity is not only to predict winners, but to build a reproducible framework for understanding team performance, player styles, and competitive dynamics in games that are becoming more important every year. That combination of novelty, data richness, and analytical depth is exactly what makes eSports such a compelling subject for an R post.
  </p>

</section>
<p>The post <a href="https://rprogrammingbooks.com/esports-analytics-in-r-predicting-dota-2-matches/" rel="nofollow" target="_blank">eSports Analytics in R: Predicting Dota 2 Matches</a> appeared first on <a href="https://rprogrammingbooks.com/" rel="nofollow" target="_blank">R Programming Books</a>.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rprogrammingbooks.com/esports-analytics-in-r-predicting-dota-2-matches/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=esports-analytics-in-r-predicting-dota-2-matches"> Blog - R Programming Books</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/esports-analytics-in-r-predicting-dota-2-matches/">eSports Analytics in R: Predicting Dota 2 Matches</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399618</post-id>	</item>
		<item>
		<title>Data Visualization, Second Edition</title>
		<link>https://www.r-bloggers.com/2026/03/data-visualization-second-edition/</link>
		
		<dc:creator><![CDATA[Kieran Healy]]></dc:creator>
		<pubDate>Fri, 06 Mar 2026 11:52:36 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://kieranhealy.org/blog/archives/2026/03/06/data-visualization-second-edition/</guid>

					<description><![CDATA[<p>I’ve written a second edition of Data Visualization: A Practical Introduction, which ideally should come out with Princeton University Press later this year. As with the first edition, a full draft of the book is available at https://socviz.co. T...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/data-visualization-second-edition/">Data Visualization, Second Edition</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://kieranhealy.org/blog/archives/2026/03/06/data-visualization-second-edition/"> R on kieranhealy.org</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>I’ve written a second edition of <a href="https://socviz.co/" rel="nofollow" target="_blank"><em>Data Visualization: A Practical Introduction</em></a>, which ideally should come out with Princeton University Press later this year. As with the first edition, a full draft of the book is available at <a href="https://socviz.co/" rel="nofollow" target="_blank">https://socviz.co</a>. The production process is just getting started so there’s no new cover yet, and there isn’t a link to pre-order. But (also like last time) I’ve put up a link to a <a href="https://forms.gle/4xeALwJLbzdzT8rz7" rel="nofollow" target="_blank">form</a> that lets you add your email if you’d like to be notified when it’s available to buy. You’ll only get one email (from me personally, not a marketing department) if you do; no spam or anything.</p>
<figure><a href="https://i1.wp.com/kieranhealy.org/blog/archives/2026/03/06/data-visualization-second-edition/global_mean_simple.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i1.wp.com/kieranhealy.org/blog/archives/2026/03/06/data-visualization-second-edition/global_mean_simple.png?w=578&#038;ssl=1"
         alt="Global Mean Sea Surface Temperatures" data-recalc-dims="1"/></a>
</figure>
<p>The revised edition is a pretty thorough rewrite. Naturally all the code is brought up to date for ggplot 4 and R version 4.5 and higher. The code from the first edition still runs, but you’ll get warnings and so on; those are all now gone. The back half of the book has been pretty thoroughly redone to reflect big changes in the availability of software for maps, (the <code>sf</code> package) and extracting results from models (the <code>marginaleffects</code> package). Meanwhile, several years of teaching this material (and getting feedback from others) have resulted in shifts of emphasis here and there to introduce just a little bit more on data wrangling. As the book goes on I also shift from an “object-based” approach to writing plots to a more “pipeline-based” one.</p>
<p>The recent rise of LLMs and coding agents gets some discussion, too. There the question is “Why can’t I just have a robot write all the code for me?” I don’t dismiss this question out of hand, and I don’t pretend that agents aren’t very powerful. My feeling about this is summed up in the <a href="https://socviz.co/#whats-new-in-this-edition" rel="nofollow" target="_blank">Preface</a>:</p>



<blockquote>
    <p>Perhaps you have a robot to help you write your code now. Large Language Models (LLMs) and coding agents are now part of the workflow of code generation and evaluation. They can do a great deal; so much so that it might seem superfluous to spend any time with the iterative, write-try-redo approach to visualization that this book presents. Can’t the robot write all the code instead? Not quite. It’s not that I believe repeatedly doing repetitive and error-prone tasks yourself is a virtue. To the contrary, that’s what computers are for. This book is full of examples where we end up automating something in order not to worry about it. But I also want you, the reader, to learn how to do good graphical work in a reproducible way. That means having a keen eye for quality and a good nose for error. Cultivating those senses requires practice and a vocabulary to express them. It seems faintly absurd to have to say explicitly but, whatever tools you use, your work will be better if you know what you are doing and understand why you are doing it. This book teaches you ggplot specifically, but it is not trying to lock you in to a particular framework. It’s just that, the way you acquire a general skill or a wide-ranging taste is by first learning some more specific version of those things, and then practicing them. Automation can come a later. In the words of the author Ann Leckie, you don’t learn how to do something by not doing it. For that reason, this book remains a hands-on introduction.</p>

</blockquote>

<p>Or to put it another way, the book is an introduction to how to do something. One feature of books like it is that they tend to have two audiences: people who don’t know anything about the topic, and who’d like to learn something about it, and people who know a <em>lot</em>, at least in relative terms, and who have forgotten what it’s like not to know it.  When the first edition came out, one of the early Amazon reviews was a complaint that the book seemed “pretty introductory” in its content. I mean, my Brother in Christ, that is right there in the title.</p>
<p>As with any corner of the vast division of labor that is human society, not everyone has to know about any specific thing in great detail. We’re all taking huge amounts of stuff for granted at any moment. But if you want to be proficient in some piece of that enormous web, it’s better that you know rather than not know what’s what. There’s nothing wrong with using tools that give you tremendous leverage. You do it every time you use a stand mixer in the kitchen, or a sander in the garage. You do it every time you turn your computer on, in fact. But you still need to develop the capacity to tell good work from bad, or correct from incorrect output, or safe uses from dangerous ones. That way you can take advantage of the power tools without being at risk of slicing your own or anyone else’s arm off.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://kieranhealy.org/blog/archives/2026/03/06/data-visualization-second-edition/"> R on kieranhealy.org</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/data-visualization-second-edition/">Data Visualization, Second Edition</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399616</post-id>	</item>
		<item>
		<title>Why did I create ESR (my thoughts on ESS)</title>
		<link>https://www.r-bloggers.com/2026/03/why-did-i-create-esr-my-thoughts-on-ess/</link>
		
		<dc:creator><![CDATA[Teoten]]></dc:creator>
		<pubDate>Tue, 03 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">http://www.r-bloggers.com/?guid=d657ad2a4639360babe9466013ba5303</guid>

					<description><![CDATA[<p>Ever since I was using a text editor for R I have been using ESS. That makes it 10 years now. And so, I have decided that it is enough.<br />
   I started using R during my masters studies for my statistical analysis. I wanted to learn the core of the ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/why-did-i-create-esr-my-thoughts-on-ess/">Why did I create ESR (my thoughts on ESS)</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://blog.teoten.com/posts/2026/why_i_created_esr/"> Teoten&#039;s blog</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<article id="post-/posts/2026/why_i_created_esr/" data-post-id="/posts/2026/why_i_created_esr/"><div>
 <span><span>
   </span></span><p>Ever since I was using a text editor for R I have been using ESS. That makes it 10 years now. And so, I have decided that it is enough.</p>
   <p>I started using R during my masters studies for my statistical analysis. I wanted to learn the core of the language and so, I decided that I didn&#8217;t want to have any interface between me and the language. I was writing all my code directly in the console, saving some Rdata, and keeping history track. And I learned a lot.</p>
   <p>A few years later I started my PhD and decided that this was too archaic. For some reason I wanted to keep away from Rstudio and started searching for a good text editor for R. I was surprised to find that there are a lot of good options. But I had a plan: I decided to try a few to see how I feel with each and then choose. I made a list of 3 or 4 and started with the first one. It was Emacs with ESS. I never made it to the 2nd one.</p>
   <p>From the very beginning, Emacs felt very natural and intuitive to me. I guess that its structure and way of working fits my mindset. I adopted ESS in the same way, simply as part of my Emacs experience. I learned the basic key bindings, the connection to the R console and a few shortcuts for package development and plots. In no time I had a great working environment for R. It was the falling in love phase and I just loved all of it.</p>
   <p><span>ESS has an interesting aim: to make statistics easy and possible with Emacs. If you think about it beyond the scope of R, the task is huge. ESS provides a lot of functionality for other statistical languages, such as SAS, S-plus and recently even Julia. According to <a href="https://en.wikipedia.org/wiki/Emacs_Speaks_Statistics" rel="nofollow" target="_blank">wikipedia</a>, &#8220;it has the capability to submit a batch job for statistical packages like SAS, BUGS or JAGS when an interactive session is unwanted due to the potentially lengthy time required for the task to complete&#8221;. It could possibly be one of the very first editors for R, way before Rstudio was even imagined. </span></p>
   <p>Its backlog goes down to 1992 with version 3.41. That could be at around Emacs version 18. At the moment of writing this post, ESS is at version 25.01.0 and Emacs at 30.2. It is evident that ESS was crafted under very old and limited Emacs tools. And for the experienced user, it is also clear that ESS has been having problems to adapt to the evolution of Emacs. The team of developers behind Emacs have been very smart at incorporating libraries that are of high value both, for the end user and for the elisp programmer. Emacs is so flexible and versatile that it is good at keeping up with cutting edge technology such as LSP, tree sitter and more recently generative AI. Unfortunately, we cannot say the same about ESS.</p>
   <p><span>The ESS team really did an outstanding job with this package. It not only works, but goes beyond and tries to give you a lot of tools to make working with statistics easy. R is the star of the package, it has 9 elisp files dedicated exclusively for it and a lot of additional or supportive functionality scattered over the code. It receives special attention for adding new features and solving bugs. I always liked how easy is to start a new R console and to pair any R script buffer to it. I like a lot its debugging system. And I have a bunch of R functions wrapped around simple elisp code to execute upon keybindings. They did create an excellent development environment for R code. Unfortunately, this is also very traditionalist in the old R way, and things start getting foggy when you try alternative methods to develop R code, such as <a href="https://klmr.me/box/" rel="nofollow" target="_blank">box</a> and <a href="https://appsilon.github.io/rhino/" rel="nofollow" target="_blank">rhino</a>. </span></p>
   <p>Despite of all the wonderful things it offers, it has also a lot of drawbacks. The code base is too big and too old. When you try to contribute, or simply to fix or change something that suppose to be simple, you can easily get lost in a labyrinth of elisp functions and variables scattered all over the place. There is certain structure though, so once you start getting used to it and learn which script is for what, the labyrinth can be navigable with the help of xref. But then, the code is too old, it has been growing along 40 years. More than once I have had situations that, when updating to a new Emacs version, a bunch if warnings and/or errors suddenly pop up for a few packages. That is because Emacs has changed certain things in certain functions, sometimes it can be new variables, different defaults, change in the location or implementation of something, etc. Generally I just wait for the maintainers to fix it. There are always some users running the pre-released newest Emacs and they report it quickly so, the maintainers can fix it fast. Unfortunately, this is not the case of ESS. In my experience, ESS has always been the last one on fixing bugs like this. And when you dive into the code, it is full of comments of what would be a good implementation, code commented out and many work around to fix this kind of bugs, rather than trying to implement it in the new way that Emacs is suggesting. And I don&#8217;t blame them. Their code is so big and interdependent, that sometimes fixing something that suppose to be simple, can break other functions down the stream, or break its implementation as required from upstream.</p>
   <p><span>Additionally to all that fuss, ESS can be too opinionated about certain topics. My favorite example is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Projects.html" rel="nofollow" target="_blank">working with projects</a>. Unlike any other major mode for a programming language, ESS has it&#8217;s own definition for a Project. There is a whole discussion in the <a href="https://github.com/emacs-ess/ESS/issues/1289" rel="nofollow" target="_blank">issue 1289</a> if you want to read the opinions. It is intended so set projects the R way, but then, if you choose to work with <a href="https://appsilon.github.io/rhino/" rel="nofollow" target="_blank">Rhino</a><span> you only get headaches. I used to work on an R project for a big data pipeline that was a well structured code base, outside of &#8220;the R way&#8221; so, whenever I wanted to use project functionality, it was not available to me. I ended up adding an <code>.Rprofile</code> just for that. But it is upsetting that the package developers get to decide what an R project should look like. </span></span></p>
   <p><span>All that takes us to their inability to have a tree sitter implementation for ESS. I will save my comments on this, you can check the details yourself at the <a href="https://github.com/emacs-ess/ESS/issues/1239" rel="nofollow" target="_blank">Issue 1239</a>, which was initiated on February 5th, 2023. Three years later, make your own conclusions. </span></p>
   <p><span>It seems that they have too much on their plate simply. Even when R is their main focus, it has to drag all the rest of their code base, which is huge. <a href="https://codeberg.org/teoten/esr" rel="nofollow" target="_blank">ESR</a> was born because we believe that the R users deserve a better Emacs experience. We deserve a major mode where R is a first class citizen, just like python or java script. And that allow us to keep up with latest R tools like radian and air. The target of <a href="https://codeberg.org/teoten/esr" rel="nofollow" target="_blank">ESR</a> is to be a minimalist package focused on R, with the support of tree sitter. </span></p>
   <p>There are some remarkable differences with ESS:</p>
  <ul>
   <li>
    <p><span>Emacs Speaks R instead of Emacs Speaks Statistics. <a href="https://codeberg.org/teoten/esr" rel="nofollow" target="_blank">ESR</a> focuses on R. </span></p></li>
   <li>
    <p>Use of tree sitter. This opens up a lot of new possibilities for syntax highlighting, code navigation and code editing.</p></li>
   <li>
    <p>Use of Emacs built-in functionality. Do not re-invent the wheel and update to newer Emacs tools.</p></li>
   <li>
    <p><span>Minimal key map. ESS provides a huge key map which resembles the buttons and menus of Rstudio. <a href="https://codeberg.org/teoten/esr" rel="nofollow" target="_blank">ESR</a> attempts to move away from that strategy by keeping the key map clean. </span></p></li>
   <li>
    <p><span>Use of modern tools to power R development, such as <a href="https://www.gnu.org/software/emacs/manual/html_node/eglot/Quick-Start.html" rel="nofollow" target="_blank">Eglot</a> and <a href="https://github.com/akermu/emacs-libvterm" rel="nofollow" target="_blank">Vterm</a>, which can support <a href="https://github.com/randy3k/radian" rel="nofollow" target="_blank">Radian</a> and <a href="https://tidyverse.org/blog/2025/02/air/" rel="nofollow" target="_blank">Air</a>. </span></p></li>
  </ul>
  <p><span><a href="https://codeberg.org/teoten/esr" rel="nofollow" target="_blank">ESR</a> was born as a tree sitter mode for R, but thanks to the support of the community interested in it, it keeps growing as an alternative to ESS. We are greatly thankful for the work that the ESS has put throughout this years, and we decided to name the package to honor that. We will work hard to make sure that Emacs Speak R. </span></p>
</div></article>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://blog.teoten.com/posts/2026/why_i_created_esr/"> Teoten&#039;s blog</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/why-did-i-create-esr-my-thoughts-on-ess/">Why did I create ESR (my thoughts on ESS)</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399547</post-id>	</item>
		<item>
		<title>Agentic coding with R workshop</title>
		<link>https://www.r-bloggers.com/2026/03/agentic-coding-with-r-workshop/</link>
		
		<dc:creator><![CDATA[Dariia Mykhailyshyna]]></dc:creator>
		<pubDate>Mon, 02 Mar 2026 11:01:29 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=18825</guid>

					<description><![CDATA[<p>Join our workshop on Agentic coding with R,  which is a part of our workshops for Ukraine series!  Here’s some more info:  Title: Agentic coding with R  Date: Thursday, April 2nd, 14:00 – 16:00 CET (Rome, Berlin, Paris timezone)  Speaker: Charles Crabtree is a political scientist and Senior Lecturer in the School ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/agentic-coding-with-r-workshop/">Agentic coding with R workshop</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/agentic-coding-with-r-workshop/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p><span style="font-weight: 400">Join our workshop on </span><span style="font-weight: 400">Agentic coding with R, </span><span style="font-weight: 400"> which is a part of our workshops for Ukraine series! </span></p>
<br />
<p><b>Here’s some more info: </b></p>
<br />
<p><b>Title</b><span style="font-weight: 400">: </span><span style="font-weight: 400">Agentic coding with R </span></p>
<p><b>Date</b><span style="font-weight: 400">: Thursday, April 2nd, 14:00 – 16:00 CET (Rome, Berlin, Paris timezone) </span></p>
<p><b>Speaker</b><span style="font-weight: 400">: Charles Crabtree is a political scientist and Senior Lecturer in the School of Social Sciences at Monash University. His research sits at the intersection of political behavior, discrimination, and research methods, with work spanning experiments, text analysis, and large-scale observational data.</span></p>
<p><b>Description:</b> <span style="font-weight: 400">This workshop introduces </span><i><span style="font-weight: 400">agentic coding</span></i><span style="font-weight: 400"> for R: using AI assistants that can help you plan, write, run, and revise multi-step analysis workflows while keeping your work transparent and reproducible. Using Warp.dev as a concrete interface, we will walk through practical patterns for (1) turning messy research tasks into clear, checkable steps, (2) writing R code safely, (3) generating documentation and analysis notes as you work, and (4) developing a paper trail you can share with coauthors or future you. A key focus is adversarial agentic coding: pairing a “builder” agent with a separate “reviewer” agent that tries to break, audit, and improve the code the first agent produced—stress-testing assumptions, spotting silent failures, and proposing fixes. The emphasis is not on prompt tricks, but on reliable habits: how to constrain the agent, verify outputs, and integrate agentic help into real projects (data cleaning, modeling, tables and figures, and report generation). Participants will leave with copy-paste templates they can reuse immediately.</span></p>
<p><b>Minimal registration fee:</b><span style="font-weight: 400"> 20 euro (or 20 USD or 800 UAH)</span></p>
<br />
<br />
<br />
<br />
<br />
<p><span style="font-weight: 400">Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration</span></p>
<br />
<p><b>How can I register?</b></p>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Go to </span><a href="https://bit.ly/3wvwMA6" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3wvwMA6</span></a><span style="font-weight: 400"> or </span><a href="https://bit.ly/4aD5LMC" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/4aD5LMC</span></a><span style="font-weight: 400">  or  </span><a href="https://bit.ly/3PFxtNA" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3PFxtNA</span></a><span style="font-weight: 400"> and donate</span><b> at least 20 euro</b><span style="font-weight: 400">. </span><span style="font-weight: 400">Feel free to donate more if you can, all proceeds go directly to support Ukraine.</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Fill in the </span><a href="https://forms.gle/P5fXwu2prs3CMDRg9" rel="nofollow" target="_blank"><span style="font-weight: 400">registration form</span></a><span style="font-weight: 400">, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).</span></li>
</ul>
<br />
<p><span style="font-weight: 400">If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.</span></p>
<br />
<p><b>How can I sponsor a student?</b></p>
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Go to </span><a href="https://bit.ly/3wvwMA6" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3wvwMA6</span></a><span style="font-weight: 400"> or </span><a href="https://bit.ly/4aD5LMC" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/4aD5LMC</span></a><span style="font-weight: 400">  or </span><a href="https://bit.ly/3PFxtNA" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3PFxtNA</span></a><span style="font-weight: 400"> and donate </span><b>at least 20 euro </b><span style="font-weight: 400">(or 17 GBP or 20 USD or 800 UAH). </span><span style="font-weight: 400">Feel free to donate more if you can, all proceeds go to support Ukraine!</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Fill in the </span><a href="https://forms.gle/cTjj5rm6U5Z9DFQE7" rel="nofollow" target="_blank"><span style="font-weight: 400">sponsorship form</span></a><span style="font-weight: 400">, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.</span></li>
</ul>
<br />
<br />
<p><span style="font-weight: 400">If you are a university student and cannot afford the registration fee, you can also sign up for the </span><b>waiting list</b> <a href="https://forms.gle/g2oqdCSmr5FsV7Dn9" rel="nofollow" target="_blank"><span style="font-weight: 400">here</span></a><span style="font-weight: 400">. (Note that you are not guaranteed to participate by signing up for the waiting list).</span></p>
<br />
<br />
<p><span style="font-weight: 400">You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings &#038; materials </span><a href="http://bit.ly/3wBeY4S" rel="nofollow" target="_blank"><span style="font-weight: 400">here</span></a><span style="font-weight: 400">.</span></p>
<br />
<p><span style="font-weight: 400">Looking forward to seeing you during the workshop!</span></p>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<p><span style="font-weight: 400"> </span></p>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br /><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/agentic-coding-with-r-workshop/" rel="nofollow" target="_blank">Agentic coding with R workshop</a> was first posted on March 2, 2026 at 11:01 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/agentic-coding-with-r-workshop/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/agentic-coding-with-r-workshop/">Agentic coding with R workshop</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399468</post-id>	</item>
		<item>
		<title>A Few Claude Skills for R Users</title>
		<link>https://www.r-bloggers.com/2026/03/a-few-claude-skills-for-r-users/</link>
		
		<dc:creator><![CDATA[Isabella Velásquez]]></dc:creator>
		<pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rworks.dev/posts/claude-skills-for-r-users/</guid>

					<description><![CDATA[<p>If you’re like me, you might be feeling a bit overwhelmed by all the new AI tools for coding. So, this post may be adding one more thing to your plate, but I promise to keep it as whelming as possible. 😄<br />
This is a (very) short roundup of Skills ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/a-few-claude-skills-for-r-users/">A Few Claude Skills for R Users</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rworks.dev/posts/claude-skills-for-r-users/"> R Works</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 





<p>If you’re like me, you might be feeling a bit overwhelmed by all the new AI tools for coding. So, this post may be adding one more thing to your plate, but I promise to keep it as whelming as possible. <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f604.png" alt="😄" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>This is a (very) short roundup of Skills created by members of the community that are especially helpful for R users. Note that I won’t show Claude output, but rather, point you to resources on where to find Skills. I’m still very much a newbie in this space. If I misrepresent anything, or if you know of another Skill that should be included, please reach out on <a href="https://bsky.app/profile/ivelasq3.bsky.social" rel="nofollow" target="_blank">Bluesky</a>.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Does it have to be Claude?
</div>
</div>
<div class="callout-body-container callout-body">
<p>Although I mention “Claude Skills” throughout this post, other providers have adopted similar features for modular, task-specific capabilities in their LLM tools. They often use the same <code>SKILL.md</code> format, with the AI tools designed to look for a folder (often called <code>.skills/</code>) containing these Markdown-based instructions.</p>
</div>
</div>
<section id="quick-definitions" class="level2">
<h2 class="anchored" data-anchor-id="quick-definitions">Quick definitions</h2>
<p>If you haven’t installed Claude Code yet, Anthropic has <a href="https://code.claude.com/docs/en/quickstart" rel="nofollow" target="_blank">great documentation</a> to get you started. Here are a few broad definitions to set the stage:</p>
<ul>
<li><a href="https://www.anthropic.com/" rel="nofollow" target="_blank">Anthropic</a>: An AI company that builds AI systems.</li>
<li><a href="https://claude.ai/" rel="nofollow" target="_blank">Claude</a>: An AI assistant created by Anthropic that can help with a wide range of tasks, including coding.</li>
<li>Claude Models: Different versions of Claude, such as <a href="https://www.anthropic.com/claude/opus" rel="nofollow" target="_blank">Claude Opus 4.5</a>, <a href="https://www.anthropic.com/claude/sonnet" rel="nofollow" target="_blank">Claude Sonnet 4.5</a>, and <a href="https://www.anthropic.com/claude/haiku" rel="nofollow" target="_blank">Claude Haiku 4.5</a>. Each model offers different trade-offs between performance, speed, and cost.</li>
<li><a href="https://claude.com/product/claude-code" rel="nofollow" target="_blank">Claude Code</a>: A command-line interface (CLI) that brings Claude into your terminal.</li>
<li><a href="https://claude.com/blog/using-claude-md-files" rel="nofollow" target="_blank">CLAUDE.md</a>: A configuration file where you can give Claude Code project-specific context, preferences, and instructions.</li>
<li><a href="https://claude.com/skills" rel="nofollow" target="_blank">Claude Skills</a>: Reusable, specialized commands that help Claude handle common development tasks more consistently.</li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>A question many people (<em>cough</em> I <em>cough</em>) have wondered about is the difference between CLAUDE.md and Claude Skills. The key distinction is how broadly the information should apply and how much of Claude’s context window you want to use. If you want Claude to always be aware of certain information for every task in a project, use CLAUDE.md. This might include project conventions, coding style, or high-level rules. If the information is only relevant for specific tasks, Claude Skills are a better fit. Skills let you scope guidance to when it’s actually needed, instead of filling up the context window with instructions that don’t apply most of the time. (Thanks to my colleague Nick Pelikan for helping clarify this.)</p>
</div>
</div>
</section>
<section id="claude-skills-for-r-users-a-roundup" class="level2">
<h2 class="anchored" data-anchor-id="claude-skills-for-r-users-a-roundup">Claude Skills for R users: a roundup</h2>
<section id="claude-r-tidyverse-expert-by-sarah-johnson" class="level3">
<h3 class="anchored" data-anchor-id="claude-r-tidyverse-expert-by-sarah-johnson">1. Claude R Tidyverse Expert by Sarah Johnson</h3>
<p>Ever asked an LLM for R code and it kindly gives you a response that uses <code>spread()</code> (deprecated 2010)?</p>
<p><a href="https://sarahjohnson.io/" rel="nofollow" target="_blank">Sarah Johnson</a> created the <a href="https://gist.github.com/sj-io/3828d64d0969f2a0f05297e59e6c15ad" rel="nofollow" target="_blank">Modern R Development Guide</a> to help Claude Code behave like a modern R user. Among other things, it guides Claude to prefer tidyverse-style solutions, use recent versions of packages like dplyr 1.1.0+, and avoid outdated patterns. Never see <code>cast()</code> again!</p>
<center>
<blockquote class="bluesky-embed blockquote" data-bluesky-uri="at://did:plc:z37oae56a45bzgiybi23p4my/app.bsky.feed.post/3lwvpothzjk2o" data-bluesky-cid="bafyreifbf5heboqaalqpo25ovz532myn6gsucfj7bxlwjzqoik6yakqkca" data-bluesky-embed-color-mode="system">
<p lang="en">
</p><p>I was loving Claude Code… until I tried it with #rstats. Constant errors, wouldn&#8217;t use the tidyverse even when asked, &#8220;optimized&#8221; functions were slower.</p>
Frustrated, I started a session just to teach R to Claude and summarize what it learned into a CLAUDE.md file gist.github.com/sj-io/3828d6…<br><br><a href="https://bsky.app/profile/did:plc:z37oae56a45bzgiybi23p4my/post/3lwvpothzjk2o?ref_src=embed" rel="nofollow" target="_blank">[image or embed]</a>
<p></p>
— sarah (<a href="https://bsky.app/profile/did:plc:z37oae56a45bzgiybi23p4my?ref_src=embed" rel="nofollow" target="_blank"><span class="citation" data-cites="sarahjohnson.io">@sarahjohnson.io</span></a>) <a href="https://bsky.app/profile/did:plc:z37oae56a45bzgiybi23p4my/post/3lwvpothzjk2o?ref_src=embed" rel="nofollow" target="_blank">August 21, 2025 at 5:18 AM</a>
</blockquote>
<script async="" src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
</center>
<p>One nice follow-up suggestion comes from <a href="https://bsky.app/profile/jeremy-data.bsky.social/post/3mc3lucxbks2v" rel="nofollow" target="_blank">Jeremy Allen</a>, who recommends breaking this Skill into smaller ones if you don’t need all the guidance at once. He has also created a Skill that can pull in <a href="https://github.com/jeremy-allen/claude-skills/tree/main/deliver-posit-news" rel="nofollow" target="_blank">recent updates</a> from Posit!</p>
</section>
<section id="claude-code-r-skills-by-alistair-bailey" class="level3">
<h3 class="anchored" data-anchor-id="claude-code-r-skills-by-alistair-bailey">2. Claude Code R Skills by Alistair Bailey</h3>
<p><a href="https://bsky.app/profile/ab604.uk" rel="nofollow" target="_blank">Alistair Bailey</a> used and built upon Sarah’s (and others’) Skills above to create <a href="https://github.com/ab604/claude-code-r-skills?tab=readme-ov-file#token-optimization" rel="nofollow" target="_blank">Claude Code R Skills</a>. I particularly enjoy the section on <a href="https://github.com/ab604/claude-code-r-skills?tab=readme-ov-file#recommended-workflow" rel="nofollow" target="_blank">recommended workflow</a>, which provides a specific order for Claude to follow when writing code.</p>
<center>
<blockquote class="bluesky-embed blockquote" data-bluesky-uri="at://did:plc:xlq5qg6yjvwxfa26oizvy43u/app.bsky.feed.post/3mdusyrw3ek26" data-bluesky-cid="bafyreicphe726tkqdttywpftnq7xpmy37ffevvuopvb4spb3zpkkgrboge" data-bluesky-embed-color-mode="system">
<p lang="en">
Based on others work, I&#8217;ve created Claude Code configurations for R: modular skills (tidyverse, rlang, performance, OOP, testing), enforcement rules (security, testing, git workflow), workflow commands (planning, code review, TDD), and context management hooks. #claudecode #rstats<br><br><a href="https://bsky.app/profile/did:plc:xlq5qg6yjvwxfa26oizvy43u/post/3mdusyrw3ek26?ref_src=embed" rel="nofollow" target="_blank">[image or embed]</a>
</p>
— Alistair Bailey (<a href="https://bsky.app/profile/did:plc:xlq5qg6yjvwxfa26oizvy43u?ref_src=embed" rel="nofollow" target="_blank"><span class="citation" data-cites="ab604.uk">@ab604.uk</span></a>) <a href="https://bsky.app/profile/did:plc:xlq5qg6yjvwxfa26oizvy43u/post/3mdusyrw3ek26?ref_src=embed" rel="nofollow" target="_blank">February 2, 2026 at 12:12 PM</a>
</blockquote>
<script async="" src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
</center>
</section>
<section id="posit-claude-skills" class="level3">
<h3 class="anchored" data-anchor-id="posit-claude-skills">3. Posit Claude Skills</h3>
<p>Several folks at Posit have been experimenting with Claude Skills and sharing them in this <a href="https://github.com/posit-dev/skills" rel="nofollow" target="_blank">GitHub repository</a>. A couple that may be especially interesting for R users include:</p>
<ul>
<li><a href="https://github.com/posit-dev/skills/blob/main/quarto/README.md#quarto-authoring-skill" rel="nofollow" target="_blank">Quarto Authoring Skill</a>: Helpful for converting existing R Markdown projects to Quarto.</li>
<li><a href="https://github.com/posit-dev/skills/blob/main/open-source/create-release-checklist/SKILL.md" rel="nofollow" target="_blank">Create an R Package Release Checklist Skill</a>: Create a release checklist and GitHub issue for an R package, with automatic version calculation and customizable checklist generation.</li>
<li><a href="https://github.com/posit-dev/skills/blob/main/shiny/shiny-bslib/SKILL.md" rel="nofollow" target="_blank">Modern Shiny Apps with bslib Skill</a>: Build modern Shiny dashboards using bslib with Bootstrap 5 layouts, cards, value boxes, navigation, theming, and modern inputs. Includes migration guide from legacy Shiny patterns</li>
</ul>
</section>
<section id="brand.yml-skills" class="level3">
<h3 class="anchored" data-anchor-id="brand.yml-skills">4. <code>_brand.yml</code> Skills</h3>
<p><a href="https://posit-dev.github.io/brand-yml/" rel="nofollow" target="_blank">brand.yml</a> allows you to create reports, apps, dashboards, plots, and more that match your company’s brand guidelines with a single YAML file. It is currently supported in Quarto and Shiny (for R and Python).</p>
<p>Here are a few Skills for you to create your <code>_brand.yml</code> file:</p>
<ul>
<li><a href="https://github.com/posit-dev/skills/blob/main/brand-yml/SKILL.md" rel="nofollow" target="_blank">brand.yml Skill by posit-dev</a></li>
<li><a href="https://github.com/stephenturner/skill-brand-yml" rel="nofollow" target="_blank">brand.yml Skill by Stephen Turner</a>: <a href="https://bsky.app/profile/stephenturner.us" rel="nofollow" target="_blank">Stephen Turner</a> walks through the process of developing this skill in his post, <a href="https://blog.stephenturner.us/p/brand-yml-claude-skill-uva-sds-quarto" rel="nofollow" target="_blank">A Claude Skill for _brand.yml, and sharing with Quarto 1.9</a>.</li>
</ul>
<center>
<blockquote class="bluesky-embed blockquote" data-bluesky-uri="at://did:plc:ppvxhapnptcy5v6cih3ynmzg/app.bsky.feed.post/3mfreyzjbok2q" data-bluesky-cid="bafyreiclocg2ivthqgnieq6nbchqcvymua3qqmjp4iae2mjs7oss7eyh6y" data-bluesky-embed-color-mode="system">
<p lang="en">
I created a Claude Skill to make _brand.yml files for your organization, and with the upcoming Quarto 1.9 release you can share brand.yml files via GitHub and <code>quarto use brand</code>. More details and how to use it: blog.stephenturner.us/p/brand-yml-… #Rstats 1/ <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f9f5.png" alt="🧵" class="wp-smiley" style="height: 1em; max-height: 1em;" /><br><br><a href="https://bsky.app/profile/did:plc:ppvxhapnptcy5v6cih3ynmzg/post/3mfreyzjbok2q?ref_src=embed" rel="nofollow" target="_blank">[image or embed]</a>
</p>
— Stephen Turner (<a href="https://bsky.app/profile/did:plc:ppvxhapnptcy5v6cih3ynmzg?ref_src=embed" rel="nofollow" target="_blank"><span class="citation" data-cites="stephenturner.us">@stephenturner.us</span></a>) <a href="https://bsky.app/profile/did:plc:ppvxhapnptcy5v6cih3ynmzg/post/3mfreyzjbok2q?ref_src=embed" rel="nofollow" target="_blank">February 26, 2026 at 2:14 PM</a>
</blockquote>
<script async="" src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
</center>
</section>
<section id="learning-opportunities-a-claude-code-skill-for-deliberate-skill-development-by-cat-hicks" class="level3">
<h3 class="anchored" data-anchor-id="learning-opportunities-a-claude-code-skill-for-deliberate-skill-development-by-cat-hicks">5. Learning Opportunities: A Claude Code Skill for Deliberate Skill Development by Cat Hicks</h3>
<p>If you are worried about all these Skills deteriorating your R coding skills, check out <a href="https://github.com/DrCatHicks/learning-opportunities" rel="nofollow" target="_blank">Learning Opportunities: A Claude Code Skill for Deliberate Skill Development</a> from <a href="https://bsky.app/profile/grimalkina.bsky.social" rel="nofollow" target="_blank">Cat Hicks</a>. It uses a “dynamic textbook” approach to help you deliberately work your coding muscles <strong>while</strong> you’re using LLM tools.</p>
<center>
<blockquote class="bluesky-embed blockquote" data-bluesky-uri="at://did:plc:yjvayj5thzisljwor7yykhlx/app.bsky.feed.post/3mevvbm3a6s26" data-bluesky-cid="bafyreigsjkiynnec4bfmouz7gfo34v7qyv67yn2px3rouxswkz7xezb7fa" data-bluesky-embed-color-mode="system">
<p lang="en">
</p><p>Can you learn AND offload? Yes. We do it all the time across our days. You just cannot do every single thing at once, and have to think about the structural support for different (sometimes competing) goals.</p>
Self-regulation isn&#8217;t new, but it is a vital skill for developers now.
<p></p>
— Cat Hicks (<a href="https://bsky.app/profile/did:plc:yjvayj5thzisljwor7yykhlx?ref_src=embed" rel="nofollow" target="_blank"><span class="citation" data-cites="grimalkina.bsky.social">@grimalkina.bsky.social</span></a>) <a href="https://bsky.app/profile/did:plc:yjvayj5thzisljwor7yykhlx/post/3mevvbm3a6s26?ref_src=embed" rel="nofollow" target="_blank">February 15, 2026 at 3:51 PM</a>
</blockquote>
<script async="" src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script>
</center>
</section>
</section>
<section id="how-to-add-a-claude-skill" class="level2">
<h2 class="anchored" data-anchor-id="how-to-add-a-claude-skill">How to add a Claude Skill</h2>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>While you can use Claude Code in RStudio, I have been using <a href="https://positron.posit.co/" rel="nofollow" target="_blank">Positron</a>. Posit has started rolling out <a href="https://posit.co/products/ai/" rel="nofollow" target="_blank">Posit AI</a> in RStudio, which also supports Skills.</p>
</div>
</div>
<section id="install-from-a-github-repository" class="level3">
<h3 class="anchored" data-anchor-id="install-from-a-github-repository">1. Install from a GitHub repository</h3>
<p>As shown in the <a href="https://github.com/posit-dev/skills?tab=readme-ov-file#installation" rel="nofollow" target="_blank">Posit Claude Skills README</a>, you can install Skills directly from a GitHub repository using a Claude Code command. For example, this installs all of the Quarto-related Skills from the Posit Claude Skills repo:</p>
<pre>/plugin install quarto@posit-dev-skills</pre>
<p>This is a good option if you want to pull in a maintained set of Skills all at once.</p>
</section>
<section id="install-from-a-local-directory" class="level3">
<h3 class="anchored" data-anchor-id="install-from-a-local-directory">2. Install from a local directory</h3>
<p>If you’ve downloaded a Skill locally, you can install it directly from its folder:</p>
<pre>/plugin add /path/to/skill-directory</pre>
</section>
<section id="manual-installation" class="level3">
<h3 class="anchored" data-anchor-id="manual-installation">3. Manual installation</h3>
<p>You can also install a Skill by placing it directly in the appropriate directory:</p>
<ul>
<li>For personal Skills: <code>~/.claude/skills/skill-name/</code> (in your home directory)</li>
<li>For project Skills: <code>.claude/skills/skill-name/</code> (in your project root)</li>
</ul>
<p>Once the files are in place, Claude Code will automatically discover and use the Skill when it’s relevant.</p>
</section>
</section>
<section id="create-your-own-skill" class="level2">
<h2 class="anchored" data-anchor-id="create-your-own-skill">Create your own Skill</h2>
<p>Anthropic has <a href="https://code.claude.com/docs/en/skills#create-your-first-skill" rel="nofollow" target="_blank">documentation</a> on creating your own Claude Skill. I enjoy looking through other people’s Skills to see how they organize and develop them. Trying them out and further tweaking them is a great way of creating a customized Skill of your own.</p>
<p>As both Stephen and <a href="https://bsky.app/profile/sarahjohnson.io/post/3lwxvrldnjc2o" rel="nofollow" target="_blank">Sarah note</a>, you can give Claude Code examples, documentation, and guidance, then ask it to help generate the Skill for you. Perhaps there’s a good Skill out there for creating Skills. It’s Skills all the way down!</p>
</section>
<section id="more-on-using-claude-code-for-r-development" class="level2">
<h2 class="anchored" data-anchor-id="more-on-using-claude-code-for-r-development">More on using Claude Code for R development</h2>
<p>If you want examples of Claude Code in action with R, Simon Couch has a couple of great blog posts on the subject: <a href="https://www.simonpcouch.com/blog/2025-03-26-claude-code/" rel="nofollow" target="_blank">Post 1</a>, <a href="https://www.simonpcouch.com/blog/2025-07-17-claude-code-2/" rel="nofollow" target="_blank">Post 2</a>. They’re a great complement to this roundup and show what this can look like in real workflows.</p>
<p>New features and techniques are popping up all the time. If you’re experimenting with Claude Code for R and find something useful, please reach out anytime!</p>


</section>

 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rworks.dev/posts/claude-skills-for-r-users/"> R Works</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/a-few-claude-skills-for-r-users/">A Few Claude Skills for R Users</a>]]></content:encoded>
					
		
		<enclosure url="https://rworks.dev/posts/claude-skills-for-r-users/thumbnail.png" length="0" type="image/png" />

		<post-id xmlns="com-wordpress:feed-additions:1">399506</post-id>	</item>
		<item>
		<title>Getting to the bottom of TMLE: the (almost) vanishing nuisance interaction</title>
		<link>https://www.r-bloggers.com/2026/03/getting-to-the-bottom-of-tmle-the-almost-vanishing-nuisance-interaction/</link>
		
		<dc:creator><![CDATA[ouR data generation]]></dc:creator>
		<pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.rdatagen.net/post/2026-03-03-getting-to-the-bottom-of-tmle-simulating-the-orthogonality/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> In the previous post, I argued that understanding TMLE starts with understanding how estimation error behaves. In particular, we saw that influence functions allow us to separate sampling variability from nuisance estimation error. But something subtle...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/getting-to-the-bottom-of-tmle-the-almost-vanishing-nuisance-interaction/">Getting to the bottom of TMLE: the (almost) vanishing nuisance interaction</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.rdatagen.net/post/2026-03-03-getting-to-the-bottom-of-tmle-simulating-the-orthogonality/"> ouR data generation</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>In the <a href="https://www.rdatagen.net/post/2026-02-05-getting-to-the-bottom-of-tmle-1/" rel="nofollow" target="_blank">previous post</a>, I argued that understanding TMLE starts with understanding how estimation error behaves. In particular, we saw that influence functions allow us to separate sampling variability from nuisance estimation error. But something subtle happens when nuisance models are estimated rather than known. The interaction term that captures their effect on the target parameter appears to shrink as the sample size grows, sometimes quite a bit. In this post, I explore that behavior through simulation. We’ll see that the nuisance interaction does shrink (though perhaps not fast enough to ignore).</p>




<div id="a-key-theoretical-underpinning-a-quick-recap" class="section level3">
<h3>A key theoretical underpinning — a quick recap</h3>
<p>In these analyses, we care about some target parameter <span class="math inline">\(T(P_0)\)</span>, but in practice can observe data drawn from that distribution and can compute <span class="math inline">\(T(P_n)\)</span>. The quantity that really matters is the difference
<span class="math display">\[
T\big(P_n\big) − T\big(P_0\big),
\]</span>
because that difference determines bias, variance, and uncertainty. Using the influence function, this error can be approximated as
<span class="math display">\[
T\big(P_n\big) − T\big(P_0\big) \approx \big(P_n − P_0\big) \phi_{P_0}.
\]</span>
This is powerful because the right-hand side has well-understood statistical behavior. Unfortunately, we never observe the true influence function <span class="math inline">\(\phi_{P_0}\)</span>. In causal problems, the influence function depends on nuisance components, outcome regressions and treatment mechanisms that must be estimated from data. So in practice, we replace the true influence function with an estimated <span class="math inline">\(\phi_{\hat{P}}\)</span>, and the key quantity becomes
<span class="math display">\[
\big(P_n−P_0\big) \phi_{\hat{P}}.
\]</span>
This can be decomposed as
<span class="math display">\[
\big(P_n−P_0\big) \phi_{\hat{P}} = \big(P_n − P_0\big)  \phi_{P_0} + \big( P_n − P_0 \big) \big( \phi_{\hat{P}} − \phi_{P_0} \big)
\]</span>
The first term is the “good” stochastic fluctuation we understand, but the second term is the dangerous one. It captures how errors in the nuisance models might leak into the leading behavior of the estimator. This interaction term is the pathway through which nuisance estimation error can influence the estimator. Nuisance errors do not affect the estimator directly, but only through their interaction with sampling variability captured by <span class="math inline">\(P_n−P_0\)</span>. If this term does not vanish, then flexible nuisance estimation could distort the target parameter itself. However, when the influence function is constructed properly (and I think this is a key theoretical foundation behind TMLE), this leakage term shrinks toward zero as the sample size grows:
<span class="math display">\[
\big( P_n − P_0 \big) \big( \phi_{\hat{P}} − \phi_{P_0} \big) \rightarrow 0
\]</span>
I really wanted to see if we can observe that, at east in an artificial setting of simulated data. In particular, I wanted to see if this nuisance-driven term actually disappears as the sample size increases, even when the nuisance models are misspecified.</p>
</div>
<div id="a-concrete-example-the-ate-influence-function" class="section level3">
<h3>A concrete example: the ATE influence function</h3>
<p>To make this discussion less abstract, we need a concrete influence function. Suppose our target parameter is the average treatment effect (ATE):
<span class="math display">\[\psi_0 = E_{P_0}\big[ Y_1 − Y_0 \big].\]</span>
Under the usual identification conditions (consistency, exchangeability, and positivity), this can be written as a functional of the observed data distribution.</p>
<p>The efficient influence function for the ATE is:
<span class="math display">\[\phi_{P_0} \big( Z \big ) = \big (Q_1 (X) − Q_0(X) − \psi_0 \big) + \frac{A}{g(X)} \big( Y − Q_1(X)\big) − \frac{1−A}{1−g(X)} \big( Y − Q_0(X) \big),\]</span>
where the nuisance functions for the outcome (<span class="math inline">\(Q\)</span>) and propensity score (<span class="math inline">\(g\)</span>), respectively, are
<span class="math display">\[
Q_a(X)=E[Y∣A=a,X], \ \ \ \ \ \ g(X)=P(A=1∣X).
\]</span>
I won’t go into the derivation of this influence function here (and maybe not anywhere, since there are many other sources far more qualified than me) but the structure is important. The nuisance functions appear in two distinct roles: directly through the plug-in term
<span class="math inline">\(Q_1(X)−Q_0(X)\)</span>, and indirectly through residual-based corrections such as <span class="math inline">\(Y−Q_a(X)\)</span>, with the propensity score entering through weights. This layered form means that errors in the nuisance models do not affect the influence function in a single direction, but instead enter through both plug-in and correction terms.</p>
<p>If we knew the true outcome model <span class="math inline">\(Q_a(X)\)</span> and the true treatment mechanism <span class="math inline">\(g(X)\)</span>, we would know the true influence function <span class="math inline">\(\phi_{P_0}\)</span>. But in practice, we must estimate them, producing an estimated influence function <span class="math inline">\(\phi_{\hat{P}}\)</span>. So for the ATE, the “dangerous term” <span class="math inline">\((P_n−P_0)(\phi_{\hat{P}} − \phi_{P_0})\)</span> is driven entirely by how errors in estimating
<span class="math inline">\(Q\)</span> and <span class="math inline">\(g\)</span> propagate through this expression.</p>
<p>In other words, misspecifying either the outcome model or the treatment model changes the influence function itself. If orthogonality were not present, these changes could enter the leading behavior of the estimator. Theory suggests that, for this influence function, their impact should diminish with increasing sample size.</p>
<p>Even when <span class="math inline">\(Q\)</span> and <span class="math inline">\(g\)</span> are estimated imperfectly, the interaction <span class="math inline">\((P_n−P_0)(\phi_{\hat{P}} − \phi_{P_0})\)</span> should shrink toward zero. The simulation below is built around this specific influence function. We will deliberately estimate <span class="math inline">\(Q\)</span> and <span class="math inline">\(g\)</span> correctly or incorrectly, construct an estimated EIF, and observe whether this nuisance-driven term vanishes as the sample size grows.</p>
</div>
<div id="a-note-on-cross-fitting" class="section level3">
<h3>A note on cross-fitting</h3>
<p>Before getting to the simulation, I want to point out why I use cross-fitting to estimate the nuisance parameters. While not strictly required for TMLE, it is generally good practice, especially when using flexible models. The quantity we want to examine, <span class="math inline">\((P_n−P_0)(\phi_{\hat{P}} − \phi_{P_0})\)</span>, captures how nuisance estimation error interacts with sampling variability.</p>
<p>Without cross-fitting, both the empirical fluctuation <span class="math inline">\((P_n − P_0)\)</span> and the nuisance-driven error <span class="math inline">\(\phi_{\hat{P}} − \phi_{P_0}\)</span> are functions of the same data and therefore share the same randomness. Cross-fitting separates these sources of variation, reducing the feedback between nuisance estimation and empirical fluctuation, and allowing their interaction to better reflect the theoretical quantity of interest.</p>
</div>
<div id="simulating-the-vanishing-term" class="section level3">
<h3>Simulating the vanishing term</h3>
<p>Before we get to the key functions, we need to load the two libraries:</p>
<pre>library(simstudy)
library(data.table)</pre>
<div id="data-generating-process" class="section level4">
<h4>Data-generating process</h4>
<p>First, we define a simple data-generating process. This creates covariates (<span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span>), treatment assignment (<span class="math inline">\(A\)</span>) driven by those covariates, and an outcome <span class="math inline">\(Y\)</span> that depends on treatment, covariates, and their interaction. The parameter <span class="math inline">\(\tau\)</span> determines the true treatment effect.</p>
<pre>gen_dgp &lt;- function(n) {
  
  def &lt;- 
    defData(varname = &quot;x1&quot;, formula = .5, dist = &quot;binary&quot;) |&gt;
    defData(varname = &quot;x2&quot;, formula = 0, variance = 1) |&gt;
    defData(
      varname = &quot;a&quot;, 
      formula = &quot;-0.2 + 0.8 * x1 + 0.6 * x2&quot;, 
      dist = &quot;binary&quot;, 
      link = &quot;logit&quot;
    ) |&gt;
    defData(
      varname = &quot;y&quot;, 
      formula = &quot;..tau * a + 1.0 * x1 + 1.0 * x2 + 1.5 * x1 * x2&quot;,
      variance = 1,
      dist = &quot;normal&quot;
    )
  
  genData(n, def)[]
  
}</pre>
</div>
<div id="fitting-nuisance-models" class="section level4">
<h4>Fitting nuisance models</h4>
<p>Next, we create a helper function that fits the nuisance models — the outcome regression <span class="math inline">\(Q\)</span> and the propensity score <span class="math inline">\(g\)</span>.</p>
<p>Depending on the scenario, these models are either correctly specified or deliberately misspecified. This allows us to examine whether the vanishing term behaves differently when nuisance models are wrong.</p>
<pre>fit_nuisance &lt;- function(dt, scenario) {
  
  # Outcome regression Q(a,x)
  
  if (scenario %in% c(&quot;both_correct&quot;, &quot;g_wrong&quot;)) {
    Q_fit &lt;- lm(y ~ a + x1 + x2 + x1:x2, data = dt)  # correct
  } else {
    Q_fit &lt;- lm(y ~ a + x1, data = dt)               # wrong on purpose
  }
  
  # Propensity model g(x)
  
  if (scenario %in% c(&quot;both_correct&quot;, &quot;Q_wrong&quot;)) {
    g_fit &lt;- glm(a ~ x1 + x2, data = dt, family = binomial())  # correct
  } else {
    g_fit &lt;- glm(a ~ x1, data = dt, family = binomial())       # wrong on purpose
  }
  
  list(Q_fit = Q_fit, g_fit = g_fit)
}</pre>
</div>
<div id="predictions-and-true-nuisance-functions" class="section level4">
<h4>Predictions and true nuisance functions</h4>
<p>These functions generate predicted values from the fitted nuisance models, as well as the true outcome regression and propensity score implied by the data-generating process.</p>
<p>The predicted versions reflect estimation error; the true versions give us the benchmark influence function we would have if the nuisances were known.</p>
<pre>predict_Q &lt;- function(Q_fit, dt, a_val) {
  nd &lt;- copy(dt)
  nd[, a := a_val]
  as.numeric(predict(Q_fit, newdata = nd))
}

predict_g &lt;- function(g_fit, dt) {
  p &lt;- as.numeric(predict(g_fit, newdata = dt, type = &quot;response&quot;))
  pmin(pmax(p, 0.01), 0.99)  # simple stabilization
}

Q_true &lt;- function(dt, a_val, tau) {
  tau * a_val + 1.0 * dt$x1 + 1.0 * dt$x2 + 1.5 * dt$x1 * dt$x2
}

g_true &lt;- function(dt) {
  plogis(-0.2 + 0.8 * dt$x1 + 0.6 * dt$x2)
}</pre>
</div>
<div id="constructing-the-influence-function" class="section level4">
<h4>Constructing the influence function</h4>
<p>Using the EIF expression for the ATE defined above, we can construct two versions:</p>
<ul>
<li>an estimated influence function based on fitted nuisance models</li>
<li>the true influence function based on the known data-generating process</li>
</ul>
<p>One small technical detail arises here. The EIF includes the target parameter <span class="math inline">\(\psi_0\)</span>, and by definition it is centered — its mean should be zero under the relevant distribution. When we construct an estimated EIF, we therefore need to plug in a compatible estimate <span class="math inline">\(\hat{\psi}\)</span>.</p>
<p>In this simulation, <span class="math inline">\(\hat{\psi}\)</span> is not the object of interest. Instead, it serves only to center the estimated influence function so that it behaves like a true influence function. To do this, we compute a fold-specific <span class="math inline">\(\hat{\psi}\)</span> using the same fitted nuisance models that are used to build the estimated EIF.</p>
<p>In principle, we could center the EIF using a simple plug-in estimate such as the average of
<span class="math inline">\(Q_1(X)−Q_0(X)\)</span>. Instead, we use an adjusted version that also includes a residual-based correction involving <span class="math inline">\(\hat{Q}\)</span> and <span class="math inline">\(\hat{g}\)</span>. This choice ensures that the estimated EIF has approximately mean zero in the evaluation fold, making it behave more like the true influence function constructed from the same nuisance fits.</p>
<p>This allows us to compare the estimated influence function based on <span class="math inline">\(\hat{Q}\)</span> and <span class="math inline">\(\hat{g}\)</span>, and the true influence function based on the known data-generating process, and ultimately evaluate how nuisance estimation error propagates through the EIF.</p>
<pre>psi_hat_from_fits &lt;- function(dt, Q_fit, g_fit) {
  Q1 &lt;- predict_Q(Q_fit, dt, 1)
  Q0 &lt;- predict_Q(Q_fit, dt, 0)
  g  &lt;- predict_g(g_fit, dt)
  A &lt;- dt$a; Y &lt;- dt$y
  mean((Q1 - Q0) + A/g * (Y - Q1) - (1 - A)/(1 - g) * (Y - Q0))
}

phi_ate &lt;- function(dt, Q1, Q0, g, psi) {
  A &lt;- dt$a
  Y &lt;- dt$y
  (Q1 - Q0 - psi) + A/g * (Y - Q1) - (1 - A)/(1 - g) * (Y - Q0)
}

# Build phi_hat using your fitted nuisances (AIPW-style)

phi_hat_from_fits &lt;- function(dt, Q_fit, g_fit, psi_hat) {
  Q1 &lt;- predict_Q(Q_fit, dt, 1)
  Q0 &lt;- predict_Q(Q_fit, dt, 0)
  g  &lt;- predict_g(g_fit, dt)
  phi_ate(dt, Q1, Q0, g, psi_hat)
}

# Build phi0 from true nuisances

phi0_true &lt;- function(dt, tau) {
  Q1 &lt;- Q_true(dt, 1, tau)
  Q0 &lt;- Q_true(dt, 0, tau)
  g  &lt;- g_true(dt)
  psi0 &lt;- tau
  phi_ate(dt, Q1, Q0, g, psi0)
}</pre>
</div>
<div id="estimating-the-term-we-hope-will-vanish" class="section level4">
<h4>Estimating the term we hope will vanish</h4>
<p>This function performs the core task of the simulation. For a given data set:</p>
<ul>
<li>we split the data into two folds</li>
<li>fit nuisance models on each fold</li>
<li>compute a cross-fitted EIF</li>
</ul>
<p>We then compare the estimated EIF with the true EIF both in the sample and in an independent population draw (which is fixed across iterations). This allows us to approximate the nuisance-driven interaction term whose behavior we want to study.</p>
<pre>est_2T &lt;- function(scenario, dd, tau, dd_pop) {
  
  n &lt;- nrow(dd)
  idx &lt;- sample.int(n)
  I1 &lt;- idx[1:floor(n/2)]
  I2 &lt;- idx[(floor(n/2)+1):n]
  
  # fit nuisances on each training fold
  
  fits1 &lt;- fit_nuisance(dd[I1], scenario)  # trained on fold 1
  fits2 &lt;- fit_nuisance(dd[I2], scenario)  # trained on fold 2
  
  # cross-fitted psi_hat (evaluate each model on opposite fold, then average)
  
  psi1 &lt;- psi_hat_from_fits(dd[I2], fits1$Q_fit, fits1$g_fit)  # train 1, eval 2
  psi2 &lt;- psi_hat_from_fits(dd[I1], fits2$Q_fit, fits2$g_fit)  # train 2, eval 1
  psi_hat_cf &lt;- 0.5 * (psi1 + psi2)
  
  # cross-fitted phi_hat on dd:
  # - for obs in fold 2, use fits1 (trained on fold 1)
  # - for obs in fold 1, use fits2 (trained on fold 2)
  
  phi_hat_dd &lt;- numeric(n)
  phi_hat_dd[I2] &lt;- phi_hat_from_fits(dd[I2], fits1$Q_fit, fits1$g_fit, psi_hat_cf)
  phi_hat_dd[I1] &lt;- phi_hat_from_fits(dd[I1], fits2$Q_fit, fits2$g_fit, psi_hat_cf)
  
  dphi_dd &lt;- phi_hat_dd - phi0_true(dd, tau)
  
  # approximate P0 expectation:
  # evaluate delta-phi under each fold-specific nuisance fit on independent pop,
  # then average them (since cross-fitting produces two fitted nuisance models)
  
  dphi_pop_1 &lt;- 
    phi_hat_from_fits(
      pop_dd, fits1$Q_fit, fits1$g_fit, psi_hat_cf) - phi0_true(pop_dd, tau
    )
  
  dphi_pop_2 &lt;- 
    phi_hat_from_fits(
      pop_dd, fits2$Q_fit, fits2$g_fit, psi_hat_cf) - phi0_true(pop_dd, tau
    )
  
  dphi_pop &lt;- 0.5 * (dphi_pop_1 + dphi_pop_2)
  
  T2 &lt;- mean(dphi_dd) - mean(dphi_pop)
  
  data.table(scenario, n, T2)[]
}</pre>
</div>
<div id="running-the-simulation" class="section level4">
<h4>Running the simulation</h4>
<p>Finally, we repeatedly generate data and apply the procedure across sample sizes and nuisance model scenarios to see whether this interaction term shrinks toward zero.</p>
<pre>run_sim &lt;- function(n, tau, dd_pop, scenarios) {
  
  dd &lt;- gen_dgp(n)
  rbindlist(lapply(
      scenarios, 
      function(s) est_2T(s, dd, tau, dd_pop)
    )
  )
}

set.seed(1)

tau &lt;- 5
pop_dd &lt;- gen_dgp(5e5)

n = rep(c(100, 250, 750, 1000), each = 500)
scenarios = c(&quot;both_correct&quot;, &quot;Q_wrong&quot;)

res &lt;- rbindlist(
  lapply(n, function(x) run_sim(x, tau, dd_pop, scenarios))
)</pre>
</div>
</div>
<div id="results" class="section level3">
<h3>Results</h3>
<p>Each point in the figure represents one estimate of the interaction term <span class="math inline">\((P_n−P_0)(\phi_{\hat{P}} − \phi_{P_0})\)</span> from a single simulated data set, across sample sizes and nuisance model scenarios.</p>
<p>When both nuisance models are correctly specified, the estimates are tightly centered around zero even at smaller sample sizes. At <span class="math inline">\(n = 100\)</span>, there is noticeable variability, but this rapidly diminishes as the sample size increases. By <span class="math inline">\(n=750\)</span> and <span class="math inline">\(n=1000\)</span>, the estimates are highly concentrated near zero, consistent with the expectation that this term should vanish.</p>
<p>More interesting is the case where the outcome model is misspecified. Here, variability remains substantially larger across all sample sizes — reflecting the fact that nuisance estimation error is present and does not disappear simply because the sample grows. However, the estimates remain centered around zero and the spread clearly decreases with increasing <span class="math inline">\(n\)</span>.</p>
<p><img src="https://i2.wp.com/www.rdatagen.net/post/2026-03-03-getting-to-the-bottom-of-tmle-simulating-the-orthogonality/code_and_output/ortho.png?w=578&#038;ssl=1" data-recalc-dims="1" /></p>
<p>For example, when both models are correct, the standard deviation drops from 0.28 at <span class="math inline">\(n=100\)</span> to 0.006 at <span class="math inline">\(n=2000\)</span>. When the outcome model is misspecified, variability is much higher initially (1.34 at <span class="math inline">\(n=100\)</span>), but still shrinks markedly with increasing <span class="math inline">\(n\)</span>, falling to 0.12 by <span class="math inline">\(n=2000\)</span>.</p>
<p>Across all settings, the averages remain close to zero, reinforcing that nuisance error affects variability rather than introducing systematic drift in this interaction term.</p>
<p>Even when nuisance models are imperfect, their contribution does not appear to enter at first order in these simulations. Instead, the interaction term shrinks with sample size, behaving like a second-order quantity. In other words, misspecification increases noise, but does not induce systematic drift.</p>
</div>
<div id="next-steps" class="section level3">
<h3>Next steps</h3>
<p>The fact that the interaction shrinks does not guarantee that it becomes small relative to sampling error. How quickly it shrinks depends on how accurately the nuisance models are estimated. Orthogonality ensures that nuisance errors do not affect the estimator directly, but only through their interaction. Still, it does not force the empirical influence function equation to hold in finite samples. Without guarantees on the rate at which the nuisance models improve, the remaining discrepancy may still affect inference. That’s where TMLE fits in.</p>
<p>In the next post, I will look more squarely at TMLE, specifically considering how the targeting step is designed to make the efficient influence function equation hold and to ensure that the remaining discrepancy behaves like sampling noise rather than model-driven bias.</p>
<p>
<p><small><font color="darkkhaki">
Reference:</p>
<p>Van der Laan, Mark J., and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Vol. 4. New York: Springer, 2011.</p>
</font></small>
</p>
</div>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.rdatagen.net/post/2026-03-03-getting-to-the-bottom-of-tmle-simulating-the-orthogonality/"> ouR data generation</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/03/getting-to-the-bottom-of-tmle-the-almost-vanishing-nuisance-interaction/">Getting to the bottom of TMLE: the (almost) vanishing nuisance interaction</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">399502</post-id>	</item>
	</channel>
</rss>

<!-- Dynamic page generated in 1.124 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2026-03-14 13:23:38 -->

<!-- Compression = gzip -->