<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>R-bloggers</title>
	<atom:link href="https://www.r-bloggers.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.r-bloggers.com</link>
	<description>R news and tutorials contributed by hundreds of R bloggers</description>
	<lastBuildDate>Tue, 12 May 2026 13:00:00 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.5.18</generator>

<image>
	<url>https://i0.wp.com/www.r-bloggers.com/wp-content/uploads/2016/08/cropped-R_single_01-200.png?fit=32%2C32&#038;ssl=1</url>
	<title>R-bloggers</title>
	<link>https://www.r-bloggers.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">11524731</site>	<item>
		<title>Durations of wars by @ellis2013nz</title>
		<link>https://www.r-bloggers.com/2026/05/durations-of-wars-by-ellis2013nz/</link>
		
		<dc:creator><![CDATA[free range statistics - R]]></dc:creator>
		<pubDate>Tue, 12 May 2026 13:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://freerangestats.info/blog/2026/05/13/war-durations</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> How long do wars last, on average? If a war such as that currently under way in Iran has lasted 74 days so far, how long do we expect it to last in total? For all sorts of reasons, inquiring minds are interested. Luckily there are some very well curate...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/durations-of-wars-by-ellis2013nz/">Durations of wars by @ellis2013nz</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://freerangestats.info/blog/2026/05/13/war-durations"> free range statistics - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>How long do wars last, on average? If a war such as that currently under way in Iran has lasted 74 days so far, how long do we expect it to last in total? For all sorts of reasons, inquiring minds are interested. Luckily there are some very well curated datasets out there, including the <a href="https://correlatesofwar.org/data-sets/cow-war/" rel="nofollow" target="_blank">Correlates of War</a>, that make it easy to answer these questions.</p>

<p>A caveat to all this applies that I am not a military historian, just an interested amateur. I’m very open to having mistakes of interpretation or method pointed out to me.</p>

<h2 id="distribution-of-wars-durations">Distribution of wars’ durations</h2>

<p>The Correlates of War data lets us see, for example, that this is the distribution (on a logarithmic scale) of durations of wars post-Napoleon:</p>

<object type="image/svg+xml" data="https://freerangestats.info/img/0321-density.svg" width="450"><img src="https://i1.wp.com/freerangestats.info/img/0321-density.png?w=450&#038;ssl=1" data-recalc-dims="1" /></object>

<p>You can see I’ve compared this to a log-normal distribution and found that it doesn’t have quite as fat tails as that. But that’s ok, I’m not too worried about the precise shape, because later on I’ll be using pretty straightforward empirical methods.</p>

<p>This data is only for inter-state wars, which are in contrast to intra-state (eg civil wars) and extra-state (eg with external non-state actors). As I’m interested in a reference population to compare the current USA-Israel-Iran war to, it’s the inter-state population I want.</p>

<p>The median length of a war is 139 days and the mean is 408 days.</p>

<p>The four day war in the dataset is the so-called “<a href="https://en.wikipedia.org/wiki/Football_War" rel="nofollow" target="_blank">Football War</a>” of 1969 between Honduras and El Salvador. The 3,734 day war was the much better-known “Vietnam War Phase II”, involving USA, Australia, Vietnam, Cambodia and others.</p>

<p>Here’s the code to import the data from the Correlates of War project and draw that first density plot:</p>

<figure class="highlight"><pre>library(tidyverse)
library(lubridate)
library(janitor)
library(glue)
library(ggrepel)
library(scales)

# https://correlatesofwar.org/data-sets/cow-war/


#----- import interstate war data----------------------

interstate &lt;- read_csv(&quot;https://correlatesofwar.org/wp-content/uploads/Inter-StateWarData_v4.0.csv&quot;) |&gt; 
  clean_names() |&gt; 
  mutate(start_date = as.Date(sprintf(&quot;%04d-%02d-%02d&quot;, start_year1, start_month1, start_day1)),
         end_date = as.Date(sprintf(&quot;%04d-%02d-%02d&quot;, end_year1, end_month1, end_day1)))

interstate_wars &lt;- interstate |&gt; 
  group_by(war_num, war_name) |&gt; 
  summarise(earliest_start= min(start_date),
            latest_end = max(end_date),
            bat_death = sum(bat_death)) |&gt; 
  mutate(duration = as.numeric(latest_end - earliest_start),
         start_year = year(earliest_start)) |&gt; 
  ungroup()

# what years covered? 1823 to 2003 at time of writing
range(interstate_wars$start_year)

#==========================plots=================
 
simple_caption &lt;- &quot;Source: Correlates of War, Inter-State War Data; analysis by freerangestats.info&quot;

#-----------------distribution of duration------------
summary(interstate_wars$duration)

sim_norm &lt;- data.frame(duration = 10 ^ (rnorm(1e6, 
                                        mean = log10(interstate_wars$duration), 
                                        sd = sd(log10(interstate_wars$duration)))))

interstate_wars |&gt; 
  ggplot(aes(x = duration)) +
  geom_density() +
  geom_rug() +
  geom_density(data = sim_norm, colour = &quot;orange&quot;) +
  annotate(&quot;text&quot;, x= 1, y = 0.18, label = &quot;Simulated log-normal distribution&quot;, 
           colour = &quot;orange&quot;, hjust = 0) +
  annotate(&quot;text&quot;, x= 300, y = 0.51, label = &quot;Empirical distribution of war durations&quot;, 
           colour = &quot;black&quot;, hjust = 0) +
  # carefully chosen labels for x axis:
  scale_x_log10(label = comma, breaks = c(range(interstate_wars$duration), 10, 100, 1000)) +
  labs(x = &quot;Duration of wars (in days, logarithmic scale)&quot;,
       y = &quot;Density&quot;,
       title = &quot;Distribution of war durations, 1823 to 2003&quot;,
       subtitle = &quot;More concentrated, less-fat tails than a log-normal distribution&quot;,
       caption = simple_caption) +
  # use coord to limit x axis so statistical calculations are all done on full data:
  coord_cartesian(xlim = c(1, 8000))</pre></figure>

<p>OK, so my main analytical task here is to work out the conditional expected duration of a war that has reached 74 days &#8211; the length so far of the USA-Israel-Iran war. Yes, I know there’s an incompletely observed ceasefire, but there’s also a blockade (or two), and that’s unambiguously an act of war under international law. So I’m counting the war as ongoing.</p>

<p>My chart to answer this question is this one:</p>

<object type="image/svg+xml" data="https://freerangestats.info/img/0321-cumulative-distribution.svg" width="450"><img src="https://i1.wp.com/freerangestats.info/img/0321-cumulative-distribution.png?w=450&#038;ssl=1" data-recalc-dims="1" /></object>

<p>What’s happening here is:</p>

<ul>
  <li>the empirical cumulative distribution function of durations is the dark line &#8211; basically the cumulative frequency on the vertical axis, but expressed as a proportion.</li>
  <li>the grey line is a simple LOESS smoother of that cumulative frequency, useful for modelling values that aren’t exactly matched in the data.</li>
  <li>the red lines show the duration of the current war, and where it would fit in the distribution of 1823 to 2003 wars. It’s about 0.33 (defined in the code below as the variable <code>current_cf</code>), meaning that the current war is already longer than about 33% of wars.</li>
  <li>the horizontal blue line is half way in the vertical space between the horizontal red line and 1. Where it meets the smoothed line and drops a vertical blue line shows the expected median duration of a war that has gotten to this 0.33 point on the cumulative frequency.</li>
</ul>

<p>So we see that of wars that get as long as 74 days, we expect the median total length to be 261 days. That’s a bit grim for those of us who think that even extending into June is going to be very bad indeed for the world economy, but it’s good to know. Of course, there’s plenty of wars that get to 74 days and then stop soon after, so there’s hope there too.</p>

<p>Here’s the code to do that bit of statistical inference and draw the chart:</p>

<figure class="highlight"><pre>#-------------------cumulative distribution--------------
interstate_cumulative &lt;- interstate_wars |&gt; 
  arrange(duration) |&gt; 
  mutate(cumulative_freq = 1:n() / n()) 

# smoothed model of the cumulative distribution, including estimates of where
# the Iran war is on it:
model &lt;- loess(cumulative_freq ~ log(duration), data = interstate_cumulative)
current_dur &lt;- 74 # as at 13 May 2025 - war started 28 February 2026
current_cf &lt;- predict(model, newdata = data.frame(duration = current_dur))

# inverse model to estimate duration given a cumulative frequency, useful for
# annotations on the chart:
inv_model &lt;- loess(duration ~ x, 
                   data = data.frame(duration = interstate_cumulative$duration, 
                                     x = fitted(model)))

# of wars that last this long, what is the median cumulative frequency (i.e. half-way to 1):
conditional_median_freq &lt;- (1 + current_cf) / 2
# of wars with that median cumulative frequency, convert it back into a duration,
conditional_median_dur &lt;- predict(inv_model, data.frame(x = conditional_median_freq))

# Draw chart of cumulative distribution:
interstate_cumulative |&gt; 
  ggplot(aes(x = duration, y = cumulative_freq)) +
  geom_smooth(method = &quot;loess&quot;, colour = &quot;grey80&quot;) +
  geom_line() +
  # note that (seems a bit odd) need to manually do the scale transform to geom_segment here:
  geom_segment(x = log10(current_dur), xend = log10(current_dur), y = -Inf, yend = current_cf, colour = &quot;red&quot;) +
  geom_segment(x = 0, xend = log10(current_dur), y = current_cf, yend = current_cf, colour = &quot;red&quot;) +
  geom_segment(x = log10(conditional_median_dur), xend = log10(conditional_median_dur), y = -Inf, yend = conditional_median_freq, colour = &quot;blue&quot;) +
  geom_segment(x = 0, xend = log10(conditional_median_dur), y = conditional_median_freq, yend = conditional_median_freq, colour = &quot;blue&quot;) +
  
  annotate(&quot;text&quot;, x = current_dur * 0.95, y = 0.39, label = &quot;Current Iran war&quot;, colour = &quot;red&quot;, hjust = 1) +
  annotate(&quot;text&quot;, x = conditional_median_dur * 1.05, y = 0.62, colour = &quot;blue&quot;, hjust = 0, vjust = 1, 
           label = glue(&quot;Median expectation conditional 
on at least {current_dur} days&quot;)) +
  scale_x_log10(label = comma, breaks = c(10, current_dur, 100, conditional_median_dur, 1000)) +
  labs(x = &quot;Total duration of war (in days, logarithmic scale)&quot;,
       y = &quot;Cumulative frequency of wars&quot;,
       title = &quot;Expectations of duration of Iran war, based on modern inter-state wars' duration&quot;,
       subtitle = glue(&quot;Comparison to wars from 1823 to 2003. The median war that lasts {current_dur} days goes on to last {round(conditional_median_dur)} days.&quot;),
       caption = simple_caption)</pre></figure>

<p>We can use the same approach to calculate not just the median war duration (conditional on getting to 74 days) but other percentiles. For example, in the below we can construct an 80% prediction interval (between the 0.1 and 0.9 quantiles) of total duration of 94.9 and 1,752 days. To put this another way, from this 74 day point, only 10% of wars will have a total duration of 94.9 or less days (ie another 21 days).</p>

<p>All up, that’s a big range of course; the main thing it tells us is that wars last longer than many people would like, and there’s a big variation in wars’ duration.</p>

<figure class="highlight"><pre># some prediction intervals, conditional on getting to 74 days:
probs &lt;- c(0.05, 0.1, 0.5, 0.8, 0.9, 0.95)
more_freqs &lt;- probs * (1 - current_cf) + current_cf
conditional_dur &lt;- predict(inv_model, data.frame(x = more_freqs))
tibble(probability = probs, duration = conditional_dur)
# so 80% of wars that reach 74 days will have a total duration between 95 and 1,752 days</pre></figure>

<pre>  probability duration
        &lt;dbl&gt;    &lt;dbl&gt;
1        0.05     82.3
2        0.1      94.9
3        0.5     261. 
4        0.8    1141. 
5        0.9    1752. 
6        0.95   2119. 
</pre>

<h2 id="duration-and-other-factors">Duration and other factors</h2>

<p>So I’d answered my main question but I was naturally curious about some other relationships too. Obviously one expects longer wars to have more deaths in battle; can we see this in the data? Yes we can:</p>

<object type="image/svg+xml" data="https://freerangestats.info/img/0321-duration-deaths.svg" width="450"><img src="https://i1.wp.com/freerangestats.info/img/0321-duration-deaths.png?w=450&#038;ssl=1" data-recalc-dims="1" /></object>

<p>I like this chart as presenting the scale of nearly two centuries of inter-state war in one easy visualisation.</p>

<p>We also see that if there’s a pattern in relationship between duration, deaths and when the war started (the starting year mapped to colour in the chart above) it’s not an obvious one. We’ll come back to that in the next chart, but first, here’s the code to create the scatter plot above.</p>

<figure class="highlight"><pre>#------------------Compare duration and number of deaths----------------
interstate_wars |&gt; 
  ggplot(aes(x = duration, y = bat_death, label = war_name)) +
  geom_point(aes(colour = start_year), size = 3.5) +
  geom_text_repel(colour = &quot;grey50&quot;, size = 2, seed = 123) +
  scale_y_log10(label = comma) +
  scale_x_log10(label = comma) +
  scale_colour_viridis_c() +
  labs(title = &quot;Inter-state wars, 1823-2003&quot;,
       colour = &quot;Starting year&quot;,
       x = &quot;Duration in days&quot;,
       y = &quot;Number of battle deaths&quot;,
       caption = simple_caption) +
  theme(legend.position = c(0.15, 0.8))</pre></figure>

<p>I was a bit worried about that “two centuries” thing. Are recent wars all much shorter, or perhaps much longer, than older wars? If so it would be a big limitation on my inference about likely war length. So I prepared one more plot to check out if there was an obvious relationship, more rigorously than just eye-balling colour on the previous plot. I was a bit surprised to see that actually there is no real growth or reduction in war duration over time:</p>

<object type="image/svg+xml" data="https://freerangestats.info/img/0321-duration-history.svg" width="450"><img src="https://i0.wp.com/freerangestats.info/img/0321-duration-history.png?w=450&#038;ssl=1" data-recalc-dims="1" /></object>

<p>I also quite like this chart as giving us an instant comparison of our current USA-Israel-Iran war with some of those in history. We can see that it is already longer than the Boxer Rebellion, but not quite as long as the Falkland Islands or the War for Kosovo (for all of these names I am using those provided by the Correlates of War project - I’m well aware that these are contested labels).</p>

<p>Here’s my final chunk of code drawing that last chart:</p>

<figure class="highlight"><pre>#------------Compare duration with when in history it happened---------------
interstate_wars |&gt; 
  arrange(bat_death) |&gt; 
  ggplot(aes(x = earliest_start, y = duration)) +
  geom_hline(yintercept = current_dur, colour = &quot;red&quot;) +
  geom_point(aes(size = bat_death), shape = 1) +
  geom_text_repel(aes(label = war_name), colour = &quot;steelblue&quot;, size = 3, seed = 123) +
  annotate(&quot;text&quot;, x= as.Date(&quot;1820-01-01&quot;), y = current_dur + 8, hjust = 0,
           label = &quot;Duration of 2026 US-Israel-Iran war so far&quot;, colour = &quot;red&quot;) +
  scale_y_log10(label = comma) +
  scale_size_area(label = comma, max_size = 25) +
  labs(title = &quot;Inter-state wars, 1823-2003&quot;,
       subtitle = glue(&quot;Compared to the USA-Israel-Iran war as at {Sys.Date()}&quot;),
       x = &quot;Start of war&quot;,
       y = &quot;Duration of war (days)&quot;,
       size = &quot;Number of batlle deaths:&quot;,
       caption = simple_caption)</pre></figure>

<p>That’s all folks. Stay safe out there.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://freerangestats.info/blog/2026/05/13/war-durations"> free range statistics - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/durations-of-wars-by-ellis2013nz/">Durations of wars by @ellis2013nz</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401218</post-id>	</item>
		<item>
		<title>Learning Data Science: Why a High R^2 Can Be Misleading</title>
		<link>https://www.r-bloggers.com/2026/05/learning-data-science-why-a-high-r2-can-be-misleading/</link>
		
		<dc:creator><![CDATA[Learning Machines]]></dc:creator>
		<pubDate>Mon, 11 May 2026 15:09:30 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://blog.ephorie.de/?p=7048</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> A high can make a regression model look impressively accurate — but this number can be deceptive. If you want to understand why a high is not always a sign of a good model, read on! In the post, Learning Data Science: Modelling Basics, we built a simple model to predict ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/learning-data-science-why-a-high-r2-can-be-misleading/">Learning Data Science: Why a High R^2 Can Be Misleading</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://blog.ephorie.de/learning-data-science-why-a-high-r2-can-be-misleading?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=learning-data-science-why-a-high-r2-can-be-misleading"> R-Bloggers – Learning Machines</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p><img loading="lazy" fetchpriority="high" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/uploads/2020/01/discount-1015451_1280-e1568068884649-300x277.jpg?resize=300%2C277&#038;ssl=1" alt="" width="300" height="277" class="alignleft size-medium wp-image-2386" srcset_temp="https://i1.wp.com/blog.ephorie.de/wp-content/uploads/2020/01/discount-1015451_1280-e1568068884649-300x277.jpg?resize=300%2C277&#038;ssl=1 300w, https://blog.ephorie.de/wp-content/uploads/2020/01/discount-1015451_1280-e1568068884649-768x709.jpg 768w, https://blog.ephorie.de/wp-content/uploads/2020/01/discount-1015451_1280-e1568068884649-840x776.jpg 840w, https://blog.ephorie.de/wp-content/uploads/2020/01/discount-1015451_1280-e1568068884649.jpg 1081w" sizes="(max-width: 300px) 85vw, 300px" data-recalc-dims="1" /></p>
<p>A high <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> can make a regression model look impressively accurate — but this number can be deceptive. If you want to understand why a high <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> is not always a sign of a good model, read on!</p>
<p><span id="more-7048"></span></p>
<p>In the post, <a href="https://blog.ephorie.de/learning-data-science-modelling-basics" rel="nofollow" target="_blank">Learning Data Science: Modelling Basics</a>, we built a simple model to predict income from age. R printed a model summary containing something called <code>R-squared</code>, but we did not yet discuss what that value actually means.</p>
<p>At first sight, a high <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> looks highly reassuring. In our example, the linear model achieved an <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> close to 90%. That sounds impressive.</p>
<p>However, just as high classification accuracy can be misleading — as discussed in <a href="https://blog.ephorie.de/zeror-the-simplest-possible-classifier-or-why-high-accuracy-can-be-misleading" rel="nofollow" target="_blank">ZeroR: The Simplest Possible Classifier, or Why High Accuracy can be Misleading</a> — a high <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> can also create a false sense of confidence.</p>
<p>To understand why, it helps to examine the formula itself and then revisit the three models from the previous post: the <em>mean model</em>, the <em>linear model</em>, and the <em>polynomial model</em>.</p>
<hr />
<h2>The Meaning of <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/></h2>
<p>The coefficient of determination is defined as:</p>
<p><img loading="lazy" decoding="async" src="https://i2.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-00e6249ac935aca67c46d1437aa02ef0_l3.png?resize=144%2C31&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2 = 1 - \frac{\sum (y_i-\hat y_i)^2}{\sum (y_i-\bar y)^2}" title="Rendered by QuickLaTeX.com" height="31" width="144" style="vertical-align: -11px;" data-recalc-dims="1"/></p>
<p>At first glance, the formula appears intimidating, but its basic idea is relatively simple.</p>
<p>The denominator</p>
<p><img loading="lazy" decoding="async" src="https://i2.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-2c3d465528e95ffd2ac7eee3afdd8fc0_l3.png?resize=149%2C20&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="SS_{tot} = \sum (y_i-\bar y)^2" title="Rendered by QuickLaTeX.com" height="20" width="149" style="vertical-align: -5px;" data-recalc-dims="1"/></p>
<p>measures the <em>total variation in the target variable</em>. It quantifies how strongly the observed values differ from their mean.</p>
<p>The numerator</p>
<p><img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-61749266165fa0e8fe29b5c6c993ee17_l3.png?resize=156%2C20&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="SS_{res} = \sum (y_i-\hat y_i)^2" title="Rendered by QuickLaTeX.com" height="20" width="156" style="vertical-align: -5px;" data-recalc-dims="1"/></p>
<p>measures the <em>remaining unexplained error after fitting the model</em>.</p>
<p>Thus, <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> measures the <em>proportion of variation explained by the model</em>.</p>
<p>An <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> of:</p>
<ul>
<li>0 means the model explains none of the variation,</li>
<li>1 means the model explains all variation perfectly.</li>
</ul>
<p>This sounds straightforward enough. The difficulty is that perfectly explaining the observed data is not necessarily the same thing as building a useful predictive model.</p>
<hr />
<h2>The Mean Model</h2>
<p>Let us begin with the simplest possible regression model.</p>
<p>Suppose we completely ignore age and simply predict the average income for every individual:</p>
<p><img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-6e7456d883a3a0982c0da3efb13957f8_l3.png?resize=47%2C16&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="\hat y_i = \bar y" title="Rendered by QuickLaTeX.com" height="16" width="47" style="vertical-align: -4px;" data-recalc-dims="1"/></p>
<p>This is effectively the regression equivalent of ZeroR. The model does not learn any relationship at all.</p>
<p>In this case:</p>
<p><img loading="lazy" decoding="async" src="https://i0.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-8b14fd9ed1319341f6f2f7ffa5ad52d6_l3.png?resize=119%2C16&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="y_i - \hat y_i = y_i - \bar y" title="Rendered by QuickLaTeX.com" height="16" width="119" style="vertical-align: -4px;" data-recalc-dims="1"/></p>
<p>Therefore, the residual sum of squares becomes identical to the total sum of squares:</p>
<p><img loading="lazy" decoding="async" src="https://i0.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-dad01239a88490f46b0c50e3d31d2e2e_l3.png?resize=199%2C20&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="\sum (y_i-\hat y_i)^2 = \sum (y_i-\bar y)^2" title="Rendered by QuickLaTeX.com" height="20" width="199" style="vertical-align: -5px;" data-recalc-dims="1"/></p>
<p>Substituting this into the formula gives:</p>
<p><img loading="lazy" decoding="async" src="https://i2.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-201f649c3db6193ecc7d5d6ec2cf3dab_l3.png?resize=145%2C24&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2 = 1 - \frac{SS_{tot}}{SS_{tot}} = 0" title="Rendered by QuickLaTeX.com" height="24" width="145" style="vertical-align: -8px;" data-recalc-dims="1"/></p>
<p>The model explains none of the variation in the data.</p>
<p>This corresponds to the <em>underfitting</em> case discussed previously: the model is too simple to capture the underlying structure.</p>
<hr />
<h2>The Polynomial Model</h2>
<p>Now consider the opposite extreme.</p>
<p>Instead of fitting a straight line, suppose we fit a polynomial of sufficiently high degree. In fact, if we have <img loading="lazy" decoding="async" src="https://i0.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-ec4217f4fa5fcd92a9edceba0e708cf7_l3.png?resize=11%2C8&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="n" title="Rendered by QuickLaTeX.com" height="8" width="11" style="vertical-align: 0px;" data-recalc-dims="1"/> observations with distinct age values, a polynomial of degree up to <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-3fd905b384548c9de7011828b88081d5_l3.png?resize=40%2C12&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="n-1" title="Rendered by QuickLaTeX.com" height="12" width="40" style="vertical-align: 0px;" data-recalc-dims="1"/> can pass exactly through all observed data points: </p>
<p><img loading="lazy" decoding="async" src="https://i2.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-2ba542bb2156501d745b39f95647edf6_l3.png?resize=291%2C19&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="y = a_0 + a_1x + a_2x^2 + \dots + a_{n-1}x^{n-1}" title="Rendered by QuickLaTeX.com" height="19" width="291" style="vertical-align: -4px;" data-recalc-dims="1"/></p>
<p>In that case:</p>
<p><img loading="lazy" decoding="async" src="https://i2.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-3723adf0d61a6d5758ddf7bbbe0865d5_l3.png?resize=52%2C16&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="y_i = \hat y_i" title="Rendered by QuickLaTeX.com" height="16" width="52" style="vertical-align: -4px;" data-recalc-dims="1"/></p>
<p>for all observations, implying:</p>
<p><img loading="lazy" decoding="async" src="https://i0.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-67e17d4eacf7a7e6995c47e968ca1464_l3.png?resize=123%2C20&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="\sum (y_i-\hat y_i)^2 = 0" title="Rendered by QuickLaTeX.com" height="20" width="123" style="vertical-align: -5px;" data-recalc-dims="1"/></p>
<p>and therefore:</p>
<p><img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-449b1f5b7fe03b52c0d2a080f98ea2ed_l3.png?resize=53%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2 = 1" title="Rendered by QuickLaTeX.com" height="15" width="53" style="vertical-align: 0px;" data-recalc-dims="1"/></p>
<p>The model achieves a perfect fit.</p>
<p>At first sight, this appears ideal. In practice, however, such a model often performs poorly on unseen data because it has adapted itself not only to the underlying relationship, but also to random fluctuations and noise within the training data.</p>
<p>This is the classical <em>overfitting</em> problem.</p>
<p>A perfect <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> may therefore indicate not a particularly good model, but a model that has become too flexible.</p>
<hr />
<h2>The Linear Model</h2>
<p>The linear model from the previous post lies between these two extremes.</p>
<p>It is simple enough to avoid memorizing every random fluctuation, yet flexible enough to capture a meaningful trend in the data.</p>
<p>This balance between simplicity and flexibility is one of the central themes in statistical learning.</p>
<p>The idea was summarized in the previous post with the following plot:</p>
<p><img loading="lazy" decoding="async" src="https://i2.wp.com/blog.ephorie.de/wp-content/uploads/2019/02/mb4.png?w=450&#038;ssl=1" alt="" class="aligncenter size-full wp-image-554" srcset_temp="https://i2.wp.com/blog.ephorie.de/wp-content/uploads/2019/02/mb4.png?w=450&#038;ssl=1 534w, https://blog.ephorie.de/wp-content/uploads/2019/02/mb4-300x231.png 300w" sizes="auto, (max-width: 534px) 85vw, 534px" data-recalc-dims="1" /></p>
<p>and by the famous observation attributed to George Box:</p>
<blockquote><p>
“All models are wrong, but some are useful.”
</p></blockquote>
<p>The objective in modelling is therefore not to maximize complexity or maximize <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/>, but to find a model that generalizes well beyond the observed sample.</p>
<hr />
<h2>Why <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> Alone Is Insufficient</h2>
<p>The key limitation of <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> is that it evaluates fit on the observed data only.</p>
<p>It does not directly measure:</p>
<ul>
<li>predictive performance on unseen data,</li>
<li>robustness,</li>
<li>causal validity, or</li>
<li>generalization ability.</li>
</ul>
<p>As model complexity increases, <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> almost always increases as well. A sufficiently flexible model can often achieve values very close to 1 even when its predictions on new data are poor.</p>
<p>For this reason, practical data science relies on additional evaluation methods such as:</p>
<ul>
<li>train-test splits,</li>
<li>cross-validation,</li>
<li>regularization,</li>
<li>adjusted <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/>, and</li>
<li>out-of-sample testing.</li>
</ul>
<p>The goal is not to reproduce historical observations perfectly, but to construct models that remain useful when confronted with new data.</p>
<p>A high <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&#038;ssl=1" class="ql-img-inline-formula quicklatex-auto-format" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21" style="vertical-align: 0px;" data-recalc-dims="1"/> can therefore mean two very different things:</p>
<ul>
<li>the model has identified a genuine structure,</li>
<li>or the model has merely adapted itself too closely to the training data.</li>
</ul>
<p>Distinguishing between these possibilities is one of the central challenges of machine learning and statistical modelling.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://blog.ephorie.de/learning-data-science-why-a-high-r2-can-be-misleading?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=learning-data-science-why-a-high-r2-can-be-misleading"> R-Bloggers – Learning Machines</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/learning-data-science-why-a-high-r2-can-be-misleading/">Learning Data Science: Why a High R^2 Can Be Misleading</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401170</post-id>	</item>
		<item>
		<title>How to Build an Expected Goals (xG) Model in R with worldfootballR</title>
		<link>https://www.r-bloggers.com/2026/05/how-to-build-an-expected-goals-xg-model-in-r-with-worldfootballr/</link>
		
		<dc:creator><![CDATA[rprogrammingbooks]]></dc:creator>
		<pubDate>Sat, 09 May 2026 23:13:30 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://rprogrammingbooks.com/?p=2554</guid>

					<description><![CDATA[<p>Expected goals has become one of the most important concepts in modern football analytics. Instead of judging a team only by goals scored, xG helps us estimate the quality of the chances created. In this tutorial, we will build a practical expected goals model in R using football data, feature ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/how-to-build-an-expected-goals-xg-model-in-r-with-worldfootballr/">How to Build an Expected Goals (xG) Model in R with worldfootballR</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rprogrammingbooks.com/expected-goals-model-r-worldfootballr/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=expected-goals-model-r-worldfootballr"> Blog - R Programming Books</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p><strong>Expected goals</strong> has become one of the most important concepts in modern football analytics. Instead of judging a team only by goals scored, xG helps us estimate the quality of the chances created. In this tutorial, we will build a practical expected goals model in R using football data, feature engineering, logistic regression, model evaluation, and visualization.</p>

<p>This is a hands-on guide for analysts who want to move beyond simple football statistics and start building reproducible soccer analytics workflows in R.</p>

<h2>What Is Expected Goals?</h2>

<p>Expected goals, usually written as xG, measures the probability that a shot becomes a goal. A shot from two meters in front of goal will usually have a high xG value, while a long-range shot from outside the box will usually have a low xG value.</p>

<p>An xG model can use variables such as:</p>

<ul>
  <li>Shot distance</li>
  <li>Shot angle</li>
  <li>Body part used</li>
  <li>Game state</li>
  <li>Minute of the match</li>
  <li>Shot type</li>
  <li>Set-piece situation</li>
  <li>Home or away context</li>
</ul>

<p>In this post, we will build a clean starter model using R. You can later extend it with richer event data, tracking data, or more advanced machine learning models.</p>

<h2>Install and Load R Packages</h2>

<pre># Core data science packages
install.packages(c(
  &quot;tidyverse&quot;,
  &quot;ggplot2&quot;,
  &quot;dplyr&quot;,
  &quot;readr&quot;,
  &quot;janitor&quot;,
  &quot;broom&quot;,
  &quot;yardstick&quot;,
  &quot;rsample&quot;,
  &quot;pROC&quot;,
  &quot;patchwork&quot;
))

# Football data package
install.packages(&quot;worldfootballR&quot;)

library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(janitor)
library(broom)
library(yardstick)
library(rsample)
library(pROC)
library(patchwork)
library(worldfootballR)</pre>

<h2>Create a Simple Shot Dataset</h2>

<p>Different public football data sources structure shot data differently. To make this tutorial reproducible, we will first create a synthetic shot dataset that behaves like real football event data. Later, you can replace this with your own data from FBref, StatsBomb open data, Wyscout-style exports, or custom event feeds.</p>

<pre>set.seed(123)

n_shots &lt;- 5000

shots &lt;- tibble(
  shot_id = 1:n_shots,
  player = sample(
    c(&quot;Player A&quot;, &quot;Player B&quot;, &quot;Player C&quot;, &quot;Player D&quot;, &quot;Player E&quot;),
    n_shots,
    replace = TRUE
  ),
  team = sample(
    c(&quot;Team Red&quot;, &quot;Team Blue&quot;, &quot;Team Green&quot;, &quot;Team White&quot;),
    n_shots,
    replace = TRUE
  ),
  minute = sample(1:95, n_shots, replace = TRUE),
  x_location = runif(n_shots, min = 70, max = 120),
  y_location = runif(n_shots, min = 0, max = 80),
  body_part = sample(
    c(&quot;Right Foot&quot;, &quot;Left Foot&quot;, &quot;Header&quot;, &quot;Other&quot;),
    n_shots,
    replace = TRUE,
    prob = c(0.43, 0.32, 0.20, 0.05)
  ),
  situation = sample(
    c(&quot;Open Play&quot;, &quot;Corner&quot;, &quot;Free Kick&quot;, &quot;Penalty&quot;, &quot;Counter Attack&quot;),
    n_shots,
    replace = TRUE,
    prob = c(0.68, 0.12, 0.08, 0.03, 0.09)
  ),
  home_away = sample(c(&quot;Home&quot;, &quot;Away&quot;), n_shots, replace = TRUE)
)

glimpse(shots)</pre>

<h2>Engineer Shot Distance and Angle</h2>

<p>Distance and angle are two of the most important features in a basic xG model. We will assume the goal is centered at x = 120 and y = 40.</p>

<pre>goal_x &lt;- 120
goal_y &lt;- 40

shots &lt;- shots %&gt;%
  mutate(
    distance_to_goal = sqrt(
      (goal_x - x_location)^2 + (goal_y - y_location)^2
    ),
    angle_to_goal = atan2(
      abs(goal_y - y_location),
      goal_x - x_location
    ),
    angle_degrees = angle_to_goal * 180 / pi
  )

shots %&gt;%
  select(shot_id, x_location, y_location, distance_to_goal, angle_degrees) %&gt;%
  head()</pre>

<h2>Create a Goal Outcome</h2>

<p>For demonstration, we will simulate goals using realistic football logic. Shots closer to goal should be more likely to become goals. Penalties should have higher probability. Headers and long-range attempts should usually be harder.</p>

<pre>shots &lt;- shots %&gt;%
  mutate(
    linear_probability =
      -2.8 -
      0.08 * distance_to_goal +
      0.025 * angle_degrees +
      if_else(body_part == &quot;Header&quot;, -0.35, 0) +
      if_else(body_part == &quot;Other&quot;, -0.60, 0) +
      if_else(situation == &quot;Penalty&quot;, 3.00, 0) +
      if_else(situation == &quot;Counter Attack&quot;, 0.35, 0) +
      if_else(situation == &quot;Free Kick&quot;, -0.45, 0),
    
    goal_probability = plogis(linear_probability),
    goal = rbinom(n(), size = 1, prob = goal_probability)
  )

shots %&gt;%
  summarise(
    total_shots = n(),
    total_goals = sum(goal),
    conversion_rate = mean(goal)
  )</pre>

<h2>Explore the Shot Data</h2>

<pre>shots %&gt;%
  count(body_part, goal) %&gt;%
  group_by(body_part) %&gt;%
  mutate(rate = n / sum(n))
shots %&gt;%
  group_by(situation) %&gt;%
  summarise(
    shots = n(),
    goals = sum(goal),
    conversion_rate = mean(goal),
    avg_distance = mean(distance_to_goal),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(conversion_rate))</pre>

<h2>Visualize Shot Locations</h2>

<pre>ggplot(shots, aes(x = x_location, y = y_location, color = factor(goal))) +
  geom_point(alpha = 0.35) +
  coord_fixed() +
  labs(
    title = &quot;Shot Map&quot;,
    x = &quot;Pitch Length&quot;,
    y = &quot;Pitch Width&quot;,
    color = &quot;Goal&quot;
  ) +
  theme_minimal()</pre>

<h2>Split Data into Training and Testing Sets</h2>

<pre>set.seed(123)

shot_split &lt;- initial_split(shots, prop = 0.80, strata = goal)

train_data &lt;- training(shot_split)
test_data  &lt;- testing(shot_split)

nrow(train_data)
nrow(test_data)</pre>

<h2>Build a Logistic Regression xG Model</h2>

<p>Expected goals is naturally suited to logistic regression because the outcome is binary: goal or no goal.</p>

<pre>xg_model &lt;- glm(
  goal ~ distance_to_goal +
    angle_degrees +
    body_part +
    situation +
    home_away +
    minute,
  data = train_data,
  family = binomial()
)

summary(xg_model)</pre>

<h2>Convert Model Output into xG Values</h2>

<pre>test_predictions &lt;- test_data %&gt;%
  mutate(
    xg = predict(xg_model, newdata = test_data, type = &quot;response&quot;)
  )

test_predictions %&gt;%
  select(player, team, goal, xg, distance_to_goal, angle_degrees) %&gt;%
  head(10)</pre>

<h2>Evaluate the xG Model</h2>

<p>A good xG model should not only predict goals, but also produce well-calibrated probabilities. If 100 shots each have an xG of 0.10, we would expect roughly 10 goals over a large enough sample.</p>

<pre>test_predictions %&gt;%
  summarise(
    actual_goals = sum(goal),
    expected_goals = sum(xg),
    avg_xg = mean(xg),
    actual_conversion = mean(goal)
  )</pre>

<h3>ROC AUC</h3>

<pre>roc_obj &lt;- roc(
  response = test_predictions$goal,
  predictor = test_predictions$xg
)

auc(roc_obj)
plot(
  roc_obj,
  main = &quot;ROC Curve for xG Model&quot;
)</pre>

<h3>Brier Score</h3>

<pre>brier_score &lt;- mean((test_predictions$xg - test_predictions$goal)^2)

brier_score</pre>

<h2>Create xG Buckets for Calibration</h2>

<pre>calibration_table &lt;- test_predictions %&gt;%
  mutate(
    xg_bucket = cut(
      xg,
      breaks = seq(0, 1, by = 0.05),
      include.lowest = TRUE
    )
  ) %&gt;%
  group_by(xg_bucket) %&gt;%
  summarise(
    shots = n(),
    avg_xg = mean(xg),
    actual_goal_rate = mean(goal),
    goals = sum(goal),
    .groups = &quot;drop&quot;
  ) %&gt;%
  filter(shots &gt;= 10)

calibration_table
ggplot(calibration_table, aes(x = avg_xg, y = actual_goal_rate)) +
  geom_point(size = 3) +
  geom_abline(intercept = 0, slope = 1, linetype = &quot;dashed&quot;) +
  labs(
    title = &quot;xG Model Calibration&quot;,
    x = &quot;Average Predicted xG&quot;,
    y = &quot;Actual Goal Rate&quot;
  ) +
  theme_minimal()</pre>

<h2>Player-Level xG Analysis</h2>

<p>Once every shot has an xG value, we can aggregate by player. This allows us to compare goals, expected goals, overperformance, and shot volume.</p>

<pre>player_xg &lt;- test_predictions %&gt;%
  group_by(player) %&gt;%
  summarise(
    shots = n(),
    goals = sum(goal),
    xg = sum(xg),
    goals_minus_xg = goals - xg,
    xg_per_shot = mean(xg),
    conversion_rate = mean(goal),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(xg))

player_xg
ggplot(player_xg, aes(x = reorder(player, xg), y = xg)) +
  geom_col() +
  coord_flip() +
  labs(
    title = &quot;Expected Goals by Player&quot;,
    x = &quot;Player&quot;,
    y = &quot;Total xG&quot;
  ) +
  theme_minimal()</pre>

<h2>Team-Level xG Analysis</h2>

<pre>team_xg &lt;- test_predictions %&gt;%
  group_by(team) %&gt;%
  summarise(
    shots = n(),
    goals = sum(goal),
    xg = sum(xg),
    goals_minus_xg = goals - xg,
    avg_xg_per_shot = mean(xg),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(xg))

team_xg
ggplot(team_xg, aes(x = reorder(team, goals_minus_xg), y = goals_minus_xg)) +
  geom_col() +
  coord_flip() +
  labs(
    title = &quot;Goals Minus xG by Team&quot;,
    x = &quot;Team&quot;,
    y = &quot;Goals - Expected Goals&quot;
  ) +
  theme_minimal()</pre>

<h2>Shot Quality Distribution</h2>

<pre>ggplot(test_predictions, aes(x = xg)) +
  geom_histogram(bins = 40) +
  labs(
    title = &quot;Distribution of Shot Quality&quot;,
    x = &quot;Expected Goals&quot;,
    y = &quot;Number of Shots&quot;
  ) +
  theme_minimal()</pre>

<h2>Compare Goals and xG by Situation</h2>

<pre>situation_xg &lt;- test_predictions %&gt;%
  group_by(situation) %&gt;%
  summarise(
    shots = n(),
    goals = sum(goal),
    xg = sum(xg),
    avg_xg = mean(xg),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(avg_xg))

situation_xg
situation_long &lt;- situation_xg %&gt;%
  select(situation, goals, xg) %&gt;%
  pivot_longer(
    cols = c(goals, xg),
    names_to = &quot;metric&quot;,
    values_to = &quot;value&quot;
  )

ggplot(situation_long, aes(x = reorder(situation, value), y = value, fill = metric)) +
  geom_col(position = &quot;dodge&quot;) +
  coord_flip() +
  labs(
    title = &quot;Goals vs Expected Goals by Situation&quot;,
    x = &quot;Situation&quot;,
    y = &quot;Value&quot;,
    fill = &quot;Metric&quot;
  ) +
  theme_minimal()</pre>

<h2>Build a More Advanced xG Model with Interactions</h2>

<p>A simple model is useful, but football is full of interactions. For example, distance may affect headers differently than footed shots. We can include interaction terms in the model.</p>

<pre>xg_model_interaction &lt;- glm(
  goal ~ distance_to_goal * body_part +
    angle_degrees +
    situation +
    home_away +
    minute,
  data = train_data,
  family = binomial()
)

summary(xg_model_interaction)
test_predictions_interaction &lt;- test_data %&gt;%
  mutate(
    xg_interaction = predict(
      xg_model_interaction,
      newdata = test_data,
      type = &quot;response&quot;
    )
  )

mean((test_predictions_interaction$xg_interaction - test_predictions_interaction$goal)^2)</pre>

<h2>Compare Two xG Models</h2>

<pre>model_comparison &lt;- tibble(
  model = c(&quot;Basic Logistic Regression&quot;, &quot;Interaction Logistic Regression&quot;),
  brier_score = c(
    mean((test_predictions$xg - test_predictions$goal)^2),
    mean((test_predictions_interaction$xg_interaction - test_predictions_interaction$goal)^2)
  ),
  total_predicted_goals = c(
    sum(test_predictions$xg),
    sum(test_predictions_interaction$xg_interaction)
  ),
  actual_goals = c(
    sum(test_predictions$goal),
    sum(test_predictions_interaction$goal)
  )
)

model_comparison</pre>

<h2>Create a Reusable xG Prediction Function</h2>

<pre>predict_xg &lt;- function(model, new_shots) {
  new_shots %&gt;%
    mutate(
      predicted_xg = predict(
        model,
        newdata = new_shots,
        type = &quot;response&quot;
      )
    )
}

new_predictions &lt;- predict_xg(xg_model, test_data)

head(new_predictions)</pre>

<h2>Create a Custom Shot Example</h2>

<pre>custom_shot &lt;- tibble(
  distance_to_goal = 12,
  angle_degrees = 28,
  body_part = &quot;Right Foot&quot;,
  situation = &quot;Open Play&quot;,
  home_away = &quot;Home&quot;,
  minute = 62
)

predict(
  xg_model,
  newdata = custom_shot,
  type = &quot;response&quot;
)</pre>

<h2>Use worldfootballR for Real Football Workflows</h2>

<p>For real projects, you can use packages such as <code>worldfootballR</code> to collect football data from public sources and build reproducible analysis pipelines. The exact available columns depend on the source and endpoint, so always inspect your data before modeling.</p>

<pre>library(worldfootballR)
library(tidyverse)

# Example: get FBref match results
# Adjust country, gender, season_end_year, and tier depending on your project

premier_league_results &lt;- fb_match_results(
  country = &quot;ENG&quot;,
  gender = &quot;M&quot;,
  season_end_year = 2025,
  tier = &quot;1st&quot;
)

glimpse(premier_league_results)
premier_league_results %&gt;%
  clean_names() %&gt;%
  head()</pre>

<p>If you are building a full football analytics pipeline with FBref, Transfermarkt, and Understat-style workflows, a more structured project template can save a lot of time. I cover that type of end-to-end workflow in <a href="https://rprogrammingbooks.com/product/mastering-football-data-worldfootballr/" rel="nofollow" target="_blank">Mastering Football Data with worldfootballR</a>, especially for readers who want reusable R scripts, clean folders, and practical football data examples.</p>

<h2>Example: Clean Match Results Data</h2>

<pre>clean_results &lt;- premier_league_results %&gt;%
  clean_names()

clean_results %&gt;%
  glimpse()
# Example structure will depend on the returned data
# Always check column names first

names(clean_results)</pre>

<h2>Build a Match-Level Team Summary</h2>

<pre># This is an example pattern.
# You may need to adjust column names depending on your data source.

team_summary_example &lt;- clean_results %&gt;%
  summarise(
    matches = n()
  )

team_summary_example</pre>

<h2>Save Your xG Model</h2>

<p>Once you have trained a model, save it so you can reuse it later in reports, dashboards, APIs, or automated pipelines.</p>

<pre>saveRDS(xg_model, &quot;xg_model_logistic_regression.rds&quot;)

loaded_xg_model &lt;- readRDS(&quot;xg_model_logistic_regression.rds&quot;)

predict(
  loaded_xg_model,
  newdata = custom_shot,
  type = &quot;response&quot;
)</pre>

<h2>Create an xG Report Table</h2>

<pre>xg_report &lt;- test_predictions %&gt;%
  group_by(team, player) %&gt;%
  summarise(
    shots = n(),
    goals = sum(goal),
    xg = round(sum(xg), 2),
    goals_minus_xg = round(goals - sum(xg), 2),
    xg_per_shot = round(mean(xg), 3),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(xg))

xg_report
write_csv(xg_report, &quot;xg_player_report.csv&quot;)</pre>

<h2>Create an xG Shot Map</h2>

<pre>ggplot(test_predictions, aes(x = x_location, y = y_location)) +
  geom_point(aes(size = xg, alpha = xg)) +
  coord_fixed() +
  labs(
    title = &quot;xG Shot Map&quot;,
    x = &quot;Pitch Length&quot;,
    y = &quot;Pitch Width&quot;,
    size = &quot;xG&quot;,
    alpha = &quot;xG&quot;
  ) +
  theme_minimal()</pre>

<h2>Create a High-Value Chances Table</h2>

<pre>big_chances &lt;- test_predictions %&gt;%
  filter(xg &gt;= 0.30) %&gt;%
  arrange(desc(xg)) %&gt;%
  select(
    player,
    team,
    minute,
    body_part,
    situation,
    distance_to_goal,
    angle_degrees,
    xg,
    goal
  )

big_chances %&gt;%
  head(20)</pre>

<h2>Model Improvement Ideas</h2>

<p>This starter xG model can be improved in many ways. A professional football analytics workflow may include:</p>

<ul>
  <li>More accurate shot coordinates</li>
  <li>Goalkeeper position</li>
  <li>Defender pressure</li>
  <li>Pass type before the shot</li>
  <li>Through balls and cutbacks</li>
  <li>Shot speed</li>
  <li>First-time shots</li>
  <li>Game state</li>
  <li>Team strength</li>
  <li>Player finishing history</li>
</ul>

<h2>Train an XGBoost-Style Model Later</h2>

<p>Logistic regression is interpretable and a good starting point. For higher predictive performance, you can later compare it with random forests, gradient boosting, or Bayesian models.</p>

<pre># Example packages for future model upgrades
# install.packages(c(&quot;xgboost&quot;, &quot;ranger&quot;, &quot;tidymodels&quot;))

library(tidymodels)

# A future tidymodels workflow could look like this:

xg_recipe &lt;- recipe(
  goal ~ distance_to_goal + angle_degrees + body_part + situation + home_away + minute,
  data = train_data
) %&gt;%
  step_dummy(all_nominal_predictors()) %&gt;%
  step_normalize(all_numeric_predictors())

xg_recipe</pre>

<h2>Build a Tidymodels Logistic Regression Workflow</h2>

<pre>logistic_spec &lt;- logistic_reg() %&gt;%
  set_engine(&quot;glm&quot;) %&gt;%
  set_mode(&quot;classification&quot;)

xg_workflow &lt;- workflow() %&gt;%
  add_recipe(xg_recipe) %&gt;%
  add_model(logistic_spec)

xg_fit &lt;- fit(
  xg_workflow,
  data = train_data %&gt;%
    mutate(goal = factor(goal, levels = c(0, 1)))
)

xg_fit
tidy(xg_fit)</pre>

<h2>Predict Probabilities with Tidymodels</h2>

<pre>tidy_predictions &lt;- predict(
  xg_fit,
  new_data = test_data,
  type = &quot;prob&quot;
) %&gt;%
  bind_cols(test_data %&gt;% mutate(goal = factor(goal, levels = c(0, 1))))

head(tidy_predictions)
tidy_predictions %&gt;%
  roc_auc(
    truth = goal,
    .pred_1
  )</pre>

<h2>Turn xG into Match Insights</h2>

<p>The real value of expected goals is not just predicting whether one shot becomes a goal. The value comes from aggregation. Once every shot has a probability, you can create match-level and season-level insights.</p>

<pre>match_shots &lt;- test_predictions %&gt;%
  mutate(
    match_id = sample(1:100, n(), replace = TRUE)
  )

match_xg &lt;- match_shots %&gt;%
  group_by(match_id, team) %&gt;%
  summarise(
    shots = n(),
    goals = sum(goal),
    xg = sum(xg),
    .groups = &quot;drop&quot;
  )

match_xg %&gt;%
  arrange(match_id, desc(xg)) %&gt;%
  head(20)</pre>

<h2>Find Teams Creating Better Chances</h2>

<pre>team_chance_quality &lt;- test_predictions %&gt;%
  group_by(team) %&gt;%
  summarise(
    shots = n(),
    total_xg = sum(xg),
    avg_xg_per_shot = mean(xg),
    big_chances = sum(xg &gt;= 0.30),
    low_quality_shots = sum(xg &lt;= 0.05),
    .groups = &quot;drop&quot;
  ) %&gt;%
  arrange(desc(avg_xg_per_shot))

team_chance_quality</pre>

<h2>Final Thoughts</h2>

<p>Building an expected goals model in R is one of the best ways to learn football analytics because it combines data cleaning, feature engineering, statistical modeling, visualization, and interpretation. A simple logistic regression model can already teach you a lot about shot quality, player performance, and team attacking style.</p>

<p>From here, the next steps are clear: use richer football data, improve your features, compare different models, evaluate calibration, and build repeatable workflows that can be updated every week during the season.</p>

<p>Expected goals is not the final answer to football analysis, but it is one of the best starting points for serious soccer data science in R.</p>
<p>The post <a href="https://rprogrammingbooks.com/expected-goals-model-r-worldfootballr/" rel="nofollow" target="_blank">How to Build an Expected Goals (xG) Model in R with worldfootballR</a> appeared first on <a href="https://rprogrammingbooks.com/" rel="nofollow" target="_blank">R Programming Books</a>.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://rprogrammingbooks.com/expected-goals-model-r-worldfootballr/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=expected-goals-model-r-worldfootballr"> Blog - R Programming Books</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/how-to-build-an-expected-goals-xg-model-in-r-with-worldfootballr/">How to Build an Expected Goals (xG) Model in R with worldfootballR</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401149</post-id>	</item>
		<item>
		<title>One interface, (Almost) Every Classifier (and Regressor): unifiedml v0.3.0</title>
		<link>https://www.r-bloggers.com/2026/05/one-interface-almost-every-classifier-and-regressor-unifiedml-v0-3-0/</link>
		
		<dc:creator><![CDATA[T. Moudiki]]></dc:creator>
		<pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://thierrymoudiki.github.io//blog/2026/05/09/r/New-UnifiedML</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> News from R package unifiedml, that offers a unified interface to R machine learning models</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/one-interface-almost-every-classifier-and-regressor-unifiedml-v0-3-0/">One interface, (Almost) Every Classifier (and Regressor): unifiedml v0.3.0</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/09/r/New-UnifiedML"> T. Moudiki's Webpage - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>In the new version of <a href="https://cran.r-project.org/web/packages/unifiedml/index.html" rel="nofollow" target="_blank"><code>unifiedml</code></a> available on CRAN, you can benchmark different models using k-fold cross-validation (section 1 of this blog post), and there’s a unified interface for predicting model probabilities (section 2 of this blog post).</p>

<pre>install.packages(&quot;unifiedml&quot;)

install.packages(c(&quot;e1071&quot;, &quot;randomForest&quot;, &quot;caret&quot;))

install.packages(&quot;glmnet&quot;)

library(unifiedml)
</pre>

<h1 id="1---benchmarking-models">1 &#8211; Benchmarking models</h1>

<pre>set.seed(123)

X &lt;- iris[, 1:4]
y &lt;- iris$Species

models &lt;- list( # `Model` is exported from package 'unifiedml'
  glm  = Model$new(caret::train), # caret can be used (see https://topepo.github.io/caret/available-models.html)
  rf   = Model$new(randomForest::randomForest), # or a native pkg
  svm  = Model$new(e1071::svm) # or another pkg
)

params &lt;- list(
  glm = list(method = &quot;glmnet&quot;,
             tuneGrid = data.frame(alpha = 0, lambda = 0.01), # for caret model, all hyperparameters must be provided
             trControl = trainControl(method = &quot;none&quot;)),
  rf  = list(ntree = 150), # Not necessarily needing to specify all hyperparameters
  svm = list(kernel = &quot;radial&quot;,
             cost = 1,
             gamma = 0.1)
)

results &lt;- unifiedml::benchmark(models, X, y, cv = 5, params = params)

[1/3] Fitting model: glm
Mean CV score for glm: 0.9533

[2/3] Fitting model: rf
Mean CV score for rf: 0.9600

[3/3] Fitting model: svm
Mean CV score for svm: 0.9733

print(results) # 5-fold cross-validation results

$glm
$glm$avg_score
[1] 0.9533333

$glm$scores
    fold1     fold2     fold3     fold4     fold5 
0.9333333 0.9666667 0.9333333 0.9333333 1.0000000 


$rf
$rf$avg_score
[1] 0.96

$rf$scores
    fold1     fold2     fold3     fold4     fold5 
0.9333333 1.0000000 0.9333333 0.9333333 1.0000000 


$svm
$svm$avg_score
[1] 0.9733333

$svm$scores
    fold1     fold2     fold3     fold4     fold5 
0.9666667 1.0000000 0.9666667 0.9333333 1.0000000 

# initialize empty vectors
model_vec &lt;- c()
fold_vec  &lt;- c()
score_vec &lt;- c()

for (model in names(results)) {
  scores &lt;- results[[model]]$scores

  model_vec &lt;- c(model_vec, rep(model, length(scores)))
  fold_vec  &lt;- c(fold_vec, names(scores))
  score_vec &lt;- c(score_vec, as.numeric(scores))
}

df &lt;- data.frame(
  model = model_vec,
  fold  = fold_vec,
  score = score_vec
)

library(ggplot2)

ggplot(df, aes(x = model, y = score, fill = model)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_jitter(width = 0.08, size = 2) +
  theme_minimal() +
  labs(
    title = &quot;Cross-validation score distribution&quot;,
    x = &quot;Model&quot;,
    y = &quot;Score&quot;
  ) +
  theme(legend.position = &quot;none&quot;)
</pre>

<p><img src="https://i0.wp.com/thierrymoudiki.github.io/images/2026-05-09/2026-05-09-New-UnifiedML_9_0.png?w=578&#038;ssl=1" alt="image-title-here" class="img-responsive" data-recalc-dims="1" /></p>

<h1 id="2---unified-interface-for-predicting-probabilities">2 - Unified interface for predicting probabilities</h1>

<pre># Load required packages
library(unifiedml)
library(randomForest)
library(nnet)
library(e1071)

# Load iris dataset
data(iris)

# Setup reproducible data
set.seed(42)

# Create feature matrix (all 4 numeric features)
X &lt;- as.matrix(iris[, 1:4])
colnames(X) &lt;- c(&quot;Sepal.Length&quot;, &quot;Sepal.Width&quot;, &quot;Petal.Length&quot;, &quot;Petal.Width&quot;)

# Target: Species (multi-class with 3 levels)
y_multiclass &lt;- iris$Species

# Create binary classification target (Versicolor vs others)
y_binary &lt;- factor(
  ifelse(iris$Species == &quot;versicolor&quot;, &quot;versicolor&quot;, &quot;other&quot;),
  levels = c(&quot;other&quot;, &quot;versicolor&quot;)
)

# Split into train/test (75% train, 25% test)
set.seed(42)
train_idx &lt;- sample(1:nrow(X), size = floor(0.75 * nrow(X)), replace = FALSE)
test_idx &lt;- setdiff(1:nrow(X), train_idx)

X_train &lt;- X[train_idx, ]
X_test &lt;- X[test_idx, ]
y_train_multiclass &lt;- y_multiclass[train_idx]
y_test_multiclass &lt;- y_multiclass[test_idx]
y_train_binary &lt;- y_binary[train_idx]
y_test_binary &lt;- y_binary[test_idx]

cat(&quot;\n&quot;)
cat(&quot;============================================================================\n&quot;)
cat(&quot;IRIS DATASET - Summary\n&quot;)
cat(&quot;============================================================================\n&quot;)
cat(sprintf(&quot;Training samples: %d\n&quot;, nrow(X_train)))
cat(sprintf(&quot;Test samples: %d\n&quot;, nrow(X_test)))
cat(sprintf(&quot;Features: %d\n&quot;, ncol(X_train)))
cat(sprintf(&quot;Classes: %s\n&quot;, paste(levels(y_multiclass), collapse = &quot;, &quot;)))

# ============================================================================
# EXAMPLE 1: randomForest - Multi-class Classification on IRIS
# ============================================================================

cat(&quot;\n&quot;)
cat(&quot;============================================================================\n&quot;)
cat(&quot;EXAMPLE 1: randomForest - Multi-class Classification\n&quot;)
cat(&quot;============================================================================\n&quot;)

mod_rf &lt;- Model$new(randomForest::randomForest)
mod_rf$fit(X_train, y_train_multiclass, ntree = 100)

cat(&quot;\nPredicting probabilities for first 5 test samples:\n&quot;)
probs_rf &lt;- mod_rf$predict_proba(X_test[1:5, ])

cat(&quot;\nProbability matrix:\n&quot;)
print(round(probs_rf, 3))

cat(&quot;\nInterpretation:\n&quot;)
for(i in 1:5) {
  cat(sprintf(&quot;\nSample %d (Actual: %s):\n&quot;, i, as.character(y_test_multiclass[i])))
  cat(sprintf(&quot;  setosa:     %.1f%%\n&quot;, probs_rf[i, &quot;setosa&quot;] * 100))
  cat(sprintf(&quot;  versicolor: %.1f%%\n&quot;, probs_rf[i, &quot;versicolor&quot;] * 100))
  cat(sprintf(&quot;  virginica:  %.1f%%\n&quot;, probs_rf[i, &quot;virginica&quot;] * 100))
  cat(sprintf(&quot;  Predicted:  %s\n&quot;, colnames(probs_rf)[which.max(probs_rf[i, ])]))
}

# Get class predictions
pred_classes_rf &lt;- mod_rf$predict(X_test[1:5, ], type = &quot;class&quot;)
cat(&quot;\nPredicted classes (first 5):&quot;, as.character(pred_classes_rf), &quot;\n&quot;)
cat(&quot;Actual classes (first 5):   &quot;, as.character(y_test_multiclass[1:5]), &quot;\n&quot;)

# Calculate accuracy on full test set
probs_all_rf &lt;- mod_rf$predict_proba(X_test)
pred_all_rf &lt;- colnames(probs_all_rf)[apply(probs_all_rf, 1, which.max)]
accuracy_rf &lt;- mean(pred_all_rf == as.character(y_test_multiclass))
cat(sprintf(&quot;\nTest set accuracy: %.1f%%\n&quot;, accuracy_rf * 100))

# ============================================================================
# EXAMPLE 2: nnet - Multi-class Classification on IRIS
# ============================================================================

cat(&quot;\n&quot;)
cat(&quot;============================================================================\n&quot;)
cat(&quot;EXAMPLE 2: nnet - Multi-class Classification\n&quot;)
cat(&quot;============================================================================\n&quot;)

mod_nnet &lt;- Model$new(nnet::nnet)
mod_nnet$fit(X_train, y_train_multiclass, size = 10, maxit = 200, trace = FALSE)

cat(&quot;\nPredicting probabilities for first 5 test samples:\n&quot;)
probs_nnet &lt;- mod_nnet$predict_proba(X_test[1:5, ])

cat(&quot;\nProbability matrix (all 3 classes):\n&quot;)
print(round(probs_nnet, 3))

cat(&quot;\nDetailed predictions:\n&quot;)
for(i in 1:5) {
  cat(sprintf(&quot;\nSample %d (Actual: %s):\n&quot;, i, as.character(y_test_multiclass[i])))
  cat(sprintf(&quot;  setosa:     %.1f%%\n&quot;, probs_nnet[i, &quot;setosa&quot;] * 100))
  cat(sprintf(&quot;  versicolor: %.1f%%\n&quot;, probs_nnet[i, &quot;versicolor&quot;] * 100))
  cat(sprintf(&quot;  virginica:  %.1f%%\n&quot;, probs_nnet[i, &quot;virginica&quot;] * 100))
  cat(sprintf(&quot;  Predicted:  %s\n&quot;, colnames(probs_nnet)[which.max(probs_nnet[i, ])]))
}

# Get class predictions
pred_classes_nnet &lt;- mod_nnet$predict(X_test[1:5, ], type = &quot;class&quot;)
cat(&quot;\nPredicted classes (first 5):&quot;, as.character(pred_classes_nnet), &quot;\n&quot;)
cat(&quot;Actual classes (first 5):   &quot;, as.character(y_test_multiclass[1:5]), &quot;\n&quot;)

# Calculate accuracy
probs_all_nnet &lt;- mod_nnet$predict_proba(X_test)
pred_all_nnet &lt;- colnames(probs_all_nnet)[apply(probs_all_nnet, 1, which.max)]
accuracy_nnet &lt;- mean(pred_all_nnet == as.character(y_test_multiclass))
cat(sprintf(&quot;\nTest set accuracy: %.1f%%\n&quot;, accuracy_nnet * 100))

# ============================================================================
# EXAMPLE 3: SVM - Multi-class Classification on IRIS
# ============================================================================

cat(&quot;\n&quot;)
cat(&quot;============================================================================\n&quot;)
cat(&quot;EXAMPLE 3: SVM - Multi-class Classification\n&quot;)
cat(&quot;============================================================================\n&quot;)

mod_svm &lt;- Model$new(e1071::svm)
mod_svm$fit(X_train, y_train_multiclass, probability = TRUE, kernel = &quot;radial&quot;)

cat(&quot;\nPredicting probabilities for first 5 test samples:\n&quot;)
probs_svm &lt;- mod_svm$predict_proba(X_test[1:5, ])

cat(&quot;\nProbability matrix:\n&quot;)
print(round(probs_svm, 4))

cat(&quot;\nDetailed predictions:\n&quot;)
for(i in 1:5) {
  cat(sprintf(&quot;\nSample %d (Actual: %s):\n&quot;, i, as.character(y_test_multiclass[i])))
  cat(sprintf(&quot;  setosa:     %.1f%%\n&quot;, probs_svm[i, &quot;setosa&quot;] * 100))
  cat(sprintf(&quot;  versicolor: %.1f%%\n&quot;, probs_svm[i, &quot;versicolor&quot;] * 100))
  cat(sprintf(&quot;  virginica:  %.1f%%\n&quot;, probs_svm[i, &quot;virginica&quot;] * 100))
  cat(sprintf(&quot;  Predicted:  %s\n&quot;, colnames(probs_svm)[which.max(probs_svm[i, ])]))
}

# Calculate accuracy
probs_all_svm &lt;- mod_svm$predict_proba(X_test)
pred_all_svm &lt;- colnames(probs_all_svm)[apply(probs_all_svm, 1, which.max)]
accuracy_svm &lt;- mean(pred_all_svm == as.character(y_test_multiclass))
cat(sprintf(&quot;\nTest set accuracy: %.1f%%\n&quot;, accuracy_svm * 100))

============================================================================
IRIS DATASET - Summary
============================================================================
Training samples: 112
Test samples: 38
Features: 4
Classes: setosa, versicolor, virginica

============================================================================
EXAMPLE 1: randomForest - Multi-class Classification
============================================================================

Predicting probabilities for first 5 test samples:

Probability matrix:
  setosa versicolor virginica
1      1          0         0
2      1          0         0
3      1          0         0
4      1          0         0
5      1          0         0
attr(,&quot;assign&quot;)
[1] 1 1 1
attr(,&quot;contrasts&quot;)
attr(,&quot;contrasts&quot;)$pred
[1] &quot;contr.treatment&quot;

attr(,&quot;extraction_method&quot;)
[1] &quot;fallback::1&quot;
attr(,&quot;model_class&quot;)
[1] &quot;randomForest.formula&quot;

Interpretation:

Sample 1 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 2 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 3 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 4 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 5 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Predicted classes (first 5): setosa setosa setosa setosa setosa 
Actual classes (first 5):    setosa setosa setosa setosa setosa 

Test set accuracy: 94.7%

============================================================================
EXAMPLE 2: nnet - Multi-class Classification
============================================================================

Predicting probabilities for first 5 test samples:

Probability matrix (all 3 classes):
  setosa versicolor virginica
1      1          0         0
2      1          0         0
3      1          0         0
4      1          0         0
5      1          0         0
attr(,&quot;extraction_method&quot;)
[1] &quot;fallback::5&quot;
attr(,&quot;model_class&quot;)
[1] &quot;nnet.formula&quot;

Detailed predictions:

Sample 1 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 2 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 3 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 4 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 5 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Predicted classes (first 5): setosa setosa setosa setosa setosa 
Actual classes (first 5):    setosa setosa setosa setosa setosa 

Test set accuracy: 97.4%

============================================================================
EXAMPLE 3: SVM - Multi-class Classification
============================================================================

Predicting probabilities for first 5 test samples:

Probability matrix:
  setosa versicolor virginica
1      1          0         0
2      1          0         0
3      1          0         0
4      1          0         0
5      1          0         0
attr(,&quot;assign&quot;)
[1] 1 1 1
attr(,&quot;contrasts&quot;)
attr(,&quot;contrasts&quot;)$pred
[1] &quot;contr.treatment&quot;

attr(,&quot;extraction_method&quot;)
[1] &quot;fallback::1&quot;
attr(,&quot;model_class&quot;)
[1] &quot;svm.formula&quot;

Detailed predictions:

Sample 1 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 2 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 3 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 4 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Sample 5 (Actual: setosa):
  setosa:     100.0%
  versicolor: 0.0%
  virginica:  0.0%
  Predicted:  setosa

Test set accuracy: 94.7%
</pre>


<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/09/r/New-UnifiedML"> T. Moudiki's Webpage - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/one-interface-almost-every-classifier-and-regressor-unifiedml-v0-3-0/">One interface, (Almost) Every Classifier (and Regressor): unifiedml v0.3.0</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401135</post-id>	</item>
		<item>
		<title>Edge detection in Python</title>
		<link>https://www.r-bloggers.com/2026/05/edge-detection-in-python/</link>
		
		<dc:creator><![CDATA[Francisco de Abreu e Lima]]></dc:creator>
		<pubDate>Fri, 08 May 2026 19:56:22 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">http://poissonisfish.com/?p=10082</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> Great strides in artificial intelligence development during the last five years produced agents that are now commonplace at work and home. It is humbling to note that virtually all frontier large language models today trace back to a preprint introducing the transformer neural network architecture – a fifteen-page paper that profoundly ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/edge-detection-in-python/">Edge detection in Python</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/"> poissonisfish</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<figure class="wp-block-image"><img data-attachment-id="10046" data-permalink="https://poissonisfish.com/?attachment_id=10046" data-orig-file="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/butterfly_canny.png?w=578&#038;ssl=1" data-orig-size="2756,1824" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="butterfly_canny" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/butterfly_canny.png?w=578&#038;ssl=1?w=1024" src="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/butterfly_canny.png?w=578&#038;ssl=1" alt="" class="wp-image-10046" data-recalc-dims="1" /><figcaption class="wp-element-caption">Edge detection is ubiquitous in animal vision and yet poorly understood. Canny edge detection on <em>Polygonia c-album</em> (Portugal, 2010)</figcaption></figure>



<p class="wp-block-paragraph">Great strides in artificial intelligence development during the last five years produced agents that are now commonplace at work and home. It is humbling to note that virtually all frontier large language models today trace back to a preprint introducing the transformer neural network architecture<sup data-fn="563b7add-8b04-4fd2-a688-2383895c42c9" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#563b7add-8b04-4fd2-a688-2383895c42c9" id="563b7add-8b04-4fd2-a688-2383895c42c9-link" rel="nofollow" target="_blank">1</a></sup> – a fifteen-page paper that profoundly rocked the world through waves of excitement and angst.</p>



<p class="wp-block-paragraph">This paradigm shift in model design has also heavily influenced computer vision, leading to a surge in vision-language models (VLMs). Not only can such systems easily generalize across tasks such as segmentation, depth estimation and image generation or editing<sup data-fn="bd3dda1f-27f7-44e9-8d0f-9a42c28ed201" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#bd3dda1f-27f7-44e9-8d0f-9a42c28ed201" id="bd3dda1f-27f7-44e9-8d0f-9a42c28ed201-link" rel="nofollow" target="_blank">2</a></sup>, they have also blown legacy models out of the water in object detection benchmarks, with little to no fine-tuning<sup data-fn="63b9bf52-e78a-405a-bc33-5082dc51f74e" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#63b9bf52-e78a-405a-bc33-5082dc51f74e" id="63b9bf52-e78a-405a-bc33-5082dc51f74e-link" rel="nofollow" target="_blank">3</a></sup>.</p>



<p class="wp-block-paragraph">However, it should not be lightly assumed that the transformer architecture is the only path forward to a more meaningful, cost-effective or even better-performing AI – not when we are still having <a href="https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawberry/" rel="nofollow" target="_blank">trouble counting “r” in the word <em>strawberry</em></a>. Neuromorphic computation<sup data-fn="9740d4a1-35a7-4003-9e9f-63a4fa16b90b" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#9740d4a1-35a7-4003-9e9f-63a4fa16b90b" id="9740d4a1-35a7-4003-9e9f-63a4fa16b90b-link" rel="nofollow" target="_blank">4</a></sup>, photonic neural networks<sup data-fn="58f9bd6d-ec82-4b9b-8285-5f8b083184ad" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#58f9bd6d-ec82-4b9b-8285-5f8b083184ad" id="58f9bd6d-ec82-4b9b-8285-5f8b083184ad-link" rel="nofollow" target="_blank">5</a></sup>, JEPA<sup data-fn="7b848804-3070-4e9f-bc5f-ea3e60f0bf14" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#7b848804-3070-4e9f-bc5f-ea3e60f0bf14" id="7b848804-3070-4e9f-bc5f-ea3e60f0bf14-link" rel="nofollow" target="_blank">6</a></sup> and many other techniques have recently shown us different ways to design and implement intelligent systems that produce optimal solutions for a variety of problems.</p>



<p class="wp-block-paragraph">Today I want to focus on a topic from a timeless book that inspired me to think differently and, particularly, to effectively apply a first principles approach to problem-solving. The topic is edge detection, and that book is <em>Vision</em>, by David Marr<sup data-fn="d6a1d31f-a93f-4844-8b90-70b4da955016" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#d6a1d31f-a93f-4844-8b90-70b4da955016" id="d6a1d31f-a93f-4844-8b90-70b4da955016-link" rel="nofollow" target="_blank">7</a></sup>. Just as <em>On the Origin of Species</em><sup data-fn="210cf9a3-ea19-4402-b37e-c2535bd96366" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#210cf9a3-ea19-4402-b37e-c2535bd96366" id="210cf9a3-ea19-4402-b37e-c2535bd96366-link" rel="nofollow" target="_blank">8</a></sup> and <em>On Growth and Form</em><sup data-fn="8d6afae8-ee06-495e-9b78-9705d6088f63" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#8d6afae8-ee06-495e-9b78-9705d6088f63" id="8d6afae8-ee06-495e-9b78-9705d6088f63-link" rel="nofollow" target="_blank">9</a></sup>, this is yet another masterpiece that brought together different disciplines – in this case neurophysiology and computer vision – to revolutionise science.</p>



<p class="wp-block-paragraph">In this blog post we will define and compare algorithms for image edge detection, and explore their remarkable similarity with neurophysiological readings.</p>



<h1 class="wp-block-heading">Introduction</h1>



<p class="wp-block-paragraph">Modern computer vision is deeply rooted in Marr’s pioneering work. To understand any information-processing system, Marr argued, one must describe it at three interdependent levels of analysis:</p>



<ul class="wp-block-list">
<li>The computational level – <strong>what</strong> problem is being solved and <strong>why</strong> (e.g. edge detection)</li>



<li>The algorithmic level – <strong>how</strong> it is solved, and what representations and procedures are used (e.g. the Laplacian transform)</li>



<li>The implementational level – <strong>where</strong> it is physically realised (e.g. <em>in vivo</em>, <em>in silico</em>)</li>
</ul>



<p class="wp-block-paragraph">This layered thinking is what makes the book so enduring. Marr was not merely describing the visual system, he was arguing that to truly understand it you had to explain it at all three levels simultaneously. The book also features memorable passages on random dot stereograms<sup data-fn="ab32dfeb-f0f5-476f-9b2a-ad0939f8514f" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#ab32dfeb-f0f5-476f-9b2a-ad0939f8514f" id="ab32dfeb-f0f5-476f-9b2a-ad0939f8514f-link" rel="nofollow" target="_blank">10</a></sup>, binocular disparity and motion perception – overall, highly recommended for science enthusiasts.</p>



<p class="wp-block-paragraph">Let us now introduce the key concepts underlying edge detection that leveraged this structured approach, to gain a better understanding of how it can be solved in practice.</p>



<h2 class="wp-block-heading">Zero-crossing <img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/270f.png?w=578&#038;ssl=1" alt="&#x270f;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></h2>



<p class="wp-block-paragraph">From a computational perspective, edges are fundamentally spatial discontinuities in images. If for a brief moment we consider a simple greyscale image, edges are wherever dark-to-light and light-to-dark transitions occur, in whatever direction.</p>



<p class="wp-block-paragraph">Because such transitions mathematically translate to local changes in pixel intensity, the most natural approach to identify edges is to compute <strong>image gradients</strong>, the two-dimensional equivalent of the derivative. The first derivative of image intensity evaluated across an edge produces a peak (for a dark-to-bright transition) or a trough (for a bright-to-dark transition), depending on the direction. However, the <strong>second derivative</strong> provides not only the means to identify both transition types, but also a beautifully simple detection mechanism: it crosses zero at the precise location of the edge. This is the essence of the <strong>zero-crossing</strong>.</p>



<figure class="wp-block-image size-full"><img data-attachment-id="9914" data-permalink="https://poissonisfish.com/?attachment_id=9914" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/03/step_edge_derivatives.png" data-orig-size="2139,606" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}" data-image-title="step_edge_derivatives" data-image-description="" data-image-caption="<p>First and second derivatives of a one-dimensional signal. Left: signal displaying low-to-high (dark-to-bright) transition. Middle: first derivative of the signal, capturing that transition as a peak. Right: second derivative of the signal, exhibiting the zero-crossing.</p>
&#8221; data-large-file=&#8221;https://poissonisfish.com/wp-content/uploads/2026/03/step_edge_derivatives.png?w=1024&#8243; src=&#8221;https://poissonisfish.com/wp-content/uploads/2026/03/step_edge_derivatives.png&#8221; alt=&#8221;&#8221; class=&#8221;wp-image-9914&#8243; /><figcaption class="wp-element-caption">First and second derivatives of a one-dimensional signal. Left: signal displaying low-to-high (dark-to-bright) transition. Middle: first derivative of the signal, capturing that transition as a peak. Right: second derivative of the signal, exhibiting the zero-crossing.</figcaption></figure>



<p class="wp-block-paragraph">Marr and Hildreth formalised this insight by proposing the <strong>Laplacian of Gaussian</strong> (LoG) as the operator of choice<sup data-fn="4b5583df-7b37-4973-9b87-cd2a13af4711" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#4b5583df-7b37-4973-9b87-cd2a13af4711" id="4b5583df-7b37-4973-9b87-cd2a13af4711-link" rel="nofollow" target="_blank">11</a></sup>. The Laplacian <img src="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\nabla^2 " class="latex" /> is the sum of second partial derivatives in both spatial dimensions:</p>



<p class="wp-block-paragraph"><img src="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+f+%3D+%5Cfrac%7B%5Cpartial%5E2+f%7D%7B%5Cpartial+x%5E2%7D+%2B+%5Cfrac%7B%5Cpartial%5E2+f%7D%7B%5Cpartial+y%5E2%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=2&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+f+%3D+%5Cfrac%7B%5Cpartial%5E2+f%7D%7B%5Cpartial+x%5E2%7D+%2B+%5Cfrac%7B%5Cpartial%5E2+f%7D%7B%5Cpartial+y%5E2%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+f+%3D+%5Cfrac%7B%5Cpartial%5E2+f%7D%7B%5Cpartial+x%5E2%7D+%2B+%5Cfrac%7B%5Cpartial%5E2+f%7D%7B%5Cpartial+y%5E2%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002&#038;zoom=4.5 4x" alt="\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2} " class="latex" /></p>



<p class="wp-block-paragraph">Applied directly to a noisy image, the Laplacian amplifies every small intensity fluctuation. The Gaussian pre-filter <img src="https://s0.wp.com/latex.php?latex=G_%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=G_%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=G_%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="G_\sigma " class="latex" /> solves this by smoothing the image at a chosen scale <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> before differentiation. Because convolution is associative, the two steps can be combined into a single kernel – the LoG, also denoted <img src="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+G+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+G+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+G+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\nabla^2 G " class="latex" />:</p>



<p class="wp-block-paragraph"><img src="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+G_%5Csigma%28x%2C+y%29+%3D+-%5Cfrac%7B1%7D%7B%5Cpi%5Csigma%5E4%7D%5Cleft%281+-+%5Cfrac%7Bx%5E2+%2B+y%5E2%7D%7B2%5Csigma%5E2%7D%5Cright%29e%5E%7B-%5Cfrac%7Bx%5E2%2By%5E2%7D%7B2%5Csigma%5E2%7D%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=2&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+G_%5Csigma%28x%2C+y%29+%3D+-%5Cfrac%7B1%7D%7B%5Cpi%5Csigma%5E4%7D%5Cleft%281+-+%5Cfrac%7Bx%5E2+%2B+y%5E2%7D%7B2%5Csigma%5E2%7D%5Cright%29e%5E%7B-%5Cfrac%7Bx%5E2%2By%5E2%7D%7B2%5Csigma%5E2%7D%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Cnabla%5E2+G_%5Csigma%28x%2C+y%29+%3D+-%5Cfrac%7B1%7D%7B%5Cpi%5Csigma%5E4%7D%5Cleft%281+-+%5Cfrac%7Bx%5E2+%2B+y%5E2%7D%7B2%5Csigma%5E2%7D%5Cright%29e%5E%7B-%5Cfrac%7Bx%5E2%2By%5E2%7D%7B2%5Csigma%5E2%7D%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002&#038;zoom=4.5 4x" alt="\nabla^2 G_\sigma(x, y) = -\frac{1}{\pi\sigma^4}\left(1 - \frac{x^2 + y^2}{2\sigma^2}\right)e^{-\frac{x^2+y^2}{2\sigma^2}} " class="latex" /></p>



<p class="wp-block-paragraph">This kernel, which resembles an inverted sombrero and is sometimes called the <strong>Mexican hat wavelet</strong>, produces a response that crosses zero exactly at an intensity edge. The width of the Gaussian <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> determines the scale of detection: small <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> preserves fine detail, large <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> captures only coarse structure. Marr argued that the visual system operates simultaneously at multiple scales – an idea that would later resonate in scale-space theory and, much later, in the multi-scale feature maps of deep convolutional networks.</p>


<div class="wp-block-image wp-image-10036 size-large">
<figure class="aligncenter size-full is-resized"><img data-attachment-id="10036" data-permalink="https://poissonisfish.com/?attachment_id=10036" data-orig-file="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/laplacian_of_gaussian.png?w=578&#038;ssl=1" data-orig-size="1179,980" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="laplacian_of_gaussian" data-image-description="" data-image-caption="" data-large-file="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/laplacian_of_gaussian.png?w=578&#038;ssl=1?w=1024" src="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/laplacian_of_gaussian.png?w=578&#038;ssl=1" alt="" class="wp-image-10036" style="aspect-ratio:1.2048252458447897;object-fit:cover;width:650px" data-recalc-dims="1" /><figcaption class="wp-element-caption">One-dimensional cross-section of the Laplacian of Gaussian. The characteristic positive central lobe flanked by two negative side lobes – the “Mexican hat” – is what produces a zero-crossing wherever image intensity changes sharply.</figcaption></figure>
</div>


<p class="wp-block-paragraph">Marr went further and showed that the centre-surround organisation of <strong>retinal ganglion cell</strong> receptive fields – which he modelled as a <strong>Difference of Gaussians</strong> (DoG), the difference between a narrow excitatory Gaussian and a broader inhibitory one – is a close biological approximation of the LoG. Put differently, your retina is already computing zero-crossings before the signal ever reaches the visual cortex. The agreement between computational predictions and <em>in vivo</em> electrophysiological measurements, documented in <em>Vision</em> (p. 64), remains one of the most compelling examples of theory meeting experiment in all of neuroscience.</p>


<div class="wp-block-image wp-image-10033 size-large">
<figure class="aligncenter"><img loading="lazy" data-attachment-id="10033" data-permalink="https://poissonisfish.com/?attachment_id=10033" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png" data-orig-size="3537,2418" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="zero_cross_deblurred" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=450&#038;ssl=1" src="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=450&#038;ssl=1" alt="" class="wp-image-10033" srcset_temp="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=2048 2048w, https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=1440 1440w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /><figcaption class="wp-element-caption">Response of the LoG operator to three idealised stimuli: a step edge, a thin bar and a wide bar (columns). Top: input intensity profile. Middle: filter response. Bottom: histogram of intracellular recordings from cat retinal X-cells exposed to analogous stimuli, after Marr (1982). The agreement is striking and forms the basis for the claim that retinal ganglion cells implement a biological LoG.</figcaption></figure>
</div>


<p class="wp-block-paragraph">Zero-crossings are theoretically elegant, but the workhorse operators most computer-vision tools reach for – including, as we will see, the Canny detector itself – operate on first-derivative gradients. Let us look at those.</p>



<h2 class="wp-block-heading">Image gradients <img src="https://i1.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4d0.png?w=578&#038;ssl=1" alt="&#x1f4d0;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></h2>



<p class="wp-block-paragraph">In practice, image gradients are computed using <strong>convolution filters</strong> – small kernels that slide across the image and produce a weighted local sum at each pixel, as illustrated in my <a href="https://poissonisfish.com/2018/07/08/convolutional-neural-networks-in-r/" rel="nofollow" target="_blank">post on convolutional neural networks</a>. The two most widely used first-order gradient operators are:</p>



<p class="wp-block-paragraph"><strong>Sobel:</strong> weights the central row and column more heavily, providing a modest degree of smoothing:</p>



<p class="wp-block-paragraph"><img src="https://s0.wp.com/latex.php?latex=G_x+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+0+%26+1+%5C%5C+-2+%26+0+%26+2+%5C%5C+-1+%26+0+%26+1+%5Cend%7Bbmatrix%7D%2C+%5Cquad+G_y+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+-2+%26+-1+%5C%5C+0+%26+0+%26+0+%5C%5C+1+%26+2+%26+1+%5Cend%7Bbmatrix%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=2&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=G_x+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+0+%26+1+%5C%5C+-2+%26+0+%26+2+%5C%5C+-1+%26+0+%26+1+%5Cend%7Bbmatrix%7D%2C+%5Cquad+G_y+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+-2+%26+-1+%5C%5C+0+%26+0+%26+0+%5C%5C+1+%26+2+%26+1+%5Cend%7Bbmatrix%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=G_x+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+0+%26+1+%5C%5C+-2+%26+0+%26+2+%5C%5C+-1+%26+0+%26+1+%5Cend%7Bbmatrix%7D%2C+%5Cquad+G_y+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+-2+%26+-1+%5C%5C+0+%26+0+%26+0+%5C%5C+1+%26+2+%26+1+%5Cend%7Bbmatrix%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002&#038;zoom=4.5 4x" alt="G_x = \begin{bmatrix} -1 &#038; 0 &#038; 1 \\ -2 &#038; 0 &#038; 2 \\ -1 &#038; 0 &#038; 1 \end{bmatrix}, \quad G_y = \begin{bmatrix} -1 &#038; -2 &#038; -1 \\ 0 &#038; 0 &#038; 0 \\ 1 &#038; 2 &#038; 1 \end{bmatrix} " class="latex" /></p>



<p class="wp-block-paragraph"><strong>Prewitt:</strong> weights all neighbours equally:</p>



<p class="wp-block-paragraph"><img src="https://s0.wp.com/latex.php?latex=G_x+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+0+%26+1+%5C%5C+-1+%26+0+%26+1+%5C%5C+-1+%26+0+%26+1+%5Cend%7Bbmatrix%7D%2C+%5Cquad+G_y+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+-1+%26+-1+%5C%5C+0+%26+0+%26+0+%5C%5C+1+%26+1+%26+1+%5Cend%7Bbmatrix%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=2&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=G_x+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+0+%26+1+%5C%5C+-1+%26+0+%26+1+%5C%5C+-1+%26+0+%26+1+%5Cend%7Bbmatrix%7D%2C+%5Cquad+G_y+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+-1+%26+-1+%5C%5C+0+%26+0+%26+0+%5C%5C+1+%26+1+%26+1+%5Cend%7Bbmatrix%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=G_x+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+0+%26+1+%5C%5C+-1+%26+0+%26+1+%5C%5C+-1+%26+0+%26+1+%5Cend%7Bbmatrix%7D%2C+%5Cquad+G_y+%3D+%5Cbegin%7Bbmatrix%7D+-1+%26+-1+%26+-1+%5C%5C+0+%26+0+%26+0+%5C%5C+1+%26+1+%26+1+%5Cend%7Bbmatrix%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=2&#038;c=20201002&#038;zoom=4.5 4x" alt="G_x = \begin{bmatrix} -1 &#038; 0 &#038; 1 \\ -1 &#038; 0 &#038; 1 \\ -1 &#038; 0 &#038; 1 \end{bmatrix}, \quad G_y = \begin{bmatrix} -1 &#038; -1 &#038; -1 \\ 0 &#038; 0 &#038; 0 \\ 1 &#038; 1 &#038; 1 \end{bmatrix} " class="latex" /></p>



<p class="wp-block-paragraph">In both cases the gradient magnitude at each pixel is <img src="https://s0.wp.com/latex.php?latex=%5C%7C%5Cnabla+f%5C%7C+%3D+%5Csqrt%7BG_x%5E2+%2B+G_y%5E2%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5C%7C%5Cnabla+f%5C%7C+%3D+%5Csqrt%7BG_x%5E2+%2B+G_y%5E2%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5C%7C%5Cnabla+f%5C%7C+%3D+%5Csqrt%7BG_x%5E2+%2B+G_y%5E2%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\|\nabla f\| = \sqrt{G_x^2 + G_y^2} " class="latex" />, and the gradient direction is <img src="https://s0.wp.com/latex.php?latex=%5Ctheta+%3D+%5Carctan%28G_y+%2F+G_x%29+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Ctheta+%3D+%5Carctan%28G_y+%2F+G_x%29+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Ctheta+%3D+%5Carctan%28G_y+%2F+G_x%29+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\theta = \arctan(G_y / G_x) " class="latex" />. Where the magnitude is large, a transition is occurring; where it is small, the neighbourhood is uniform.</p>



<figure class="wp-block-image size-full"><img data-attachment-id="10175" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/prewitt_sobel/" data-orig-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/prewitt_sobel.png?w=578&#038;ssl=1" data-orig-size="2800,1000" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="prewitt_sobel" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/prewitt_sobel.png?w=578&#038;ssl=1?w=1024" src="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/prewitt_sobel.png?w=578&#038;ssl=1" alt="" class="wp-image-10175" data-recalc-dims="1" /><figcaption class="wp-element-caption">Effect of Prewitt and Sobel operators on a natural image of a brick floor (gradient magnitude).</figcaption></figure>



<p class="wp-block-paragraph">The limitation of these operators is that they are sensitive to noise and produce thick, diffuse edges. Every pixel with a large gradient is flagged regardless of whether it truly lies on the edge or merely near it. This is precisely the problem that John Canny set out to solve.</p>



<h2 class="wp-block-heading">Canny edge detection <img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f50d.png?w=578&#038;ssl=1" alt="&#x1f50d;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></h2>



<p class="wp-block-paragraph">Published in 1986, John Canny’s paper <em>A Computational Approach to Edge Detection</em><sup data-fn="b51508a4-906a-492d-87d6-4a3a4971d9cc" class="fn"><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#b51508a4-906a-492d-87d6-4a3a4971d9cc" id="b51508a4-906a-492d-87d6-4a3a4971d9cc-link" rel="nofollow" target="_blank">12</a></sup> remains one of the most cited works in computer vision. Canny framed edge detection as an explicit optimisation problem and derived a detector that simultaneously maximises three criteria: <em>i</em>) good detection (few missed edges, few “false alarms”), <em>ii</em>) good localisation (detected edges close to true edges), and <em>iii</em>) single response (one response per edge, not many). The resulting algorithm is a four-step pipeline outlined below:</p>



<h3 class="wp-block-heading">Step 1 – Gaussian smoothing</h3>



<p class="wp-block-paragraph">As with the LoG, the first step is to suppress noise by convolving the image with a Gaussian kernel of width <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" />. The choice of <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> directly governs the trade-off between noise suppression and fine detail preservation. A larger <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> removes more noise but blurs genuine edges.</p>



<h3 class="wp-block-heading">Step 2 – Gradient computation</h3>



<p class="wp-block-paragraph">The smoothed image is then differentiated, typically using Sobel kernels – to obtain the gradient magnitude <img src="https://s0.wp.com/latex.php?latex=%5C%7C%5Cnabla+f%5C%7C+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5C%7C%5Cnabla+f%5C%7C+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5C%7C%5Cnabla+f%5C%7C+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\|\nabla f\| " class="latex" /> and direction <img src="https://s0.wp.com/latex.php?latex=%5Ctheta+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Ctheta+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Ctheta+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\theta " class="latex" /> at every pixel.</p>



<h3 class="wp-block-heading">Step 3 – Non-maximum suppression (NMS)</h3>



<p class="wp-block-paragraph">This step thins the edges. For each pixel, Canny checks whether its gradient magnitude is a local maximum along the gradient direction – that is, whether it is larger than its two neighbours in the direction <img src="https://s0.wp.com/latex.php?latex=%5Ctheta+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Ctheta+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Ctheta+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\theta " class="latex" />. If it is not, it is suppressed to zero. The result is a set of thin, one-pixel-wide candidate edges.</p>



<h3 class="wp-block-heading">Step 4 – Hysteresis thresholding</h3>



<p class="wp-block-paragraph">The final step uses <strong>two thresholds</strong>, <img src="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="T_\text{high} " class="latex" /> and <img src="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="T_\text{low} " class="latex" />, to prune candidate pixels. A pixel whose gradient magnitude exceeds <img src="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="T_\text{high} " class="latex" /> is accepted as an edge, and conversely a pixel below <img src="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="T_\text{low} " class="latex" /> is rejected. A pixel between the two thresholds is accepted only if it is connected, directly or through other such pixels, to a strong edge pixel. This connectivity analysis – the defining feature of hysteresis – ensures that long, continuous edges are preserved even when their local gradient fluctuates, while isolated noise responses are discarded. For a more visual understanding of NMS and hysteresis I recommend reading the <a href="https://docs.opencv.org/4.x/da/d22/tutorial_py_canny.html" rel="nofollow" target="_blank">Canny edge detection</a> documentation from OpenCV.</p>


<div class="wp-block-image wp-image-10050 size-large">
<figure class="aligncenter"><img loading="lazy" data-attachment-id="10050" data-permalink="https://poissonisfish.com/?attachment_id=10050" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png" data-orig-size="3004,1638" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="canny_workflow" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=450&#038;ssl=1" src="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=450&#038;ssl=1" alt="" class="wp-image-10050" srcset_temp="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=2048 2048w, https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=1440 1440w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /><figcaption class="wp-element-caption">The Canny pipeline applied to a sample image. From left to right: original greyscale input, Gaussian-blurred image, Sobel gradient magnitude, output of NMS, and final edge map after hysteresis thresholding. Each step removes a specific failure mode of the previous one (AI-generated image)</figcaption></figure>
</div>


<p class="wp-block-paragraph">The elegance of Canny lies in how each step addresses a specific failure mode of earlier operators. In essence Gaussian smoothing tackles noise, NMS tackles thick edges and hysteresis tackles the false edge / broken edge trade-off that a single threshold cannot solve.</p>



<h1 class="wp-block-heading">Let’s get started with Python</h1>



<p class="wp-block-paragraph">Time to practice! We will first build the separate components (Gaussian blur, Sobel gradients, LoG zero-crossings), then run the full Canny pipeline and explore how its parameters trade off recall against noise. We will use <code>opencv</code> and <code>scikit-image</code> alongside the usual suspects <code>numpy</code> and <code>matplotlib</code>. You can install all packages using the following shell command:</p>



<p class="wp-block-paragraph"><code>pip install opencv-python scikit-image matplotlib numpy</code></p>



<h2 class="wp-block-heading">Image loading and preprocessing</h2>



<p class="wp-block-paragraph">We start by loading a greyscale image. For demonstration purposes I use a stock picture from <code>scikit-image</code> – feel free to use any other image of your choice.</p>


<div class="wp-block-code">
	<div class="cm-editor">
		<div class="cm-scroller">
			
<pre>
import cv2import numpy as npimport matplotlib.pyplot as pltfrom skimage import datafrom scipy.ndimage import gaussian_laplace# Load a greyscale test image (uint8, values 0–255)image = data.camera() fig, ax = plt.subplots(figsize=(5, 5))ax.imshow(image, cmap=&apos;gray&apos;)ax.set_title(&apos;Original image&apos;)ax.axis(&apos;off&apos;)plt.tight_layout()plt.show()</pre>
		</div>
	</div>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-attachment-id="10134" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/camera_original/" data-orig-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_original-1.png?w=578&#038;ssl=1" data-orig-size="1000,1000" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="camera_original" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_original-1.png?w=578&#038;ssl=1?w=1000" src="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_original-1.png?w=578&#038;ssl=1" alt="" class="wp-image-10134" style="width:650px" data-recalc-dims="1" /></figure>
</div>


<h2 class="wp-block-heading">The Laplacian of Gaussian and zero-crossings</h2>



<p class="wp-block-paragraph">Let us inspect the LoG response and its zero-crossings – the theoretical backbone we discussed earlier. </p>


<div class="wp-block-code">
	<div class="cm-editor">
		<div class="cm-scroller">
			
<pre>
# LoG: positive sigma = apply Gaussian of that std, then Laplacianlog_response = gaussian_laplace(image.astype(float), sigma=2.0) # Zero-crossings: sign changes between neighbouring pixelsdef zero_crossings(log_img):    &quot;&quot;&quot;Return a binary mask of zero-crossing locations.&quot;&quot;&quot;    zc = np.zeros_like(log_img, dtype=bool)    # Check horizontal and vertical sign changes    for shift in [(0, 1), (1, 0)]:        shifted = np.roll(log_img, shift=shift, axis=(0, 1))        zc |= (np.sign(log_img) != np.sign(shifted))    return zc zc_mask = zero_crossings(log_response) fig, axes = plt.subplots(1, 2, figsize=(10, 4))axes[0].imshow(log_response, cmap=&apos;RdBu_r&apos;)axes[0].set_title(&apos;LoG response (σ=2.0)&apos;)axes[0].axis(&apos;off&apos;)axes[1].imshow(zc_mask, cmap=&apos;gray&apos;)axes[1].set_title(&apos;Zero-crossings&apos;)axes[1].axis(&apos;off&apos;)plt.tight_layout()plt.show()</pre>
		</div>
	</div>
</div>


<figure class="wp-block-image size-large"><img loading="lazy" data-attachment-id="10137" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/camera_zerocross/" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png" data-orig-size="2000,800" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="camera_zerocross" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=450&#038;ssl=1" src="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=450&#038;ssl=1" alt="" class="wp-image-10137" srcset_temp="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=1440 1440w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png 2000w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></figure>



<p class="wp-block-paragraph">Notice how the zero-crossing map already captures much of the scene’s edge structure, but it is sensitive to low-level noise and retains spurious responses in flat regions. This motivates the additional refinement steps of the Canny algorithm.</p>



<h2 class="wp-block-heading">Gaussian smoothing and gradient computation</h2>



<p class="wp-block-paragraph">Before running the full Canny pipeline, it is instructive to inspect the intermediate steps. Here we apply a Gaussian blur and then compute Sobel gradients manually.</p>


<div class="wp-block-code">
	<div class="cm-editor">
		<div class="cm-scroller">
			
<pre>
# Gaussian blur — sigma controlled by ksize (must be odd) and sigmaXblurred = cv2.GaussianBlur(image, ksize=(5, 5), sigmaX=1.4) # Sobel gradients in x and yGx = cv2.Sobel(blurred, cv2.CV_64F, dx=1, dy=0, ksize=3)Gy = cv2.Sobel(blurred, cv2.CV_64F, dx=0, dy=1, ksize=3) # Gradient magnitudemagnitude = np.sqrt(Gx**2 + Gy**2)magnitude = (magnitude / magnitude.max() * 255).astype(np.uint8) fig, axes = plt.subplots(1, 2, figsize=(10, 4))for ax, img, title in zip(axes, [blurred, magnitude], [&apos;Gaussian blur (σ=1.4)&apos;, &apos;|∇f| - Sobel magnitude&apos;]):    ax.imshow(img, cmap=&apos;gray&apos;)    ax.set_title(title)    ax.axis(&apos;off&apos;)plt.tight_layout()plt.show()</pre>
		</div>
	</div>
</div>


<figure class="wp-block-image size-large"><img loading="lazy" data-attachment-id="10136" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/camera_smooth_grad/" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png" data-orig-size="2000,800" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="camera_smooth_grad" data-image-description="" data-image-caption="" data-large-file="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=450&#038;ssl=1" src="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=450&#038;ssl=1" alt="" class="wp-image-10136" srcset_temp="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=1440 1440w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png 2000w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></figure>



<h2 class="wp-block-heading">Canny edge detection with OpenCV</h2>



<p class="wp-block-paragraph">The OpenCV <code>Canny()</code> function accepts the image, the two hysteresis thresholds, and an optional aperture size for the Sobel operator. Crucially, the Gaussian smoothing step should be applied manually beforehand so you have full control over <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" />.</p>


<div class="wp-block-code">
	<div class="cm-editor">
		<div class="cm-scroller">
			
<pre>
def run_canny(image, sigma, t_low, t_high, aperture=3):    &quot;&quot;&quot;Apply Gaussian blur then Canny edge detection.&quot;&quot;&quot;    # Kernel size: 2 * ceil(3*sigma) + 1 ensures the kernel covers ±3σ    ksize = 2 * int(np.ceil(3 * sigma)) + 1    blurred = cv2.GaussianBlur(image, (ksize, ksize), sigmaX=sigma)    edges = cv2.Canny(blurred, threshold1=t_low, threshold2=t_high, apertureSize=aperture)    return edgesedges = run_canny(image, sigma=1.4, t_low=50, t_high=150)fig, axes = plt.subplots(1, 2, figsize=(10, 4))axes[0].imshow(image, cmap=&apos;gray&apos;)axes[0].set_title(&apos;Original&apos;)axes[0].axis(&apos;off&apos;)axes[1].imshow(edges, cmap=&apos;gray&apos;)axes[1].set_title(&apos;Canny edges (σ=1.4, T_low=50, T_high=150)&apos;)axes[1].axis(&apos;off&apos;)plt.tight_layout()plt.show()</pre>
		</div>
	</div>
</div>


<figure class="wp-block-image size-large"><img loading="lazy" data-attachment-id="10132" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/camera_canny/" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png" data-orig-size="2000,800" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="camera_canny" data-image-description="" data-image-caption="" data-large-file="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=450&#038;ssl=1" src="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=450&#038;ssl=1" alt="" class="wp-image-10132" srcset_temp="https://i1.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=1440 1440w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png 2000w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></figure>



<h2 class="wp-block-heading">The effect of hysteresis thresholds</h2>



<p class="wp-block-paragraph">The ratio <img src="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+%2F+T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+%2F+T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=T_%5Ctext%7Bhigh%7D+%2F+T_%5Ctext%7Blow%7D+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="T_\text{high} / T_\text{low} " class="latex" /> is at least as important as the absolute values. A common rule of thumb is to use a 2:1 or 3:1 ratio. Let us explore this now.</p>


<div class="wp-block-code">
	<div class="cm-editor">
		<div class="cm-scroller">
			
<pre>
fig, axes = plt.subplots(2, 3, figsize=(14, 8))configs = [    (1.4, 20, 60, &apos;↑ Recall, ↑ Noise\nσ=1.4, T=20/60&apos;),    (1.4, 50, 150, &apos;Balanced\nσ=1.4, T=50/150&apos;),    (1.4, 100, 200, &apos;↓ Recall, ↓ Noise\nσ=1.4, T=100/200&apos;),    (0.5, 50, 150, &apos;Fine scale\nσ=0.5, T=50/150&apos;),    (2.0, 50, 150, &apos;Coarse scale\nσ=2.0, T=50/150&apos;),    (4.0, 50, 150, &apos;Very coarse scale\nσ=4.0, T=50/150&apos;),] for ax, (sigma, tl, th, title) in zip(axes.ravel(), configs):    result = run_canny(image, sigma=sigma, t_low=tl, t_high=th)    ax.imshow(result, cmap=&apos;gray&apos;)    ax.set_title(title, fontsize=9)    ax.axis(&apos;off&apos;) plt.tight_layout()plt.show()</pre>
		</div>
	</div>
</div>


<figure class="wp-block-image size-large"><img loading="lazy" data-attachment-id="10133" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/camera_hyst_thresh/" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png" data-orig-size="2800,1600" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="camera_hyst_thresh" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=450&#038;ssl=1" src="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=450&#038;ssl=1" alt="" class="wp-image-10133" srcset_temp="https://i0.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=2048 2048w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=1440 1440w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></figure>



<p class="wp-block-paragraph">The top row demonstrates the threshold effect – lower thresholds recover more edges but also more noise, higher thresholds yield cleaner output at the cost of broken contours. The bottom row shows the scale effect governed by <img src="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;%23038;fg=555555&#038;%23038;s=0&#038;%23038;c=20201002" srcset_temp="https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma+&#038;bg=f9f9f9&#038;fg=555555&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="\sigma " class="latex" /> – at small scales the detector responds to fine texture and noise, at large scales only the dominant structural boundaries survive.</p>



<h2 class="wp-block-heading">Overlaying edges on the original image</h2>



<p class="wp-block-paragraph">A useful visualisation is overlaying detected edges on the original image. This facilitates the quality assessment of our workflow.</p>


<div class="wp-block-code">
	<div class="cm-editor">
		<div class="cm-scroller">
			
<pre>
# Convert to RGB so we can draw edges in redoverlay = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)overlay[edges &gt; 0] = [220, 40, 40] # red edges fig, ax = plt.subplots(figsize=(6, 6))ax.imshow(overlay)ax.set_title(&apos;Canny edges overlaid (σ=1.4, T=50/150)&apos;)ax.axis(&apos;off&apos;)plt.tight_layout()plt.show()</pre>
		</div>
	</div>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img loading="lazy" data-attachment-id="10135" data-permalink="https://poissonisfish.com/2026/05/08/edge-detection-in-python/camera_overlay/" data-orig-file="https://poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png" data-orig-size="1200,1200" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="camera_overlay" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=450&#038;ssl=1" src="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=450&#038;ssl=1" alt="" class="wp-image-10135" style="width:650px" srcset_temp="https://i2.wp.com/poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=450&#038;ssl=1 1024w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=150 150w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=300 300w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=768 768w, https://poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png 1200w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /></figure>
</div>


<p class="wp-block-paragraph">Similarly to the butterfly picture, here we find a solution that precisely identifies the sharpest edges from the image.</p>



<h1 class="wp-block-heading">Conclusion <img src="https://i1.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f3c1.png?w=578&#038;ssl=1" alt="&#x1f3c1;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></h1>



<p class="wp-block-paragraph">We have traced a line from the centre-surround receptive fields of retinal ganglion cells to the LoG operator and Marr’s zero-crossings, and from there to the Canny detector – one of the most popular algorithms in image processing. The key ideas are worth summarising:</p>



<ul class="wp-block-list">
<li><strong>Edges are zero-crossings of the second derivative</strong> of image intensity, a principle Marr derived from first principles and validated against neurophysiology <img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f9e0.png?w=578&#038;ssl=1" alt="&#x1f9e0;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></li>



<li><strong>The LoG operator</strong> implements this computationally: a Gaussian pre-filter controls scale and suppresses noise, whereas the Laplacian finds sign changes <img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4bb.png?w=578&#038;ssl=1" alt="&#x1f4bb;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></li>



<li><strong>Canny refines the idea</strong> with NMS for thin, well-localised edges, and hysteresis thresholding to preserve continuous contours without fragmenting them <img src="https://i1.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f3d9.png?w=578&#038;ssl=1" alt="&#x1f3d9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></li>
</ul>



<p class="wp-block-paragraph">Edge detection may seem like a solved problem in an era of end-to-end learned vision systems, but it remains the conceptual foundation of a surprisingly wide range of techniques. Some practical applications worth exploring on your own include:</p>



<ul class="wp-block-list">
<li><strong>Hough transform</strong> for line and circle detection – it operates directly on edge maps</li>



<li><strong>Contour-based object detection</strong> – a classical pre-deep-learning approach that is still competitive in constrained domains</li>



<li><strong>Medical image segmentation</strong> – where edge-based pre-processing still complements learned models for thin-structure detection</li>
</ul>



<p class="wp-block-paragraph">That brings us to a close – thanks for reading, I hope this post was insightful and entertaining. Stay curious! <img src="https://i1.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4a1.png?w=578&#038;ssl=1" alt="&#x1f4a1;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></p>



<h1 class="wp-block-heading">References <img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4d6.png?w=578&#038;ssl=1" alt="&#x1f4d6;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" /></h1>


<ol class="wp-block-footnotes"><li id="563b7add-8b04-4fd2-a688-2383895c42c9">Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &#038; Polosukhin, I. (2017). <em>Attention Is All You Need.</em> arXiv:1706.03762. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#563b7add-8b04-4fd2-a688-2383895c42c9-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="bd3dda1f-27f7-44e9-8d0f-9a42c28ed201">Vision Banana team, Google DeepMind (2026). <em>Image Generators are Generalist Vision Learners.</em> arXiv:2604.20329. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#bd3dda1f-27f7-44e9-8d0f-9a42c28ed201-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="63b9bf52-e78a-405a-bc33-5082dc51f74e">Robicheaux, P. <em>et al.</em> (2025). <em>RF-DETR: Neural Architecture Search for Real-Time Detection Transformers.</em> arXiv:2511.09554 (ICLR 2026). <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#63b9bf52-e78a-405a-bc33-5082dc51f74e-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="9740d4a1-35a7-4003-9e9f-63a4fa16b90b">Kudithipudi, D. <em>et al.</em> (2025). <em>Neuromorphic computing at scale.</em> Nature 637, 801–812. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#9740d4a1-35a7-4003-9e9f-63a4fa16b90b-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="58f9bd6d-ec82-4b9b-8285-5f8b083184ad">Ashtiani, F., Idjadi, M. H., &#038; Kim, K. (2026). <em>Integrated photonic neural network with on-chip backpropagation training.</em> Nature 651, 927–932. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#58f9bd6d-ec82-4b9b-8285-5f8b083184ad-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="7b848804-3070-4e9f-bc5f-ea3e60f0bf14">Assran, M. <em>et al.</em> (2023). <em>Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA).</em> arXiv:2301.08243. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#7b848804-3070-4e9f-bc5f-ea3e60f0bf14-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="d6a1d31f-a93f-4844-8b90-70b4da955016">Marr, D. (1982). <em>Vision: A Computational Investigation into the Human Representation and Processing of Visual Information.</em> W. H. Freeman; reissued by MIT Press (2010). <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#d6a1d31f-a93f-4844-8b90-70b4da955016-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="210cf9a3-ea19-4402-b37e-c2535bd96366">Darwin, C. (1859). <em>On the Origin of Species by Means of Natural Selection.</em> John Murray, London. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#210cf9a3-ea19-4402-b37e-c2535bd96366-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="8d6afae8-ee06-495e-9b78-9705d6088f63">Thompson, D’A. W. (1917). <em>On Growth and Form.</em> Cambridge University Press. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#8d6afae8-ee06-495e-9b78-9705d6088f63-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="ab32dfeb-f0f5-476f-9b2a-ad0939f8514f">Julesz, B. (1971). <em>Foundations of Cyclopean Perception.</em> University of Chicago Press. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#ab32dfeb-f0f5-476f-9b2a-ad0939f8514f-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="4b5583df-7b37-4973-9b87-cd2a13af4711">Marr, D., &#038; Hildreth, E. (1980). <em>Theory of edge detection.</em> Proceedings of the Royal Society of London B, 207(1167), 187–217. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#4b5583df-7b37-4973-9b87-cd2a13af4711-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li><li id="b51508a4-906a-492d-87d6-4a3a4971d9cc">Canny, J. (1986). <em>A Computational Approach to Edge Detection.</em> IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6), 679–698. <a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/#b51508a4-906a-492d-87d6-4a3a4971d9cc-link" rel="nofollow" target="_blank"><img src="https://i0.wp.com/s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/21a9.png?w=578&#038;ssl=1" alt="&#x21a9;" class="wp-smiley" style="height: 1em; max-height: 1em;" data-recalc-dims="1" />︎</a></li></ol>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://poissonisfish.com/2026/05/08/edge-detection-in-python/"> poissonisfish</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/edge-detection-in-python/">Edge detection in Python</a>]]></content:encoded>
					
		
		<enclosure url="https://1.gravatar.com/avatar/ddaccde3ecefe0821900911d3cd41d541083048d067f5e78cd9d597f0ea3ceaa?s=96&#038;d=identicon&#038;r=G" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/butterfly_canny.png" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/03/step_edge_derivatives.png" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/laplacian_of_gaussian.png" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/zero_cross_deblurred.png?w=1024" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/prewitt_sobel.png" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/canny_workflow.png?w=1024" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/camera_original-1.png" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/camera_zerocross-1.png?w=1024" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/camera_smooth_grad-1.png?w=1024" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/camera_canny-1.png?w=1024" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/camera_hyst_thresh-1.png?w=1024" length="0" type="" />
<enclosure url="https://poissonisfish.com/wp-content/uploads/2026/05/camera_overlay-1.png?w=1024" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401133</post-id>	</item>
		<item>
		<title>Differencing: A Transformation or a Trap?</title>
		<link>https://www.r-bloggers.com/2026/05/differencing-a-transformation-or-a-trap/</link>
		
		<dc:creator><![CDATA[M. Fatih Tüzen]]></dc:creator>
		<pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
<p>1 Introduction<br />
Differencing is one of the most common transformations in time series analysis.<br />
It is also one of the easiest transformations to misunderstand.<br />
In many ARIMA-style workflows, differencing is introduced almost mechanically: i...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/differencing-a-transformation-or-a-trap/">Differencing: A Transformation or a Trap?</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/"> A Statistician&#039;s R Notebook</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://i1.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/timeseries_differencing.png?w=578&#038;ssl=1" class="img-fluid quarto-figure quarto-figure-center figure-img" data-recalc-dims="1"></p>
</figure>
</div>
<section id="introduction" class="level1" data-number="1">
<h1 data-number="1"><span class="header-section-number">1</span> Introduction</h1>
<p>Differencing is one of the most common transformations in time series analysis.</p>
<p>It is also one of the easiest transformations to misunderstand.</p>
<p>In many ARIMA-style workflows, differencing is introduced almost mechanically: if a series is not stationary, take a difference; if it still appears non-stationary, take another one. While this advice is not entirely wrong, it can quietly create a dangerous habit. Differencing is not merely a technical preprocessing step — it changes the object of analysis itself.</p>
<p>In the previous article of this series, <em>Why Most Time Series Models Fail Before They Start</em>, we explored stationarity using real CPI data and discussed why many forecasting problems begin long before model estimation. The central idea was simple but important: unstable statistical properties can make even sophisticated models misleading.</p>
<p>You can read the first article here:</p>
<p><a href="https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/" class="uri" rel="nofollow" target="_blank">https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/</a></p>
<p>This article continues that discussion with a more subtle question:</p>
<blockquote class="blockquote">
<p>What exactly happens when we difference a time series?</p>
</blockquote>
<p>To explore this question, we will use the <strong>S&#038;P CoreLogic Case-Shiller U.S. National Home Price Index</strong>, available from the Federal Reserve Economic Data (FRED) database under the code <code>CSUSHPINSA</code>.</p>
<p>FRED series link:</p>
<p><a href="https://fred.stlouisfed.org/series/CSUSHPINSA" class="uri" rel="nofollow" target="_blank">https://fred.stlouisfed.org/series/CSUSHPINSA</a></p>
<p>The series tracks U.S. national home prices and provides a rich real-world example: long-run growth, sharp reversals during the housing crisis, and rapid post-pandemic acceleration.</p>
<p>That makes it an ideal setting for a deeper lesson:</p>
<blockquote class="blockquote">
<p>Differencing can stabilize a series, but it can also reshape the structure of the signal.</p>
</blockquote>
</section>
<section id="setup" class="level1" data-number="2">
<h1 data-number="2"><span class="header-section-number">2</span> Setup</h1>
<p>The data used in this article can be downloaded directly from FRED:</p>
<p><a href="https://fred.stlouisfed.org/graph/fredgraph.csv?id=CSUSHPINSA" class="uri" rel="nofollow" target="_blank">https://fred.stlouisfed.org/graph/fredgraph.csv?id=CSUSHPINSA</a></p>
<p>For reproducibility, the CSV file is saved in the same folder as this Quarto document.</p>
<div class="cell">
<pre>library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(slider)
library(forecast)
library(tseries)
library(scales)

theme_set(theme_minimal(base_size = 13))</pre>
</div>
<div class="cell">
<pre>hpi &lt;- read_csv(&quot;CSUSHPINSA.csv&quot;, show_col_types = FALSE) %&gt;%
  transmute(
    date = as.Date(observation_date),
    hpi  = as.numeric(CSUSHPINSA)
  ) %&gt;%
  arrange(date) %&gt;%
  filter(!is.na(date), !is.na(hpi))

hpi %&gt;% slice_head(n = 5)</pre>
<div class="cell-output cell-output-stdout">
<pre># A tibble: 5 × 2
  date         hpi
  &lt;date&gt;     &lt;dbl&gt;
1 1987-01-01  63.7
2 1987-02-01  64.1
3 1987-03-01  64.5
4 1987-04-01  65.0
5 1987-05-01  65.5</pre>
</div>
</div>
<p>We will create several transformed versions of the series.</p>
<div class="cell">
<pre>hpi &lt;- hpi %&gt;%
  mutate(
    diff_1   = hpi - lag(hpi),
    diff_2   = diff_1 - lag(diff_1),
    log_hpi  = log(hpi),
    log_diff = log_hpi - lag(log_hpi)
  )</pre>
</div>
<p>The variables have different meanings:</p>
<ul>
<li><code>hpi</code>: the index level itself</li>
<li><code>diff_1</code>: monthly absolute change in the index</li>
<li><code>diff_2</code>: change in the monthly change</li>
<li><code>log_diff</code>: approximate monthly proportional change</li>
</ul>
<p>This distinction matters. Transformations are not neutral. Each one changes what the series represents.</p>
</section>
<section id="the-raw-series-persistence-everywhere" class="level1" data-number="3">
<h1 data-number="3"><span class="header-section-number">3</span> The raw series: persistence everywhere</h1>
<p>Let us begin with the raw housing price index.</p>
<div class="cell">
<pre>ggplot(hpi, aes(date, hpi)) +
  geom_line(linewidth = 0.8, color = &quot;#1f4e5f&quot;) +
  labs(
    title = &quot;Raw Housing Price Index: Strong Persistence and Long-Run Trend&quot;,
    subtitle = &quot;S&P CoreLogic Case-Shiller U.S. National Home Price Index&quot;,
    x = NULL,
    y = &quot;Index&quot;
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i2.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/raw-series-plot-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>Even before running a formal statistical test, the plot already reveals something important. The series does not fluctuate around a stable mean. Instead, it exhibits strong persistence and a pronounced long-run upward movement.</p>
<p>Several major economic episodes are immediately visible: the housing boom of the mid-2000s, the collapse following the global financial crisis, the gradual recovery during the 2010s, and the rapid acceleration after 2020.</p>
<p>This is clearly not a series that looks ready for direct stationary modeling.</p>
<p>But the key issue is not simply the presence of trend. The trend itself carries economic meaning. Housing prices are not merely noisy observations around a fixed level; they reflect long-run structural forces such as credit conditions, interest rates, demographic demand, construction constraints, and broader macroeconomic cycles.</p>
<p>This creates the central tension behind differencing:</p>
<blockquote class="blockquote">
<p>By removing persistence, we may improve the statistical properties of the series — while simultaneously weakening part of its long-run economic signal.</p>
</blockquote>
</section>
<section id="the-acf-of-the-raw-series" class="level1" data-number="4">
<h1 data-number="4"><span class="header-section-number">4</span> The ACF of the raw series</h1>
<p>The autocorrelation function provides another perspective on the same phenomenon.</p>
<div class="cell">
<pre>forecast::ggAcf(na.omit(hpi$hpi), lag.max = 26) +
  labs(
    title = &quot;ACF of Raw Housing Price Index&quot;,
    x = &quot;Lag&quot;,
    y = &quot;ACF&quot;
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i1.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/raw-acf-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The ACF declines extremely slowly and remains strongly positive even at relatively long lags. This is one of the classic visual signatures of a highly persistent process.</p>
<p>In practical terms, today’s housing price index is strongly related to its past values. That is not surprising. Housing markets do not reset from month to month; they evolve gradually through credit conditions, market expectations, supply constraints, and macroeconomic forces.</p>
<p>From a modeling standpoint, however, this dependence structure creates a challenge. Methods built around stationarity assumptions may struggle to distinguish genuine short-run dynamics from long-run drift if we model the raw level directly.</p>
<p>To formalize this intuition, let us turn to the Augmented Dickey–Fuller (ADF) test.</p>
<div class="cell">
<pre>adf_level &lt;- tseries::adf.test(na.omit(hpi$hpi))
adf_level</pre>
<div class="cell-output cell-output-stdout">
<pre>
    Augmented Dickey-Fuller Test

data:  na.omit(hpi$hpi)
Dickey-Fuller = -0.97386, Lag order = 7, p-value = 0.9427
alternative hypothesis: stationary</pre>
</div>
</div>
<p>The ADF test fails to reject the null hypothesis of a unit root for the raw series. In other words, there is no statistical evidence supporting stationarity in the housing price index at the level scale.</p>
<p>This result aligns closely with what we already observed visually: the series behaves more like a drifting process than a stable mean-reverting one.</p>
<p>So far, the standard recommendation appears sensible:</p>
<blockquote class="blockquote">
<p>If the series is non-stationary, take a difference.</p>
</blockquote>
</section>
<section id="first-differencing-less-trend-but-not-no-structure" class="level1" data-number="5">
<h1 data-number="5"><span class="header-section-number">5</span> First differencing: less trend, but not no structure</h1>
<p>A first difference replaces the level of the series with its period-to-period change:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20x_t%20=%20x_t%20-%20x_%7Bt-1%7D.%0A"></p>
<p>This operation is often described as “removing the trend.” That description is useful, but incomplete.</p>
<p>Let us now examine the first-differenced series.</p>
<div class="cell">
<pre>ggplot(hpi, aes(date, diff_1)) +
  geom_line(linewidth = 0.7, color = &quot;#d95f02&quot;, na.rm = TRUE) +
  labs(
    title = &quot;First Difference of the Housing Price Index&quot;,
    subtitle = &quot;Monthly absolute change in the index&quot;,
    x = NULL,
    y = &quot;Δ Index&quot;
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i1.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/first-difference-plot-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The transformation clearly changes the behavior of the data. The dominant upward drift visible in the raw housing price index is no longer the central feature. Instead, the series fluctuates around a much more stable level.</p>
<p>That is the good news.</p>
<p>But something equally important remains: the differenced series is still highly structured. It does not resemble random white noise. Distinct regimes, bursts of volatility, and recurring short-run movements are still visible throughout the series.</p>
<p>The periods surrounding the housing crisis and the post-pandemic surge are especially revealing. The magnitude of month-to-month changes increases sharply, and the volatility structure itself becomes more pronounced.</p>
<p>In other words, differencing reduced the trend — but it did not eliminate dependence.</p>
<p>This is a crucial distinction.</p>
<p>A transformed series can become statistically more manageable while still retaining meaningful internal structure. That is precisely why treating differencing as a mechanical preprocessing step can be misleading.</p>
</section>
<section id="the-acf-after-first-differencing" class="level1" data-number="6">
<h1 data-number="6"><span class="header-section-number">6</span> The ACF after first differencing</h1>
<p>Let us now inspect the autocorrelation structure after first differencing.</p>
<div class="cell">
<pre>forecast::ggAcf(na.omit(hpi$diff_1), lag.max = 26) +
  labs(
    title = &quot;ACF of First Difference&quot;,
    x = &quot;Lag&quot;,
    y = &quot;ACF&quot;
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i2.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/first-diff-acf-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The ACF is no longer dominated by the extremely slow decay observed in the raw housing price index. This is an important change. The transformation has substantially reduced the long-run persistence associated with the level series.</p>
<p>But the structure has not vanished.</p>
<p>Several lags remain clearly significant, and the series still exhibits meaningful short-run dynamics. Cyclical patterns and medium-range dependence are still visible, suggesting that the transformation reduced the trend without erasing the internal behavior of the process.</p>
<p>This is particularly important in housing markets, where adjustments tend to occur gradually rather than instantaneously. Prices respond over time through financing conditions, supply rigidities, expectations, and broader economic cycles.</p>
<p>A common beginner misconception is that differencing should transform a series into white noise. It should not. If every form of dependence disappeared completely, there would be little left to model.</p>
<p>The goal of differencing is not to destroy structure. The goal is to remove problematic non-stationarity while preserving meaningful dynamics.</p>
<p>The Augmented Dickey–Fuller test now tells a very different statistical story.</p>
<div class="cell">
<pre>adf_diff1 &lt;- tseries::adf.test(na.omit(hpi$diff_1))
adf_diff1</pre>
<div class="cell-output cell-output-stdout">
<pre>
    Augmented Dickey-Fuller Test

data:  na.omit(hpi$diff_1)
Dickey-Fuller = -3.9775, Lag order = 7, p-value = 0.01019
alternative hypothesis: stationary</pre>
</div>
</div>
<p>The ADF test rejects the null hypothesis of a unit root for the first-differenced series. Statistically speaking, the transformation appears successful: the series is now much more compatible with stationarity assumptions.</p>
<p>But this is where a subtle danger begins.</p>
<p>Once a transformation starts “working,” it becomes tempting to continue applying it mechanically. And that raises an important question:</p>
<blockquote class="blockquote">
<p>What happens if we difference the series again?</p>
</blockquote>
</section>
<section id="second-differencing-cleaner-or-distorted" class="level1" data-number="7">
<h1 data-number="7"><span class="header-section-number">7</span> Second differencing: cleaner or distorted?</h1>
<p>If one difference helps, should two differences help even more?</p>
<p>This is where the trap begins.</p>
<p>A second difference is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%5E2%20x_t%20=%20%5CDelta%20x_t%20-%20%5CDelta%20x_%7Bt-1%7D.%0A"></p>
<p>Conceptually, it measures the change in the change. In our case, the transformation no longer asks how much housing prices change from one month to the next. Instead, it asks whether those monthly changes themselves are accelerating or decelerating.</p>
<p>That is a fundamentally different question.</p>
<div class="cell">
<pre>ggplot(hpi, aes(date, diff_2)) +
  geom_line(linewidth = 0.7, color = &quot;#7b3294&quot;, na.rm = TRUE) +
  labs(
    title = &quot;Second Difference of the Housing Price Index&quot;,
    subtitle = &quot;Change in the monthly change&quot;,
    x = NULL,
    y = &quot;Δ² Index&quot;
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i1.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/second-difference-plot-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The second-differenced series appears more centered and much more aggressively oscillatory. It reacts strongly to turning points, reversals, and short-run fluctuations. At the same time, however, it becomes increasingly difficult to interpret in economic terms.</p>
<p>This is where the statistical and substantive perspectives begin to diverge.</p>
<p>From a purely statistical viewpoint, the second difference may appear attractive because the series now looks even more stationary. But statistical improvement alone does not guarantee that the transformed series remains meaningful for analysis or forecasting.</p>
<p>The key question is no longer:</p>
<blockquote class="blockquote">
<p>“Did we remove non-stationarity?”</p>
</blockquote>
<p>The key question becomes:</p>
<blockquote class="blockquote">
<p>“What happened to the original signal?”</p>
</blockquote>
</section>
<section id="the-acf-after-second-differencing" class="level1" data-number="8">
<h1 data-number="8"><span class="header-section-number">8</span> The ACF after second differencing</h1>
<p>The autocorrelation structure after second differencing makes the issue even clearer.</p>
<div class="cell">
<pre>forecast::ggAcf(na.omit(hpi$diff_2), lag.max = 26) +
  labs(
    title = &quot;ACF of Second Difference&quot;,
    x = &quot;Lag&quot;,
    y = &quot;ACF&quot;
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i1.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/second-diff-acf-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The pattern is now fundamentally different from what we observed earlier. The raw housing price index exhibited strong long-run persistence, while the first difference retained a more moderate and interpretable dependence structure. The second difference, however, introduces a much more alternating and oscillatory behavior.</p>
<p>This is one of the classic warning signs of over-differencing.</p>
<p>Excessive differencing can artificially induce negative dependence and amplify short-run fluctuations that were far less dominant in the original data. In practical terms, the transformation may begin to reshape the signal rather than simply stabilize it.</p>
<p>In other words:</p>
<blockquote class="blockquote">
<p>The second difference may look statistically cleaner, while simultaneously becoming substantively less meaningful.</p>
</blockquote>
<p>Let us now examine the Augmented Dickey–Fuller result.</p>
<div class="cell">
<pre>adf_diff2 &lt;- tseries::adf.test(na.omit(hpi$diff_2))
adf_diff2</pre>
<div class="cell-output cell-output-stdout">
<pre>
    Augmented Dickey-Fuller Test

data:  na.omit(hpi$diff_2)
Dickey-Fuller = -16.035, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary</pre>
</div>
</div>
<p>The ADF test strongly rejects the null hypothesis of a unit root for the second-differenced series. In fact, the warning message suggests that the p-value is even smaller than the value printed by the function.</p>
<p>From a purely statistical perspective, this might appear highly desirable. The transformation seems extremely successful at producing stationarity.</p>
<p>But this creates a useful paradox.</p>
<blockquote class="blockquote">
<p>The test becomes increasingly confident — but should we?</p>
</blockquote>
<p>A more stationary series is not automatically a better modeling target. Sometimes it is simply a more aggressively transformed version of the original data, with less economically meaningful structure left to explain.</p>
<p>At this point, another question naturally emerges:</p>
<blockquote class="blockquote">
<p>Is repeated ordinary differencing always the most meaningful transformation for economic time series?</p>
</blockquote>
</section>
<section id="a-brief-note-on-log-differencing" class="level1" data-number="9">
<h1 data-number="9"><span class="header-section-number">9</span> A brief note on log differencing</h1>
<p>So far, we have focused on ordinary differencing based on absolute changes. But in many economic and financial applications, analysts often prefer log differences instead.</p>
<p>Why?</p>
<p>Because the interpretation of absolute changes becomes increasingly problematic when the scale of a series evolves over time. A one-point increase in a housing price index does not carry the same meaning when the index is near 80 and when it exceeds 300.</p>
<p>Log differencing addresses this issue by focusing on proportional change rather than absolute change:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20%5Clog(x_t)%20=%20%5Clog(x_t)%20-%20%5Clog(x_%7Bt-1%7D).%0A"></p>
<p>For relatively small changes, this quantity closely approximates the growth rate:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20%5Clog(x_t)%20%5Capprox%20%5Cfrac%7Bx_t%20-%20x_%7Bt-1%7D%7D%7Bx_%7Bt-1%7D%7D.%0A"></p>
<p>This is one reason why log differences are widely used in macroeconomics, inflation analysis, and financial modeling. They often provide a more interpretable representation of economic dynamics because they express changes relative to the current scale of the series.</p>
<p>But an important caution remains.</p>
<p>Log differencing does not eliminate the broader trade-offs discussed in this article. It still transforms the dependence structure of the data, and it still changes the underlying modeling question.</p>
<p>The key lesson is therefore not:</p>
<blockquote class="blockquote">
<p>“Which transformation is universally correct?”</p>
</blockquote>
<p>The real question is:</p>
<blockquote class="blockquote">
<p>“Which transformation preserves the most meaningful structure for the problem we are trying to study?”</p>
</blockquote>
</section>
<section id="comparing-the-three-versions" class="level1" data-number="10">
<h1 data-number="10"><span class="header-section-number">10</span> Comparing the three versions</h1>
<p>A direct comparison makes the effect of differencing much easier to see. The figure below summarizes the central theme of this article.</p>
<div class="cell">
<pre>hpi_long &lt;- hpi %&gt;%
  select(date, hpi, diff_1, diff_2) %&gt;%
  pivot_longer(
    cols = c(hpi, diff_1, diff_2),
    names_to = &quot;series&quot;,
    values_to = &quot;value&quot;
  ) %&gt;%
  mutate(
    series = recode(
      series,
      hpi = &quot;Raw level&quot;,
      diff_1 = &quot;First difference&quot;,
      diff_2 = &quot;Second difference&quot;
    ),
    series = factor(series, levels = c(&quot;Raw level&quot;, &quot;First difference&quot;, &quot;Second difference&quot;))
  )

ggplot(hpi_long, aes(date, value)) +
  geom_line(linewidth = 0.7, color = &quot;#2c3e50&quot;, na.rm = TRUE) +
  facet_wrap(~ series, scales = &quot;free_y&quot;, ncol = 1) +
  labs(
    title = &quot;Raw series, first difference, and second difference&quot;,
    subtitle = &quot;Each transformation changes both the statistical properties and the interpretation&quot;,
    x = NULL,
    y = NULL
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i0.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/compare-series-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The raw housing price index contains long-run persistence, structural trend, and broad economic cycles. The first difference shifts attention toward month-to-month changes and substantially reduces the long-run drift. The second difference goes even further, emphasizing acceleration and deceleration in those monthly movements.</p>
<p>Each transformation produces a series with different statistical properties.</p>
<p>But more importantly, each transformation changes the interpretation of the data itself.</p>
<p>That is the crucial point.</p>
<p>These are not simply cleaner or noisier versions of the same series. They are fundamentally different analytical objects, each answering a different question about the underlying process.</p>
</section>
<section id="rolling-volatility-transformation-does-not-solve-everything" class="level1" data-number="11">
<h1 data-number="11"><span class="header-section-number">11</span> Rolling volatility: transformation does not solve everything</h1>
<p>Differencing may stabilize the mean of a series, but it does not guarantee stable variance.</p>
<div class="cell">
<pre>hpi &lt;- hpi %&gt;%
  mutate(
    roll_sd_diff1 = slider::slide_dbl(diff_1, sd, .before = 24, .complete = TRUE),
    roll_sd_diff2 = slider::slide_dbl(diff_2, sd, .before = 24, .complete = TRUE)
  )</pre>
</div>
<div class="cell">
<pre>hpi %&gt;%
  select(date, roll_sd_diff1, roll_sd_diff2) %&gt;%
  pivot_longer(
    cols = c(roll_sd_diff1, roll_sd_diff2),
    names_to = &quot;series&quot;,
    values_to = &quot;rolling_sd&quot;
  ) %&gt;%
  mutate(
    series = recode(
      series,
      roll_sd_diff1 = &quot;First difference&quot;,
      roll_sd_diff2 = &quot;Second difference&quot;
    )
  ) %&gt;%
  ggplot(aes(date, rolling_sd, color = series)) +
  geom_line(linewidth = 0.8, na.rm = TRUE) +
  scale_color_manual(values = c(&quot;First difference&quot; = &quot;#d95f02&quot;, &quot;Second difference&quot; = &quot;#7b3294&quot;)) +
  labs(
    title = &quot;24-month rolling standard deviation&quot;,
    subtitle = &quot;Differencing changes the volatility structure too&quot;,
    x = NULL,
    y = &quot;Rolling SD&quot;,
    color = NULL
  )</pre>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://i0.wp.com/mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/index_files/figure-html/rolling-volatility-plot-1.png?w=450&#038;ssl=1" class="img-fluid figure-img"  data-recalc-dims="1"></p>
</figure>
</div>
</div>
</div>
<p>The rolling standard deviation highlights an important lesson that is often overlooked in introductory time series discussions: stationarity is not a single on–off property. A transformation can improve one aspect of the data while leaving other forms of instability unresolved.</p>
<p>The housing price series illustrates this clearly. Even after differencing, the post-2020 period remains substantially more volatile than earlier decades. Large swings, volatility bursts, and changing dispersion are still visible in both transformed series.</p>
<p>This matters because many classical time series models implicitly assume not only stable mean behavior, but also relatively stable variance structure.</p>
<p>A model that ignores changing volatility may appear statistically successful while still producing fragile forecasts and misleading uncertainty estimates in practice.</p>
<p>In other words:</p>
<blockquote class="blockquote">
<p>Differencing can reduce trend-related non-stationarity without fully stabilizing the broader dynamics of the process.</p>
</blockquote>
</section>
<section id="a-compact-comparison" class="level1" data-number="12">
<h1 data-number="12"><span class="header-section-number">12</span> A compact comparison</h1>
<p>The table below summarizes both the transformations examined directly in this article and closely related alternatives frequently used in applied time series analysis.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Series version</th>
<th>What it represents</th>
<th>What improves</th>
<th>What may be lost</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Raw level</td>
<td>Housing price index itself</td>
<td>Preserves long-run economic structure and trend information</td>
<td>Strong persistence and unit-root-like behavior</td>
</tr>
<tr class="even">
<td>First difference</td>
<td>Monthly absolute change</td>
<td>Reduces long-run drift and improves stationarity properties</td>
<td>Level interpretation and part of the long-run dependence structure</td>
</tr>
<tr class="odd">
<td>Second difference</td>
<td>Change in monthly change</td>
<td>Produces an even stronger stationarity signal</td>
<td>Economic interpretability and smoother dependence dynamics</td>
</tr>
<tr class="even">
<td>Log difference</td>
<td>Approximate proportional change</td>
<td>Often provides a more scale-adjusted interpretation of change</td>
<td>May still contain volatility shifts, structural breaks, or persistence</td>
</tr>
</tbody>
</table>
<p>No transformation is universally best. The appropriate choice depends on the analytical question, the structure of the data, and the type of signal we want to preserve.</p>
</section>
<section id="the-paradox-of-differencing" class="level1" data-number="13">
<h1 data-number="13"><span class="header-section-number">13</span> The paradox of differencing</h1>
<p>Differencing is powerful because it reduces persistence.</p>
<p>But that is also where the danger begins.</p>
<p>Persistence is not always a statistical nuisance. In many economic and financial time series, persistence is part of the signal itself. Long-run movements in housing prices, inflation, production, or income are often economically meaningful features of the process rather than accidental artifacts.</p>
<p>This creates a practical tension at the heart of time series modeling.</p>
<p>If we difference too little, we may mistake long-run drift for stable structure.</p>
<p>If we difference too aggressively, we may weaken meaningful dependence and end up modeling noise-like fluctuations instead of economically relevant dynamics.</p>
<p>And if we difference mechanically, without thinking carefully about interpretation, we may ultimately answer a question nobody intended to ask.</p>
<p>That is why differencing should not be treated as a preprocessing ritual.</p>
<p>It is a modeling decision.</p>
</section>
<section id="common-mistakes" class="level1" data-number="14">
<h1 data-number="14"><span class="header-section-number">14</span> Common mistakes</h1>
<p>Most mistakes with differencing are not computational. They are conceptual.</p>
<p><strong>Mistake 1: assuming first differencing automatically solves the problem</strong></p>
<p>First differencing often reduces trend and improves stationarity properties, but it does not guarantee white noise, stable variance, or a well-specified model.</p>
<p><strong>Mistake 2: increasing the differencing order simply because a test improves</strong></p>
<p>A second difference may appear statistically “better” according to a unit root test, but that does not automatically make it a more meaningful modeling target.</p>
<p><strong>Mistake 3: forgetting that differencing changes the question</strong></p>
<p>Modeling levels, monthly changes, and changes in monthly changes are fundamentally different analytical tasks.</p>
<p><strong>Mistake 4: ignoring the ACF after transformation</strong></p>
<p>The ACF is not merely a diagnostic plot. It reveals how the dependence structure of the series has been reshaped by the transformation itself.</p>
<p><strong>Mistake 5: treating preprocessing as separate from modeling</strong></p>
<p>Every transformation changes what the model sees. And once the model sees a different series, the modeling problem itself has changed.</p>
</section>
<section id="practical-workflow" class="level1" data-number="15">
<h1 data-number="15"><span class="header-section-number">15</span> Practical workflow</h1>
<p>A sensible differencing workflow should not begin with the question:</p>
<blockquote class="blockquote">
<p>“How many differences do I need?”</p>
</blockquote>
<p>A better workflow is something closer to this:</p>
<ol type="1">
<li>Plot the raw series.</li>
<li>Ask what the level of the series actually represents.</li>
<li>Inspect the autocorrelation structure.</li>
<li>Apply the smallest transformation that addresses the main statistical problem.</li>
<li>Re-examine the transformed series visually.</li>
<li>Re-check the dependence structure using the ACF.</li>
<li>Use tests such as the ADF test as supporting evidence rather than final truth.</li>
<li>Ask whether the transformed series still answers the substantive question of interest.</li>
</ol>
<p>This workflow is slower than blindly calling <code>auto.arima()</code> and accepting whatever transformation it selects automatically. But it is also safer. And in real analytical work, safer usually wins.</p>
</section>
<section id="final-thoughts" class="level1" data-number="16">
<h1 data-number="16"><span class="header-section-number">16</span> Final thoughts</h1>
<p>Differencing is not a trap by itself.</p>
<p>It becomes a trap when we start treating it as a harmless default.</p>
<p>The housing price example illustrates this tension clearly. The raw series is highly persistent and visibly non-stationary. The first difference improves the statistical behavior of the data while still preserving meaningful short-run dynamics. The second difference pushes the series even further toward stationarity, but it also reshapes the dependence structure and weakens the direct economic interpretation.</p>
<p>This is the central trade-off behind differencing.</p>
<p>The real question is not:</p>
<blockquote class="blockquote">
<p>“Is the series stationary now?”</p>
</blockquote>
<p>The more difficult — and ultimately more useful — question is:</p>
<blockquote class="blockquote">
<p>“After transformation, am I still modeling the signal I actually care about?”</p>
</blockquote>
<p>That question matters far more than the differencing order itself.</p>
</section>
<section id="references-and-further-reading" class="level1" data-number="17">
<h1 data-number="17"><span class="header-section-number">17</span> References and further reading</h1>
<p><strong>Data source</strong></p>
<ul>
<li><p>Federal Reserve Bank of St. Louis. <em>S&P CoreLogic Case-Shiller U.S. National Home Price Index (CSUSHPINSA).</em><br>
<a href="https://fred.stlouisfed.org/series/CSUSHPINSA" class="uri" rel="nofollow" target="_blank">https://fred.stlouisfed.org/series/CSUSHPINSA</a></p></li>
<li><p>FRED CSV download link used in this article:<br>
<a href="https://fred.stlouisfed.org/graph/fredgraph.csv?id=CSUSHPINSA" class="uri" rel="nofollow" target="_blank">https://fred.stlouisfed.org/graph/fredgraph.csv?id=CSUSHPINSA</a></p></li>
</ul>
<p><strong>Core time series references</strong></p>
<ul>
<li><p>Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). <em>Time Series Analysis: Forecasting and Control.</em> Wiley.</p></li>
<li><p>Hyndman, R. J., & Athanasopoulos, G. (2021). <em>Forecasting: Principles and Practice</em> (3rd ed.).<br>
<a href="https://otexts.com/fpp3/" class="uri" rel="nofollow" target="_blank">https://otexts.com/fpp3/</a></p></li>
<li><p>Hamilton, J. D. (1994). <em>Time Series Analysis.</em> Princeton University Press.</p></li>
</ul>
<p><strong>Unit roots and differencing</strong></p>
<ul>
<li><p>Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. <em>Journal of the American Statistical Association.</em></p></li>
<li><p>Said, S. E., & Dickey, D. A. (1984). Testing for unit roots in autoregressive-moving average models of unknown order. <em>Biometrika.</em></p></li>
</ul>
<p><strong>Practical R resources</strong></p>
<ul>
<li><p>R Core Team. <em>R: A Language and Environment for Statistical Computing.</em><br>
<a href="https://www.r-project.org/" class="uri" rel="nofollow" target="_blank">https://www.r-project.org/</a></p></li>
<li><p>Hyndman, R. J. et al. <em>forecast package documentation.</em><br>
<a href="https://pkg.robjhyndman.com/forecast/" class="uri" rel="nofollow" target="_blank">https://pkg.robjhyndman.com/forecast/</a></p></li>
</ul>


<!-- -->

</section>

 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://mfatihtuzen.github.io/posts/2026-05-07_timeseries_differencing/"> A Statistician&#039;s R Notebook</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/differencing-a-transformation-or-a-trap/">Differencing: A Transformation or a Trap?</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401116</post-id>	</item>
		<item>
		<title>New Mentoring Team with Experienced Mentors and New Voices</title>
		<link>https://www.r-bloggers.com/2026/05/new-mentoring-team-with-experienced-mentors-and-new-voices/</link>
		
		<dc:creator><![CDATA[rOpenSci]]></dc:creator>
		<pubDate>Wed, 06 May 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://ropensci.org/blog/2026/05/06/mentors-2026/</guid>

					<description><![CDATA[<p>Read it in: Español. We are excited to introduce the new team of mentors for the rOpenSci 2026 Champions Program! This year we have eleven individuals committed to open science, bringing together a rich diversity of backgrounds and perspectives. The t...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/new-mentoring-team-with-experienced-mentors-and-new-voices/">New Mentoring Team with Experienced Mentors and New Voices</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://ropensci.org/blog/2026/05/06/mentors-2026/"> rOpenSci - open tools for open science</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p><a href='https://ropensci.org/es/blog/2026/05/26/mentoras_es-2026/' rel="nofollow" target="_blank">Read it in: Español</a>.</p> <p>We are excited to introduce the new team of mentors for the rOpenSci 2026 Champions Program! This year we have eleven individuals committed to open science, bringing together a rich diversity of backgrounds and perspectives. The team is made up of people joining the program for the first time, former Champions returning as mentors, and experienced mentors from previous cohorts returning to continue to strengthen this community.</p>
<p>This year’s mentors come from a variety of disciplines and countries, and are active voices in the R community in Latin America and beyond. With their guidance, the new group of Champions will not only develop their projects, but also grow as leaders in open science and research software development.</p>
<h2>
New mentors
</h2><h3>
Alber Hamersson Sánchez Ipia
</h3><figure class="pull-left"><img src="https://i0.wp.com/ropensci.org/img/team/alber-sanchez.jpg?w=250&#038;ssl=1"
alt="Profile photo of Alber Hamersson Sánchez Ipia"  data-recalc-dims="1"><figcaption>
<p><strong>Alber Hamersson Sánchez Ipia </br> Instituto Nacional de Investigación Espacial del Brasil </br> rOpenSci</strong></p>
</figcaption>
</figure>
<p>Hi! I’m Alber and I’m going to be an rOpenSci mentor this year.</p>
<p>I was born in Colombia, in the department of Cauca, in one of the country’s most mountainous regions, called Tierradentro.
I am a Cadastral and Geodetic Engineer from the Francisco José de Caldas District University in Colombia, where I earned a Master’s in Information and Communication Sciences;
additionally, I completed another Master’s degree in Geoinformatics
at the University of Münster in Germany, and later earned a PhD in Earth System Science at the National Institute for Space Research (INPE) in Brazil. Currently, I live and work in Brazil and serve as a research assistant at the same INPE.</p>
<p>Part of my daily work involves writing R code to process spatial data and ensure the reproducibility of the scientific experiment results, so I am familiar with R package development.
Additionally, I am a co-author of the segmetric package, which is currently available on CRAN,
and I maintain one of the Data Carpentry lessons,
specifically the introduction to R for geospatial data.</p>
<p>I am interested in sharing the knowledge and experience I have accumulated so far with anyone who is going to write scientific or statistical software,
particularly in Spanish.
For this reason I am joining rOpenSci,
where I hope to be part of and help build a community of developers.</p>
</br>
</br>
<h3>
Pablo Paccioretti
</h3><figure class="pull-right"><img src="https://i0.wp.com/ropensci.org/img/team/pablo-paccioretti.jpg?w=250&#038;ssl=1"
alt="Profile photo of Pablo Paccioretti"  data-recalc-dims="1"><figcaption>
<p><strong>Pablo Paccioretti </br> Universidad Nacional de Córdoba (UNC) &#038; CONICET, Argentina </br> rOpenSci</strong></p>
</figcaption>
</figure>
<p>Hello! I am Pablo, Agricultural Engineer and PhD from the National University of Córdoba (Argentina), where I work as a researcher and teacher. Since my student years I have been interested in Statistics, which directed my work towards data analysis. In particular, I apply and develop methodologies and software tools to analyze georeferenced data from field trials and agricultural monitoring platforms.</p>
<p>I am interested in the development of open tools for data processing and analysis. I have developed scientific software, including R packages for georeferenced data analysis.</p>
<p>My participation in the Champions Program arises from an interest in strengthening the links between applied data analysis and programming, and promoting good practices in both areas. Through this program I hope to contribute to the community by sharing experiences and resources, while also learning from other professionals working in different contexts and disciplines.</p>
</br>
</br>
<h2>
Champions to mentors
</h2><h3>
Erick Navarro Delgado
</h3><figure class="pull-left"><img src="https://i0.wp.com/ropensci.org/img/team/erick-navarro-delgado.jpg?w=250&#038;ssl=1"
alt="Profile photo of Erick Navarro Delgado"  data-recalc-dims="1"><figcaption>
<p><strong>Erick Navarro Delgado </br> The University of British Columbia </br> rOpenSci</strong></p>
</figcaption>
</figure>
<p>Hello! My name is Erick Navarro, and I have a degree in biology from the Universidad Nacional Autónoma de México, and I am a PhD candidate in Bioinformatics at The University of British Columbia. I was born and raised in Mexico City, but currently live in Vancouver, Canada. My line of research is focused on developing computational tools to understand how genetic factors and environmental exposures/lived experiences act together or separately to shape our molecular landscape.</p>
<p>I am excited to participate in the rOpenSci Champions Program because I believe that open and accessible science is essential for conducting relevant research whose results benefit everyone in our society. In this program I hope to connect with new members of the open science community, share my programming skills, and drive software development in Latin America.</p>
</br>
<h3>
Guadalupe Pascal
</h3><figure class="pull-right"><img src="https://i0.wp.com/ropensci.org/img/team/guadalupe-pascal.jpg?w=250&#038;ssl=1"
alt="Profile photo of Guadalupe Pascal"  data-recalc-dims="1"><figcaption>
<p><strong>Guadalupe Pascal </br> UNLZ-UCA-UGR </br>rOpenSci</strong></p>
</figcaption>
</figure>
<p>Hello! My name is Guadalupe.</p>
<p>I am a researcher and project coordinator in optimization and data science for decision making in social systems, with transfeminist, open science and regional perspectives. I am also an associate professor of optimization and quantitative methods (UNLZ-UCA) and professor in data science and artificial intelligence courses (UGR). I have a Master’s in Decision Systems Engineering from URJC (Spain) and am an industrial engineer from UNLZ (Argentina), a PhD student in Information Technology and Engineering (URJC-UNLZ), and hold deplomas in Gender and Society (UNLZ), Cognitive Neuroscience (Neurotransmitting), and Education in the Age of Artificial Intelligence (UMET). I am part of the Matilda Latin American Open Chair and Women in Engineering, as a founding member and representative of the Gender Network of Engineering Faculties of Argentina.</p>
<p>I am also currently part of the rOpenSci community as a 2025-2026 cohort Champion, and I am very excited to be a mentor in this program for several reasons. On the one hand, I have a deep gratification from being engaged with the current cohort. From a simple point of view, the quality and rigor with which the program is implemented in all its instances have a direct impact on the quality and rigor of my own work. And from an holistic point of view, this serves as extremely valuable and compelling evidence of the synergy within communities of practice in developing skills and producing situated knowledge: the rOpenSci Champions Program is a concrete and real example of how communities share knowledge and, fundamentally, values, perspectives and embodied learning. On the other hand, I am looking forward to the challenge of being a mentor in this program because, although it is a role that I have played in other environments, I have never mentored the development of someone else’s R package. Finally, I would like to work in this role to share my experiences as both a mentor and mentee with the community. I believe that accompanying each other in a formative and transformative process is one of the most human dimensions of this ecosystem in which we work.</p>
</br>
<h3>
Andrea Gomez Vargas
</h3><figure class="pull-left"><img src="https://i2.wp.com/ropensci.org/img/team/andrea-vargas.png?w=250&#038;ssl=1"
alt="Profile photo of Andrea Gomez Vargas"  data-recalc-dims="1"><figcaption>
<p><strong>Andrea Gomez Vargas </br> INDEC </br> R-Ladies, rOpenSci</strong></p>
</figcaption>
</figure>
<p>I am Colombian by origin and Argentinean by choice, where today I live, develop my career, and actively participate in the community of R. I am a sociologist and work in the national statistics office of Argentina, in the area of social statistics, where I analyze information about the population to understand inequalities and living conditions.</p>
<p>The R community is my favorite space to share knowledge and build collectively. Currently, I am co-organizer of <a href="https://renbaires.github.io/" rel="nofollow" target="_blank">R in Buenos Aires</a> and <a href="https://rse-argentina.github.io/" rel="nofollow" target="_blank">RSE Argentina</a>, and I also participate in communities such as R-Ladies, LatinR and rOpenSci, contributing to the strengthening of networks at local, regional and global levels, promoting the learning and use of open tools in data science.</p>
<p>I was a <a href="https://blog/2025/05/15/puentes-comunidades-campeones-ropensci/" rel="nofollow" target="_blank">Champion in the 2023-2024 cohort.</a> where I developed <a href="https://soyandrea.github.io/arcenso/" rel="nofollow" target="_blank">{ARcenso} a package that facilitates access to historical census data for Argentina</a>. I am motivated to continue in the program as a mentor to continue promoting open knowledge and to accompany other people in the development of projects with an impact on their communities.</p>
</br>
<h3>
Monika Avila Marquez
</h3><figure class="pull-right"><img src="https://i1.wp.com/ropensci.org/img/team/monika-avila-marquez.jpeg?w=250&#038;ssl=1"
alt="Profile photo of Monika Ávila Márquez"  data-recalc-dims="1"><figcaption>
<p><strong>Monika Ávila Márquez </br> Universidad de Ginebra </br> R-Ladies, rOpenSci</strong></p>
</figcaption>
</figure>
<p>Hi, I am Monika, a postdoctoral researcher in statistics at the University of Geneva, where I work on causal inference and machine learning methods for panel data. I have a PhD in econometrics and my research focuses on the development of semi-parametric estimators that combine machine learning techniques with econometric foundations for estimating panel data models. I also work on mixed effects model selection and causal inference with interference.</p>
<p>I am co-organizer of the R-Ladies Geneva chapter, where I strive to build an inclusive community of practice for people using R in research.</p>
<p>I am participating as a mentor in this program because I want to give back for all that rOpenSci has given me. This community has accompanied me in my professional development &#8211; as a source of resources, as a learning space and as an example of what it means to do open science with rigor and generosity. Today I have the opportunity to offer that same support to others, and that excites me deeply.</p>
</br>
<h2>
Returning mentors
</h2><h3>
Luis D. Verde Arregoitia
</h3><figure class="pull-left"><img src="https://i2.wp.com/ropensci.org/img/team/luis-verde.jpeg?w=250&#038;ssl=1"
alt="Profile photo of Luis D. Verde Arregoitia"  data-recalc-dims="1"><figcaption>
<p><strong>Luis D. Verde Arregoitia</br>Instituto de Ecología AC &#8211; INECOL</br>LatinR, rOpenSci, The Carpentries</strong></p>
</figcaption>
</figure>
<p>Hi, I’m Luis D. Verde Arregoitia, a Mexican living in Xalapa, Mexico. Biologist and PhD in Biological Sciences, I am a mammal specialist with experience in R programming for data analysis, visualization and statistical modeling. I am also a certified instructor and author of several packages.</p>
<p>I was a mentor in two previous cohorts of the program where I have supported software developers in Latin America and I return with much enthusiasm to this new cohort.</p>
</br>
</br>
</br>
</br>
</br>
</br>
<h3>
Pao Corrales
</h3><figure class="pull-right"><img src="https://i1.wp.com/ropensci.org/img/team/paola-corrales.png?w=250&#038;ssl=1"
alt="Profile photo of Pao Corrales"  data-recalc-dims="1"><figcaption>
<p><strong>Pao Corrales</br>Australian National University &#038; 21st century weather CoE </br> R-Ladies, LatinR, rOpenSci, The Carpentries, RForwards </strong></p>
</figcaption>
</figure>
<p>I have a PhD in Atmospheric Sciences from the University of Buenos Aires (Argentina) and am currently working in Australia at the <em>21st Century Weather Centre</em> as a research software engineer.</p>
<p>I actively participate in R-Ladies, R Forwards, The Carpentries, LatinR and rOpenSci, learning and sharing knowledge about R in the community. In 2023 I participated in the Champions Program as a Champion, submitting the agroclimate package to the rOpenSci peer review process. I learned a lot and connected with people from all over the world. Tt was an excellent experience!</p>
<p>I am passionate about teaching and helping other people grow in what they do, access new opportunities and develop professionally and as individuals. I am very excited to participate again this year as a mentor in the Latin America Champions Program.</p>
<h3>
Francisco Cardozo
</h3><figure class="pull-right"><img src="https://i2.wp.com/ropensci.org/img/team/francisco-cardozo.jpg?w=250&#038;ssl=1"
alt="Profile photo of Francisco Cardozo"  data-recalc-dims="1"><figcaption>
<p><strong>Francisco Cardozo</br>[Afiliacion universidad] </br> rOpenSci &#8211; The Carpentries</strong></p>
</figcaption>
</figure>
<p>My name is Francisco Cardozo. I am originally from Colombia and came to the United States to pursue my doctoral studies. I am currently working at the University of Miami as a postdoctoral researcher in the IMPAC research center, an institution dedicated to advancing our understanding of adolescent development. I have participated in the Champions Program on several occasions. Much of my professional work has focused on research design and the application of statistical methods, particularly through the use of the R software environment.</p>
</br>
</br>
</br>
</br>
<h3>
Milagros Mendoza
</h3><figure class="pull-left"><img src="https://i1.wp.com/ropensci.org/img/team/milagros-mendoza.jpeg?w=250&#038;ssl=1"
alt="Milagros Mendoza&#39;s Profile Picture "  data-recalc-dims="1"><figcaption>
<p><strong>Milagros Mendoza </br> Universidade Federal Rural de Pernambuco</br> R-Ladies Natal, rOpenSci</strong></p>
</figcaption>
</figure>
<p>Hello, my name is Milagros. I am an ecologist and statistician driven by a desire to understand the complex systems that intertwine nature, society, and data. Throughout my career, I have worked with interdisciplinary data in the fields of climate, demography, and ecology, always striving to translate that data into knowledge that engages with reality and contributes to more informed decision-making. I am currently pursuing a postdoctoral fellowship at the Vale Institute of Technology in Brazil, where I am part of the research group on territories and natural resources.</p>
<p>I decided to serve as a mentor at rOpenSci because I am motivated to help more people develop confidence in using scientific tools, strengthen their critical thinking, and actively engage within the academic community. In this sense, I view mentoring as a learning space focused on dialogue and mutual growth.</p>
<h3>
Elio Campitelli
</h3><figure class="pull-left"><img src="https://i1.wp.com/ropensci.org/img/team/elio-campitelli.jpg?w=250&#038;ssl=1"
alt="Profile photo of Elio Campitelli"  data-recalc-dims="1"><figcaption>
<p><strong>Elio Campitelli </br> Monash University &#8211; rOpenSci</strong></p>
</figcaption>
</figure>
<p>I am from Argentina but two years ago I moved to Australia because it is the only other country that starts with A and uses the same type of plug.</p>
<p>I am doing a postdoc at Monash University researching interactions between Antarctic sea ice and the atmosphere.</p>
<p>I have been a mentor to previous cohorts of the program. It was a great experience that I want to repeat once more.</p>
</br>
</br>
</br>
</br>
</br>
</br>
<h2>
What’s next
</h2><p>We are happy to have this diverse and talented team of mentors, who embody the values of collaboration and commitment to collective growth. Their support will be key to helping the new Champions move their ideas and projects forward and contribute to the development of a stronger and more diverse open science community.</p>
<p>The selection of Champions is now complete, and we’ll be announcing them soon.</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://ropensci.org/blog/2026/05/06/mentors-2026/"> rOpenSci - open tools for open science</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/new-mentoring-team-with-experienced-mentors-and-new-voices/">New Mentoring Team with Experienced Mentors and New Voices</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401076</post-id>	</item>
		<item>
		<title>Differential Machine Learning with Twin Networks in R: Forecasting Bitcoin with Volatility Proxies</title>
		<link>https://www.r-bloggers.com/2026/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/</link>
		
		<dc:creator><![CDATA[Selcuk Disci]]></dc:creator>
		<pubDate>Tue, 05 May 2026 14:04:43 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">http://datageeek.com/?p=11991</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> Introduction Differential Machine Learning (DML), as introduced in the recent arXiv paper (Differential Machine Learning for 0DTE Options with Stochastic Volatility and Jumps), extends supervised learning by incorporating not only function values but also their derivatives. In financial contexts, this often means sensitivities such as Greeks. However, when direct derivatives ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/">Differential Machine Learning with Twin Networks in R: Forecasting Bitcoin with Volatility Proxies</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://datageeek.com/2026/05/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/"> DataGeeek</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<h2 class="wp-block-heading">Introduction</h2>



<p class="wp-block-paragraph">Differential Machine Learning (DML), as introduced in the recent <strong><em><a href="https://arxiv.org/html/2603.07600v1" rel="nofollow" target="_blank">arXiv paper (Differential Machine Learning for 0DTE Options with Stochastic Volatility and Jumps)</a></em></strong>, extends supervised learning by incorporating not only function values but also their derivatives. In financial contexts, this often means sensitivities such as Greeks. However, when direct derivatives are unavailable, we can approximate market dynamics using <strong>volatility indicators</strong>.</p>



<p class="wp-block-paragraph">In this project, we adapt DML to Bitcoin price forecasting. Instead of derivatives, we use <strong>RSI, MACD, and Bollinger Bands</strong> as proxies for volatility. These indicators capture momentum, trend strength, and price dispersion, providing a practical way to embed uncertainty into the learning process. To implement this, we design a <strong>twin-network architecture</strong> in Keras: one network learns price dynamics from time-based features, while the other learns volatility signals. Finally, we combine them via a stacking ensemble to achieve robust forecasts with confidence intervals.</p>



<h2 class="wp-block-heading">Why Volatility Variables Instead of Derivatives?</h2>



<ul class="wp-block-list">
<li><strong>RSI (Relative Strength Index)</strong>: Measures momentum and overbought/oversold conditions.</li>



<li><strong>MACD (Moving Average Convergence Divergence)</strong>: Captures trend direction and strength.</li>



<li><strong>Bollinger Bands (upper/lower bands, %B)</strong>: Quantifies price dispersion and volatility.</li>
</ul>



<p class="wp-block-paragraph">These indicators act as empirical substitutes for theoretical derivatives. While DML in its pure form requires sensitivities, in practice, these volatility proxies provide similar information about how prices respond to market forces.</p>



<h3 class="wp-block-heading">Why Twin Networks?</h3>



<p class="wp-block-paragraph">The idea is to separate the learning tasks:</p>



<ul class="wp-block-list">
<li>The <strong>primary network</strong> models the continuous component of the price process.</li>



<li>The <strong>auxiliary network</strong> models the volatility/jump component. Together, they mimic the decomposition found in stochastic models such as Bates or Heston, but implemented within a flexible neural framework.</li>
</ul>



<h2 class="wp-block-heading">Ensemble via Stacking</h2>



<p class="wp-block-paragraph">Once both networks are trained, their predictions are combined using a <strong>linear regression meta-model</strong>. This stacking ensemble learns the optimal weighting between the primary and auxiliary outputs. The result is a forecast that integrates both trend and volatility signals, significantly improving accuracy compared to either network alone.</p>



<h2 class="wp-block-heading">Evaluation</h2>



<figure data-wp-context="{"imageId":"69f9f92c89a12"}" data-wp-interactive="core/image" data-wp-key="69f9f92c89a12" class="wp-block-image size-large wp-lightbox-container"><img loading="lazy" data-attachment-id="12020" data-permalink="https://datageeek.com/2026/05/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/image-132/" data-orig-file="https://datageeek.com/wp-content/uploads/2026/05/image-1.png" data-orig-size="1012,353" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="image" data-image-description="" data-image-caption="" data-large-file="https://i1.wp.com/datageeek.com/wp-content/uploads/2026/05/image-1.png?w=450&#038;ssl=1" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on--pointerdown="actions.preloadImage" data-wp-on--pointerenter="actions.preloadImageWithDelay" data-wp-on--pointerleave="actions.cancelPreload" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://i1.wp.com/datageeek.com/wp-content/uploads/2026/05/image-1.png?w=450&#038;ssl=1" alt="" class="wp-image-12020" srcset_temp="https://datageeek.com/wp-content/uploads/2026/05/image-1.png 1012w, https://datageeek.com/wp-content/uploads/2026/05/image-1.png?w=150 150w, https://datageeek.com/wp-content/uploads/2026/05/image-1.png?w=300 300w, https://datageeek.com/wp-content/uploads/2026/05/image-1.png?w=768 768w" sizes="(max-width: 1012px) 100vw, 1012px" data-recalc-dims="1" /><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			data-wp-bind--aria-label="state.thisImage.triggerButtonAriaLabel"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.thisImage.buttonRight"
			data-wp-style--top="state.thisImage.buttonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button></figure>



<ul class="wp-block-list">
<li>Metrics: RMSE and MAPE, computed with the <code>yardstick</code> package.</li>



<li>Results:
<ul class="wp-block-list">
<li>Individual networks → RMSE ~76,000, MAPE ~99%.</li>



<li>Stacking ensemble → RMSE ~3,030, MAPE ~3.65.</li>
</ul>
</li>
</ul>



<p class="wp-block-paragraph">This demonstrates the power of combining price and volatility signals in a unified framework.</p>



<h2 class="wp-block-heading">Confidence Intervals</h2>



<p class="wp-block-paragraph">To quantify uncertainty, we compute <strong>residual-based confidence intervals</strong> around the point forecasts:</p>



<p class="wp-block-paragraph"><math display="block"><mrow><msub><mover accent="true"><mi>y</mi><mo>^</mo></mover><mi>t</mi></msub><mo>±</mo><mn>1.96</mn><mo>⋅</mo><msub><mi>σ</mi><mtext>residuals</mtext></msub></mrow></math></p>



<p class="wp-block-paragraph">This approach uses the standard deviation of training residuals to generate 95% confidence bands. It provides interpretable uncertainty estimates without requiring explicit probabilistic modeling.</p>



<h2 class="wp-block-heading">Visualization</h2>



<p class="wp-block-paragraph">The forecasts are visualized with <code>ggplot2</code>:</p>



<ul class="wp-block-list">
<li><strong>Grey ribbon</strong> → confidence intervals.</li>



<li><strong>Red line</strong> → stacking ensemble forecast.</li>



<li><strong>Black line</strong> → actual BTC prices.</li>
</ul>



<figure data-wp-context="{"imageId":"69f9f92c8a8d9"}" data-wp-interactive="core/image" data-wp-key="69f9f92c8a8d9" class="wp-block-image size-large wp-lightbox-container"><img loading="lazy" data-attachment-id="12021" data-permalink="https://datageeek.com/2026/05/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/image-133/" data-orig-file="https://datageeek.com/wp-content/uploads/2026/05/image-2.png" data-orig-size="1673,592" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="image" data-image-description="" data-image-caption="" data-large-file="https://i1.wp.com/datageeek.com/wp-content/uploads/2026/05/image-2.png?w=450&#038;ssl=1" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on--pointerdown="actions.preloadImage" data-wp-on--pointerenter="actions.preloadImageWithDelay" data-wp-on--pointerleave="actions.cancelPreload" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://i1.wp.com/datageeek.com/wp-content/uploads/2026/05/image-2.png?w=450&#038;ssl=1" alt="" class="wp-image-12021" srcset_temp="https://i1.wp.com/datageeek.com/wp-content/uploads/2026/05/image-2.png?w=450&#038;ssl=1 1024w, https://datageeek.com/wp-content/uploads/2026/05/image-2.png?w=150 150w, https://datageeek.com/wp-content/uploads/2026/05/image-2.png?w=300 300w, https://datageeek.com/wp-content/uploads/2026/05/image-2.png?w=768 768w, https://datageeek.com/wp-content/uploads/2026/05/image-2.png?w=1440 1440w, https://datageeek.com/wp-content/uploads/2026/05/image-2.png 1673w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			data-wp-bind--aria-label="state.thisImage.triggerButtonAriaLabel"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.thisImage.buttonRight"
			data-wp-style--top="state.thisImage.buttonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button></figure>



<p class="wp-block-paragraph">This design clearly communicates both the central forecast and the uncertainty range. The chart you will include at the end of the blog shows exactly this: a red forecast line, black actuals, and a grey confidence band, illustrating how the ensemble integrates volatility information into predictive intervals.</p>



<figure data-wp-context="{"imageId":"69f9f92c8b334"}" data-wp-interactive="core/image" data-wp-key="69f9f92c8b334" class="wp-block-image size-large wp-lightbox-container"><img loading="lazy" data-attachment-id="12007" data-permalink="https://datageeek.com/2026/05/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/btc_dml/" data-orig-file="https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png" data-orig-size="1112,646" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="btc_dml" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=450&#038;ssl=1" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on--pointerdown="actions.preloadImage" data-wp-on--pointerenter="actions.preloadImageWithDelay" data-wp-on--pointerleave="actions.cancelPreload" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://i2.wp.com/datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=450&#038;ssl=1" alt="" class="wp-image-12007" srcset_temp="https://i2.wp.com/datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=450&#038;ssl=1 1024w, https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=150 150w, https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=300 300w, https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=768 768w, https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png 1112w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			data-wp-bind--aria-label="state.thisImage.triggerButtonAriaLabel"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.thisImage.buttonRight"
			data-wp-style--top="state.thisImage.buttonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button></figure>



<h2 class="wp-block-heading">Keras3 in R: Flexible Deep Learning for Financial Forecasting</h2>



<h3 class="wp-block-heading">What is Keras3?</h3>



<p class="wp-block-paragraph"><strong><em><a href="https://keras3.posit.co/" rel="nofollow" target="_blank">Keras3</a></em></strong> is the modern R interface to the Keras deep learning library, built on top of TensorFlow. It allows R users to define, train, and evaluate neural networks with concise syntax while leveraging TensorFlow’s computational power. Unlike earlier versions, Keras3 is fully aligned with TensorFlow 2.x, ensuring long-term support and compatibility.</p>



<h3 class="wp-block-heading">How We Used Keras3</h3>



<p class="wp-block-paragraph">In our workflow, Keras3 was the backbone for implementing the <strong>twin-network architecture</strong>:</p>



<figure data-wp-context="{"imageId":"69f9f92c8bffa"}" data-wp-interactive="core/image" data-wp-key="69f9f92c8bffa" class="wp-block-image size-large wp-lightbox-container"><img loading="lazy" data-attachment-id="12017" data-permalink="https://datageeek.com/2026/05/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/image-131/" data-orig-file="https://datageeek.com/wp-content/uploads/2026/05/image.png" data-orig-size="1064,654" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0","alt":""}" data-image-title="image" data-image-description="" data-image-caption="" data-large-file="https://i2.wp.com/datageeek.com/wp-content/uploads/2026/05/image.png?w=450&#038;ssl=1" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on--pointerdown="actions.preloadImage" data-wp-on--pointerenter="actions.preloadImageWithDelay" data-wp-on--pointerleave="actions.cancelPreload" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://i2.wp.com/datageeek.com/wp-content/uploads/2026/05/image.png?w=450&#038;ssl=1" alt="" class="wp-image-12017" srcset_temp="https://i2.wp.com/datageeek.com/wp-content/uploads/2026/05/image.png?w=450&#038;ssl=1 1024w, https://datageeek.com/wp-content/uploads/2026/05/image.png?w=150 150w, https://datageeek.com/wp-content/uploads/2026/05/image.png?w=300 300w, https://datageeek.com/wp-content/uploads/2026/05/image.png?w=768 768w, https://datageeek.com/wp-content/uploads/2026/05/image.png 1064w" sizes="(max-width: 1024px) 100vw, 1024px" data-recalc-dims="1" /><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			data-wp-bind--aria-label="state.thisImage.triggerButtonAriaLabel"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.thisImage.buttonRight"
			data-wp-style--top="state.thisImage.buttonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button></figure>



<h3 class="wp-block-heading">Why ReLU?</h3>



<ul class="wp-block-list">
<li><strong>ReLU (Rectified Linear Unit)</strong> is the activation function used in hidden layers.</li>



<li>Formula: <math><mrow><mtext>ReLU</mtext><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>max</mi><mo>⁡</mo><mo stretchy="false">(</mo><mn>0</mn><mo separator="true">,</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></math>.</li>



<li>Benefits:
<ul class="wp-block-list">
<li>Introduces non-linearity, enabling the network to learn complex relationships.</li>



<li>Efficient and helps avoid vanishing gradients.</li>



<li>Well-suited for financial data where signals can be sparse and directional.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">Why Adam?</h3>



<ul class="wp-block-list">
<li><strong>Adam (Adaptive Moment Estimation)</strong> is the optimizer chosen.</li>



<li>Combines <strong>momentum</strong> (using past gradients to accelerate learning) and <strong>adaptive learning rates</strong> (adjusting step sizes per parameter).</li>



<li>Benefits:
<ul class="wp-block-list">
<li>Robust for noisy, non-stationary data like cryptocurrency prices.</li>



<li>Requires minimal tuning, making it ideal for plug-and-play workflows.</li>



<li>Widely adopted in both academic and applied machine learning.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">Contribution to the R Ecosystem</h3>



<p class="wp-block-paragraph">Keras3 bridges the gap between R’s <strong>tidyverse/tidymodels ecosystem</strong> and modern deep learning:</p>



<ul class="wp-block-list">
<li>Integrates seamlessly with data preprocessing pipelines (<code>recipes</code>, <code>timetk</code>).</li>



<li>Allows financial analysts and data scientists to stay within R while accessing TensorFlow’s deep learning capabilities.</li>



<li>Encourages reproducibility: models can be defined, trained, and evaluated entirely in R, without switching to Python.</li>



<li>Expands R’s role beyond traditional statistical modeling into <strong>state-of-the-art AI applications</strong>.</li>
</ul>



<h2 class="wp-block-heading">Why It Matters for DML</h2>



<p class="wp-block-paragraph">By using Keras3:</p>



<ul class="wp-block-list">
<li>We could <strong>separate learning tasks</strong> into a primary network (trend/seasonality) and an auxiliary network (volatility/momentum).</li>



<li>Both networks were trained with ReLU activations and Adam optimization, ensuring stability and efficiency.</li>



<li>Their outputs were combined in a stacking ensemble, yielding forecasts that integrate both price dynamics and volatility signals.</li>
</ul>



<p class="wp-block-paragraph">This demonstrates how Keras3 empowers R users to implement advanced architectures like twin networks, making Differential Machine Learning concepts practical in financial forecasting.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p class="wp-block-paragraph">This case study demonstrates how Differential Machine Learning concepts can be adapted for financial forecasting in R:</p>



<ul class="wp-block-list">
<li>Volatility indicators serve as practical substitutes for derivatives.</li>



<li>Twin-network architecture in Keras captures both trend and volatility.</li>



<li>Stacking ensembles significantly improves predictive performance.</li>



<li>Residual-based confidence intervals provide interpretable uncertainty estimates.</li>
</ul>



<p class="wp-block-paragraph">By combining academic ideas with reproducible R workflows, we can build robust forecasting pipelines that bridge theory and practice.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://datageeek.com/2026/05/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/"> DataGeeek</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/differential-machine-learning-with-twin-networks-in-r-forecasting-bitcoin-with-volatility-proxies/">Differential Machine Learning with Twin Networks in R: Forecasting Bitcoin with Volatility Proxies</a>]]></content:encoded>
					
		
		<enclosure url="https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png" length="0" type="" />
<enclosure url="https://1.gravatar.com/avatar/db5e3f9ef188ea98fe38ab05c5a3fad9fb52fe3472715a8fc02f7ea41731f77c?s=96&#038;d=identicon&#038;r=G" length="0" type="" />
<enclosure url="https://datageeek.com/wp-content/uploads/2026/05/image-1.png?w=1012" length="0" type="" />
<enclosure url="https://datageeek.com/wp-content/uploads/2026/05/image-2.png?w=1024" length="0" type="" />
<enclosure url="https://datageeek.com/wp-content/uploads/2026/05/btc_dml.png?w=1024" length="0" type="" />
<enclosure url="https://datageeek.com/wp-content/uploads/2026/05/image.png?w=1024" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401060</post-id>	</item>
		<item>
		<title>Setting function parameters for debugging</title>
		<link>https://www.r-bloggers.com/2026/05/setting-function-parameters-for-debugging/</link>
		
		<dc:creator><![CDATA[Jason Bryer]]></dc:creator>
		<pubDate>Tue, 05 May 2026 04:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://bryer.org/posts/2026-05-05-Setting_Function_Parameters_for_Debugging.html</guid>

					<description><![CDATA[<p>I tend to write a lot of functions that create specific graphics implemented with ggplot2. Although I try to pick graphic parameters (e.g. colors, text size, etc.) that are reasonable, I will typically define all relevant aesthetics as param...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/setting-function-parameters-for-debugging/">Setting function parameters for debugging</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://bryer.org/posts/2026-05-05-Setting_Function_Parameters_for_Debugging.html"> Jason Bryer</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 




<p>I tend to write a lot of functions that create specific graphics implemented with <a href="https://ggplot2.tidyverse.org/" rel="nofollow" target="_blank"><code>ggplot2</code></a>. Although I try to pick graphic parameters (e.g. colors, text size, etc.) that are reasonable, I will typically define all relevant aesthetics as parameters to my function. As a result, my functions tend to have a lot of parameters. When I need to debug the function I need to have all those parameters set in the global environment which usually requires me highlighting each assignment and running it. This function automates this process. You can pass any function and it will attempt to set parameters to the given environment (the global environment by default). It will return a data frame with a column indicating if the variable was set and the value. This is useful to know what parameters don’t have a default value that need to be set yourself.</p>
<div class="cell">
<pre>#' Set function parameters to an environment.
#'
#' This function is designed to help debug functions. It will attempt to set all
#' the default parameter values to the specified environment (global environment
#' by default). This is useful for when you want to execute code within the 
#' function definition interactively but need the parameters set in the current 
#' environment.
#'
#' **Warning:** This function will modify the global environment and therefore 
#' violates CRAN policy
#' [&quot;Packages should not modify the global environment (user’s workspace)&quot;]
#' (https://cran.r-project.org/web/packages/policies.html#Source-packages).
#'
#' @param FUN the function to assign parameters to an environment.
#' @param envir the environment to assign the variables to. Defaults to the 
#'        global environment.
#' @param verbose whether to return the data frame invisibly or to print the results.
#' @return a data frame where row names correspond to the parameter name with 
#'        two columns: `set` which is logical indicating if the variable was set 
#'        and `value` with a character representation of the variable value.
set_function_params &lt;- function(FUN, envir = globalenv(), verbose = interactive()) {
    params &lt;- formals(FUN)
    params_set &lt;- data.frame(row.names = names(params),
                             set = rep(FALSE, length(params)),
                             value = rep(NA_character_, length(params)))
    for(param in names(params)) {
        value &lt;- params[[param]]
        if(!missing(value)) {
            if(is.character(value)) {
                assign(param, value, envir = envir)
                params_set[param,]$value &lt;- value
            } else {
                assign(param, eval(value), envir = envir)
                params_set[param,]$value &lt;- eval(value)
            }
            params_set[param,]$set &lt;- TRUE
        }
    }
    if(verbose) {
        return(params_set)
    } else {
        invisible(params_set)
    }
}</pre>
</div>
<p>Very recently I was trying to debug a function that creates profile plots for cluster analysis (<a href="https://github.com/jbryer/clav/blob/master/R/profile_plot.R" rel="nofollow" target="_blank"><code>clav::profile_plot()</code></a>, <a href="https://clav.bryer.org/reference/profile_plot.html" rel="nofollow" target="_blank">documentation</a>). This function has 23 parameters! Setting these all manually is pretty tedious.</p>
<div class="cell">
<pre># List objects in the current environment
ls()</pre>
<div class="cell-output cell-output-stdout">
<pre>[1] &quot;set_function_params&quot;</pre>
</div>
<pre># Call the function
param_set_result &lt;- set_function_params(clav::profile_plot)

# Check to see if the parameters are actually set
ls()</pre>
<div class="cell-output cell-output-stdout">
<pre> [1] &quot;bonferroni&quot;          &quot;center_alpha&quot;        &quot;center_band&quot;        
 [4] &quot;center_fill&quot;         &quot;cluster_label_hjust&quot; &quot;color_palette&quot;      
 [7] &quot;hjust&quot;               &quot;label_clusters&quot;      &quot;label_means&quot;        
[10] &quot;label_outcome_means&quot; &quot;label_profile_means&quot; &quot;param_set_result&quot;   
[13] &quot;point_size&quot;          &quot;se_factor&quot;           &quot;set_function_params&quot;
[16] &quot;standardize&quot;         &quot;text_size&quot;           &quot;title&quot;              
[19] &quot;ylab&quot;               </pre>
</div>
</div>
<p>We can examine the data frame which gives a summary of the parameters set (or not).</p>
<div class="cell">
<pre>param_set_result</pre>
<div class="cell-output cell-output-stdout">
<pre>                      set               value
df                  FALSE                &lt;NA&gt;
clusters            FALSE                &lt;NA&gt;
df_dep              FALSE                &lt;NA&gt;
standardize          TRUE                TRUE
bonferroni           TRUE                TRUE
label_means          TRUE                TRUE
label_profile_means  TRUE                TRUE
label_outcome_means  TRUE                TRUE
center_band          TRUE                0.25
center_fill          TRUE             #f0f9e8
center_alpha         TRUE                 0.1
text_size            TRUE                   4
hjust                TRUE                 0.5
point_size           TRUE                   2
se_factor            TRUE                1.96
color_palette        TRUE                   2
cluster_labels      FALSE                &lt;NA&gt;
cluster_order       FALSE                &lt;NA&gt;
label_clusters       TRUE                TRUE
cluster_label_x     FALSE                &lt;NA&gt;
cluster_label_hjust  TRUE                   5
ylab                 TRUE Mean Standard Score
title                TRUE    Cluster Profiles</pre>
</div>
</div>



 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://bryer.org/posts/2026-05-05-Setting_Function_Parameters_for_Debugging.html"> Jason Bryer</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/setting-function-parameters-for-debugging/">Setting function parameters for debugging</a>]]></content:encoded>
					
		
		<enclosure url="https://bryer.org/posts/2026-05-05-banner.png" length="0" type="image/png" />

		<post-id xmlns="com-wordpress:feed-additions:1">401048</post-id>	</item>
		<item>
		<title>JAGS 5.0.0-beta is available</title>
		<link>https://www.r-bloggers.com/2026/05/jags-5-0-0-beta-is-available/</link>
		
		<dc:creator><![CDATA[Martyn]]></dc:creator>
		<pubDate>Mon, 04 May 2026 17:20:26 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">http://martynplummer.wordpress.com/?p=1992</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> JAGS 5.0.0-beta is now available from SourceForge. The beta release is for two groups of people: Please send feedback via the JAGS forums or file a bug report The JAGS library The following packages are available: The rjags package In … Continue reading →</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/jags-5-0-0-beta-is-available/">JAGS 5.0.0-beta is available</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://martynplummer.wordpress.com/2026/05/04/jags-5-0-0-beta-is-available/"> R – JAGS News</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<figure class="wp-block-image size-large"><a href="https://i1.wp.com/martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?ssl=1" rel="nofollow" target="_blank"><img loading="lazy" data-attachment-id="1993" data-permalink="https://martynplummer.wordpress.com/img_0083/" data-orig-file="https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg" data-orig-size="1500,2000" data-comments-opened="1" data-image-meta="{"aperture":"1.64","credit":"","camera":"iPhone 16e","caption":"","created_timestamp":"1777199209","copyright":"","focal_length":"4.2","iso":"32","shutter_speed":"0.0021052631578947","title":"","orientation":"1","alt":""}" data-image-title="img_0083" data-image-description="" data-image-caption="" data-large-file="https://i1.wp.com/martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=450&#038;ssl=1" src="https://i1.wp.com/martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=450&#038;ssl=1" alt="" class="wp-image-1993" srcset_temp="https://i1.wp.com/martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=450&#038;ssl=1 584w, https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=1168 1168w, https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=113 113w, https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=225 225w, https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=768 768w" sizes="(max-width: 584px) 100vw, 584px" data-recalc-dims="1" /></a></figure>



<p class="wp-block-paragraph">JAGS 5.0.0-beta is now available from SourceForge. </p>



<p class="wp-block-paragraph">The beta release is for two groups of people:</p>



<ul class="wp-block-list">
<li>People who have written software depending on JAGS, in particular authors of R packages that depend on one of the four interfaces between R and JAGS – rjags, runjags, R2jags, and jagsUI. Currently some of these packages do not pass the CRAN tests with the new version of JAGS. Some time to fix these problems before the official release is helpful.</li>



<li>Anyone who wants to try out the new version and find problems with it before the official release. </li>
</ul>



<p class="wp-block-paragraph">Please send feedback via the <a href="https://sourceforge.net/p/mcmc-jags/discussion/" rel="nofollow" target="_blank">JAGS forums</a> or file a <a href="https://sourceforge.net/p/mcmc-jags/bugs/" rel="nofollow" target="_blank">bug report</a></p>



<h1 class="wp-block-heading">The JAGS library</h1>



<p class="wp-block-paragraph">The following packages are available:</p>



<ul class="wp-block-list">
<li><a href="https://sourceforge.net/projects/mcmc-jags/files/JAGS/5.x/Source/" rel="nofollow" target="_blank">Source tarball</a></li>



<li><a href="https://sourceforge.net/projects/mcmc-jags/files/JAGS/5.x/Windows/" rel="nofollow" target="_blank">Windows binary</a> installer (x86_64)</li>



<li><a href="https://sourceforge.net/projects/mcmc-jags/files/JAGS/5.x/macOS/" rel="nofollow" target="_blank">macOS binary</a> installer
<ul class="wp-block-list">
<li>There is a single macOS installer for both x86_64 and arm64.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading">The rjags package</h2>



<p class="wp-block-paragraph">In order to interface to JAGS 5.0.0 from R you will need rjags_5-1. This is not yet available from CRAN because some of the reverse dependencies do not yet work with version 5.0.0 of JAGS. The following packages are provided:</p>



<ul class="wp-block-list">
<li><a href="https://sourceforge.net/projects/mcmc-jags/files/rjags/5/Source/" rel="nofollow" target="_blank">Source tarball</a></li>



<li><a href="https://sourceforge.net/projects/mcmc-jags/files/rjags/5/Source/" rel="nofollow" target="_blank">Windows binary</a> (x86_64)</li>



<li><a href="https://sourceforge.net/projects/mcmc-jags/files/rjags/5/macOS/" rel="nofollow" target="_blank">macOS binaries</a>
<ul class="wp-block-list">
<li>Separate binaries are provided for x86_64 and arm64 and for R version 4.5.3 and 4.6.0.</li>
</ul>
</li>
</ul>



<p class="wp-block-paragraph"></p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://martynplummer.wordpress.com/2026/05/04/jags-5-0-0-beta-is-available/"> R – JAGS News</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/jags-5-0-0-beta-is-available/">JAGS 5.0.0-beta is available</a>]]></content:encoded>
					
		
		<enclosure url="https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg" length="0" type="" />
<enclosure url="https://0.gravatar.com/avatar/fdc509bd31ae635d89cccbdc64ef09464ea1c20d7858c4089a07ea3bea91b8e3?s=96&#038;d=identicon&#038;r=G" length="0" type="" />
<enclosure url="https://martynplummer.wordpress.com/wp-content/uploads/2026/05/img_0083.jpg?w=584" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401041</post-id>	</item>
		<item>
		<title>Comparing R&#8217;s {targets} and dbt for Data Engineering</title>
		<link>https://www.r-bloggers.com/2026/05/comparing-rs-targets-and-dbt-for-data-engineering/</link>
		
		<dc:creator><![CDATA[Jonathan Carroll]]></dc:creator>
		<pubDate>Mon, 04 May 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> I’m getting more and more into data engineering these days and having used R for<br />
a long time, I’m seeing a lot of problems that look nail-shaped to my R-shaped<br />
hammer. The available tools to solve those problems exist for (presumably) very<br />
good reasons, so I wanted to ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/comparing-rs-targets-and-dbt-for-data-engineering/">Comparing R’s {targets} and dbt for Data Engineering</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/"> rstats on Irregularly Scheduled Programming</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>I’m getting more and more into data engineering these days and having used R for
a long time, I’m seeing a lot of problems that look nail-shaped to my R-shaped
hammer. The available tools to solve those problems exist for (presumably) very
good reasons, so I wanted to take some time to dig into how to use them and
compare their workflows to what I would otherwise naively do in R.</p>
<p>I should mention here that I’m currently open to data/code-related opportunities
and am actively seeking a new role – if your organisation is looking for someone
aligned with my skillset, please get in touch with me any way you can, e.g.
<a href="mailto:contact@jcarroll.com.au?subject=Work%20opportunity" rel="nofollow" target="_blank">contact at jcarroll.com.au</a>.</p>
<p>I’m a firm believer in “you learn with your hands, not with your eyes” so I wanted
to actually build something. I definitely could spin up Claude Code and have it
produce the entire thing for me – and in a different project I might do that – but
in this case I want to make the mistakes myself so I can learn where the complexity
really lives and where my prior assumptions are misaligned. I did have Claude
(the chat version, not the full coding agent) guide me through the steps to
get this project running, and I did let it clean up my SQL; this project wasn’t
about learning to better optimise my SQL, but understanding exactly what it
produced will help me write a better version on my next iteration.</p>
<p>Thinking of a real-world project I could take for a spin, I decided to build some
ingestion for my personal finances. I’ve used Quickbooks previously which connects
up to my bank and helps categorise personal and business (as a freelance contractor)
expenses. I decided I’ll build my own ‘slowbooks’ processing workflow based on
some manual exports (I don’t think my bank has an API).</p>
<p>Both of the approaches I’ll compare here build on the idea of a <code>Makefile</code> which
connects up commands to run based on dependencies, and only runs what is needed;
if all the input dependencies of a step have not changed, there’s no need to
re-run that step. From what I understand, you could largely get away with just
writing some <code>Makefile</code>s (or the newer implementation
<a href="https://just.systems/man/en/" rel="nofollow" target="_blank"><code>just</code></a>) but these two approaches help to better
structure how that’s constructed.</p>
<p>This is a somewhat longer post than some of mine, so here’s some quick links to
the sections</p>
<nav id="TableOfContents">
  <ul>
    <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#dbt" rel="nofollow" target="_blank">dbt</a></li>
    <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#targets" rel="nofollow" target="_blank">{targets}</a></li>
    <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#comparing-workflows" rel="nofollow" target="_blank">Comparing Workflows</a>
      <ul>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#staging---load-data" rel="nofollow" target="_blank">Staging &#8211; Load Data</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#intermediate---joins-and-enrichment" rel="nofollow" target="_blank">Intermediate &#8211; Joins and Enrichment</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#marts---summaries-and-outputs" rel="nofollow" target="_blank">Marts &#8211; Summaries and Outputs</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#tests--validation" rel="nofollow" target="_blank">Tests / Validation</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#analysis" rel="nofollow" target="_blank">Analysis</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#the-complete-workflow" rel="nofollow" target="_blank">The Complete Workflow</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#dag--visualisation--docs" rel="nofollow" target="_blank">DAG / Visualisation / Docs</a></li>
        <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#exploration" rel="nofollow" target="_blank">Exploration</a></li>
      </ul>
    </li>
    <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#comparison" rel="nofollow" target="_blank">Comparison</a></li>
    <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#other-solutions" rel="nofollow" target="_blank">Other Solutions</a></li>
    <li><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#conclusion" rel="nofollow" target="_blank">Conclusion</a></li>
  </ul>
</nav>

<h2 id="dbt">dbt</h2>
<p>One tool that comes up frequently is ‘data build tool’ most commonly referred to
as just dbt, though that full name doesn’t even show up on <a href="https://www.getdbt.com/product/what-is-dbt" rel="nofollow" target="_blank">their website</a>. Started in 2016, it’s
released as a Python package (<a href="https://pypi.org/project/dbt-core/" rel="nofollow" target="_blank">dbt-core</a>)
though if you do try to just install something called ‘dbt’ you get the
<a href="https://pypi.org/project/dbt/" rel="nofollow" target="_blank">cloud CLI</a> tool which isn’t quite the same. Naming
stuff is hard.</p>
<p>It’s a way to write code you can commit, which translates to SQL and performs data
ingestion, processing, transformation, and storage in a structured way with
relationships between various steps in the workflow. It adds macros on top of
plain SQL to make the transformations easier, written in
<a href="https://github.com/pallets/jinja" rel="nofollow" target="_blank">jinja</a>, a template engine which enables
writing something more like Python within SQL.</p>
<p><a href="https://www.youtube.com/watch?v=f7_WwFmlslo" rel="nofollow" target="_blank">This episode of Data Science Lab</a>
from Posit walks through an example of using dbt, and while it’s a fantastic
overview of what a project looks like, it can’t answer all of the ‘how would I do
that?’ problems that will come up in a different project.</p>
<p>Like they did, I will use DuckDB for a database &#8211; I enjoyed reading through
‘<a href="https://www.manning.com/books/duckdb-in-action" rel="nofollow" target="_blank">DuckDB in Action</a>’ with the
<a href="https://dslc.io/bookclubs.html" rel="nofollow" target="_blank">DSLC.io</a> book club and can definitely see the
advantages over SQLite which I would previously have reached for in this case.</p>
<p>I installed dbt via <code>uv</code> &#8211; the
<a href="https://docs.getdbt.com/docs/local/install-dbt" rel="nofollow" target="_blank">official instructions</a> use <code>pip</code>
and I’ve been burned too many times with that tool; <code>uv</code> is much nicer.
Nonetheless, I still encountered Python-related issues because it looks like
<code>dbt</code> doesn’t yet support Python 3.14 and yet this isn’t mentioned in their
instructions either. I got it working with this command, adding the <code>dbt-duckdb</code>
extension I plan to use, as well as <code>streamlit</code> to make a dashboard later</p>
<pre>uv init slowbooks --python 3.12
cd slowbooks
uv add dbt-duckdb duckdb streamlit
</pre><p>Adding a <code>profiles.yml</code> in the project root defining the database (DuckDB) I
want to produce to store the tables</p>
<pre>slowbooks:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: slowbooks.duckdb
      schema: main
</pre><p>I can then initialise the project with</p>
<pre>uv run dbt init . --skip-profile-setup
</pre><p>This creates the basic project structure, and there’s a lot going on.</p>
<p>I also needed to define a dependency in <code>packages.yml</code> so that I could use the
macros</p>
<pre>packages:
  - package: dbt-labs/dbt_utils
    version: [&quot;&gt;=1.0.0&quot;]
</pre><p>and ran</p>
<pre>uv run dbt deps
</pre><p>I put my exported CSVs (several for my transaction/savings accounts and one for
my credit card) in a new <code>raw/</code> folder; my understanding is that the <code>seeds/</code>
folder is for static data, although that’s the folder used in the Posit tutorial
above.</p>
<p>I also ran some pre-processing over my CSVs to categorise the merchants. My bank
provides a ‘category’ and ‘subcategory’ for each item, but I wanted to be able
to override some of those to more specific definitions so that I could group by
them, e.g. ’total spent on books’ since I mainly buy those from just a couple of
merchants. This produced a new CSV of patterns, resolved names, and classifications,
since the ‘description’ of an item in my transactions might have, e.g.</p>
<pre>Paypal *FruitShop 0401000000 Au
</pre><p>and I want to identify the ‘FruitShop’ part, so I can match against that pattern.
This <em>is</em> a (fairly) static file (the source data will occasionally be extended),
so that did go into <code>seeds/</code>.</p>
<h2 id="targets">{targets}</h2>
<p>The whole time I’ve been learning about dbt I’ve had a voice in my head asking
“can’t I just use <a href="https://docs.ropensci.org/targets/" rel="nofollow" target="_blank">{targets}</a>?” Yes, it’s an
R-specific tool, but it does a fantastic job at what it does. It’s not a new
tool at all – <a href="https://ropensci.org/blog/2021/02/03/targets/" rel="nofollow" target="_blank">this post from 2021</a>
demonstrates the power of it, and Miles McBain has been singing the praises of it
since <a href="https://www.youtube.com/watch?v=jU1Zv21GvT4" rel="nofollow" target="_blank">at least as early as 2020</a>
(along with the predecessor {drake}).</p>
<p>Rather than double up all of my inputs, I will just keep the {targets} implementation
as a subdirectory of my dbt project and refer to the exact same source files. I
will create a distinct database, though.</p>
<p>Installing {targets}, provided you already have a working R installation, is
as straightforward as</p>
<pre>install.packages(&quot;targets&quot;)
</pre><p>within an R session, be that in RStudio, Positron, Emacs, or a terminal.</p>
<p>As for the rest of the file structure, 100% of the R code here goes into a
<code>_targets.R</code> file &#8211; much cleaner, albeit that’s a tradeoff in terms of separating
different components.</p>
<p><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a></p>
<h2 id="comparing-workflows">Comparing Workflows</h2>
<p>For the actual processing I’m going to show both dbt and {targets} approaches in
tabsets for switching back and forth.</p>
<p>For dbt a ‘model’ is a <code>select</code> statement producing a table, with the structure being
models split out into three layers of increasingly production-ready data. From
the dbt docs, these are defined as:</p>
<ul>
<li>Staging: Preparing atomic building blocks</li>
<li>Intermediate: Purpose-built transformation steps</li>
<li>Marts: Business-defined entities</li>
</ul>
<p>and I’m trying to stick to that as best as I can.</p>
<h3 id="staging---load-data">Staging &#8211; Load Data</h3>
<p>The first step was to ingest that into a ‘staging’ model. This is where the
initial data loading happens. For this personal project I’ve exported the CSV
files I need, and will do so again in the future, adding them to the same folder
for de-duplication within the pipeline. In a more mature project these might be
read from an API or a connection to a managed database, and both approaches can
easily switch between different ’environments’ (dev, staging, prod, …) without
adjusting much, certainly without having to rename all the dependency labels.</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>First I defined the sources in <code>models/staging/sources.yml</code>, leveraging
DuckDB’s <code>read_csv()</code> to read all the CSV files in <code>raw/</code></p>
<pre>version: 2

sources:
  - name: all_raw
    schema: main
    meta:
      external_location: &quot;read_csv(&#39;raw/*.csv&#39;, filename=true, union_by_name=true, header=true)&quot;
    tables:
      - name: all_transactions
        description: &quot;All raw CSV exports&quot;
</pre><p>Reading in these files occurs in a model <code>models/staging/stg_bank.sql</code> and
<code>models/staging/stg_cc.sql</code>, the first of which is</p>
<pre>with source as (
    select * from {{ source(&#39;all_raw&#39;, &#39;all_transactions&#39;) }}
),

filtered as (
    select * from source
    where filename not like &#39;%visa_%&#39;
),

cleaned as (
    select
        -- Parse YYYYMMDD integer date format
        strptime(cast(&quot;Date&quot; as varchar), &#39;%Y%m%d&#39;)::date as date,

        -- Collapse whitespace runs in description
        regexp_replace(trim(&quot;Description&quot;), &#39;\s+&#39;, &#39; &#39;, &#39;g&#39;) as description,

        -- Debit = spend (positive), Credit = refund/income (negative)
        coalesce(&quot;Debit&quot;, 0) - coalesce(&quot;Credit&quot;, 0) as amount_aud,

        &quot;Category&quot;    as raw_category,
        &quot;SubCategory&quot; as raw_subcategory,
        filename      as raw_source
    from filtered
    where &quot;Date&quot; is not null
)

select * from cleaned
</pre><p>With a similar model for the credit card data. I’ve separated these based on
matching to <code>'visa'</code> in the filename, and I’ve named my files according to this.</p>
<p>These are then combined in <code>models/staging/stg_transactions.sql</code>, referencing
each of the dependencies with the <code>ref()</code> macro. A ‘surrogate key’ is created
to uniquely identify rows, so that when I add more data, the duplicates will
drop out. This does mean that any intentional duplicates: double records on
the same date from the same merchant for the same amount – e.g. buying one
ice-cream, dropping it, buying another – will also be dropped, but I’m
considering that an edge-case and not worrying about it.</p>
<pre>with bank as (
    select * from {{ ref(&#39;stg_bank&#39;) }}
),

cc as (
    select * from {{ ref(&#39;stg_cc&#39;) }}
),

unioned as (
    select * from bank
    union all
    select * from cc
),

with_surrogate_key as (
    select
        {{ dbt_utils.generate_surrogate_key([&#39;date&#39;, &#39;description&#39;, &#39;amount_aud&#39;]) }} as transaction_id,
        date,
        description,
        amount_aud,
        raw_category,
        raw_subcategory,
        raw_source
    from unioned
    where &quot;Description&quot; not ilike &#39;%Internet Withdrawal%&#39; -- drop transfers between accounts
      -- this does include manual payments, but most of these are small
) 

select * from with_surrogate_key
</pre><p>I’ve also stripped out the ‘internet withdrawal’ records as these are mostly
transfers between my own accounts. It also includes manual transfers to e.g.
contractors or even some bills, but dealing with these didn’t seem worth the
effort.</p>
<p>One point worth noting here is that this processing is all in SQL; I definitely
got the feeling after working with this tool that it was made for data folks
who naturally reach for SQL when working with data. Personally, I prefer an
abstraction on top of my SQL, so this felt limiting to me, but tastes will
absolutely differ.</p>
<p>The merchants seed file is automatically loaded with the name <code>seed_merchants</code>
matching the file name.</p>
</li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>The equivalent in {targets} uses <code>tar_target()</code> to identify dependencies and
things to be output. I start by identifying the files I want to read in. A
strict comparison would have been to do another <code>grepv()</code> with <code>invert=TRUE</code>
but <code>setdiff()</code> works nicely here</p>
<pre>RAW_DIR &lt;- &quot;../raw&quot;

file_list &lt;- list.files(RAW_DIR, full.names = TRUE)
cc_list &lt;- grepv(&quot;visa.*\\.csv$&quot;, file_list)
bank_list &lt;- setdiff(file_list, cc_list)
</pre><p>(sidenote: ooh, yeah – I get to use that new <code>grepv()</code> added in R 4.5.0)</p>
<p>Loading the data requires a function, but now we can leverage R and its
abstractions, in this case {dplyr}, {stringr}, and {lubridate}</p>
<pre>stage_source &lt;- function(files) {
  read_files(files) |&gt;
    mutate(
      date = ymd(as.character(Date)),
      # Collapse whitespace runs — mirrors regexp_replace(trim(Description), &#39;\s+&#39;, &#39; &#39;, &#39;g&#39;)
      description = str_squish(Description),
      # Debit = spend (positive), Credit = refund/income (negative)
      amount_aud = coalesce(Debit, 0) - coalesce(Credit, 0),
      raw_category = Category,
      raw_subcategory = SubCategory,
      raw_source = filename
    ) |&gt;
    filter(!is.na(Date)) |&gt;
    select(
      date,
      description,
      amount_aud,
      raw_category,
      raw_subcategory,
      raw_source
    )
}
</pre><p>and combining the sources along with filtering out the transfers</p>
<pre>stg_transactions &lt;- function(bank, cc) {
  bind_rows(bank, cc) |&gt;
    # Drop inter-account transfers — mirrors WHERE description NOT ILIKE &#39;%Internet Withdrawal%&#39;
    filter(!str_detect(str_to_lower(description), &quot;internet withdrawal&quot;)) |&gt;
    surrogate_key(c(&quot;date&quot;, &quot;description&quot;, &quot;amount_aud&quot;)) |&gt;
    select(
      transaction_id,
      date,
      description,
      amount_aud,
      raw_category,
      raw_subcategory,
      raw_source
    )
}
</pre><p>The <code>surrogate_key</code> function is something I did have to define, but Claude
happily provided me with an equivalent to what’s in dbt</p>
<pre>surrogate_key &lt;- function(df, cols) {
  df |&gt;
    mutate(
      transaction_id = purrr::pmap_chr(pick(all_of(cols)), \(...) {
        vals &lt;- list(...)
        parts &lt;- purrr::map_chr(seq_along(vals), \(i) {
          v &lt;- vals[[i]]
          if (is.na(v)) &quot;^^NULL^^&quot; else as.character(v)
        })
        digest::digest(
          paste(parts, collapse = &quot;|&quot;),
          algo = &quot;md5&quot;,
          serialize = FALSE
        )
      })
    )
}
</pre><p>With those pieces, plus loading the merchants file, the full pipeline so far is</p>
<pre>list(
  tar_target(cc_files,   cc_list,   format = &quot;file&quot;),
  tar_target(bank_files, bank_list, format = &quot;file&quot;),

  # Staging
  tar_target(stg_bank, stage_source(bank_files)),
  tar_target(stg_cc, stage_source(cc_files)),
  tar_target(stg_txns, stg_transactions(stg_bank, stg_cc))
)
</pre></li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#staging---load-data" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="intermediate---joins-and-enrichment">Intermediate - Joins and Enrichment</h3>
<p>For this ‘simple’ example there won’t be a lot of difference between an ‘intermediate’
stage and a final ‘mart’ stage, but this is where the merging with the merchant
categories occurs. The transactions from staging are loaded and joined according
to the patterns I’ve defined in the seed file.</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>In this step the transaction items are categorised according to the merchant
file, taking the entry with the most specific pattern match. In
<code>models/intermediate/int_transactions_categorised.sql</code></p>
<pre>with transactions as (
    select * from {{ ref(&#39;stg_transactions&#39;) }}
),

merchants as (
    select * from {{ ref(&#39;seed_merchants&#39;) }}
),

matched as (
    select
        t.*,
        m.keyword,
        m.merchant_name,
        m.merchant_category
    from transactions t
    left join merchants m
        on t.description ilike &#39;%&#39; || m.keyword || &#39;%&#39;

    -- Where multiple keywords match, take the longest (most specific)
    qualify row_number() over (
        partition by t.transaction_id
        order by length(m.keyword) desc
    ) = 1
)

select
    transaction_id,
    date,
    description,
    amount_aud,
    raw_category,
    raw_subcategory,
    raw_source,
    coalesce(merchant_name,     &#39;Unknown&#39;)       as merchant_name,
    coalesce(merchant_category, &#39;Uncategorised&#39;) as merchant_category
from matched
</pre><p>From there, a monthly aggregation table is produced. The <code>date_trunc()</code> feature
makes this fairly clean, and being able to <code>sum()</code> values is nice. In
<code>models/intermediate/int_monthly_balances.sql</code></p>
<pre>select
    date_trunc(&#39;month&#39;, date)::date as month,
    sum(amount_aud) as total_spend_aud,
    count(*) as transaction_count
from {{ ref(&#39;int_transactions_categorised&#39;) }}
group by 1
</pre></li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>Rather than relying on the implementation of SQL, R can use the {fuzzyjoin}
package to perform the matching</p>
<pre>categorise_transactions &lt;- function(transactions, merchants) {
  # Mirrors: JOIN merchants ON description ILIKE &#39;%&#39; || keyword || &#39;%&#39;
  matched &lt;- fuzzyjoin::fuzzy_left_join(
    transactions,
    merchants,
    by = c(&quot;description&quot; = &quot;keyword&quot;),
    match_fun = \(x, y) {
      str_detect(str_to_lower(x), str_to_lower(y), negate = FALSE)
    }
  ) |&gt;
    # Where multiple keywords match, prefer the longest — mirrors QUALIFY ROW_NUMBER() OVER (... ORDER BY length(keyword) DESC) = 1
    group_by(transaction_id) |&gt;
    arrange(desc(str_length(keyword)), .by_group = TRUE) |&gt;
    slice(1) |&gt;
    ungroup() |&gt;
    mutate(
      merchant_name = coalesce(merchant_name, &quot;Unknown&quot;),
      merchant_category = coalesce(merchant_category, &quot;Uncategorised&quot;)
    ) |&gt;
    select(
      transaction_id,
      date,
      description,
      amount_aud,
      merchant_name,
      merchant_category,
      raw_category,
      raw_subcategory,
      raw_source
    )

  matched
}
</pre><p>The month aggregations are a bread-and-butter problem for {dplyr}</p>
<pre>monthly_balances &lt;- function(transactions_categorised) {
  transactions_categorised |&gt;
    mutate(month = floor_date(date, &quot;month&quot;)) |&gt;
    group_by(month) |&gt;
    summarise(
      total_spend_aud = sum(amount_aud),
      transaction_count = n(),
      .groups = &quot;drop&quot;
    )
  }
</pre><p>and now the pipeline can include those steps</p>
<pre>list(
  tar_target(cc_files,   cc_list,   format = &quot;file&quot;),
  tar_target(bank_files, bank_list, format = &quot;file&quot;),

  tar_target(merchant_file, &quot;../seeds/seed_merchants.csv&quot;, format = &quot;file&quot;),
  tar_target(merchants, readr::read_csv(merchant_file, show_col_types = FALSE)),

  # Staging
  tar_target(stg_bank, stage_source(bank_files)),
  tar_target(stg_cc, stage_source(cc_files)),
  tar_target(stg_txns, stg_transactions(stg_bank, stg_cc)),

  # Intermediate
  tar_target(int_categorised, categorise_transactions(stg_txns, merchants)),
  tar_target(int_monthly, monthly_balances(int_categorised))
)
</pre></li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#intermediate---joins-and-enrichment" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="marts---summaries-and-outputs">Marts - Summaries and Outputs</h3>
<p>I could create some definitive ‘data product’ results here, but for now this is
very similar to the ‘intermediate’ stage with one additional grouping by merchant
as well as month</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>This is basically just a <code>select</code>, but it does filter for uniqueness on the key.
In <code>models/marts/mart_transactions.sql</code></p>
<pre>{{
    config(
        materialized=&#39;incremental&#39;,
        unique_key=&#39;transaction_id&#39;
    )
}}

select
    transaction_id,
    date,
    description,
    amount_aud,
    merchant_name,
    merchant_category,
    raw_category,
    raw_subcategory,
    raw_source
from {{ ref(&#39;int_transactions_categorised&#39;) }}

{% if is_incremental() %}
    where transaction_id not in (select transaction_id from {{ this }})
{% endif %}
</pre><p>and finally a month/category aggregation in <code>models/marts/mart_category_trends.sql</code></p>
<pre>select
  date_trunc(&#39;month&#39;, date)::date as month,
  merchant_category,
  sum(amount_aud)                 as total_aud,
  count(*)                        as transaction_count
from {{ ref(&#39;int_transactions_categorised&#39;) }}
group by 1, 2
</pre></li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>These are essentially the same as intermediate, but with an additional
dimension for the monthly summary</p>
<pre>mart_transactions &lt;- function(transactions_categorised) {
  # Equivalent to the incremental mart — deduplication by transaction_id
  transactions_categorised |&gt;
    distinct(transaction_id, .keep_all = TRUE)
}

mart_monthly_summary &lt;- function(mart_txns) {
  mart_txns |&gt;
    mutate(month = floor_date(date, &quot;month&quot;)) |&gt;
    group_by(month, merchant_category) |&gt;
    summarise(
      total_spend_aud = sum(amount_aud),
      transaction_count = n(),
      .groups = &quot;drop&quot;
    )
}
</pre></li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#marts---summaries-and-outputs" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="tests--validation">Tests / Validation</h3>
<p>One ‘selling point’ I’ve seen for dbt is that it can additionally add validation
tests within the workflow. That’s extremely useful to ensure that you’re not
producing junk data inadvertently.</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>A <code>schema.yml</code> can be added to a model folder with details of tests to be run
on the resulting object. This adds a test that the <code>transaction_id</code> is not
null and is unique, and specifies the known values of <code>merchant_category</code>
column</p>
<pre>version: 2

models:
  - name: mart_transactions
    columns:
      - name: transaction_id
        tests:
          - not_null
          - unique
      - name: merchant_category
        tests:
        - accepted_values:
            name: merchant_category_is_valid
            values:
              - Accommodation
              - Business
              - Cash
              - Clothing
              - Dining & Drinks
              - Donations
              - Education
[..truncated..]
</pre><p>The <code>tests/</code> folder can contain additional SQL tests to be run as part of the
workflow. These just need to return a result that <em>should</em> be empty if all
goes well, with some number of rows returned if they fail the validation -
i.e. no news is good news. I will set up more of these as I figure out what
else I want to define as my definition of ‘good quality’ but for now I’ll
ensure that no records have the ‘Uncategorised’ category, which means I don’t
have an entry for them in my <code>seed_merchants.csv</code> definition.</p>
<p>I’ve set the option for this to <code>'warn'</code> because while I do want to identify
those missing categories, I don’t want it to stop the workflow entirely</p>
<pre>{{ config(severity=&#39;warn&#39;) }}

select *
from {{ ref(&#39;mart_transactions&#39;) }}
where merchant_category = &#39;Uncategorised&#39;
</pre><p>In my case there are still some (34) uncategorised transactions (manual transfers),
but the <code>merchant_category_is_valid</code> validation passes</p>
<pre>10 of 13 WARN 34 assert_all_transactions_categorised ........................... [WARN 34 in 0.01s]
11 of 13 START test merchant_category_is_valid ................................. [RUN]
11 of 13 PASS merchant_category_is_valid ....................................... [PASS in 0.01s]
</pre><p>and otherwise (if I remove one of the ‘valid’ values)</p>
<pre>10 of 13 WARN 34 assert_all_transactions_categorised ........................... [WARN 34 in 0.02s]
11 of 13 START test merchant_category_is_valid ................................. [RUN]
11 of 13 FAIL 1 merchant_category_is_valid ..................................... [FAIL 1 in 0.02s]
</pre><p>which in this case shows that one category didn’t match.</p>
<p>The tests can also be run independently with</p>
<pre>uv run dbt test
</pre></li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>{targets} doesn’t have a specific way to test results, but it does have a way
to produce artefacts as part of the workflow in exactly the same way as we do
for the data, so I can run arbitrary code including a full data validation.</p>
<p>In my case I’ll use Appsilon’s <br>
<a href="https://appsilon.github.io/data.validator/articles/targets_workflow.html" rel="nofollow" target="_blank">{data.validator}</a>
which has an example for {targets} already, but one could just as easily use
{pointblank} or {validate}.</p>
<p>That leverages assertions from {assertr} and is again just another function</p>
<pre>run_tests &lt;- function(mart_txns) {
  report &lt;- data.validator::data_validation_report()

  data.validator::validate(mart_txns, name = &quot;mart_transactions&quot;) |&gt;
    data.validator::validate_cols(
      predicate = assertr::not_na,
      &quot;transaction_id&quot;,
      description = &quot;transaction_id is not null&quot;
    ) |&gt;
    data.validator::validate_cols(
      predicate = assertr::is_uniq,
      &quot;transaction_id&quot;,
      description = &quot;transaction_id is unique&quot;
    ) |&gt;
    data.validator::validate_cols(
      predicate = assertr::in_set(valid_categories),
      &quot;merchant_category&quot;,
      description = &quot;merchant_category is an accepted value&quot;
    ) |&gt;
    data.validator::add_results(report)

  data.validator::save_report(report, output_file = &quot;validation_report.html&quot;)
  report
}
</pre><p>but it ties in nicely because errors here get reported correctly by {targets}.</p>
<p>Once the workflow has run, the output file can be opened</p>
<pre>browseURL(&quot;validation_report.html&quot;)
</pre><p>and if all went well it looks like this</p>
<img src="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets_validation_good.png?w=578&#038;ssl=1" alt="Successful {targets} + {data.validator} run" px" data-recalc-dims="1"/>
<div class="figcaption">Successful {targets} + {data.validator} run</div>
<p>otherwise a failure is reported</p>
<img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets_validation_bad.png?w=578&#038;ssl=1" alt="Errored {targets} + {data.validator} run" px" data-recalc-dims="1"/>
<div class="figcaption">Errored {targets} + {data.validator} run</div>
<p>and clicking on ‘Show’ opens a table of the offending results.</p>
</li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#tests--validation" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="analysis">Analysis</h3>
<p>What’s the point of organising this data if we’re not going to do something with
it? This is where I start to really wonder if {targets} maybe has a bigger picture
in mind when it connects up the data, because while dbt will do all of the
processing in SQL, R will happily continue to do the analysis.</p>
<p>I think this is where a separation of concerns becomes necessary, and that depends
on the scale of the data involved. While you or I working on a small project might
be very happy to tie the analysis into the data preparation all in one place,
Netflix probably wants to segregate the data processing and analysis steps into
entirely different divisions, so tying a bow on the cleaned data and letting
analysts pick it up from a database makes a lot more sense.</p>
<p>For my example, let’s say I’m interested in analysing which categories have
out of the ordinary amounts of spend in a given month - have I spent more on
groceries this month? To do that, I want to calculate the average spend in each
category each month plus the variation and identify when the spend is more than
a standard deviation away from the average.</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>Given that this needs the ‘final’ tables, it belongs in the <code>models/marts</code>
folder. There <em>is</em> an <code>analysis/</code> folder in the dbt project by default,
but that’s for ad-hoc SQL queries that need to use <code>ref()</code> but don’t necessarily
produce anything one wishes to persist.</p>
<p>Calculating the standard deviations relies on
<a href="https://duckdb.org/docs/lts/sql/functions/aggregates#stddev_sampx" rel="nofollow" target="_blank">DuckDB’s helpers</a>,
which I don’t even want to consider writing in bare SQL myself. In
<code>models/mart/mart_category_outliers.sql</code>:</p>
<pre>with monthly as (
    select * from {{ ref(&#39;mart_category_trends&#39;) }}
),

stats as (
    select
        merchant_category,
        avg(total_aud)                                    as mean_spend,
        stddev_samp(total_aud)                            as sd_spend,
        count(*)                                          as n_months
    from monthly
    where total_aud &gt; 0
    group by 1
    having count(*) &gt; 1  -- need &gt;1 observation for stddev
),

outliers as (
    select
        m.month,
        m.merchant_category,
        m.total_aud,
        s.mean_spend,
        m.total_aud - s.mean_spend                       as deviation,
        (m.total_aud - s.mean_spend) / s.sd_spend        as z_score
    from monthly m
    inner join stats s using (merchant_category)
    where m.total_aud &gt; 0
      and abs((m.total_aud - s.mean_spend) / s.sd_spend) &gt; 1
)

select * from outliers
order by abs(z_score) desc
</pre><p>This creates a new table in the database with the results.</p>
</li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>The <code>sd()</code> function in R is no stranger to anyone who’s done stats, and it
drops into this code cleanly because {dplyr} translates SQL, and supports
DuckDB</p>
<pre>monthly_outliers &lt;- function(mart_monthly) {
  spend &lt;- mart_monthly |&gt;
    filter(total_spend_aud &gt; 0)

  stats &lt;- spend |&gt;
    group_by(merchant_category) |&gt;
    summarise(
      mean_spend = mean(total_spend_aud),
      sd_spend = sd(total_spend_aud),
      n_months = n(),
      .groups = &quot;drop&quot;
    ) |&gt;
    filter(n_months &gt; 1) # need &gt;1 observation for sd

  spend |&gt;
    inner_join(stats, by = &quot;merchant_category&quot;) |&gt;
    mutate(
      z_score = (total_spend_aud - mean_spend) / sd_spend,
      deviation = total_spend_aud - mean_spend
    ) |&gt;
    filter(abs(z_score) &gt; 1) |&gt;
    arrange(desc(abs(z_score))) |&gt;
    select(
      month,
      merchant_category,
      total_spend_aud,
      mean_spend,
      deviation,
      z_score
    )
}
</pre><p>which one can examine by asking {dbplyr} to explain how it’s working</p>
<pre>dplyr::copy_to(
  DBI::dbConnect(duckdb::duckdb()), 
  data.frame(x = c(2, 3, 1, 5, 4)), 
  &quot;example&quot;
) |&gt; 
  dplyr::summarise(sd_x = sd(x, na.rm = TRUE)) |&gt;
  dplyr::show_query()

## &lt;SQL&gt;
## SELECT STDDEV(x) AS sd_x
## FROM example
</pre><p>which shows that it uses the alias <code>STDDEV(x)</code>.</p>
</li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#analysis" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="the-complete-workflow">The Complete Workflow</h3>
<p>That’s all the pieces I need to push data in the exported CSVs through the pipe
and produce a database of monthly aggregated, categorised totals. Here’s how it
looks in terms of the two tools.</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>The file structure is perhaps best seen in the <code>docs</code> website (see the next
section), but essentially the files in <code>models</code> define the workflow</p>
<pre>models
├── intermediate
│   ├── int_monthly_balances.sql
│   └── int_transactions_categorised.sql
├── marts
│   ├── mart_category_outliers.sql
│   ├── mart_category_trends.sql
│   ├── mart_transactions.sql
│   └── schema.yml
└── staging
    ├── sources.yml
    ├── stg_bank.sql
    ├── stg_cc.sql
    └── stg_transactions.sql
</pre><p>Throughout the dbt steps I’ve detailed here, each <code>.sql</code> model produced a
corresponding table in the DuckDB database. Running</p>
<pre>uv run dbt build
</pre><p>runs through the DAG identifying what needs to be run before what, then runs
all the steps in order. An example of this running well looks like</p>
<pre>01:51:09  Found 1 seed, 8 models, 4 data tests, 2 sources, 591 macros
01:51:09  
01:51:09  Concurrency: 1 threads (target=&#39;dev&#39;)
01:51:09  
01:51:09  1 of 13 START sql view model main.stg_bank ..................................... [RUN]
01:51:09  1 of 13 OK created sql view model main.stg_bank ................................ [OK in 0.04s]
01:51:09  2 of 13 START sql view model main.stg_cc ....................................... [RUN]
01:51:09  2 of 13 OK created sql view model main.stg_cc .................................. [OK in 0.02s]
01:51:09  3 of 13 START seed file main.seed_merchants .................................... [RUN]
01:51:09  3 of 13 OK loaded seed file main.seed_merchants ................................ [INSERT 402 in 0.02s]
01:51:09  4 of 13 START sql view model main.stg_transactions ............................. [RUN]
01:51:09  4 of 13 OK created sql view model main.stg_transactions ........................ [OK in 0.04s]
01:51:09  5 of 13 START sql view model main.int_transactions_categorised ................. [RUN]
01:51:09  5 of 13 OK created sql view model main.int_transactions_categorised ............ [OK in 0.03s]
01:51:09  6 of 13 START sql view model main.int_monthly_balances ......................... [RUN]
01:51:09  6 of 13 OK created sql view model main.int_monthly_balances .................... [OK in 0.03s]
01:51:09  7 of 13 START sql table model main.mart_category_trends ........................ [RUN]
01:51:09  7 of 13 OK created sql table model main.mart_category_trends ................... [OK in 0.22s]
01:51:09  8 of 13 START sql incremental model main.mart_transactions ..................... [RUN]
01:51:09  8 of 13 OK created sql incremental model main.mart_transactions ................ [OK in 0.26s]
01:51:09  9 of 13 START sql table model main.mart_category_outliers ...................... [RUN]
01:51:09  9 of 13 OK created sql table model main.mart_category_outliers ................. [OK in 0.01s]
01:51:09  10 of 13 START test assert_all_transactions_categorised ........................ [RUN]
01:51:09  10 of 13 WARN 34 assert_all_transactions_categorised ........................... [WARN 34 in 0.01s]
01:51:09  11 of 13 START test merchant_category_is_valid ................................. [RUN]
01:51:09  11 of 13 PASS merchant_category_is_valid ....................................... [PASS in 0.01s]
01:51:09  12 of 13 START test not_null_mart_transactions_transaction_id .................. [RUN]
01:51:09  12 of 13 PASS not_null_mart_transactions_transaction_id ........................ [PASS in 0.01s]
01:51:09  13 of 13 START test unique_mart_transactions_transaction_id .................... [RUN]
01:51:09  13 of 13 PASS unique_mart_transactions_transaction_id .......................... [PASS in 0.01s]
01:51:09  
01:51:09  Finished running 1 incremental model, 1 seed, 2 table models, 4 data tests, 5 view models in 0 hours 0 minutes and 0.78 seconds (0.78s).
01:51:09  
01:51:09  Completed with 1 warning:
01:51:09  
01:51:09  Warning in test assert_all_transactions_categorised (tests/assert_all_transactions_categorised.sql)
01:51:09  Got 34 results, configured to warn if != 0
01:51:09  
01:51:09    compiled code at target/compiled/slowbooks/tests/assert_all_transactions_categorised.sql
01:51:09  
01:51:09  Done. PASS=12 WARN=1 ERROR=0 SKIP=0 NO-OP=0 TOTAL=13
</pre><p>The times on the left side are in UTC, and I can’t find a way to change that,
which may be for the best - one can always convert to local after the fact if
needed.</p>
<p>Everything completed with an <code>OK</code> status except for the assertion which I’ve
allowed to <code>WARN</code> because I haven’t categorised a handful of records.</p>
<p>Looking at the resulting database, e.g. in a terminal, shows all the tables
which have been created</p>
<pre>duckdb slowbooks.duckdb

DuckDB v1.5.2 (Variegata)
Enter &quot;.help&quot; for usage hints.
slowbooks D show tables;
┌──────────────────────────────┐
│             name             │
│           varchar            │
├──────────────────────────────┤
│ int_monthly_balances         │
│ int_transactions_categorised │
│ mart_category_outliers       │
│ mart_category_trends         │
│ mart_transactions            │
│ seed_merchants               │
│ stg_bank                     │
│ stg_cc                       │
│ stg_transactions             │
└──────────────────────────────┘
</pre></li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>The {targets} workflow is defined as a list within the <code>_targets.R</code> file</p>
<pre>list(
  # format=&quot;file&quot; means targets re-runs downstream when file list or contents change
  tar_target(
    cc_files,
    cc_list,
    format = &quot;file&quot;
  ),
  tar_target(
    bank_files,
    bank_list,
    format = &quot;file&quot;
  ),
  tar_target(merchant_file, &quot;../seeds/seed_merchants.csv&quot;, format = &quot;file&quot;),

  # Seeds
  tar_target(merchants, readr::read_csv(merchant_file, show_col_types = FALSE)),

  # Staging — passing file vectors directly so targets tracks dependencies correctly
  tar_target(stg_bank, stage_source(bank_files)),
  tar_target(stg_cc, stage_source(cc_files)),
  tar_target(stg_txns, stg_transactions(stg_bank, stg_cc)),

  # Intermediate
  tar_target(int_categorised, categorise_transactions(stg_txns, merchants)),
  tar_target(int_monthly, monthly_balances(int_categorised)),

  # Tests
  tar_target(validation, run_tests(int_categorised)),
  tar_target(is_violation, validation_violation(validation)),

  # Marts
  tar_target(mart_txns, mart_transactions(int_categorised)),
  tar_target(mart_monthly, mart_monthly_summary(mart_txns)),

  # Persist to DuckDB
  tar_target(
    db_mart_transactions,
    write_to_duckdb(mart_txns, &quot;mart_transactions&quot;)
  ),
  tar_target(
    db_mart_monthly_summary,
    write_to_duckdb(mart_monthly, &quot;mart_monthly_summary&quot;)
  ),

  # Analysis
  tar_target(outliers, monthly_outliers(mart_monthly))
)
</pre><p>and running <code>tar_make()</code> (or with the added validation, <code>tar_make_catch()</code>)
from the working directory containing that file, the workflow is run</p>
<pre>+ bank_files dispatched                      
&#x2714; bank_files completed [281ms, 74.42 kB]
+ cc_files dispatched
&#x2714; cc_files completed [0ms, 137.82 kB]
+ merchant_file dispatched
&#x2714; merchant_file completed [1ms, 15.89 kB]
+ stg_bank dispatched
&#x2714; stg_bank completed [88ms, 12.98 kB]
+ stg_cc dispatched
&#x2714; stg_cc completed [84ms, 18.60 kB]
+ merchants dispatched
&#x2714; merchants completed [137ms, 7.33 kB]
+ stg_txns dispatched
&#x2714; stg_txns completed [101ms, 71.10 kB]
+ int_categorised dispatched
&#x2714; int_categorised completed [871ms, 86.06 kB]                
+ mart_txns dispatched                                       
&#x2714; mart_txns completed [0ms, 86.06 kB]                        
+ validation dispatched                                      
&#x2714; validation completed [2.7s, 5.10 kB]                        
+ int_monthly dispatched                                      
&#x2714; int_monthly completed [3ms, 650 B]                          
+ mart_monthly dispatched                                     
&#x2714; mart_monthly completed [4ms, 2.78 kB]                       
+ db_mart_transactions dispatched                             
&#x2714; db_mart_transactions completed [83ms, 70 B]                 
+ is_violation dispatched                                     
&#x2714; is_violation completed [1ms, 48 B]                          
+ db_mart_monthly_summary dispatched                          
&#x2714; db_mart_monthly_summary completed [79ms, 72 B]              
+ outliers dispatched                                         
&#x2714; outliers completed [10ms, 2.30 kB]                          
&#x2714; ended pipeline [4.6s, 16 completed, 0 skipped] 
</pre><p>An important difference is that writing to the database only happened in two
of the steps at the end, so the (distinct) database only contains those tables</p>
<pre>duckdb targets/slowbooks_r.duckdb

DuckDB v1.5.2 (Variegata)
Enter &quot;.help&quot; for usage hints.
slowbooks_r D show tables;
┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ mart_monthly_summary │
│ mart_transactions    │
└──────────────────────┘
</pre><p>There’s nothing stopping me from also adding a <code>write_to_duckdb()</code> (a function
which just opens the database, writes a table, then closes) call for any of
the other steps, but I was satisfied that I am building the same thing in
both cases.</p>
</li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#the-complete-workflow" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="dag--visualisation--docs">DAG / Visualisation / Docs</h3>
<p>The similarity to a <code>Makefile</code> of these approaches depends on being able to
determine what has ‘changed’ and what is the same, and this is where the two
approaches differ. Both consider the workflow as a Directed Acyclic Graph (DAG)
with steps taking dependencies on previous steps or data sources. This means I
can visualise the workflow as a graph, but also makes for some important differences
between how things actually run.</p>
<p></p>
<div class="tabset"></div>
<ul>
<li>
<p><span><img src="https://i2.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>Every time I run <code>dbt build</code> the entire model is re-run. If the model is
‘incremental’ then it won’t need to do a full <code>CREATE TABLE</code> or re-categorise
existing records every time, but all of the steps will be re-run. This also
makes for more complexity if I <em>do</em> want to re-categorise existing records, in
which case I need to add <code>--full-refresh</code> to the build step.</p>
<p>With the model defined I can visualise it by generating and serving the
documentation site with</p>
<pre>uv run dbt docs generate && uv run dbt docs serve 
</pre><p>This builds a site locally hosted which includes all of the SQL code and a
lineage graph showing how different pieces connect together</p>
<a href="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt_docs.png?ssl=1" rel="nofollow" target="_blank">
  <img src="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt_docs.png?w=578&#038;ssl=1" alt="dbt docs site (click to embiggen)" px" data-recalc-dims="1"/>
</a>
<div class="figcaption">dbt docs site (click to embiggen)</div>
<p>Expanding this pane shows more of the DAG for the project, though not all of
the connections</p>
<a href="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt_dag.png?ssl=1" rel="nofollow" target="_blank">
  <img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/dbt_dag.png?w=578&#038;ssl=1" alt="dbt DAG for the whole slowbooks project (click to embiggen)" px" data-recalc-dims="1"/>
</a>
<div class="figcaption">dbt DAG for the whole slowbooks project (click to embiggen)</div>
<p>This is really nice, and shows how the data flows from the raw data to the
final summary.</p>
</li>
<li>
<p><span><img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets.png?fit=578%2C40&#038;ssl=1" height="40 px" data-recalc-dims="1"></span></p>
<p>This is where I think {targets} may have an advantage over dbt – since the
workflow considers a hash of the data objects to determine what has changed,
even if the code remains the same, it can identify which steps of the DAG are
invalidated, and can skip over any steps which don’t need to be re-run.</p>
<p>This is significant when the data you’re processing is no longer necessarily
local to the machine running the pipeline. dbt performs the queries with SQL
<em>on</em> the database (in this example the tables are written in DuckDB and materialised
as views for downstream models), while the structure I’m using for {targets}
here explicitly pulls in the data for R-native processing. I <em>could</em> make it
more remote and use lazy <code>tbl()</code> operations via {dbplyr}, but it’s a trade-off
one needs to consider.</p>
<p>A full DAG for the project can be produced in an editor able to render HTML
such as RStudio or Positron, with</p>
<pre>targets::tar_visnetwork()
</pre><p>producing an interactive plot</p>
<a href="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/target_dag_utd.png?ssl=1" rel="nofollow" target="_blank">
  <img src="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/target_dag_utd.png?w=578&#038;ssl=1" alt="Up-to-date {targets} DAG visualisation (click to embiggen)" px" data-recalc-dims="1"/>
</a>
<div class="figcaption">Up-to-date {targets} DAG visualisation (click to embiggen)</div>
<p>If I change some of the code in the mart definition, and re-evaluate just that
function, then re-running <code>targets::tar_visnetwork()</code> shows me which nodes
are affected</p>
<a href="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/target_dag_inv.png?ssl=1" rel="nofollow" target="_blank">
  <img src="https://i0.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/target_dag_inv.png?w=578&#038;ssl=1" alt="Invalidated {targets} DAG visualisation (click to embiggen)" px" data-recalc-dims="1"/>
</a>
<div class="figcaption">Invalidated {targets} DAG visualisation (click to embiggen)</div>
<p>(note the different colour of some nodes). This is fantastic!</p>
<p>What’s more, if I have a failing test during the validation, I can see what is
downstream from that</p>
<a href="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets_dag_err.png?ssl=1" rel="nofollow" target="_blank">
  <img src="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/targets_dag_err.png?w=578&#038;ssl=1" alt="Failed validation in {targets} DAG (click to embiggen)" px" data-recalc-dims="1"/>
</a>
<div class="figcaption">Failed validation in {targets} DAG (click to embiggen)</div>
<p>That did require following the article in the {data.validator} docs to define</p>
<pre>validation_violation &lt;- function(report) {
  violations_exist &lt;- report$get_validations() %&gt;%
    summarise(
      sum(num.violations, na.rm = TRUE) &gt; 0
    ) %&gt;%
    pull()
  if (isTRUE(violations_exist)) {
    rlang::abort(
      &quot;Validation schema error&quot;,
      body = capture.output(report),
      class = &quot;validation_violation&quot;
    )
  }
  FALSE
}
</pre><p>and adding to the target</p>
<pre>tar_target(is_violation, validation_violation(validation)),
</pre><p>and instead of running <code>tar_make()</code>, using a <code>tryCatch()</code></p>
<pre>tar_make_catch &lt;- function() {
    tryCatch(
        tar_make(),
        validation_violation = function(e) {
            print(e)
            tar_visnetwork()
        }
    )
}
</pre><p>Incredibly powerful stuff, right?</p>
</li>
</ul>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#dag--visualisation--docs" rel="nofollow" target="_blank">Top of this section</a> | <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h3 id="exploration">Exploration</h3>
<p>I only built the exploration dashboard as part of the dbt project because I’ve
built plenty of shiny apps - I wanted to see what Claude could build based on
this database data source. It built a streamlit app which shows the monthly spend
broken down by category, and I had it add filters for the various categories,
tables of transactions, and the monthly outliers.</p>
<p>The dashboard works great, albeit not perfectly. It looks something like this</p>
<a href="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/slowbooks.png?ssl=1" rel="nofollow" target="_blank">
  <img src="https://i1.wp.com/jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/images/slowbooks.png?w=578&#038;ssl=1" alt="Slowbooks dashboard (click to embiggen)" px" data-recalc-dims="1"/>
</a>
<div class="figcaption">Slowbooks dashboard (click to embiggen)</div>
<p>There’s obvious issues with it - not least that the legend is incomplete, but
for the sort of exploration I wanted to try out, it’s a great starting point.</p>
<p>It reads the summary tables directly from the database, so the analysis doesn’t
need to happen within the app – a nice separation of business logic and
visualisation.</p>
<h2 id="comparison">Comparison</h2>
<p>As a final sanity check, I’ll confirm that I get the same number of transactions
in the monthly trend tables which <em>are</em> saved to both databases, albeit with
different names</p>
<pre>duckdb slowbooks.duckdb -c &quot;select sum(transaction_count) from mart_category_trends;&quot;
┌────────────────────────┐
│ sum(transaction_count) │
│         int128         │
├────────────────────────┤
│                   2028 │
└────────────────────────┘

duckdb targets/slowbooks_r.duckdb -c &quot;select sum(transaction_count) from mart_monthly_summary;&quot;
┌────────────────────────┐
│ sum(transaction_count) │
│         int128         │
├────────────────────────┤
│                   2028 │
└────────────────────────┘
</pre><p><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f389.png" alt="🎉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>As for what I like and don’t like about each approach:</p>
<ul>
<li>
<p>Language: I don’t reach for SQL as a primary language (it’s absolutely the
second language everyone who codes with data should learn, in my opinion) so having
to write <em>everything</em> in SQL myself doesn’t appeal so much to me. I’m very happy
to be able to use {dplyr} more or less everywhere and have it write the SQL for me.
On the other hand, I can see the value in moving to a language that’s closer to
the data itself – the abstractions change over time ({dplyr} is notorious for this)
so with fewer bells and whistles likely comes more stability. Provided the helper
functions are available for things like basic statistics (e.g. in DuckDB) this
doesn’t sound like too much of a downside. Doing a bit more research, it seems
that dbt <em>does</em> support using
<a href="https://docs.getdbt.com/docs/build/python-models" rel="nofollow" target="_blank">Python for models</a>,
provided the adapter supports it (which <code>dbt-duckdb</code> does), so that’s a big win
for those more familiar with Python, although I am under the impression that not
everything works exactly the same for these models.</p>
</li>
<li>
<p>Connection: I appreciate the massive leg-up that dbt offers in terms of handling
connections to sources via extensions (e.g. <code>dbt-duckdb</code>). I’m sure if you’re not
used to that then it looks like magic, but those familiar with working with
databases via {dbplyr} and {DBI} it loses some of the wonder. Importantly, the
dbt SQL code all runs within the database - downstream models rely on views so
the data never really leaves the database. The R version <em>could</em> get closer to this,
but I suspect the more common use-case is to actually pull down all of the data
in which case it’s likely to fit within RAM.</p>
</li>
<li>
<p>Version Control: For people not used to committing their work, again this seems
like a huge step up, but R users get taught fairly early on to work with git and
track their code, even if it’s just scripts. Someone used to just throwing SQL
at a database from a terminal might be rightly amazed at the benefits opened up
by tracking code this way, but for me it’s the default state.</p>
</li>
<li>
<p>Layout: dbt has so many files for even a simple project that my VSCode file
explorer runs off the screen. {targets} has everything in a single file. This
could be organised more like dbt with a liberal use of <code>source()</code> calls at the
top of <code>_targets.R</code>, say for each model and some utils.</p>
</li>
<li>
<p>Interrogation: Perhaps there’s some more tooling I’m not aware of, but the
{targets} visualisation of the DAG is a clear winner for me. Part of the tradeoff
between ‘run everything locally’ and ‘run everything remotely’ is that I can
inspect the intermediate data in the {targets} workflow with <code>tar_read(id)</code> and
see what’s happening. I <em>can</em> read the generated table in the database, but for
smallish data being able to just crack it open and have a look wins for me.</p>
</li>
</ul>
<h2 id="other-solutions">Other Solutions</h2>
<p>While I’ve focused on this comparison between dbt and {targets}, these aren’t
the only players in the game. I’m aware of <a href="https://airflow.apache.org/" rel="nofollow" target="_blank">Airflow</a>,
at least in the sense that it can ingest dbt pipelines and schedule them.
For the Python folks there’s also <a href="https://www.prefect.io/" rel="nofollow" target="_blank">prefect</a> and
<a href="https://dagster.io/" rel="nofollow" target="_blank">dagster</a>, the latter of which also has an R ingestion
route in the form of <a href="https://joekirincic.github.io/dagsterpipes/" rel="nofollow" target="_blank">dagsterpipes</a>.
A purely R solution is <a href="https://whipson.github.io/maestro/" rel="nofollow" target="_blank">maestro</a> which appears
to target (pun intended) data coming from an API or database for which {targets}
can’t identify the ‘up-to-date-ness’ (since that involves a hash of the file).</p>
<div style="text-align: right;">
  <a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#TableOfContents" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/2b06.png" alt="⬆" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Table of Contents</a>
</div>
<h2 id="conclusion">Conclusion</h2>
<p>I’ve vastly grown my understanding of both dbt and {targets} and have a much
greater appreciation for what goes into using each of these to move and curate
data. Plus, now I have a cool new toy I’ve built to explore my finances. I’m not
sharing the code itself – partly so that I don’t risk committing my own finance
data by accident, and partly because what I’ve done here isn’t anything you need
to build on; if you’re interested in learning either or both of these tools, I
recommend you do what I did and build a toy project.</p>
<p>I’m interested to hear what you think of this comparison – have I overlooked
some significant difference or similarity? Some use-case where one of them would
really shine over the other? Have I misrepresented something? I’m here to learn,
so by all means please do let me know. And if you’re looking for someone with
a history of programming and data who digs into projects this way, I’m on the
market for opportunities.</p>
<p>As always, I can be found on
<a href="https://fosstodon.org/@jonocarroll" rel="nofollow" target="_blank">Mastodon</a> and the comment section below.</p>
<br />
<details>
  <summary>
    <tt>devtools::session_info()</tt>
  </summary>
<pre>## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.3 (2026-03-11)
##  os       macOS Tahoe 26.3.1
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Australia/Adelaide
##  date     2026-05-04
##  pandoc   3.6.3 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
##  quarto   1.7.31 @ /usr/local/bin/quarto
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blob          1.3.0   2026-01-14 [1] CRAN (R 4.5.2)
##  blogdown      1.23    2026-01-18 [1] CRAN (R 4.5.2)
##  bookdown      0.46    2025-12-05 [1] CRAN (R 4.5.2)
##  bslib         0.10.0  2026-01-26 [1] CRAN (R 4.5.2)
##  cachem        1.1.0   2024-05-16 [1] CRAN (R 4.5.0)
##  cli           3.6.5   2025-04-23 [1] CRAN (R 4.5.0)
##  DBI           1.3.0   2026-02-25 [1] CRAN (R 4.5.2)
##  dbplyr        2.5.2   2026-02-13 [1] CRAN (R 4.5.2)
##  devtools      2.4.6   2025-10-03 [1] CRAN (R 4.5.0)
##  digest        0.6.39  2025-11-19 [1] CRAN (R 4.5.2)
##  dplyr         1.2.1   2026-04-03 [1] CRAN (R 4.5.2)
##  duckdb        1.5.2   2026-04-13 [1] CRAN (R 4.5.2)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.5.0)
##  evaluate      1.0.5   2025-08-27 [1] CRAN (R 4.5.0)
##  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.5.0)
##  fs            1.6.7   2026-03-06 [1] CRAN (R 4.5.2)
##  generics      0.1.4   2025-05-09 [1] CRAN (R 4.5.0)
##  glue          1.8.1   2026-04-17 [1] CRAN (R 4.5.2)
##  htmltools     0.5.9   2025-12-04 [1] CRAN (R 4.5.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.5.0)
##  jsonlite      2.0.0   2025-03-27 [1] CRAN (R 4.5.0)
##  knitr         1.51    2025-12-20 [1] CRAN (R 4.5.2)
##  lifecycle     1.0.5   2026-01-08 [1] CRAN (R 4.5.2)
##  magrittr      2.0.4   2025-09-12 [1] CRAN (R 4.5.0)
##  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.5.0)
##  otel          0.2.0   2025-08-29 [1] CRAN (R 4.5.0)
##  pillar        1.11.1  2025-09-17 [1] CRAN (R 4.5.0)
##  pkgbuild      1.4.8   2025-05-26 [1] CRAN (R 4.5.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.5.0)
##  pkgload       1.5.0   2026-02-03 [1] CRAN (R 4.5.2)
##  purrr         1.2.2   2026-04-10 [1] CRAN (R 4.5.2)
##  R6            2.6.1   2025-02-15 [1] CRAN (R 4.5.0)
##  remotes       2.5.0   2024-03-17 [1] CRAN (R 4.5.0)
##  rlang         1.1.7   2026-01-09 [1] CRAN (R 4.5.2)
##  rmarkdown     2.30    2025-09-28 [1] CRAN (R 4.5.0)
##  rstudioapi    0.18.0  2026-01-16 [1] CRAN (R 4.5.2)
##  sass          0.4.10  2025-04-11 [1] CRAN (R 4.5.0)
##  sessioninfo   1.2.3   2025-02-05 [1] CRAN (R 4.5.0)
##  tibble        3.3.1   2026-01-11 [1] CRAN (R 4.5.2)
##  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.5.0)
##  usethis       3.2.1   2025-09-06 [1] CRAN (R 4.5.0)
##  vctrs         0.7.1   2026-01-23 [1] CRAN (R 4.5.2)
##  withr         3.0.2   2024-10-28 [1] CRAN (R 4.5.0)
##  xfun          0.56    2026-01-18 [1] CRAN (R 4.5.2)
##  yaml          2.3.12  2025-12-10 [1] CRAN (R 4.5.2)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────
</pre></details>
<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/"> rstats on Irregularly Scheduled Programming</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/comparing-rs-targets-and-dbt-for-data-engineering/">Comparing R’s {targets} and dbt for Data Engineering</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401038</post-id>	</item>
		<item>
		<title>The Magic of In-Context Learning (ICL): When Your Model Already Knows Your Data</title>
		<link>https://www.r-bloggers.com/2026/05/the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data/</link>
		
		<dc:creator><![CDATA[Learning Machines]]></dc:creator>
		<pubDate>Sun, 03 May 2026 15:10:05 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://blog.ephorie.de/?p=7029</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> Have you ever looked at a freshly plotted scatter plot and immediately thought, “Ah, this is clearly a logarithmic curve with some heteroskedastic noise,” without running a single line of modeling code? How do you do that? You don’t perform gradient descent in your head. You use your intuition! ...</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data/">The Magic of In-Context Learning (ICL): When Your Model Already Knows Your Data</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://blog.ephorie.de/the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data"> R-Bloggers – Learning Machines</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p><img loading="lazy" fetchpriority="high" decoding="async" src="https://i0.wp.com/blog.ephorie.de/wp-content/uploads/2026/05/ChatGPT-Image-May-3-2026-04_51_25-PM-e1777819949657-243x300.png?resize=243%2C300&#038;ssl=1" alt="" width="243" height="300" class="alignleft size-medium wp-image-7032" srcset_temp="https://i0.wp.com/blog.ephorie.de/wp-content/uploads/2026/05/ChatGPT-Image-May-3-2026-04_51_25-PM-e1777819949657-243x300.png?resize=243%2C300&#038;ssl=1 243w, https://blog.ephorie.de/wp-content/uploads/2026/05/ChatGPT-Image-May-3-2026-04_51_25-PM-e1777819949657-829x1024.png 829w, https://blog.ephorie.de/wp-content/uploads/2026/05/ChatGPT-Image-May-3-2026-04_51_25-PM-e1777819949657-768x949.png 768w, https://blog.ephorie.de/wp-content/uploads/2026/05/ChatGPT-Image-May-3-2026-04_51_25-PM-e1777819949657.png 1068w" sizes="(max-width: 243px) 85vw, 243px" data-recalc-dims="1" /><br />
Have you ever looked at a freshly plotted scatter plot and immediately thought, “<em>Ah, this is clearly a logarithmic curve with some heteroskedastic noise,</em>” without running a single line of modeling code? How do you do that? You don’t perform gradient descent in your head. You use your <em>intuition</em>!<br />
<span id="more-7029"></span></p>
<p>As an experienced data scientist, you have seen thousands of datasets in your career. When confronted with new data, your natural neural network (a.k.a. brain) simply draws on this vast library of past mathematical shapes and immediately recognizes the pattern. But what if an artificial neural network could do exactly the same thing? What if it could predict your data without actually being trained on it?</p>
<p>Welcome to the mind-bending world of <em>In-Context Learning (ICL)</em> for tabular data, brought to R via the incredible new <code>TabPFN</code> package (on CRAN).</p>
<h2>The Transformer: From Text to Tables</h2>
<p>To understand ICL, we have to talk about <em>Large Language Models</em> like ChatGPT (see also <a href="https://blog.ephorie.de/building-your-own-mini-chatgpt-with-r-from-markov-chains-to-transformers" rel="nofollow" target="_blank">Building Your Own Mini-ChatGPT with R: From Markov Chains to Transformers!</a>). When you give a chatbot an unfinished sentence, it doesn’t retrain its weights to guess the next word. It uses a <em>Transformer </em>architecture equipped with an attention mechanism (see also <a href="https://blog.ephorie.de/attention-what-lies-at-the-core-of-chatgpt" rel="nofollow" target="_blank">Attention! What lies at the Core of ChatGPT? (Also as a Video!)</a>). It reads the words you provided, understands the dependencies between them (the grammar and context), and instantly extrapolates what comes next.</p>
<p>The genius of <code>TabPFN</code> is taking this exact architecture and applying it to spreadsheets. Instead of a sequence of words, the Transformer reads a sequence of data rows. It treats your features (<em>X</em>) and your target (<em>Y</em>) like the grammar of a language. By comparing all the rows and columns simultaneously in its “context window,” it figures out the dependencies in the table just like a language model figures out dependencies in text.</p>
<p>The model that arises is a <em>foundation model for tabular data</em>, or <em>tabular foundation model</em> for short.</p>
<blockquote><p>This process is formally known as Few-Shot Learning. You aren’t giving the model an empty brain to train; you are “prompting” a pre-trained brain with a few dozen (or a few hundred) “shots” (rows) of your data to establish the pattern!</p></blockquote>
<h2>The Training Matrix: Learning the Shape of Maths</h2>
<p>You might be wondering: <em>If it isn’t training on my data, what exactly was it trained on?</em></p>
<p>This is where it gets incredibly cool. The researchers who built <code>TabPFN </code>didn’t train it on real-world datasets like housing prices or medical records. Instead, they wrote algorithms to generate millions of completely random, artificially created mathematical dependency structures.</p>
<p>They forced the network to practice on synthetic datasets containing every statistical quirk imaginable: linear trends, severe non-linearities, bizarre interaction effects, extreme missing data mechanisms, and sheer noise. Because it spent its entire training solving billions of abstract maths puzzles, the model learned the fundamental <em>shape</em> of causal mathematical dependencies. When it sees your real-world data, it’s just recognizing a pattern it has already solved synthetically a thousand times before.</p>
<h2>Let’s see it in action</h2>
<p>Let’s use the venerable <code>iris</code> dataset. Because <code>iris</code> is small and the mathematical boundaries are very clear, it’s the perfect candidate for few-shot learning. Notice how the code looks exactly like traditional machine learning, but under the hood, <em>no training is actually happening!</em></p>
<pre>
# Load the package
library(tabpfn)

# 1. Prepare the Data
set.seed(42)
train_indices &lt;- sample(seq_len(nrow(iris)), size = 0.7 * nrow(iris))

iris_train &lt;- iris[train_indices, ]
iris_test  &lt;- iris[-train_indices, ]

# 2. Fit the Model
cat(&quot;Generating embeddings...\n&quot;)
## Generating embeddings...
tab_fit &lt;- tab_pfn(Species ~ ., data = iris_train)

# 3. Make Predictions
cat(&quot;Predicting...\n&quot;)
## Predicting...
predictions &lt;- predict(tab_fit, new_data = iris_test)

# 4. Check the accuracy
accuracy &lt;- sum(predictions$.pred_class == iris_test$Species) / nrow(iris_test)
cat(&quot;\nSuccess! Overall Accuracy:&quot;, round(accuracy * 100, 1), &quot;%\n&quot;)
## 
## Success! Overall Accuracy: 97.8 %
</pre>
<p>When you run this, you will see an accuracy of 97.8%. The model looked at the few examples in <code>iris_train</code>, instantly recognized the multidimensional shapes separating the species using its synthetic intuition, and accurately classified the new test data without a single epoch of traditional backpropagation.</p>
<h2>Conclusion</h2>
<p><code>TabPFN</code>is a paradigm shift. For small to medium tabular datasets, we no longer need to spend hours tuning hyperparameters for Random Forests or XGBoost. We can simply hand the data to an experienced, mathematically omniscient Transformer and let In-Context Learning do the heavy lifting.</p>
<p>Give it a try on your own data, and tell us about your experience in the comments below!</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://blog.ephorie.de/the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data"> R-Bloggers – Learning Machines</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/the-magic-of-in-context-learning-icl-when-your-model-already-knows-your-data/">The Magic of In-Context Learning (ICL): When Your Model Already Knows Your Data</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">401005</post-id>	</item>
		<item>
		<title>Bad Weather and the Subway</title>
		<link>https://www.r-bloggers.com/2026/05/bad-weather-and-the-subway/</link>
		
		<dc:creator><![CDATA[Kieran Healy]]></dc:creator>
		<pubDate>Sat, 02 May 2026 12:59:15 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/</guid>

					<description><![CDATA[<p>Snow in Inwood, New York. Photograph by the author.</p>
<p>Recently I’ve been looking at hourly ridership data from the New York City Subway. Last time we learned that people go to work in the morning and come home in the eve...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/bad-weather-and-the-subway/">Bad Weather and the Subway</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/"> R on kieranhealy.org</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<figure class="full-width"><a href="https://i0.wp.com/kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/snow-in-nyc.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i0.wp.com/kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/snow-in-nyc.png?w=578&#038;ssl=1"
         alt="Two figures walking in the snow; trees in the distance." data-recalc-dims="1"/></a><figcaption>
            <p>Snow in Inwood, New York. Photograph by the author.</p>
        </figcaption>
</figure>
<p>Recently I’ve been looking at hourly ridership data from the New York City Subway. Last time we learned that <a href="https://kieranhealy.org/blog/archives/2026/04/25/hourly-subway-station-flows/" rel="nofollow" target="_blank">people go to work in the morning and come home in the evening</a>, for example. (All together now: “Only in New York, baby!”) Today we’ll learn that bad weather makes people stay at home. Except, sometimes it doesn’t.</p>
<p>Regular readers will recall that the subway system <a href="https://kieranhealy.org/blog/archives/2025/02/19/mta-ridership/" rel="nofollow" target="_blank">carries a <em>lot</em> of passengers every day</a>. The ridership data for the whole of 2025 represents just over 1.3 billion entries into the system via an OMNY tap or Metrocard. It’s available aggregated to hourly resolution by station complex. With that data in hand, we can calculate average hourly ridership for every day of the week. This gives us a profile of what, for example, a Monday or a Wednesday typically looks like, by hour. When calculating the average day-of-the week profile we exclude holidays and the like.</p>
<p>Meanwhile, the National Weather Service provides data on severe weather events that affected the New York City region in 2025. We could get more fine-grained if we wanted to, but for now we’ll just use the <a href="https://www.weather.gov/okx/stormarchive" rel="nofollow" target="_blank">general list of events</a> the NWS provides. Then we plot the Subway ridership profile for that specific date against the average profile for whatever day of the week the event happened on.</p>
<figure class="full-width"><a href="https://i0.wp.com/kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/rhythms_2025_weather.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i0.wp.com/kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/rhythms_2025_weather.png?w=578&#038;ssl=1"
         alt="Small multiple showing generally suppressive relatinship between subway ridership and adverse weather days in 2025." data-recalc-dims="1"/></a><figcaption>
            <p>Bad weather suppresses Subway ridership, in general. But not always.</p>
        </figcaption>
</figure>
<p>The gray lines are the baseline. The red ones are the bad weather day. The basic shape of the gray lines (and many of the red ones) is set by the rhythm of daily life. The sharp double-peak pattern is what someone I’ve shown too many of these graphs to has taken to calling “The Giant Cat-Ears of Employment”. The cat-ear shapes vary by work day (which might be the topic of another post), but are most sharply-contrasted with the weekends, which look more like little hillocks or <a href="https://en.wikipedia.org/wiki/Drumlin" rel="nofollow" target="_blank">drumlins</a>.</p>
<p>We can see a few different cases in the panels. First are days when the weather event put no dent at all in people’s day. This is because <del>of the incredible toughness resilience of New Yorkers, something they are surprisingly very modest about</del> even though there was a weather event in the region that day, it just didn’t impinge on the city much, or at all. The light snow on February 11th or the heavy rain on March 6th are examples here. People just continued to go about their business.</p>
<p>Second are cases where there’s a lot of travel suppression but it’s not really—or not wholly—the weather that’s responsible. The winter storm on Friday December 26th is a case of this. That’s not a regular Friday. Many people are able to stay at home anyway, because it’s the day after Christmas.</p>
<p>Third are cases where the weather does seem to have suppressed travel. These are days like the snow on January 19th, or the shitty weather on Sunday February 16th. These events look like they made people stay at home. Some of these are more severe than others. The strongest example is the flash flooding on Thursday July 31st. That happened in the back half of the day and affected the evening commute directly.</p>
<p>Our fourth and final category is my favorite one. Sometimes snow makes no difference at all, especially if it’s on a workday. Sometimes it’s snowy on the weekend but you’re kind of sick of it, maybe because it’s late in the winter, so you’re either going about your business as usual or you’re just staying indoors. But there’s another kind of snow day.</p>
<figure class="full-width"><a href="https://i1.wp.com/kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/rhythms_2025_weather_storm_dec13.png?ssl=1" rel="nofollow" target="_blank">
    <img src="https://i1.wp.com/kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/rhythms_2025_weather_storm_dec13.png?w=578&#038;ssl=1"
         alt="A close up of Dec 13th and 14th, when the first snow of the season fell and it made people want to go outside." data-recalc-dims="1"/></a><figcaption>
            <p>Let’s go exploring.</p>
        </figcaption>
</figure>
<p>The weekend of <a href="https://www.weather.gov/okx/20251213_14" rel="nofollow" target="_blank">December 13th and 14th 2025</a> brought the city’s <a href="https://weather.com/news/news/2025-12-14-first-snow-new-york-city" rel="nofollow" target="_blank">first measurable snow of the year</a>, and in decent amounts, too—<a href="https://www.weather.gov/okx/20251213_14" rel="nofollow" target="_blank">between four and eight inches of accumulation</a>. Reports remarked on how long it had been in arriving. The result was that, over the weekend, ridership on the subway went <em>up</em>. Maybe on the Saturday it was to go out and buy the mandatory bread, milk, and eggs.<sup id="fnref:1"><a href="https://kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/#fn:1" class="footnote-ref" role="doc-noteref" rel="nofollow" target="_blank">1</a></sup>  But maybe it was also just to be out in the snow. The next day, the people who didn’t have to go work slept in as usual. But that day, too, across the afternoon, more people than usual headed outside and took the subway somewhere. I’d like to think a bunch of them had a sled under their arm.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Maybe, at least for some New Yorkers, it was because it made more sense to take the subway than drive. Though this probably wouldn’t be all that many people. It’d be somewhat possible to investigate this with the data at hand, especially if e.g. outlying stations showed higher ridership rates. <a href="https://kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/#fnref:1" class="footnote-backref" role="doc-backlink" rel="nofollow" target="_blank"><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></p>
</li>
</ol>
</div>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://kieranhealy.org/blog/archives/2026/05/02/bad-weather-and-the-subway/"> R on kieranhealy.org</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/bad-weather-and-the-subway/">Bad Weather and the Subway</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400985</post-id>	</item>
		<item>
		<title>Learning &#038; Exploring Survival Analysis Part 1 &#8211; A Note To Myself</title>
		<link>https://www.r-bloggers.com/2026/05/learning-exploring-survival-analysis-part-1-a-note-to-myself/</link>
		
		<dc:creator><![CDATA[r on Everyday Is A School Day]]></dc:creator>
		<pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://www.kenkoonwong.com/blog/survival/</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
A note to myself on survival analysis — KM curves, log-rank tests &#038; Cox models 🧮 If I wrote it the way I understood it, maybe I’ll actually remember it 🤞</p>
<p>Motivations</p>
<p>We see survival analysis or more generally call...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/learning-exploring-survival-analysis-part-1-a-note-to-myself/">Learning & Exploring Survival Analysis Part 1 – A Note To Myself</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.kenkoonwong.com/blog/survival/"> r on Everyday Is A School Day</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<blockquote>
<p>A note to myself on survival analysis — KM curves, log-rank tests &#038; Cox models <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f9ee.png" alt="🧮" class="wp-smiley" style="height: 1em; max-height: 1em;" /> If I wrote it the way I understood it, maybe I’ll actually remember it <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f91e.png" alt="🤞" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
</blockquote>




<h2 id="motivations">Motivations
  <a href="https://www.kenkoonwong.com/blog/survival/#motivations" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>We see survival analysis or more generally called time-to-event analysis almost all the time when we review journals articles on NEJM etc. Even though we understand the heuristic in interpreting some of the simpler result, I realized that I need to look at this a bit closer to full understand the works and math behind it. There was a recent project that made me feel that my understanding of this is not as competent as I had hoped after talking to one of my statistician colleagues, who also so happen to wrote 
<a href="https://www.emilyzabor.com/survival-analysis-in-r.html$0" rel="nofollow" target="_blank">this blog</a>. Please take a look at Emily’s blog for a better, and a more accurate survival analysis tutorial. This blog is more for my learning so that I can refer back the fundamental when I need a refresher in the future. Also, if I were to write it the way I understood it, maybe that might increase the probably of me recollecting what I understood before. What are we waiting for? Let’s time-to-event this analysis!</p>




<h2 id="objectives">Objectives:
  <a href="https://www.kenkoonwong.com/blog/survival/#objectives" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>= 
<a href="https://www.kenkoonwong.com/blog/survival/#time" rel="nofollow" target="_blank">Time-to-event Analysis</a></p>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#function" rel="nofollow" target="_blank">Survival function</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#handcalc" rel="nofollow" target="_blank">Let’s Calculate By Hand</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#sim" rel="nofollow" target="_blank">Simulation</a>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#km" rel="nofollow" target="_blank">Kaplan-Meier Estimator</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#cox" rel="nofollow" target="_blank">Cox Proportional Hazard Model</a></li>
</ul>
</li>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#ack" rel="nofollow" target="_blank">Acknowledgement</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#opportunities" rel="nofollow" target="_blank">Oppotunities For Improvement</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/survival/#lessons" rel="nofollow" target="_blank">Lessons Learnt</a></li>
</ul>




<h2 id="time">Time-to-event Analysis
  <a href="https://www.kenkoonwong.com/blog/survival/#time" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>The name “survival analysis” is a bit misleading if you first encounter it outside of clinical research. The “survival” doesn’t necessarily mean staying alive — it means surviving without experiencing the event. But that even does not necessarily have to be mortality. It could be an unwanted event etc. Outside of clinical research, an event could be the time when a waymo arrives at your doorstep or someone flaked out. <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f923.png" alt="🤣" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Hence time-to-event analysis appears to be more a better terminology, in my opinion.</p>
<p>This is different from the good ol regression is because <code>time is the outcome</code>, not only that it occurred or not (binary), but when! Now then if you’re like me, that’s just negative binomial regression, right? Not quite. Because, there is an additional special feature to time-to-event analysis called <code>censoring</code>.</p>
<blockquote>
<p>Censoring can mean that the event did not occur, but it can also mean that we lost track of the patient, or the study ended before the event occurred.</p>
</blockquote>
<p>I don’t know about you. But for me, censoring has a negative connotation. It sounds like we’re hiding something. But in survival analysis, censoring is actually a good thing. It means we have partial information about the time to event, even if we don’t know the exact time. So, think of censoring as the last time we noticed that the event DID NOT happen, and it’s usually coded as 0. In good ol regression, we usually will either do a complete case analysis (throw out missing data) or impute. But, imputing outcome is a tad odd, 
<a href="https://stats.stackexchange.com/questions/46226/multiple-imputation-for-outcome-variables" rel="nofollow" target="_blank">in most cases except here</a>. The common censoring is <code>right censoring</code>, meaning we lose track of someone on the right side of the timeline.</p>




<h2 id="function">Survival Function
  <a href="https://www.kenkoonwong.com/blog/survival/#function" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>The <code>survival function</code>, written S(t), answers one simple question: “What is the probability that a person has NOT yet experienced the event by time t?”. At t=0, everyone is event-free, so S(0) = 1 (100%). As time goes on, people experience the event, and S(t) decreases.</p>




<h2 id="handcalc">Let’s Calculate By Hand
  <a href="https://www.kenkoonwong.com/blog/survival/#handcalc" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<table>
<thead>
<tr>
<th>patient</th>
<th>time (months)</th>
<th>status</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>2</td>
<td>1 (event)</td>
</tr>
<tr>
<td>B</td>
<td>3</td>
<td>0 (censored)</td>
</tr>
<tr>
<td>C</td>
<td>5</td>
<td>1 (event)</td>
</tr>
<tr>
<td>D</td>
<td>6</td>
<td>1 (event)</td>
</tr>
<tr>
<td>E</td>
<td>8</td>
<td>0 (censored)</td>
</tr>
</tbody>
</table>
<p>Alright, the above looks quite self-explanatory. We have 5 patients, and we are tracking their time to event in months. Patient A experienced the event at 21 months, while patient B was censored at 30 months (we lost track of them). Patient C had the event at 51 months, patient D at 61 months, and patient E was censored at 80 months. Now let’s do some calculation.</p>
<p>Formula:
$$
\hat{S}(t) = \prod_{i:, t_i \leq t} \left(1 &#8211; \frac{d_i}{n_i}\right)
$$</p>
<ul>
<li><code>\(\hat{S}(t)\)</code>: The estimated survival function; the probability of surviving beyond time <code>\(t\)</code>.</li>
<li><code>\(\prod_{i:\, t_i \leq t}\)</code>: Product over all event times <code>\(t_i\)</code> that are less than or equal to <code>\(t\)</code>.</li>
<li><code>\(t_i\)</code>: The <code>\(i\)</code>-th observed event (death/failure) time.</li>
<li><code>\(d_i\)</code>: The number of events (deaths/failures) that occurred at time <code>\(t_i\)</code>.</li>
<li><code>\(n_i\)</code>: The number of individuals at risk (still alive and under observation) just before time <code>\(t_i\)</code>.</li>
<li><code>\(\frac{d_i}{n_i}\)</code>: The estimated probability of the event occurring at time <code>\(t_i\)</code>.</li>
<li><code>\(1 - \frac{d_i}{n_i}\)</code>: The estimated probability of <strong>surviving</strong> through time <code>\(t_i\)</code>.</li>
</ul>
<p>or simplistically:</p>
<p><code>\(S(t) = S(t-1).(1-d/n)\)</code></p>
<p>Let’s calculate by hand:</p>
<table>
<thead>
<tr>
<th>time</th>
<th>at risk (n)</th>
<th>event (d)</th>
<th>S(t)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>5</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>5</td>
<td>1</td>
<td>1*(1-1/5)=0.8</td>
</tr>
<tr>
<td>3</td>
<td>5-1=4</td>
<td>0</td>
<td>0.8*(1-0)=0.8</td>
</tr>
<tr>
<td>5</td>
<td>4-1=3</td>
<td>1</td>
<td>0.8*(1-1/3)=0.5333</td>
</tr>
<tr>
<td>6</td>
<td>3-1=2</td>
<td>1</td>
<td>0.5333*(1-1/2)=0.2667</td>
</tr>
<tr>
<td>8</td>
<td>2-1=1</td>
<td>0</td>
<td>0.2667*(1-0)=0.2667</td>
</tr>
</tbody>
</table>
<p>That’s interesting! I don’t think I’ve calcualte these by hand before and that’s very helpful in just doing a simple example and observing the result. Alright, when we read articles, there typically is a group factor, how do they then use KM to generate 2 survival curves or survival estimate for each group? They just do the same thing but only for the subset of the data that belongs to that group. So, if we have a treatment and control group, we would calculate S(t) separately for each group, and then we can compare the two survival curves to see if there is a difference in survival between the groups. How? We can use the log-rank test to compare the survival curves, or we can use a Cox proportional hazards model to estimate the hazard ratio between the groups. Now, things are starting to look a tad more familiar. Let’s use R and some simple example and see if we can get to log-rank test on just simple KM etimator.</p>
<pre>library(tidyverse)

simple_df &lt;- tribble(
  ~time, ~status, ~treatment,
  5,1,1,
  2,1,0,
  6,0,1,
  1,1,0,
  2,0,1,
  2,0,0,
  7,1,1,
  3,1,0,
  7,1,1,
  2,1,0,
  1,1,0,
  6,1,1
) |&gt;
  mutate(subject = row_number()) 

treatment_df &lt;- simple_df |&gt;
  filter(treatment == 1) |&gt;
  arrange(time)

treatment &lt;- tibble()
time &lt;- treatment_df |&gt; distinct(time) |&gt; pull()
at_risk &lt;- nrow(treatment_df)
S_t &lt;- 1

for (i in time) {
  df_i &lt;- treatment_df |&gt;
    filter(time == i)
  status &lt;- df_i |&gt; pull(status) |&gt; sum()
  n &lt;- df_i |&gt; pull(status) |&gt; length()
  S_t &lt;- S_t * (1 - status/at_risk)
  treatment &lt;- treatment |&gt;
    bind_rows(tibble(time=i,at_risk=at_risk,S_t=S_t,treatment=1))
  at_risk &lt;- at_risk - n
}

(treatment)

## # A tibble: 4 × 4
##    time at_risk   S_t treatment
##   &lt;dbl&gt;   &lt;int&gt; &lt;dbl&gt;     &lt;dbl&gt;
## 1     2       6   1           1
## 2     5       5   0.8         1
## 3     6       4   0.6         1
## 4     7       2   0           1

no_treatment_df &lt;- simple_df |&gt;
  filter(treatment == 0) |&gt;
  arrange(time)

no_treatment &lt;- tibble()
time &lt;- no_treatment_df |&gt; distinct(time) |&gt; pull()
at_risk &lt;- nrow(no_treatment_df)
S_t &lt;- 1

for (i in time) {
  df_i &lt;- no_treatment_df |&gt;
    filter(time == i)
  status &lt;- df_i |&gt; pull(status) |&gt; sum()
  n &lt;- df_i |&gt; pull(status) |&gt; length()
  S_t &lt;- S_t * (1 - status/at_risk)
  no_treatment &lt;- no_treatment |&gt;
    bind_rows(tibble(time=i,at_risk=at_risk,S_t=S_t,treatment=0))
  at_risk &lt;- at_risk - n
}

(no_treatment)

## # A tibble: 3 × 4
##    time at_risk   S_t treatment
##   &lt;dbl&gt;   &lt;int&gt; &lt;dbl&gt;     &lt;dbl&gt;
## 1     1       6 0.667         0
## 2     2       4 0.333         0
## 3     3       1 0             0
</pre>



<h4 id="visualize">Visualize
  <a href="https://www.kenkoonwong.com/blog/survival/#visualize" rel="nofollow" target="_blank"></a>
</h4>
<pre>rbind(treatment,no_treatment) |&gt;
  bind_rows(tibble(
    time=c(0,0), status=c(0,0), treatment=c(1,0), subject=c(0,0), at_risk=c(6,6), S_t=c(1,1)  #add initial phase 
  )) |&gt;
  ggplot(aes(x=time,y=S_t,color=as.factor(treatment))) +
  geom_step() +
  theme_bw()
</pre><img src="https://i2.wp.com/www.kenkoonwong.com/blog/survival/index_files/figure-html/unnamed-chunk-2-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>Wow, since we created this simpel dataset, knowing treatment extended the time event, whereas no treatment didn’t, we can nicely see that when creating 2 KM plots stratified by the treatment group and plot it all they look very different. Now let’s quickly look at log rank test with <code>survival</code> package and then calculate by hand and see if we can reproduce the same p value.</p>
<pre>## log-rank test
(log_rank_test &lt;- survival::survdiff(Surv(time, status) ~ treatment, data = simple_df))

## Call:
## survival::survdiff(formula = Surv(time, status) ~ treatment, 
##     data = simple_df)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## treatment=0 6        5     1.97      4.68      9.02
## treatment=1 6        4     7.03      1.31      9.02
## 
##  Chisq= 9  on 1 degrees of freedom, p= 0.003
</pre><p>Alright! It looks like a chi-square test and has a p val of 0.003. Let’s see if we can reproduce that. And it also looked like they use (O-E)^2/V as opposed to sum of (O-E)^2/E like the usual chi-square test to get the chisq statistic. Interesting.</p>
<p><code>V (variance) = n_0 * n_1 * d * (n - d) / (n^2 * (n - 1))</code></p>
<details>
<summary>Click Here For Calculated Details</summary>
<pre>## log-rank test by hand
n0 &lt;- 6
n1 &lt;- 6

### time 1
simple_df |&gt;
  filter(time == 1) |&gt;
  group_by(treatment) |&gt;
  summarize(n = n(),
            d = sum(status))

## # A tibble: 1 × 3
##   treatment     n     d
##       &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1         0     2     2

(t_1 &lt;- tibble(n0=n0, n1=n1, d0=2, d1=0) |&gt;
  mutate(n = n0 + n1) |&gt;
  mutate(d = d0 + d1) |&gt;
  mutate(V = n0 * n1 * d * (n - d) / (n^2 * (n - 1))) |&gt;
  mutate(E0 = (n0 /n) * d) |&gt;
  mutate(chi_i = (d0-E0)^2/V))

## # A tibble: 1 × 9
##      n0    n1    d0    d1     n     d     V    E0 chi_i
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     6     6     2     0    12     2 0.455     1   2.2

n0 &lt;- n0 - 2
n1 &lt;- n1 - 0

### time 2
simple_df |&gt;
  filter(time == 2) |&gt;
  group_by(treatment) |&gt;
  summarize(n = n(),
            d = sum(status))

## # A tibble: 2 × 3
##   treatment     n     d
##       &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1         0     3     2
## 2         1     1     0

(t_2 &lt;- tibble(n0=n0, n1=n1, d0=2, d1=0) |&gt;
  mutate(n = n0 + n1) |&gt;
  mutate(d = d0 + d1) |&gt;
  mutate(V = n0 * n1 * d * (n - d) / (n^2 * (n - 1))) |&gt;
  mutate(E0 = (n0 /n) * d) |&gt;
  mutate(chi_i = (d0-E0)^2/V))

## # A tibble: 1 × 9
##      n0    n1    d0    d1     n     d     V    E0 chi_i
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     4     6     2     0    10     2 0.427   0.8  3.37

n0 &lt;- n0 - 3
n1 &lt;- n1 - 1

### time 3
simple_df |&gt;
  filter(time == 3) |&gt;
  group_by(treatment) |&gt;
  summarize(n = n(),
            d = sum(status))

## # A tibble: 1 × 3
##   treatment     n     d
##       &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1         0     1     1

(t_3 &lt;- tibble(n0=n0, n1=n1, d0=1, d1=0) |&gt;
  mutate(n = n0 + n1) |&gt;
  mutate(d = d0 + d1) |&gt;
  mutate(V = n0 * n1 * d * (n - d) / (n^2 * (n - 1))) |&gt;
  mutate(E0 = (n0 /n) * d) |&gt;
  mutate(chi_i = (d0-E0)^2/V))

## # A tibble: 1 × 9
##      n0    n1    d0    d1     n     d     V    E0 chi_i
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     1     5     1     0     6     1 0.139 0.167     5

n0 &lt;- n0 - 1
n1 &lt;- n1 - 0

### time 5
simple_df |&gt;
  filter(time == 5) |&gt;
  group_by(treatment) |&gt;
  summarize(n = n(),
            d = sum(status))

## # A tibble: 1 × 3
##   treatment     n     d
##       &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1         1     1     1

(t_5 &lt;- tibble(n0=n0, n1=n1, d0=0, d1=1) |&gt;
  mutate(n = n0 + n1) |&gt;
  mutate(d = d0 + d1) |&gt;
  mutate(V = n0 * n1 * d * (n - d) / (n^2 * (n - 1))) |&gt;
  mutate(E0 = (n0 /n) * d) |&gt;
  mutate(chi_i = (d0-E0)^2/V))

## # A tibble: 1 × 9
##      n0    n1    d0    d1     n     d     V    E0 chi_i
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     0     5     0     1     5     1     0     0   NaN

n1 &lt;- n1 - 1

### time 6
simple_df |&gt;
  filter(time == 6) |&gt;
  group_by(treatment) |&gt;
  summarize(n = n(),
            d = sum(status))

## # A tibble: 1 × 3
##   treatment     n     d
##       &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1         1     2     1

(t_6 &lt;- tibble(n0=n0, n1=n1, d0=0, d1=1) |&gt;
  mutate(n = n0 + n1) |&gt;
  mutate(d = d0 + d1) |&gt;
  mutate(V = n0 * n1 * d * (n - d) / (n^2 * (n - 1))) |&gt;
  mutate(E0 = (n0 /n) * d) |&gt;
  mutate(chi_i = (d0-E0)^2/V))

## # A tibble: 1 × 9
##      n0    n1    d0    d1     n     d     V    E0 chi_i
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     0     4     0     1     4     1     0     0   NaN

n1 &lt;- n1 - 2

### time 7
simple_df |&gt;
  filter(time == 7) |&gt;
  group_by(treatment) |&gt;
  summarize(n = n(),
            d = sum(status))

## # A tibble: 1 × 3
##   treatment     n     d
##       &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1         1     2     2

(t_7 &lt;- tibble(n0=n0, n1=n1, d0=0, d1=2) |&gt;
  mutate(n = n0 + n1) |&gt;
  mutate(d = d0 + d1) |&gt;
  mutate(V = n0 * n1 * d * (n - d) / (n^2 * (n - 1))) |&gt;
  mutate(E0 = (n0 /n) * d) |&gt;
  mutate(chi_i = (d0-E0)^2/V))

## # A tibble: 1 × 9
##      n0    n1    d0    d1     n     d     V    E0 chi_i
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     0     2     0     2     2     2     0     0   NaN

n1 &lt;- n1 - 2
</pre><p>Alright, lots of back and forth sanity check but I think we did it! Now, let’s replace those NaN to 0, do some calculation and check our final chi square statistic</p>
</details>
<pre>bind_rows(t_1, t_2, t_3, t_5, t_6, t_7) |&gt;
  mutate(V = replace_na(V, 0),
         E0 = replace_na(E0, 0)) |&gt;
  summarise(
    O0 = sum(d0),
    E0 = sum(E0),
    V  = sum(V)
  ) |&gt;
  mutate(chi_sq = (O0 - E0)^2 / V)

## # A tibble: 1 × 4
##      O0    E0     V chi_sq
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;
## 1     5  1.97  1.02   9.02

pchisq(q = 9.02, df = 1, lower.tail = F)

## [1] 0.002670414
</pre><p><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f64c.png" alt="🙌" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f64c.png" alt="🙌" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f64c.png" alt="🙌" class="wp-smiley" style="height: 1em; max-height: 1em;" /> We got it! If we round it up, it’s exactly 0.003 just like from <code>survival</code>.</p>
<blockquote>
<p>Notice that we used E0, but you can use E1 and it would should return the same chi square statistic. Click below for details.</p>
</blockquote>
<details>
<summary>Click to expand</summary>
<pre>bind_rows(t_1, t_2, t_3, t_5, t_6, t_7) |&gt;
  mutate(E1 = (n1 / n) * d) |&gt;
  summarise(
    O1 = sum(d1),
    E1 = sum(E1),
    V = sum(V)
  ) |&gt;
  mutate(chi_sq = (O1-E1)^2/V) 

## # A tibble: 1 × 4
##      O1    E1     V chi_sq
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;
## 1     4  7.03  1.02   9.02
</pre></details>
<blockquote>
<p>Take note that KM estimator can only estimate survival function and you can only compare the survival curves with log-rank test, but you can’t add more variables to adjust for confounding. Which means, we assume that there isn’t any confounding factors between treatment groups.</p>
</blockquote>
<p>If any adjustment that’s needed, that’s where Cox proportional hazard model comes in. Now if we were to add age and run cox model, we would get a different hazard ratio and p value, but the log-rank test would still be the same because log-rank test is only comparing the survival curves without adjusting for any covariates. Let’s see that in action. Click below to expand, you’re going to see an interesting warning, complete separation.</p>
<details>
<summary>Click to Expand</summary>
<pre>simple_df &lt;- tribble(
  ~time, ~status, ~treatment, ~age,
  5,1,1,30,
  2,1,0,80,
  6,0,1,35,
  1,1,0,85,
  2,0,1,32,
  2,0,0,30,
  7,1,1,25,
  3,1,0,90,
  7,1,1,98,
  2,1,0,98,
  1,1,0,89,
  6,1,1,20
) |&gt;
  mutate(subject = row_number()) 

survival::coxph(Surv(time,status) ~ treatment+age, data = simple_df)

## Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 1 ; coefficient may be infinite.

## Call:
## survival::coxph(formula = Surv(time, status) ~ treatment + age, 
##     data = simple_df)
## 
##                 coef  exp(coef)   se(coef)      z     p
## treatment -2.202e+01  2.729e-10  1.937e+04 -0.001 0.999
## age        2.275e-03  1.002e+00  1.192e-02  0.191 0.849
## 
## Likelihood ratio test=10.61  on 2 df, p=0.004959
## n= 12, number of events= 9
</pre><p>Notice how our treatment has high p val, and high SE? Since our mock data has a clear separation between treatment and age, where all the treated patients are young and all the untreated patients are old, the model is having a hard time estimating the effect of treatment because it’s confounded by age. This is called 
<a href="https://www.kenkoonwong.com/blog/mle/" rel="nofollow" target="_blank">complete separation</a>, and it leads to infinite estimates for the coefficients, which is why we see those warnings. In real world data, we might not have such a clear separation, but we might still have some degree of separation that can lead to unstable estimates. That’s why it’s important to check for separation and consider using penalized regression methods if we encounter this issue.</p>
</details>
<p>Let’s simulate the data so that we can estimate a more accurate hazard ratio with cox model and see how it compares to the true hazard ratio that we set in the simulation.</p>




<h2 id="sim">Simulation
  <a href="https://www.kenkoonwong.com/blog/survival/#sim" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<pre>library(survival)
library(survminer)

# simulate data of HR 0.55 (95%CI 0.442-0.674)
set.seed(1)
n &lt;- 350
base_event &lt;- 25
base_rate &lt;- 1/base_event
treatment_event &lt;- base_event + 20
treatment_rate &lt;- 1/treatment_event
hr &lt;- treatment_rate/base_rate
coef &lt;- log(hr)
confounder &lt;- rbinom(n,1,0.5)
treatment &lt;- rbinom(n, 1, plogis(0.5*confounder))
true_time &lt;- rexp(n, rate = base_rate*exp(coef*treatment+0.5*confounder))
cens_time &lt;- runif(n, min = 0, max = treatment_event)         

df &lt;- tibble(true_time, cens_time) |&gt;
  mutate(time = pmin(true_time, cens_time),
         status = case_when(
           true_time &lt;= cens_time ~ 1,
           TRUE ~ 0
         )) |&gt;
  mutate(confounder = confounder) |&gt;
  mutate(treatment = treatment |&gt; as.factor())

head(df)

## # A tibble: 6 × 6
##   true_time cens_time  time status confounder treatment
##       &lt;dbl&gt;     &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;      &lt;int&gt; &lt;fct&gt;    
## 1     13.0      42.3  13.0       1          0 0        
## 2      8.83      2.60  2.60      0          0 0        
## 3      4.49     13.7   4.49      1          1 0        
## 4     61.4      10.8  10.8       0          1 1        
## 5     47.7      17.3  17.3       0          0 1        
## 6     11.6      34.3  11.6       1          1 1
</pre><p>The above simulation, we set the true hazard ratio to be 0.55, which means that the treatment group has a 45% reduction in the hazard of the event compared to the control group. We then simulate the true time to event using an exponential distribution, and we also simulate a censoring time using a uniform distribution. The observed time is the minimum of the true time and the censoring time, and the status variable indicates whether the event was observed (1) or censored (0).</p>
<p>Simulating the above is helpful because we then know the true rate was derived from exponential function based on base rate multiplied by the hazard ratio, so we can then compare the estimated hazard ratio from the Cox model to the true hazard ratio we set in the simulation. The part that connects intuitively is how the <code>exp(coef\*treatment+coef2\*confounder)</code> is similar to the linear regression. If you noticed that we use <code>base_rate*exp(...)</code>, it’s essentialy the same as <code>exp(log(base_rate)+coef\*treatment+coef2\*confounder)</code> which is the same as <code>exp(intercept + coef\*treatment + coef2\*confounder)</code>, where the intercept is log(base_rate). So, in a way, the Cox model is modeling the log of the hazard function as a linear combination of the covariates, which is similar to how linear regression models the mean of the outcome as a linear combination of the covariates. The difference is that in Cox model, we are modeling the hazard function, which is the instantaneous rate of event occurrence at time t, whereas in linear regression, we are modeling the mean of the outcome variable.</p>




<h2 id="km">Kaplan-Meier Estimator
  <a href="https://www.kenkoonwong.com/blog/survival/#km" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<pre>(survdiff(Surv(time,status) ~ treatment, data = df))

## Call:
## survdiff(formula = Surv(time, status) ~ treatment, data = df)
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## treatment=0 149       74     62.3      2.20      3.61
## treatment=1 201       87     98.7      1.39      3.61
## 
##  Chisq= 3.6  on 1 degrees of freedom, p= 0.06

km_fit &lt;- survfit(Surv(time, status) ~ treatment, data = df)
</pre><p>the log-rank test did show a significant difference between the two groups, which is expected since we set a true hazard ratio of 0.55 in the simulation. The Kaplan-Meier estimator will give us the estimated survival curves for each group, and we can visually compare them to see the difference in survival between the treatment and control groups.</p>




<h2 id="cox">Cox Proportional Hazard Model
  <a href="https://www.kenkoonwong.com/blog/survival/#cox" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<pre>cox_fit &lt;- coxph(Surv(time, status) ~ treatment + confounder, data = df, x = T)
summary(cox_fit)

## Call:
## coxph(formula = Surv(time, status) ~ treatment + confounder, 
##     data = df, x = T)
## 
##   n= 350, number of events= 161 
## 
##               coef exp(coef) se(coef)      z Pr(&gt;|z|)   
## treatment1 -0.3730    0.6887   0.1603 -2.327  0.01999 * 
## confounder  0.5024    1.6526   0.1605  3.130  0.00175 **
## ---
## Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## treatment1    0.6887     1.4521     0.503    0.9429
## confounder    1.6526     0.6051     1.207    2.2635
## 
## Concordance= 0.582  (se = 0.024 )
## Likelihood ratio test= 13.34  on 2 df,   p=0.001
## Wald test            = 13.34  on 2 df,   p=0.001
## Score (logrank) test = 13.5  on 2 df,   p=0.001

# plot
ggsurvplot(
  fit = km_fit,
  data = df,
  # pval = TRUE,
  conf.int = TRUE,
  risk.table = TRUE,
  legend.labs = c(&quot;Control&quot;, &quot;Treatment&quot;),
  title    = &quot;Kaplan-Meier Survival Curves&quot;
)
</pre><img src="https://i0.wp.com/www.kenkoonwong.com/blog/survival/index_files/figure-html/unnamed-chunk-10-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>Notice how the estimated hazard ratio from the Cox model is close to the true hazard ratio of 0.55 that we set in the simulation, our HR is 0.595 (95% CI 0.44-0.81). The Kaplan-Meier plot showed us the survival curves for each group, and we can visually see the difference in survival between the treatment and control groups.</p>
<blockquote>
<p>Note: <code>survdiff</code> calculates log-rank test, <code>survfit</code> estimates the survival function, and <code>coxph</code> estimates the hazard ratio adjusting for covariates</p>
</blockquote>
<p>There is an 
<a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3653612/" rel="nofollow" target="_blank">interesting article by Hernán</a> that cautioned the use of unadjusted HR and the use unadjusted survival curve (which we did, because it’s based off KM estimator). He also mentioned that first, a single average HR across the entire follow-up can be misleading because the true effect may vary over time. Let’s see if we can apply that to our current plot. Let’s use <code>adjustedCurves</code> and see if it looks different.</p>
<pre>library(adjustedCurves)

adjust_curve &lt;- adjustedsurv(
  data = df, 
  ev_time = &quot;time&quot;, 
  event = &quot;status&quot;, 
  variable = &quot;treatment&quot;, 
  method = &quot;direct&quot;, 
  outcome_model = cox_fit, 
  conf_int = T)

plot(adjust_curve, conf_int = T)
</pre><img src="https://i1.wp.com/www.kenkoonwong.com/blog/survival/index_files/figure-html/unnamed-chunk-11-1.png?w=450&#038;ssl=1" data-recalc-dims="1" />
<p>Interesting. They do look different! Though the paper didn’t directly use this package. He also proposed a solution to more accurately estimate time-varying HR using pooled logistic regression and spline on time as feature. Let’s try this next time! So much to learn! On 
<a href="https://www.emilyzabor.com/survival-analysis-in-r.html#assessing-proportional-hazards" rel="nofollow" target="_blank">Emily Zabor’s blog</a>, she did mention there is <code>survival::cox.zph()</code> function which allows us to check the assumption of proportional hazards.</p>




<h2 id="ack">Acknowledgement
  <a href="https://www.kenkoonwong.com/blog/survival/#ack" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Thanks to Emily Zabor’s tutorial and also personal advice on practical usage of survival analysis! Her blog contains so much more advanced topics and some functions and packages I’m planning to use in the future. It’s truly one of the more comprehensive and yet easy to understand tutorials I’ve seen on survival analysis. Thanks Emily!</p>




<h2 id="opportunities">Opportunities For Improvement
  <a href="https://www.kenkoonwong.com/blog/survival/#opportunities" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<ul>
<li>learn competing risk analysis with fine-gray model</li>
<li>learn to customize ggsurvplot</li>
<li>use <code>gtsummary::tbl_regression(exp = TRUE)</code> to further beautify aHR</li>
<li>test out Hernán’s proposed solution to calculate HR</li>
<li>let’s test out other dataset such as BMT from SemiCompRisks, Melanoma from MASS,</li>
</ul>




<h2 id="lessons">Lessons learnt
  <a href="https://www.kenkoonwong.com/blog/survival/#lessons" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<ul>
<li>calculating by hand is helpful because I just realized we can’t just calculate at_risk with one row at a time because of the time occurs at the same time, it would be calculated at the same time, that’s why we used for loop for clarity</li>
<li><code>survdiff</code> calculates log-rank test, <code>survfit</code> estimates the survival function, and <code>coxph</code> estimates the hazard ratio adjusting for covariates</li>
<li>censor is usually a good thing, but it could also mean lost to follow up.</li>
</ul>
<p>If you like this article:</p>
<ul>
<li>please feel free to send me a 
<a href="https://www.kenkoonwong.com/blog/" rel="nofollow" target="_blank">comment or visit my other blogs</a></li>
<li>please feel free to follow me on 
<a href="https://bsky.app/profile/kenkoonwong.bsky.social" rel="nofollow" target="_blank">BlueSky</a>, 
<a href="https://twitter.com/kenkoonwong/" rel="nofollow" target="_blank">twitter</a>, 
<a href="https://github.com/kenkoonwong/" rel="nofollow" target="_blank">GitHub</a> or 
<a href="https://rstats.me/@kenkoonwong" rel="nofollow" target="_blank">Mastodon</a></li>
<li>if you would like collaborate please feel free to 
<a href="https://www.kenkoonwong.com/contact/" rel="nofollow" target="_blank">contact me</a></li>
</ul>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.kenkoonwong.com/blog/survival/"> r on Everyday Is A School Day</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/learning-exploring-survival-analysis-part-1-a-note-to-myself/">Learning & Exploring Survival Analysis Part 1 – A Note To Myself</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400996</post-id>	</item>
		<item>
		<title>You Don&#8217;t Need to Learn All the Weights on tabular data: The Case for rvflnet (a nonlinear expressive glmnet) on regression, classification and survival analysis</title>
		<link>https://www.r-bloggers.com/2026/05/you-dont-need-to-learn-all-the-weights-on-tabular-data-the-case-for-rvflnet-a-nonlinear-expressive-glmnet-on-regression-classification-and-survival-analysis/</link>
		
		<dc:creator><![CDATA[T. Moudiki]]></dc:creator>
		<pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://thierrymoudiki.github.io//blog/2026/05/02/r/rvflnet</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; "> rvflnet is an R package that implements a Random Vector Functional Link (RVFL) network. It is a nonlinear expressive version of glmnet that can be used for regression, classification and survival analysis.</div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/you-dont-need-to-learn-all-the-weights-on-tabular-data-the-case-for-rvflnet-a-nonlinear-expressive-glmnet-on-regression-classification-and-survival-analysis/">You Don’t Need to Learn All the Weights on tabular data: The Case for rvflnet (a nonlinear expressive glmnet) on regression, classification and survival analysis</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/02/r/rvflnet"> T. Moudiki's Webpage - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<h2 id="introduction">Introduction</h2>

<p>Random Vector Functional Link (RVFL) networks offer a simple yet powerful alternative to traditional neural networks for tabular data. Instead of learning hidden layers through backpropagation, RVFL generates them <strong>randomly</strong> (or not, if using a deterministic sequence of quasi-random numbers) and focuses all learning effort on a final, regularized linear model.</p>

<p>Formally, let</p>

\[X \in \mathbb{R}^{n \times p}\]

<p>be the input data. RVFL networks (the ones described in this blog post) construct a set of nonlinear features by projecting (X) onto a random matrix</p>

\[W \in \mathbb{R}^{p \times m},\]

<p>and applying an activation function (\(g(\cdot)\)):</p>

\[H = g\left( \frac{X &#8211; \mu}{\sigma} ; W \right).\]

<p>These random nonlinear features are then concatenated with the original inputs to form an augmented design matrix:</p>

\[Z = [X | H].\]

<p>The model prediction is obtained by fitting a linear model on this expanded space (hence, a nonlinear GLM):</p>

\[\hat{y} = Z \beta.\]

<p>Because (Z) can be high-dimensional and highly redundant, RVFL networks (the ones described in this blog post) rely on <strong>Elastic Net regularization</strong> (<a href="https://glmnet.stanford.edu/articles/glmnet.html" rel="nofollow" target="_blank"><code>glmnet</code></a>) to estimate the coefficients:</p>

\[\hat{\beta} = \arg\min_{\beta}\mathcal{L}(y, Z\beta) + \lambda \left(\alpha ||\beta||_1 + (1-\alpha)||\beta||_2^2\right).\]

<p>In this framework, randomness creates a rich pool of nonlinear transformations, while regularization selects and stabilizes the most useful ones. The result is a nonlinear model that combines the flexibility of neural networks with the efficiency and robustness of linear methods.</p>

<p>Of course, this blog post is not a proof of the title. It’s about <a href="https://docs.techtonique.net/rvflnet/index.html" rel="nofollow" target="_blank">R package <code>rvflnet</code></a>. But you can appreciate the <strong>high performance of RVFLs</strong> on regression, classification and survival analysis, an notably on the controversial <code>Boston</code> dataset (performs on par with Random Forest or Gradient Boosting).</p>

<h2 id="0---install-package">0 &#8211; Install package</h2>

<pre>install.packages(&quot;survival&quot;, repos = &quot;https://cran.r-project.org&quot;) # survival analysis

install.packages(&quot;remotes&quot;, repos = &quot;https://cran.r-project.org&quot;)

devtools::install_github('thierrymoudiki/rvflnet') # Nonlinear glm (RVFL networks)
</pre>

<h2 id="1---regression">1 &#8211; Regression</h2>

<pre>set.seed(123)

library(glmnet)
data(Boston, package = &quot;MASS&quot;)

# -------------------------
# Data
# -------------------------
X &lt;- as.matrix(Boston[, -14])
y &lt;- Boston$medv

n &lt;- nrow(X)
idx &lt;- sample(1:n, size = round(0.8 * n))

X_train &lt;- X[idx, ]
y_train &lt;- y[idx]

X_test &lt;- X[-idx, ]
y_test &lt;- y[-idx]

# -------------------------
# Grid
# -------------------------
grid &lt;- expand.grid(
  n_hidden = c(175, 200, 225, 250),
  alpha = seq(0.1, 0.5, by=0.2),
  include_original = c(TRUE, FALSE),
  seed = 1,
  stringsAsFactors = FALSE
)

results &lt;- vector(&quot;list&quot;, nrow(grid))

# -------------------------
# Loop
# -------------------------
for (i in seq_len(nrow(grid))) {

  params &lt;- grid[i, ]

  #cat(&quot;\n========================================\n&quot;)
  #cat(sprintf(&quot;Run %d / %d\n&quot;, i, nrow(grid)))
  #print(params)

  # -------------------------
  # Fit model
  # -------------------------
  fit &lt;- rvflnet::rvflnet(
    X_train, y_train,
    n_hidden = params$n_hidden,
    activation = &quot;sigmoid&quot;,
    W_type = &quot;gaussian&quot;,
    seed = params$seed,
    include_original = params$include_original, # direct link, skip connection or not
    alpha = params$alpha
  )

  # -------------------------
  # Evaluate full lambda path
  # -------------------------
  lambdas &lt;- fit$fit$lambda

  preds &lt;- predict(fit, newx = X_test, s = lambdas)

  rmse_path &lt;- sqrt(colMeans((preds - y_test)^2))

  best_idx &lt;- which.min(rmse_path)

  best_rmse &lt;- rmse_path[best_idx]
  best_lambda &lt;- lambdas[best_idx]

  # -------------------------
  # Sparsity
  # -------------------------
  coef_mat &lt;- coef(fit, s = best_lambda)
  nonzero &lt;- sum(coef_mat[-1, 1] != 0)

  # -------------------------
  # Verbose output
  # -------------------------
  #cat(sprintf(&quot;Best RMSE: %.4f\n&quot;, best_rmse))
  #cat(sprintf(&quot;Best lambda: %.6f\n&quot;, best_lambda))
  #cat(sprintf(&quot;Non-zero coeffs: %d\n&quot;, nonzero))

  # -------------------------
  # Store
  # -------------------------
  results[[i]] &lt;- data.frame(
    n_hidden = params$n_hidden,
    alpha = params$alpha,
    include_original = params$include_original,
    seed = params$seed,
    rmse = best_rmse,
    lambda = best_lambda,
    nonzero = nonzero
  )
}

# -------------------------
# Aggregate
# -------------------------
results_df &lt;- do.call(rbind, results)
results_df &lt;- results_df[order(results_df$rmse), ]
print(head(results_df))

Loading required package: Matrix

Loaded glmnet 4.1-10



               n_hidden alpha include_original seed     rmse     lambda nonzero
s= 0.027561759      200   0.1             TRUE    1 2.881935 0.02756176     190
s= 0.017620327      200   0.3             TRUE    1 2.884739 0.01762033     167
s= 0.012734248      200   0.5             TRUE    1 2.889339 0.01273425     158
s= 0.036435024      175   0.1             TRUE    1 2.920012 0.03643502     165
s= 0.016833926      175   0.5             TRUE    1 2.938472 0.01683393     136
s= 0.023293035      175   0.3             TRUE    1 2.941267 0.02329304     144
</pre>

<p>An RMSE of 2.88 is on par with Random Forest or Gradient Boosting, with a <strong>significantly</strong> faster computation time.</p>

<h2 id="2---classification">2 - Classification</h2>

<h3 id="2---1-binary-classification">2 - 1 Binary Classification</h3>

<pre>set.seed(123)

data(iris)

# Binary classification: setosa vs others
y &lt;- ifelse(iris$Species == &quot;setosa&quot;, 1, 0)
X &lt;- as.matrix(iris[, 1:4])

# Train/test split
n &lt;- nrow(X)
idx &lt;- sample(1:n, size = round(0.8 * n))

X_train &lt;- X[idx, ]
y_train &lt;- y[idx]

X_test &lt;- X[-idx, ]
y_test &lt;- y[-idx]

# -------------------------
# Fit model
# -------------------------
cv_model &lt;- rvflnet::cv.rvflnet(
  X_train, y_train,
  n_hidden = 50,
  activation = &quot;relu&quot;,
  W_type = &quot;gaussian&quot;,
  family = &quot;binomial&quot;,
  nfolds = 5
)

# -------------------------
# Predictions (probabilities)
# -------------------------
(probs &lt;- predict(cv_model, X_test, type = &quot;response&quot;))

# Convert to class
y_pred &lt;- ifelse(probs &gt; 0.5, 1, 0)

all.equal(as.numeric(y_pred), as.numeric(predict(cv_model, X_test, type=&quot;class&quot;)))

# -------------------------
# Diagnostics
# -------------------------

# Accuracy
acc &lt;- mean(drop(y_pred) == y_test)
cat(&quot;Accuracy:&quot;, acc, &quot;\n&quot;)

# Confusion matrix
table(Predicted = y_pred, Actual = y_test)
</pre>

<table class="dataframe">
<caption>A matrix: 30 × 1 of type dbl</caption>
<thead>
	<tr><th scope="col">lambda.min</th></tr>
</thead>
<tbody>
	<tr><td>0.9997617002</td></tr>
	<tr><td>0.9992267955</td></tr>
	<tr><td>0.9997120678</td></tr>
	<tr><td>0.9997524867</td></tr>
	<tr><td>0.9996600481</td></tr>
	<tr><td>0.9992472082</td></tr>
	<tr><td>0.9996101744</td></tr>
	<tr><td>0.9999356520</td></tr>
	<tr><td>0.9998139568</td></tr>
	<tr><td>0.9995418762</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0019937012</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0005459970</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0005035848</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
	<tr><td>0.0003328885</td></tr>
</tbody>
</table>

<p>TRUE</p>

<pre>Accuracy: 1 



         Actual
Predicted  0  1
        0 20  0
        1  0 10
</pre>

<h3 id="2---2-multiclass-classification">2 - 2 Multiclass Classification</h3>

<pre>set.seed(123)

data(iris)

y &lt;- as.numeric(iris$Species)
X &lt;- as.matrix(iris[, 1:4])

# Train/test split
n &lt;- nrow(X)
idx &lt;- sample(1:n, size = round(0.8 * n))

X_train &lt;- X[idx, ]
y_train &lt;- y[idx]

X_test &lt;- X[-idx, ]
y_test &lt;- y[-idx]

# -------------------------
# Fit model
# -------------------------
cv_model &lt;- rvflnet::rvflnet(
  X_train, y_train,
  n_hidden = 50,
  activation = &quot;relu&quot;,
  W_type = &quot;gaussian&quot;,
  family = &quot;multinomial&quot;,
  nlambda = 25,
  nfolds = 5
)

# -------------------------
# Diagnostics
# -------------------------

# Accuracy
acc &lt;- colMeans(predict(cv_model, X_test, type=&quot;class&quot;) == y_test)
cat(&quot;Accuracies:&quot;, acc, &quot;\n&quot;) # consider other metrics

Accuracies: 0.1666667 0.7666667 0.9333333 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 0.9666667 
</pre>

<h2 id="3---nonlinear-cox-survival-analysis">3 - Nonlinear Cox survival analysis</h2>

<h3 id="3---1-example-1">3 - 1 Example 1</h3>

<pre>library(survival)
library(rvflnet)

data(ovarian)

X &lt;- as.matrix(ovarian[, c(&quot;age&quot;, &quot;resid.ds&quot;, &quot;rx&quot;, &quot;ecog.ps&quot;)])
y &lt;- Surv(ovarian$futime, ovarian$fustat)

set.seed(123)
n &lt;- nrow(X)
train_idx &lt;- sample(1:n, size = round(0.8 * n))

X_train &lt;- X[train_idx, ]
X_test  &lt;- X[-train_idx, ]
y_train &lt;- y[train_idx]
y_test  &lt;- y[-train_idx]

# -------------------------
# Fit model
# -------------------------
cv_fit &lt;- rvflnet::cv.rvflnet(
  X_train, y_train,
  family = &quot;cox&quot;,
  nfolds = 5,
  type.measure = &quot;C&quot;
)

plot(cv_fit)

# Out-of-sample C-index
print(glmnet::Cindex(pred = predict(cv_fit, X_test), y = y_test))


Warning message in data(ovarian):
“data set ‘ovarian’ not found”


[1] 0.8571429
</pre>

<p><img src="https://i2.wp.com/thierrymoudiki.github.io/images/2026-05-02/2026-05-02-rvflnet_15_2.png?w=578&#038;ssl=1" alt="image-title-here" class="img-responsive" data-recalc-dims="1" /></p>

<h3 id="3---2-example-2">3 - 2 Example 2</h3>

<pre>library(glmnet)
library(survival)

data(pbc)
pbc2       &lt;- pbc[!is.na(pbc$trt), ]
pbc2$event &lt;- as.integer(pbc$status[!is.na(pbc$trt)] == 2)
pbc2$sex_n &lt;- as.integer(pbc2$sex == &quot;f&quot;)

feat_cols &lt;- c(&quot;trt&quot;,&quot;age&quot;,&quot;sex_n&quot;,&quot;ascites&quot;,&quot;hepato&quot;,&quot;spiders&quot;,&quot;edema&quot;,
               &quot;bili&quot;,&quot;chol&quot;,&quot;albumin&quot;,&quot;copper&quot;,&quot;alk.phos&quot;,&quot;ast&quot;,
               &quot;trig&quot;,&quot;platelet&quot;,&quot;protime&quot;,&quot;stage&quot;)

df &lt;- pbc2[, c(&quot;time&quot;, &quot;event&quot;, feat_cols)]
for (col in feat_cols)
  if (any(is.na(df[[col]])))
    df[[col]][is.na(df[[col]])] &lt;- median(df[[col]], na.rm = TRUE)

set.seed(42)
idx_train &lt;- sample(nrow(df), floor(0.75 * nrow(df)))
train &lt;- df[idx_train, ]; test &lt;- df[-idx_train, ]
X_tr  &lt;- as.matrix(train[, feat_cols])
X_te  &lt;- as.matrix(test[,  feat_cols])
y_tr   &lt;- Surv(train$time, train$event)

fit &lt;- rvflnet::rvflnet(
  X_tr, y_tr,
  family = &quot;cox&quot;,
  alpha=0.1, lambda=0.1 # not recommended
)

y_te   &lt;- Surv(test$time, test$event)
ci &lt;- glmnet::Cindex(predict(fit, X_te), y_te)

cat(&quot;\n=== Test-set C-index ===\n&quot;)
print(ci)


=== Test-set C-index ===
[1] 0.8218117

fit &lt;- rvflnet::rvflnet(
  X_tr, y_tr,
  family = &quot;cox&quot;,
  alpha=0.1, nlambda=50
)

y_te   &lt;- Surv(test$time, test$event)

(cis &lt;- apply(predict(fit, X_te), 2, function(x) glmnet::Cindex(x, y_te)))

#cat(&quot;\n=== Test-set C-index ===\n&quot;)
plot(log(fit$fit$lambda), cis, type = 'l')
abline(h=0.8, lty=2, col=&quot;red&quot;)
</pre>

<style>
.dl-inline {width: auto; margin:0; padding: 0}
.dl-inline>dt, .dl-inline>dd {float: none; width: auto; display: inline-block}
.dl-inline>dt::after {content: ":\0020"; padding-right: .5ex}
.dl-inline>dt:not(:first-of-type) {padding-left: .5ex}
</style>
<p><dl class=dl-inline><dt>s0</dt><dd>0.5</dd><dt>s1</dt><dd>0.762812872467223</dd><dt>s2</dt><dd>0.802145411203814</dd><dt>s3</dt><dd>0.811084624553039</dd><dt>s4</dt><dd>0.811680572109654</dd><dt>s5</dt><dd>0.814064362336114</dd><dt>s6</dt><dd>0.815852205005959</dd><dt>s7</dt><dd>0.817640047675805</dd><dt>s8</dt><dd>0.820023837902265</dd><dt>s9</dt><dd>0.81942789034565</dd><dt>s10</dt><dd>0.817640047675805</dd><dt>s11</dt><dd>0.81823599523242</dd><dt>s12</dt><dd>0.81823599523242</dd><dt>s13</dt><dd>0.815852205005959</dd><dt>s14</dt><dd>0.814660309892729</dd><dt>s15</dt><dd>0.813468414779499</dd><dt>s16</dt><dd>0.813468414779499</dd><dt>s17</dt><dd>0.815852205005959</dd><dt>s18</dt><dd>0.814660309892729</dd><dt>s19</dt><dd>0.82061978545888</dd><dt>s20</dt><dd>0.81942789034565</dd><dt>s21</dt><dd>0.82181168057211</dd><dt>s22</dt><dd>0.82061978545888</dd><dt>s23</dt><dd>0.817044100119189</dd><dt>s24</dt><dd>0.817640047675805</dd><dt>s25</dt><dd>0.81823599523242</dd><dt>s26</dt><dd>0.814660309892729</dd><dt>s27</dt><dd>0.810488676996424</dd><dt>s28</dt><dd>0.803933253873659</dd><dt>s29</dt><dd>0.802145411203814</dd><dt>s30</dt><dd>0.799761620977354</dd><dt>s31</dt><dd>0.793206197854589</dd><dt>s32</dt><dd>0.789034564958284</dd><dt>s33</dt><dd>0.777711561382598</dd><dt>s34</dt><dd>0.771156138259833</dd><dt>s35</dt><dd>0.766984505363528</dd><dt>s36</dt><dd>0.756853396901073</dd><dt>s37</dt><dd>0.748510131108462</dd><dt>s38</dt><dd>0.743146603098927</dd><dt>s39</dt><dd>0.735399284862932</dd><dt>s40</dt><dd>0.728843861740167</dd><dt>s41</dt><dd>0.721692491060787</dd><dt>s42</dt><dd>0.718116805721096</dd><dt>s43</dt><dd>0.717520858164482</dd><dt>s44</dt><dd>0.716924910607867</dd><dt>s45</dt><dd>0.716924910607867</dd><dt>s46</dt><dd>0.715733015494636</dd><dt>s47</dt><dd>0.716328963051251</dd><dt>s48</dt><dd>0.715137067938021</dd><dt>s49</dt><dd>0.713945172824791</dd></dl></p>

<p><img src="https://i0.wp.com/thierrymoudiki.github.io/images/2026-05-02/2026-05-02-rvflnet_18_1.png?w=578&#038;ssl=1" alt="image-title-here" class="img-responsive" data-recalc-dims="1" /></p>


<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/02/r/rvflnet"> T. Moudiki's Webpage - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/you-dont-need-to-learn-all-the-weights-on-tabular-data-the-case-for-rvflnet-a-nonlinear-expressive-glmnet-on-regression-classification-and-survival-analysis/">You Don’t Need to Learn All the Weights on tabular data: The Case for rvflnet (a nonlinear expressive glmnet) on regression, classification and survival analysis</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400979</post-id>	</item>
		<item>
		<title>Regression Modeling Strategies Short Course 2026, with Frank Harrell; May 14, 15, 18, 19</title>
		<link>https://www.r-bloggers.com/2026/05/regression-modeling-strategies-short-course-2026-with-frank-harrell-may-14-15-18-19/</link>
		
		<dc:creator><![CDATA[Drew Levy]]></dc:creator>
		<pubDate>Fri, 01 May 2026 17:20:19 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=19151</guid>

					<description><![CDATA[<p>Frank Harrell’s Regression Modeling Strategies online seminar will take place May 14, 15, 18, and 19. This workshop covers principled strategies for building, validating, and interpreting multivariable regression models for a wide range of outcomes, with emphasis on predictive accuracy, avoiding overfitting, and interpreting estimated effects. It explores spline methods, data reduction, benefits ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/regression-modeling-strategies-short-course-2026-with-frank-harrell-may-14-15-18-19/">Regression Modeling Strategies Short Course 2026, with Frank Harrell; May 14, 15, 18, 19</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/regression-modeling-strategies-short-course-2026-with-frank-harrell-may-14-15-18-19/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>Frank Harrell’s <a href="https://hbiostat.org/doc/rms/4day" rel="nofollow" target="_blank"><strong>Regression Modeling Strategies</strong></a> online seminar will take place May 14, 15, 18, and 19.</p>
<p>This workshop covers principled strategies for building, validating, and interpreting multivariable regression models for a wide range of outcomes, with emphasis on predictive accuracy, avoiding overfitting, and interpreting estimated effects. It explores spline methods, data reduction, benefits of Bayesian modeling, robust semiparametric ordinal, longitudinal, and survival models, and rigorous resampling-based validation, illustrated with applied case studies and R examples. More details <a href="https://hbiostat.org/doc/rms/desc" rel="nofollow" target="_blank">here</a>.</p>
<p>Along with the 1-day <a href="https://instats.org/seminar/introduction-to-r-regression-and-the-rms" rel="nofollow" target="_blank">Introduction to R, Regression, and the rms Package,</a> these <a href="https://hbiostat.org/course/" rel="nofollow" target="_blank">virtual seminars</a> are offered through <a href="https://instats.org/" rel="nofollow" target="_blank">Instats</a>, in association with the A.S.A.</p>
<p><img loading="lazy" fetchpriority="high" decoding="async" src="https://i0.wp.com/r-posts.com/wp-content/uploads/2026/04/download-24-198x300.png?resize=198%2C300" alt="" width="198" height="300" class="alignnone size-medium wp-image-19154" srcset_temp="https://i0.wp.com/r-posts.com/wp-content/uploads/2026/04/download-24-198x300.png?resize=198%2C300 198w, http://r-posts.com/wp-content/uploads/2026/04/download-24.png 441w" sizes="(max-width: 198px) 100vw, 198px" data-recalc-dims="1" /> <img loading="lazy" decoding="async" src="https://i2.wp.com/r-posts.com/wp-content/uploads/2026/04/FrankHarrell-2-450x143.png?resize=450%2C143" alt="" width="450" height="143" class="alignnone size-medium wp-image-19155" srcset_temp="https://i2.wp.com/r-posts.com/wp-content/uploads/2026/04/FrankHarrell-2-450x143.png?resize=450%2C143 450w, http://r-posts.com/wp-content/uploads/2026/04/FrankHarrell-2-768x243.png 768w, http://r-posts.com/wp-content/uploads/2026/04/FrankHarrell-2.png 770w" sizes="(max-width: 450px) 100vw, 450px" data-recalc-dims="1" /></p><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/regression-modeling-strategies-short-course-2026-with-frank-harrell-may-14-15-18-19/" rel="nofollow" target="_blank">Regression Modeling Strategies Short Course 2026, with Frank Harrell; May 14, 15, 18, 19</a> was first posted on May 1, 2026 at 5:20 pm.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/regression-modeling-strategies-short-course-2026-with-frank-harrell-may-14-15-18-19/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/regression-modeling-strategies-short-course-2026-with-frank-harrell-may-14-15-18-19/">Regression Modeling Strategies Short Course 2026, with Frank Harrell; May 14, 15, 18, 19</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400962</post-id>	</item>
		<item>
		<title>Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop</title>
		<link>https://www.r-bloggers.com/2026/05/reactive-shiny-apps-and-deployment-with-google-cloud-run-intermediate-r-shiny-workshop/</link>
		
		<dc:creator><![CDATA[Dariia Mykhailyshyna]]></dc:creator>
		<pubDate>Fri, 01 May 2026 11:35:38 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=19182</guid>

					<description><![CDATA[<p>Join our workshop on Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop,  which is a part of our workshops for Ukraine series!  Here’s some more info:  Title: Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop  Date: Thursday, May 21st, 18:00 – 20:00 ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/reactive-shiny-apps-and-deployment-with-google-cloud-run-intermediate-r-shiny-workshop/">Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/reactive-shiny-apps-and-deployment-with-google-cloud-run-intermediate-r-shiny-workshop/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p><span style="font-weight: 400">Join our workshop on Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop</span><span style="font-weight: 400">, </span><span style="font-weight: 400"> which is a part of our workshops for Ukraine series! </span></p>
<br />
<p><b>Here’s some more info: </b></p>
<br />
<p><b>Title</b><span style="font-weight: 400">: Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop </span></p>
<p><b>Date</b><span style="font-weight: 400">: Thursday, May 21st, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone) </span></p>
<p><b>Speaker</b><span style="font-weight: 400">:Alfredo Hernández Sánchez is a Marie Skłodowska Curie ERA Postdoctoral Fellow at Vilnius University, where he leads the FIRSA project on financial regulation and innovation in Europe. His work combines applied research, data analysis, and reproducible computational methods, with a strong interest in turning research outputs into accessible digital tools such as dashboards and interactive web applications. He works extensively with R, Quarto, and Shiny in academic and policy-oriented settings.</span></p>
<p><a href="http://www.alfredohs.com/" rel="nofollow" target="_blank"><span style="font-weight: 400">www.alfredohs.com</span></a><span style="font-weight: 400"> </span></p>
<p><b>Description: </b><span style="font-weight: 400"> This workshop is designed for people who already know the basics of Shiny and want to build apps that are more robust, more reactive, and easier to maintain. We will look at practical reactive patterns, app structure, and some common choices that make Shiny dashboards easier to develop as they become more complex. In the second part of the workshop, I will show how a Shiny app can move from local development to a public deployment on Google Cloud Run, using a real dashboard project as an example. The session will give participants a practical introduction to a cloud-based workflow for publishing and maintaining Shiny applications in highly customizable environment. Basic familiarity with Shiny is assumed, and some previous experience building simple apps will help participants get the most out of the session.</span></p>
<p><b>Minimal registration fee:</b><span style="font-weight: 400"> 20 euro (or 20 USD or 800 UAH)</span></p>
<br />
<br />
<br />
<br />
<br />
<br />
<p><span style="font-weight: 400">Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration</span></p>
<br />
<p><b>How can I register?</b></p>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Go to </span><a href="https://bit.ly/3wvwMA6" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3wvwMA6</span></a><span style="font-weight: 400"> or </span><a href="https://bit.ly/4aD5LMC" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/4aD5LMC</span></a><span style="font-weight: 400">  or  </span><a href="https://bit.ly/3PFxtNA" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3PFxtNA</span></a><span style="font-weight: 400"> and donate</span><b> at least 20 euro</b><span style="font-weight: 400">. </span><span style="font-weight: 400">Feel free to donate more if you can, all proceeds go directly to support Ukraine.</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Fill in the</span><a href="https://forms.gle/DL4VFswe6gQQrDKz9" rel="nofollow" target="_blank"><span style="font-weight: 400"> registration form</span></a><span style="font-weight: 400">, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).</span></li>
</ul>
<br />
<p><span style="font-weight: 400">If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.</span></p>
<br />
<p><b>How can I sponsor a student?</b></p>
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Go to </span><a href="https://bit.ly/3wvwMA6" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3wvwMA6</span></a><span style="font-weight: 400"> or </span><a href="https://bit.ly/4aD5LMC" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/4aD5LMC</span></a><span style="font-weight: 400">  or </span><a href="https://bit.ly/3PFxtNA" rel="nofollow" target="_blank"><span style="font-weight: 400">https://bit.ly/3PFxtNA</span></a><span style="font-weight: 400"> and donate </span><b>at least 20 euro </b><span style="font-weight: 400">(or 17 GBP or 20 USD or 800 UAH). </span><span style="font-weight: 400">Feel free to donate more if you can, all proceeds go to support Ukraine!</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)</span></li>
</ul>
<br />
<ul>
	<li style="font-weight: 400"><span style="font-weight: 400">Fill in the </span><a href="https://forms.gle/BzEZVjXNNP8sYRsT6" rel="nofollow" target="_blank"><span style="font-weight: 400">sponsorship form</span></a><span style="font-weight: 400">, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.</span></li>
</ul>
<br />
<br />
<p><span style="font-weight: 400">If you are a university student and cannot afford the registration fee, you can also sign up for the </span><b>waiting list</b> <a href="https://forms.gle/TpGE6zsZMACyEtQVA" rel="nofollow" target="_blank"><span style="font-weight: 400">here</span></a><span style="font-weight: 400">. (Note that you are not guaranteed to participate by signing up for the waiting list).</span></p>
<br />
<br />
<p><span style="font-weight: 400">You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings &#038; materials </span><a href="http://bit.ly/3wBeY4S" rel="nofollow" target="_blank"><span style="font-weight: 400">here</span></a><span style="font-weight: 400">.</span></p>
<br />
<p><span style="font-weight: 400">Looking forward to seeing you during the workshop!</span></p>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<p><span style="font-weight: 400"> </span></p>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br /><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/reactive-shiny-apps-and-deployment-with-google-cloud-run-intermediate-r-shiny-workshop/" rel="nofollow" target="_blank">Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop</a> was first posted on May 1, 2026 at 11:35 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/reactive-shiny-apps-and-deployment-with-google-cloud-run-intermediate-r-shiny-workshop/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/reactive-shiny-apps-and-deployment-with-google-cloud-run-intermediate-r-shiny-workshop/">Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400956</post-id>	</item>
		<item>
		<title>Closing the Gap in Exposure-Response Data: A Pharmaverse Framework</title>
		<link>https://www.r-bloggers.com/2026/04/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework/</link>
		
		<dc:creator><![CDATA[Jeff Dickinson]]></dc:creator>
		<pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.html</guid>

					<description><![CDATA[<p>The Missing Standard<br />
CDISC released the Population Pharmacokinetic (PopPK) Implementation Guide in 2023, giving the clinical programming community a clear structural blueprint for PK analysis datasets. But Exposure-Response (ER) modeling — wh...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework/">Closing the Gap in Exposure-Response Data: A Pharmaverse Framework</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.html"> pharmaverse blog</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 





<!--------------- typical setup ----------------->
<!--------------- post begins here ----------------->
<section id="the-missing-standard" class="level2">
<h2 class="anchored" data-anchor-id="the-missing-standard">The Missing Standard</h2>
<p>CDISC released the Population Pharmacokinetic (PopPK) Implementation Guide in 2023, giving the clinical programming community a clear structural blueprint for PK analysis datasets. But Exposure-Response (ER) modeling — which builds directly on PopPK outputs to characterize relationships between drug exposure, safety, and efficacy — has no equivalent standard.</p>
<p>The result is predictable: different studies, different variable names, different exposure metrics, different dataset structures. Every ER analysis team starts more or less from scratch. That makes cross-study pooling, automation, and programming more difficult than necessary, particularly with ever-quickening turnaround times in drug development.</p>
</section>
<section id="a-framework-built-on-what-we-already-have" class="level2">
<h2 class="anchored" data-anchor-id="a-framework-built-on-what-we-already-have">A Framework Built on What We Already Have</h2>
<p>ER datasets share a lot of structural DNA with PopPK datasets — numeric covariates, relative time variables, pharmacokinetic exposure metrics. That overlap is the starting point for this framework: extending CDISC ADaM principles already established for PopPK into the ER space.</p>
<p>Early discussions are underway with the CDISC ADaM working group about moving this framework forward as a Knowledge Article or Examples Document. The working group has expressed interest in positioning ER datasets as a subclass of ADPPK — grounding the framework within existing CDISC standards architecture and providing a clear lineage from the 2023 PopPK Implementation Guide. Nothing is formalized yet, but the direction is encouraging.</p>
<p>The result of the new framework is four specialized datasets, each targeting a different aspect of ER analysis:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 50%">
</colgroup>
<thead>
<tr class="header">
<th>Dataset</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong><code>ADER</code></strong></td>
<td>Exposure foundation — comprehensive PK metrics, transformations, and baseline covariates</td>
</tr>
<tr class="even">
<td><strong><code>ADEE</code></strong></td>
<td>Exposure-Efficacy — time-to-event efficacy outcomes linked to drug exposure</td>
</tr>
<tr class="odd">
<td><strong><code>ADES</code></strong></td>
<td>Exposure-Safety — adverse event occurrence, severity, and time-to-onset by exposure</td>
</tr>
<tr class="even">
<td><strong><code>ADTRR</code></strong></td>
<td>Exposure-Tumor Response Rate — categorical tumor response (CR, PR, SD, PD) by exposure</td>
</tr>
</tbody>
</table>
<p>Each dataset builds on standard ADaM datasets (<code>ADSL</code>, <code>ADRS</code>, <code>ADTTE</code>, <code>ADAE</code>, <code>ADLB</code>, <code>ADVS</code>) and incorporates PK parameters from <code>ADPC</code>/<code>ADPP</code>, producing analysis-ready datasets without additional data wrangling.</p>
<p>The framework was presented as paper DS12 at PHUSE US Connect 2026 in Austin, TX. The <a href="https://phuse.s3.eu-central-1.amazonaws.com/Archive/2026/Connect/US/Austin/PAP_DS12.pdf" rel="nofollow" target="_blank">paper</a> and <a href="https://phuse.s3.eu-central-1.amazonaws.com/Archive/2026/Connect/US/Austin/PRE_DS12.pdf" rel="nofollow" target="_blank">slides</a> are now available in the PHUSE archive.</p>
</section>
<section id="why-the-pharmaverse-ecosystem" class="level2">
<h2 class="anchored" data-anchor-id="why-the-pharmaverse-ecosystem">Why the Pharmaverse Ecosystem?</h2>
<p>The framework is implemented using <code>{admiral}</code>, <code>{metacore}</code>, <code>{metatools}</code>, and <code>{xportr}</code> — the same toolchain used across the pharmaverse for ADaM dataset development. That choice was intentional.</p>
<p><code>{admiral}</code>’s modular derivation functions map naturally onto how ER datasets are built incrementally. Its <code>assert_*</code> functions catch errors at the point of derivation rather than burying them downstream. <code>{metacore}</code> keeps specs and code in sync. <code>{metatools}</code> provides utility functions for metadata management and validation. <code>{xportr}</code> handles CDISC compliance at the point of export.</p>
<p>The pharmaverse ecosystem did not just make implementation easier — it made the framework more trustworthy and maintainable. And because it is open-source, every improvement feeds back to the community.</p>
</section>
<section id="the-examples-page" class="level2">
<h2 class="anchored" data-anchor-id="the-examples-page">The Examples Page</h2>
<p>The working R code is now live on the <a href="https://pharmaverse.github.io/examples/adam/ader.html" rel="nofollow" target="_blank">pharmaverse examples site</a>.</p>
<p>The <code>ADER+</code> page covers all four datasets in a single tabbed page, with:</p>
<ul>
<li>A shared introduction explaining the ER framework and its relationship to the PopPK Implementation Guide</li>
<li>Common derivations used across all four datasets</li>
<li>Dataset-specific derivation code for <code>ADER</code>, <code>ADEE</code>, <code>ADES</code>, and <code>ADTRR</code></li>
<li>Full variable listings and metadata</li>
</ul>
<p>The code uses <code>{pharmaverseadam}</code> as source data, making it immediately reproducible. Think of it as a template — a starting point you can adapt for your own studies.</p>
</section>
<section id="we-need-your-feedback" class="level2">
<h2 class="anchored" data-anchor-id="we-need-your-feedback">We Need Your Feedback</h2>
<p>This framework is a proposal, not a finished standard. Any formal ER ADaM standard would ultimately be owned and ratified by CDISC — the community can propose, pilot, and advocate, but the path to an official standard requires active collaboration with CDISC. The groundwork for that is community validation: pilot testing across therapeutic areas, working group discussion, and real-world use.</p>
<p>That means we need two kinds of input:</p>
<p><strong>Clinical programmers</strong> — try the code. Does the derivation logic hold up? What edge cases are we missing? Open an issue or PR on the <a href="https://github.com/pharmaverse/examples" rel="nofollow" target="_blank">pharmaverse examples repository</a>.</p>
<p><strong>ER modelers and pharmacometricians</strong> — this one is especially for you. Does this dataset structure actually serve your modeling needs? Are the exposure metrics the right ones? Is the dataset grain appropriate for the analyses you run? You are the end users of these datasets, and your perspective is exactly what’s needed to make this framework scientifically sound, not just technically compliant.</p>
<p>The discussion is open. Let’s keep it going.</p>
<!--------------- appendices go here ----------------->
</section>
<div class="cell">
<div class="cell-output-display">


</div>
</div>



<div id="quarto-appendix" class="default"><section id="last-updated" class="level2 appendix"><h2 class="anchored quarto-appendix-heading">Last updated</h2><div class="quarto-appendix-contents">

<p>2026-05-04 12:40:51.777961</p>
</div></section><section id="details" class="level2 appendix"><h2 class="anchored quarto-appendix-heading">Details</h2><div class="quarto-appendix-contents">

<p><a href="https://github.com/pharmaverse/blog/tree/main/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.qmd" rel="nofollow" target="_blank">Source</a>, <a href="https://pharmaverse.github.io/blog/session_info.html" rel="nofollow" target="_blank">Session info</a></p>
</div></section><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="nofollow" href="https://creativecommons.org/licenses/by/4.0/" target="_blank">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre>@online{dickinson2026,
  author = {Dickinson, Jeff},
  title = {Closing the {Gap} in {Exposure-Response} {Data:} {A}
    {Pharmaverse} {Framework}},
  date = {2026-04-30},
  url = {https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.html},
  langid = {en}
}
</pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-dickinson2026" class="csl-entry quarto-appendix-citeas">
Dickinson, Jeff. 2026. <span>“Closing the Gap in Exposure-Response Data:
A Pharmaverse Framework.”</span> April 30, 2026. <a href="https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.html" rel="nofollow" target="_blank">https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.html</a>.
</div></div></section></div> 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework.html"> pharmaverse blog</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/closing-the-gap-in-exposure-response-data-a-pharmaverse-framework/">Closing the Gap in Exposure-Response Data: A Pharmaverse Framework</a>]]></content:encoded>
					
		
		<enclosure url="https://pharmaverse.github.io/blog/posts/2026-04-17-closing-the-gap-in/pharmaverse_examples.png" length="0" type="image/png" />

		<post-id xmlns="com-wordpress:feed-additions:1">401018</post-id>	</item>
		<item>
		<title>rOpenSci News Digest, April 2026</title>
		<link>https://www.r-bloggers.com/2026/04/ropensci-news-digest-april-2026/</link>
		
		<dc:creator><![CDATA[rOpenSci]]></dc:creator>
		<pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://ropensci.org/blog/2026/04/30/news-april-2026/</guid>

					<description><![CDATA[<p>Dear rOpenSci friends, it’s time for our monthly news roundup!  You can read this post on our blog. Now let’s dive into the activity at and around rOpenSci!</p>
<p>Tomáš Kalibera (1978–2026)<br />
The rOpenSci team is deeply saddened at the los...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/ropensci-news-digest-april-2026/">rOpenSci News Digest, April 2026</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://ropensci.org/blog/2026/04/30/news-april-2026/"> rOpenSci - open tools for open science</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<!-- Before sending DELETE THE INDEX_CACHE and re-knit! -->
<p>Dear rOpenSci friends, it’s time for our monthly news roundup! <!-- blabla --> You can read this post <a href="https://ropensci.org/blog/2026/04/30/news-april-2026" rel="nofollow" target="_blank">on our blog</a>. Now let’s dive into the activity at and around rOpenSci!</p>
<h2>
Tomáš Kalibera (1978–2026)
</h2><p>The rOpenSci team is deeply saddened at the loss of Tomáš Kalibera, a member of R-Core and contributor to the R community for almost 10 years. Tomáš passed away on April 1.  Our thoughts are with Tomáš’s friends and family at this time. </p>
<p>Read <a href="https://prl-prg.github.io/tomas-kalibera.html" rel="nofollow" target="_blank">Jan Vitek’s remembrance of Tomáš</a>.</p>
<h2>
rOpenSci HQ
</h2><h3>
New editors Alec Robitaille and Lucy D’Agostino McGowan
</h3><p>We’re excited to welcome <a href="https://ropensci.org/author/alec-robitaille/" rel="nofollow" target="_blank">Alec Robitaille</a> and <a href="https://ropensci.org/author/lucy-dagostino-mcgowan/" rel="nofollow" target="_blank">Lucy D’Agostino McGowan</a> as new editors. Alec joins our general review team, and Lucy our statistical software review team. Read more in the <a href="https://ropensci.org/blog/2026/04/16/editors2026/" rel="nofollow" target="_blank">post introducing them</a>!</p>
<h3>
Champions Program update
</h3><p>We’re excited to share that we’ve finished selecting the new cohort for the rOpenSci Champions Program! This was not an easy process. The quality of the proposals was exceptionally high, which made the selection both challenging and inspiring. We’re grateful to everyone who applied and shared their ideas with us. Please join us in welcoming our new Champions: Bastián Olea Herrera, Durga Valentina Linares Herrera, José Daniel Conejeros, Denisse Fierro Arcos, Evelia Lorena Coss Navarrete, Gladys Choque Ulloa, Linda Jazmín Cabrera Orellana, Patricia Andrea Loto, Marina Cecilia Cock, María Florencia Tames, and Estefanía Torrejón.</p>
<p>Over the coming months, they will contribute to the R Community through developing new packages, reviewing packages, and submitting packages for peer review. We’re looking forward to working with this amazing group and supporting their projects!</p>
<h3>
Collaborating between Bioconductor and R-universe on Development of Common Infrastructure
</h3><p>Bioconductor is collaborating with R-universe to gradually modernize parts of its infrastructure, while accommodating the project’s scale, governance, and established processes. In turn, Bioconductor is helping R-universe expand and refine its features as we learn to serve the complex needs of the Bioconductor community. Read more in the <a href="https://ropensci.org/blog/2026/04/08/r-universe-bioc/" rel="nofollow" target="_blank">blog post</a>.</p>
<h3>
rOpenSci Staff presentations
</h3><h4>
Yanina Bellini Saibene at R/Medicine 2026
</h4><p>Yani will deliver her keynote talk <a href="https://ropensci.org/events/r-medicine-2026-keynote/" rel="nofollow" target="_blank">“Software Sustainability and Community Management”</a> on Thursday May 7th, 11:15AM–12:15PM ET.</p>
<h4>
Jeroen Ooms at “Where Do R Packages Live?”
</h4><p>Jeroen will take part in an online discussion panel on Wednesday 20 May at 5:00 PM &#8211; 6:00 PM (AEST). The panel is organized by the Statistical Computing and Visualisation section of the Statistical Society of Australia (<a href="https://statsoc.org.au/event-6653060" rel="nofollow" target="_blank">details</a>).</p>
<h3>
Updates to the goodpractice R package
</h3><p>We have long recommended the <a href="https://docs.ropensci.org/goodpractice" rel="nofollow" target="_blank">goodpractice package</a>, which identifies issues with R packages, and advises how to fix them. Thanks to a huge amount of work by valued community member <a href="https://ropensci.org/author/athanasia-mo-mowinckel/" rel="nofollow" target="_blank">Athanasia Mo Mowinckel</a>, goodpractice has been extended and improved to include entirely new suites of checks, and improved ability to control which checks are run. A blog post describing the updates will be published soon, but in the meantime, we encourage you to install the current <a href="https://docs.ropensci.org/goodpractice/#installation" rel="nofollow" target="_blank">development version</a> and try it out yourself.</p>
<h3>
Analyse your targets pipeline
</h3><p>Following our <a href="https://ropensci.org/blog/2026/04/02/tree-sitter-overview/" rel="nofollow" target="_blank">blog post about tree-sitter</a>, Tyler Morgan Wall was <a href="https://news.ycombinator.com/item?id=47801899" rel="nofollow" target="_blank">inspired</a> to create a <a href="https://github.com/tylermorganwall/tarborist" rel="nofollow" target="_blank">static analysis tool for targets pipelines (as a VS Code extension)</a>.</p>
<h3>
Coworking
</h3><p>Read <a href="https://ropensci.org/blog/2023/06/21/coworking/" rel="nofollow" target="_blank">all about coworking</a>!</p>
<ul>
<li>Tuesday May 5th 2026, 9:00 Australia Western (01:00 UTC) <a href="https://ropensci.org/events/coworking-2026-05/" rel="nofollow" target="_blank">“Code Review with rOpenSci”</a> with <a href="https://ropensci.org/author/steffi-lazerte/" rel="nofollow" target="_blank">Steffi LaZerte</a> and cohost <a href="https://ropensci.org/author/liz-hare/" rel="nofollow" target="_blank">Liz Hare</a>.
<ul>
<li>Explore resources for Code Review</li>
<li>Sign up to volunteer to do <a href="https://airtable.com/app8dssb6a7PG6Vwj/shrnfDI2S9uuyxtDw" rel="nofollow" target="_blank">software peer-review</a> at rOpenSci</li>
<li>Meet cohost, Liz Hare, and discuss resources for Code Review with rOpenSci.</li>
</ul>
</li>
<li>Tuesday June 2nd 2026, 14:00 Europe Central (12:00 UTC) [theme to be determined], with <a href="https://ropensci.org/author/steffi-lazerte/" rel="nofollow" target="_blank">Steffi LaZerte</a> and cohost to be determined.
<ul>
<li>Explore resources related to the theme</li>
<li>Meet the cohost, and other attendees, and discuss the theme or other topics.</li>
</ul>
</li>
</ul>
<p>And remember, you can always cowork independently on work related to R, work on packages that tend to be neglected, or work on what ever you need to get done!</p>
<h3>
Editors’ Office Hours
</h3><p>We are exploring hosting a new event, Editors’ Office Hours, where you can drop in to ask questions about rOpenSci Software Peer Review on or near the third Tuesday of each month, alternating among timezones to accommodate different parts of the world.</p>
<p>Upcoming office hours:</p>
<ul>
<li>Tuesday May 19, 16:00-17:00 Europe Central (14:00-15:00 UTC) (<a href="https://ropensci.org/events/office-hours-2026-05/" rel="nofollow" target="_blank">event</a>)</li>
</ul>
<h3>
useR! 2026 Diversity Scholarship Program
</h3><p>useR! 2026 is offering diversity scholarships to support participation from people in underrepresented or historically marginalized groups within the R community. The program includes both registration fee waivers and full needs-based scholarships, which cover conference registration as well as travel and lodging (via reimbursement). Applications are open to eligible participants worldwide <strong>until May 10</strong>, and will be reviewed based on need, eligibility, and potential impact by a committee from Forwards, RLadies+ Global, and rOpenSci.</p>
<p>Found all the details and important links on the conference website: <a href="https://user2026.r-project.org/additional/diversity_scholarship.html" rel="nofollow" target="_blank">https://user2026.r-project.org/additional/diversity_scholarship.html</a></p>
<h2>
Software <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f4e6.png" alt="📦" class="wp-smiley" style="height: 1em; max-height: 1em;" />
</h2><h3>
New packages
</h3><p>The following package recently became a part of our software suite:</p>
<ul>
<li><a href="https://docs.ropensci.org/reviser" rel="nofollow" target="_blank">reviser</a>, developed by Marc Burri together with Philipp Wegmueller: Analyzes revisions in real-time time series vintages. The package converts between wide revision triangles and tidy long vintages, extracts selected releases, computes revision series, visualizes vintage paths, and summarizes revision properties such as bias, dispersion, autocorrelation, and news-noise diagnostics. It also identifies efficient releases and estimates state-space models for revision nowcasting. Methods are based on Howrey (1978) <a href="https://doi.org/10.2307/1924972" rel="nofollow" target="_blank">https://doi.org/10.2307/1924972</a>, Jacobs and Van Norden (2011) <a href="https://doi.org/10.1016/j.jeconom.2010.04.010" rel="nofollow" target="_blank">https://doi.org/10.1016/j.jeconom.2010.04.010</a>, and Kishor and Koenig (2012) <a href="https://doi.org/10.1198/jbes.2010.08169" rel="nofollow" target="_blank">https://doi.org/10.1198/jbes.2010.08169</a>. It has been <a href="https://github.com/ropensci/software-review/issues/709" rel="nofollow" target="_blank">reviewed</a>.</li>
</ul>
<p>Discover <a href="https://ropensci.org/packages" rel="nofollow" target="_blank">more packages</a>, read more about <a href="https://ropensci.org/software-review" rel="nofollow" target="_blank">Software Peer Review</a>.</p>
<h3>
New versions
</h3><p>The following ten packages have had an update since the last newsletter: <a href="https://docs.ropensci.org/osmextract" title="Download and Import Open Street Map Data Extracts" rel="nofollow" target="_blank">osmextract</a> (<a href="https://github.com/ropensci/osmextract/releases/tag/v0.6.0" rel="nofollow" target="_blank"><code>v0.6.0</code></a>), <a href="https://docs.ropensci.org/Athlytics" title="Academic R Package for Sports Physiology Analysis from Local Strava Data" rel="nofollow" target="_blank">Athlytics</a> (<a href="https://github.com/ropensci/Athlytics/releases/tag/v1.0.5" rel="nofollow" target="_blank"><code>v1.0.5</code></a>), <a href="https://docs.ropensci.org/emodnet.wfs" title="Access EMODnet Web Feature Service Data" rel="nofollow" target="_blank">emodnet.wfs</a> (<a href="https://github.com/EMODnet/emodnet.wfs/releases/tag/v2.1.2" rel="nofollow" target="_blank"><code>v2.1.2</code></a>), <a href="https://docs.ropensci.org/fellingdater" title="Tree-ring dating and estimating felling dates of historical timbers" rel="nofollow" target="_blank">fellingdater</a> (<a href="https://github.com/ropensci/fellingdater/releases/tag/v1.2.1" rel="nofollow" target="_blank"><code>v1.2.1</code></a>), <a href="https://docs.ropensci.org/readODS" title="Read and Write ODS Files" rel="nofollow" target="_blank">readODS</a> (<a href="https://github.com/ropensci/readODS/releases/tag/v2.3.5" rel="nofollow" target="_blank"><code>v2.3.5</code></a>), <a href="https://docs.ropensci.org/git2rdata" title="Store and Retrieve Data.frames in a Git Repository" rel="nofollow" target="_blank">git2rdata</a> (<a href="https://github.com/ropensci/git2rdata/releases/tag/v0.5.2" rel="nofollow" target="_blank"><code>v0.5.2</code></a>), <a href="https://docs.ropensci.org/weatherOz" title="An API Client for Australian Weather and Climate Data Resources" rel="nofollow" target="_blank">weatherOz</a> (<a href="https://github.com/ropensci/weatherOz/releases/tag/v3.0.0" rel="nofollow" target="_blank"><code>v3.0.0</code></a>), <a href="https://docs.ropensci.org/promoutils" title="Utilities for Promoting rOpenSci on Social Media" rel="nofollow" target="_blank">promoutils</a> (<a href="https://github.com/ropensci-org/promoutils/releases/tag/v0.5.0" rel="nofollow" target="_blank"><code>v0.5.0</code></a>), <a href="https://docs.ropensci.org/allcontributors" title="Acknowledge all Contributors to a Project" rel="nofollow" target="_blank">allcontributors</a> (<a href="https://github.com/ropensci/allcontributors/releases/tag/v0.2.3" rel="nofollow" target="_blank"><code>v0.2.3</code></a>), and <a href="https://docs.ropensci.org/reviser" title="Analyzing Revisions in Real-Time Time Series Vintages" rel="nofollow" target="_blank">reviser</a> (<a href="https://github.com/ropensci/reviser/releases/tag/v0.1.1" rel="nofollow" target="_blank"><code>v0.1.1</code></a>).</p>
<p>The writexl package has a <a href="https://github.com/ropensci/writexl/pull/98#issuecomment-4191858158" rel="nofollow" target="_blank">new maintainer</a>, Bill Denney. NLMR is now maintained by <a href="https://github.com/ropensci/NLMR/issues/116#issuecomment-4280937012" rel="nofollow" target="_blank">Jakub Nowosad</a>.</p>
<h2>
Software Peer Review
</h2><p>There are eighteen recently closed and active submissions and 4 submissions on hold. Issues are at different stages:</p>
<ul>
<li>
<p>One at <a href="https://github.com/ropensci/software-review/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%226/approved%22" rel="nofollow" target="_blank">‘6/approved’</a>:</p>
<ul>
<li><a href="https://github.com/ropensci/software-review/issues/709" rel="nofollow" target="_blank">reviser</a>, Tools for Studying Revision Properties in Real-Time Time Series Vintages. Submitted by <a href="https://marcburri.github.io/" rel="nofollow" target="_blank">Marc Burri</a>. (Stats).</li>
</ul>
</li>
<li>
<p>Two at <a href="https://github.com/ropensci/software-review/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%225/awaiting-reviewer(s)-response%22" rel="nofollow" target="_blank">‘5/awaiting-reviewer(s)-response’</a>:</p>
<ul>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/754" rel="nofollow" target="_blank">saperlipopette</a>, Create Example Git Messes. Submitted by <a href="https://masalmon.eu/" rel="nofollow" target="_blank">Maëlle Salmon</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/671" rel="nofollow" target="_blank">pkgmatch</a>, Find R Packages Matching Either Descriptions or Other R Packages. Submitted by <a href="https://mpadge.github.io/" rel="nofollow" target="_blank">mark padgham</a>.</p>
</li>
</ul>
</li>
<li>
<p>Four at <a href="https://github.com/ropensci/software-review/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%224/review(s)-in-awaiting-changes%22" rel="nofollow" target="_blank">‘4/review(s)-in-awaiting-changes’</a>:</p>
<ul>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/760" rel="nofollow" target="_blank">pvEBayes</a>, Empirical Bayes Methods for Pharmacovigilance. Submitted by <a href="https://github.com/YihaoTancn" rel="nofollow" target="_blank">Yihao Tan</a>. (Stats).</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/741" rel="nofollow" target="_blank">logolink</a>, An Interface for Running NetLogo Simulations. Submitted by <a href="http://danielvartan.com/" rel="nofollow" target="_blank">Daniel Vartanian</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/732" rel="nofollow" target="_blank">ActiGlobe</a>, Wearable Recording Processor for Time Shift Adjustment and Data Analysis. Submitted by <a href="https://scholar.google.ca/citations?user=T7y9ckwAAAAJ&#038;hl=en" rel="nofollow" target="_blank">C. William Yao</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/615" rel="nofollow" target="_blank">galamm</a>, Generalized Additive Latent and Mixed Models. Submitted by <a href="https://osorensen.github.io/" rel="nofollow" target="_blank">Øystein Sørensen</a>. (Stats).</p>
</li>
</ul>
</li>
<li>
<p>Five at <a href="https://github.com/ropensci/software-review/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%223/reviewer(s)-assigned%22" rel="nofollow" target="_blank">‘3/reviewer(s)-assigned’</a>:</p>
<ul>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/750" rel="nofollow" target="_blank">nycOpenData</a>, Convenient Access to NYC Open Data API Endpoints. Submitted by <a href="https://github.com/martinezc1" rel="nofollow" target="_blank">Christian Martinez</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/743" rel="nofollow" target="_blank">RAMEN</a>, RAMEN: Regional Association of Methylome variability with the Exposome and geNome. Submitted by <a href="https://erick-navarrodelgado.netlify.app/" rel="nofollow" target="_blank">Erick Navarro-Delgado</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/730" rel="nofollow" target="_blank">ernest</a>, A Toolkit for Nested Sampling. Submitted by <a href="https://github.com/kylesnap" rel="nofollow" target="_blank">Kyle Dewsnap</a>. (Stats).</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/718" rel="nofollow" target="_blank">rcrisp</a>, Automate the Delineation of Urban River Spaces. Submitted by <a href="https://github.com/cforgaci" rel="nofollow" target="_blank">Claudiu Forgaci</a>. (Stats).</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/704" rel="nofollow" target="_blank">priorsense</a>, Prior Diagnostics and Sensitivity Analysis. Submitted by <a href="https://github.com/n-kall" rel="nofollow" target="_blank">Noa Kallioinen</a>. (Stats).</p>
</li>
</ul>
</li>
<li>
<p>Three at <a href="https://github.com/ropensci/software-review/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%222/seeking-reviewer(s)%22" rel="nofollow" target="_blank">‘2/seeking-reviewer(s)’</a>:</p>
<ul>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/763" rel="nofollow" target="_blank">EpiStrainDynamics</a>, Infer temporal trends of multiple pathogens. Submitted by <a href="http://www.smwindecker.com/" rel="nofollow" target="_blank">Saras Windecker</a>. (Stats).</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/762" rel="nofollow" target="_blank">lakefetch</a>, Calculate Fetch and Wave Exposure for Lake Sampling Points. Submitted by <a href="https://github.com/jeremylfarrell" rel="nofollow" target="_blank">jeremylfarrell</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/740" rel="nofollow" target="_blank">fcmconfr</a>, Fuzzy Cognitive Map Analysis in R. Submitted by <a href="https://github.com/bhroston" rel="nofollow" target="_blank">benroston</a>. (Stats).</p>
</li>
</ul>
</li>
<li>
<p>Three at <a href="https://github.com/ropensci/software-review/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%221/editor-checks%22" rel="nofollow" target="_blank">‘1/editor-checks’</a>:</p>
<ul>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/765" rel="nofollow" target="_blank">ciecl</a>, International Classification of Diseases ICD-10/ICD-11 for Chile. Submitted by <a href="https://github.com/Rodotasso" rel="nofollow" target="_blank">Rodolfo Tasso</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/744" rel="nofollow" target="_blank">RAQSAPI</a>, A Simple Interface to the US EPA Air Quality System Data Mart API. Submitted by <a href="https://github.com/mccroweyclinton-EPA" rel="nofollow" target="_blank">mccroweyclinton-EPA</a>.</p>
</li>
<li>
<p><a href="https://github.com/ropensci/software-review/issues/717" rel="nofollow" target="_blank">coevolve</a>, Fit Bayesian Generalized Dynamic Phylogenetic Models using Stan. Submitted by <a href="https://scottclaessens.github.io/" rel="nofollow" target="_blank">Scott Claessens</a>. (Stats).</p>
</li>
</ul>
</li>
</ul>
<p>Find out more about <a href="https://ropensci.org/software-review" rel="nofollow" target="_blank">Software Peer Review</a> and how to get involved.</p>
<h2>
On the blog
</h2><!-- Do not forget to rebase your branch! -->
<h3>
Software Review
</h3><ul>
<li>
<p><a href="https://ropensci.org/blog/2026/04/13/reviser" rel="nofollow" target="_blank">reviser: Analyzing Real-Time Data Revisions in R</a> by Marc Burri. A short introduction to reviser for analyzing real-time data vintages and revisions in R.</p>
</li>
<li>
<p><a href="https://ropensci.org/blog/2026/04/16/editors2026" rel="nofollow" target="_blank">Expanding the Editorial Team: Alec Robitaille and Lucy D’Agostino McGowan Join as Editors</a> by Alec Robitaille, Lucy D’Agostino McGowan, and Yanina Bellini Saibene. Introducing two new editors for rOpenSci statistical software peer review.</p>
</li>
<li>
<p><a href="https://ropensci.org/blog/2026/04/08/r-universe-bioc" rel="nofollow" target="_blank">Collaborating between Bioconductor and R-universe on Development of Common Infrastructure</a> by The rOpenSci Team and The Bioconductor Team. R-consortium ISC top-level project R-universe is working with Bioconductor to help gradually modernize parts of its infrastructure, while accommodating the project’s scale, governance, and established processes.</p>
</li>
</ul>
<h3>
Tech Notes
</h3><ul>
<li><a href="https://ropensci.org/blog/2026/04/02/tree-sitter-overview" rel="nofollow" target="_blank">A Better R Programming Experience Thanks to Tree-sitter</a> by Maëlle Salmon. Modern tooling for parsing, searching, formatting, editing R code, just like for other programming languages.</li>
</ul>
<h2>
Calls for contributions
</h2><h3>
Calls for maintainers
</h3><p>If you’re interested in maintaining any of the R packages below, you might enjoy reading our blog post <a href="https://ropensci.org/blog/2023/02/07/what-does-it-mean-to-maintain-a-package/" rel="nofollow" target="_blank">What Does It Mean to Maintain a Package?</a>.</p>
<ul>
<li>
<p><a href="https://docs.ropensci.org/landscapetools" rel="nofollow" target="_blank">landscapetools</a>, R package for some of the less-glamorous tasks involved in landscape analysis. <a href="https://github.com/ropensci/landscapetools/issues/48" rel="nofollow" target="_blank">Issue for volunteering</a>.</p>
</li>
<li>
<p><a href="https://docs.ropensci.org/hddtools" rel="nofollow" target="_blank">hddtools</a>, Tools to discover hydrological data, accessing catalogues and databases from various data providers. <a href="https://github.com/ropensci/hddtools/issues/36" rel="nofollow" target="_blank">Issue for volunteering</a>.</p>
</li>
<li>
<p><a href="https://docs.ropensci.org/qualtRics/" rel="nofollow" target="_blank">qualtRics</a>, download Qualtrics survey data. <a href="https://github.com/ropensci/qualtRics/issues/383" rel="nofollow" target="_blank">Issue for volunteering</a>.</p>
</li>
</ul>
<h3>
Calls for contributions
</h3><p>Refer to our <a href="https://ropensci.org/help-wanted/" rel="nofollow" target="_blank">help wanted page</a> – before opening a PR, we recommend asking in the issue whether help is still needed.</p>
<h2>
Package development corner
</h2><p>Some useful information for R package developers. <img src="https://s.w.org/images/core/emoji/13.0.0/72x72/1f440.png" alt="👀" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h3>
Useless code, duplicated function? Refactoring with Jarl
</h3><p>The Jarl CLI by Étienne Bacher received several useful new features for package developers:</p>
<ul>
<li><a href="https://jarl.etiennebacher.com/rules/unused_function" rel="nofollow" target="_blank"><code>unused_function</code></a></li>
<li><a href="https://jarl.etiennebacher.com/rules/duplicated_function_definition" rel="nofollow" target="_blank"><code>duplicated_function_definition</code></a>.</li>
</ul>
<p>They are a nice complement to <a href="https://jarl.etiennebacher.com/rules/unreachable_code" rel="nofollow" target="_blank"><code>unreachable_code</code></a>.</p>
<p>Read more in the <a href="https://www.etiennebacher.com/posts/2026-03-23-jarl-0.5.0/#other-features-for-package-developers" rel="nofollow" target="_blank">release announcement</a>.</p>
<h3>
Git commands to get to know a project
</h3><p>Ally Piechowski wrote an insightful post entitled <a href="https://piechowski.io/post/git-commands-before-reading-code/" rel="nofollow" target="_blank">“The Git Commands I Run Before Reading Any Code”</a>, which suggests Git commands which are useful for understanding the code base. For instance a command to determine which files recently changed the most!</p>
<p>To complement this post, Garrick Aden-Buie wrote <a href="https://gist.github.com/gadenbuie/463ff1e9f3b0f48cddc44db2224d286b" rel="nofollow" target="_blank">“a little <code>git-recon</code> bash script that runs them in series, complete with some ascii bar plots”</a>.</p>
<h3>
Another R mailing list archive
</h3><p>Last month we listed several ways to <a href="https://ropensci.org/blog/2026/03/30/news-mars-2026/#how-to-browse-the-r-mailing-lists" rel="nofollow" target="_blank">browse the R mailing lists</a>. Newsletter reader Florian Kohrt kindly wrote to us to mention the <a href="https://github.com/MichaelChirico/r-mailing-list-archive" rel="nofollow" target="_blank">plain text backup</a> maintained by Michael Chirico.</p>
<h3>
Will R run out of random seeds? Useful seed explainer
</h3><p>Andrew Heiss published an useful and interesting deep dive into <a href="https://www.andrewheiss.com/blog/2026/04/13/seeds-predetermined-universes/" rel="nofollow" target="_blank">random seeds</a>.</p>
<h3>
Enforcing the coalesce operator
</h3><p>Are you enjoying the coalesce operator <code>%||%</code> introduced in <a href="https://cran.r-project.org/bin/windows/base/old/4.4.0/NEWS.R-4.4.0.html" rel="nofollow" target="_blank">R 4.4.0</a>?</p>
<blockquote>
<p><code>L %||% R</code> newly in base is an expressive idiom for the phrases <code>if(!is.null(L)) L else R or if(is.null(L)) R else L</code>.</p>
</blockquote>
<p>Consider enforcing it via <a href="https://jarl.etiennebacher.com/rules/coalesce" rel="nofollow" target="_blank">Jarl</a> or <a href="https://lintr.r-lib.org/reference/coalesce_linter.html" rel="nofollow" target="_blank">lintr</a>.</p>
<p>As a reminder, the operator can be used in older versions of R through the <a href="https://github.com/r-lib/backports/pull/81/changes" rel="nofollow" target="_blank">backports</a> R package or by importing it from <a href="https://rlang.r-lib.org/reference/op-null-default.html" rel="nofollow" target="_blank">rlang</a>.</p>
<h3>
Positive AI reading
</h3><p>If you’re feeling some AI dread, you might enjoy:</p>
<ul>
<li>The blog post <a href="https://vickiboykis.com/2026/03/04/antidote/" rel="nofollow" target="_blank">“Antidote”</a> by Vicky Boykis;</li>
<li>The talk <a href="https://www.youtube.com/watch?v=5kTxZMSB9oo&#038;t" rel="nofollow" target="_blank">“Is my degree worthless”</a> by Davis Vaughan.</li>
</ul>
<h2>
Last words
</h2><p>Thanks for reading! If you want to get involved with rOpenSci, check out our <a href="https://contributing.ropensci.org/" rel="nofollow" target="_blank">Contributing Guide</a> that can help direct you to the right place, whether you want to make code contributions, non-code contributions, or contribute in other ways like sharing use cases. You can also support our work through <a href="https://ropensci.org/donate" rel="nofollow" target="_blank">donations</a>.</p>
<p>If you haven’t subscribed to our newsletter yet, you can <a href="https://ropensci.org/news/" rel="nofollow" target="_blank">do so via a form</a>. Until it’s time for our next newsletter, you can keep in touch with us via our <a href="https://ropensci.org/" rel="nofollow" target="_blank">website</a> and <a href="https://hachyderm.io/@rOpenSci" rel="nofollow" target="_blank">Mastodon account</a>.</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://ropensci.org/blog/2026/04/30/news-april-2026/"> rOpenSci - open tools for open science</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/ropensci-news-digest-april-2026/">rOpenSci News Digest, April 2026</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400938</post-id>	</item>
		<item>
		<title>logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines</title>
		<link>https://www.r-bloggers.com/2026/04/logrittr-a-verbose-pipe-operator-for-logging-dplyr-pipelines-2/</link>
		
		<dc:creator><![CDATA[Guillaume Pressiat]]></dc:creator>
		<pubDate>Wed, 29 Apr 2026 08:05:55 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://guillaumepressiat.github.io/blog/2026/04/logrittr-re</guid>

					<description><![CDATA[<div style = "width:60%; display: inline-block; float:left; ">
<p>  dplyr verbs are descriptive: let’s make them more verbose!</p>
<p>  Yet another pipe for R.</p>
<p>Repost for better image handling on r-bloggers.</p>
<p>Motivation</p>
<p>In SAS, every DATA step prints a log:</p>
<p>NOTE: There were 120000 observations read f...</p></div>
<div style = "width: 40%; display: inline-block; float:right;"></div>
<div style="clear: both;"></div>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/logrittr-a-verbose-pipe-operator-for-logging-dplyr-pipelines-2/">logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://guillaumepressiat.github.io/blog/2026/04/logrittr-re"> Guillaume Pressiat</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p><a href="https://github.com/guillaumepressiat/logrittr" rel="nofollow" target="_blank">
<img src="https://i2.wp.com/github.com/GuillaumePressiat/logrittr/raw/main/man/figures/logo.png?w=15%25&#038;ssl=1" style="float:right;padding-bottom: 20px;padding-right:30%" data-recalc-dims="1" />
</a></p>

<p><br /></p>

<blockquote>
  <p>dplyr verbs are descriptive: let’s make them more verbose!</p>
</blockquote>

<blockquote>
  <p>Yet another pipe for R.</p>
</blockquote>

<p><br />
<br /></p>

<span id="more-400890"></span>

<p><em>Repost for better image handling on r-bloggers.</em></p>

<p><img src="https://i2.wp.com/guillaumepressiat.github.io/images/pastels_example.png?w=578&#038;ssl=1" data-recalc-dims="1" /></p>

<p><br />
<br /></p>

<h2 id="motivation">Motivation</h2>

<p>In SAS, every DATA step prints a log:</p>

<figure class="highlight"><pre>NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 7153 observations were deleted.
NOTE: The data set WORK.RESULT has 112847 observations and 11 variables.</pre></figure>

<p>R’s <code>dplyr</code> pipelines are silent. <code>logrittr</code> fills that gap with <code>%&gt;=%</code>, a
drop-in pipe that logs row counts, column counts, added/dropped columns, and
timing at every step, with no function masking.</p>

<p>With <a href="https://github.com/tonsky/FiraCode" rel="nofollow" target="_blank">Fira Code</a> ligatures, <code>%&gt;=%</code> renders
as a single wide arrow visually similar to <code>%&gt;%</code> with an underline added, like a subtitle or, say, to read between the lines of a pipeline (what happened).</p>

<h2 id="multiples-contexts">Multiples contexts</h2>

<p>Things happens:</p>

<figure class="highlight"><pre>NOTE: There were 120000 observations read from WORK.SALES.
NOTE: 120000 observations were deleted.
NOTE: The data set WORK.RESULT has 0 observations and 11 variables.</pre></figure>

<p>“It’s here we’ve lost all rows in script execution”.</p>

<h4 id="pro">Pro</h4>

<p>Reading this a long time after execution of a script helps you see:</p>

<ul>
  <li>what happened at each stage of data processing without having to rerun the code, for example in a production environment where the input data is constantly changing</li>
  <li>monitor key processes</li>
  <li>Make sure you can explain what happened (an audit, for example)</li>
</ul>

<p>In professional contexts it’s often needed.</p>

<h4 id="educational">Educational</h4>

<p>This will also be clearer thanks to a console log for those with little experience with 
the tidyverse: people who are taking their first steps in programming by following a tutorial or teaching themselves.</p>

<h2 id="installation">Installation</h2>

<figure class="highlight"><pre>install.packages('logrittr', repos = 'https://guillaumepressiat.r-universe.dev')

# or from github
# devtools::install_github(&quot;GuillaumePressiat/logrittr&quot;)</pre></figure>

<p>See <a href="https://github.com/guillaumepressiat/logrittr" rel="nofollow" target="_blank">github</a> or <a href="https://guillaumepressiat.r-universe.dev/logrittr" rel="nofollow" target="_blank">r-universe</a>.</p>

<h2 id="usage">Usage</h2>

<figure class="highlight"><pre>library(logrittr)
library(dplyr)

iris %&gt;=%
  as_tibble() %&gt;=%
  filter(Sepal.Length &lt; 5)  %&gt;=%
  mutate(rn = row_number()) %&gt;=%
  semi_join(
    iris %&gt;% as_tibble() %&gt;=%
      filter(Species == &quot;setosa&quot;),
    by = &quot;Species&quot;
  )  %&gt;=%
  group_by(Species) %&gt;=%
  summarise(n = n_distinct(rn))</pre></figure>

<figure class="highlight"><pre>── iris  [rows:       150  cols:    5] ─────────────────────────────────────────────────────
&#x2139; as_tibble()                            rows:       150 +0        cols:    5 +0    [   0.0 ms]
&#x2139; filter(Sepal.Length &lt; 5)               rows:        22 -128      cols:    5 +0    [   3.0 ms]
&#x2139; mutate(rn = row_number())              rows:        22 +0        cols:    6 +1    [   1.0 ms]
  added: rn
&#x2139; &gt; filter(Species == &quot;setosa&quot;)          rows:        50 -100      cols:    5 +0    [   1.0 ms]
&#x2139; semi_join(iris %&gt;% as_tibble() %&gt;=%    rows:        20 -2        cols:    6 +0    [   5.0 ms]
  filter(Species == &quot;setosa&quot;), by =
  &quot;Species&quot;)
&#x2139; group_by(Species)                      rows:        20 +0        cols:    6 +0    [   3.0 ms]
&#x2139; summarise(n = n_distinct(rn))          rows:         1 -19       cols:    2 -4    [   2.0 ms]
  dropped: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, rn
  added: n</pre></figure>

<h3 id="screenshot">Screenshot</h3>

<p><img src="https://i1.wp.com/guillaumepressiat.github.io/images/nycflights13_example.png?w=578&#038;ssl=1" data-recalc-dims="1" /></p>

<p><br /></p>

<figure class="highlight"><pre>library(dplyr)
library(logrittr)

logrittr_options(lang = &quot;en&quot;, big_mark = &quot;,&quot;, wrap_width = NULL, max_cols = 3)

nycflights13::flights %&gt;=% 
  as_tibble() %&gt;=%
  group_by(year, month, day) %&gt;=% 
  count() %&gt;=% 
  tidyr::pivot_wider(values_from = &quot;n&quot;, names_from = &quot;day&quot;) %&gt;=% 
  glimpse()</pre></figure>

<h2 id="related-package-tidylog">Related package: <code>tidylog</code></h2>

<p><a href="https://github.com/elbersb/tidylog" rel="nofollow" target="_blank">tidylog</a> is a really neat package that gives me motivation for this one.
<code>tidylog</code> works by masking dplyr functions, which doesn’t seem ideal to me.</p>

<p>Anyway this also was a moment for me to test a new programmer tool that 
is used a lot for programming at this time.</p>

<p><code>logrittr</code> uses a custom pipe operator and never touches
the dplyr namespace. Its console output is colorful and informative thanks to the cli package.</p>

<h2 id="working-with-lumberjack">Working with <code>lumberjack</code></h2>

<p>If you already know the <a href="https://github.com/markvanderloo/lumberjack" rel="nofollow" target="_blank">lumberjack</a> package, 
compatibility is available with logrittr (timings are approximates).</p>

<p>Calling <code>logrittr_logger$new()</code>:</p>

<figure class="highlight"><pre>library(lumberjack)
library(dplyr)

l &lt;- logrittr_logger$new(verbose = TRUE)
logfile &lt;- tempfile(fileext=&quot;.-r.log.csv&quot;)

iris %L&gt;%
   start_log(log = l, label = &quot;iris step&quot;) %L&gt;%
   as_tibble() %L&gt;%
   filter(Sepal.Length &lt; 5) %L&gt;%
   mutate(rn = row_number()) %L&gt;%
   group_by(Species) %L&gt;%
   summarise(n = n_distinct(rn)) %L&gt;%
   dump_log(file=logfile, stop = FALSE)
   

mtcars %&gt;% 
  start_log(log = l, label = &quot;mtcars step&quot;) %L&gt;%
   count() %L&gt;%
   dump_log(file=logfile, stop = TRUE)


logdata &lt;- read.csv(logfile)</pre></figure>

<p>Will write logrittr log content of multiple data steps in the same csv file.</p>

<h2 id="limitations">Limitations</h2>

<ul>
  <li>
    <p>Like <code>tidylog</code>, logrittr only works with dplyr pipelines on R data.frames (in memory)
and is not able to do so with dbplyr pipelines from databases (remote/lazy table).</p>
  </li>
  <li>
    <p>Join cardinalities nicely done in tidylog are difficult to have from the pipe 
as join is already done, at this time we only show N row and N col evolution (before / after).</p>
  </li>
  <li>
    <p>Yes it’s another pipe, not ideal. We can dream of a <code>with_logging(TRUE)</code> context that will activate behaviour of logrittr pipe in <code>|&gt;</code> or in <code>%&gt;%</code>.</p>
  </li>
</ul>

<h2 id="take-another-pipe-for-a-spin">Take another pipe for a spin</h2>

<p><code>logrittr</code> prioritizes the user experience with a structured and colorful display in the console.</p>

<p>For now, this package is just a proof of concept that gave me a chance to experiment a bit with the <code>cli</code> package and few other things. But I think there’s a need for that in R, in a specific area where SAS outputs are so informative.</p>

<ul>
  <li><a href="https://guillaumepressiat.r-universe.dev/logrittr" rel="nofollow" target="_blank">https://guillaumepressiat.r-universe.dev/logrittr</a></li>
  <li><a href="https://github.com/guillaumepressiat/logrittr" rel="nofollow" target="_blank">https://github.com/guillaumepressiat/logrittr</a></li>
</ul>


<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://guillaumepressiat.github.io/blog/2026/04/logrittr-re"> Guillaume Pressiat</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/logrittr-a-verbose-pipe-operator-for-logging-dplyr-pipelines-2/">logrittr: A Verbose Pipe Operator for Logging dplyr Pipelines</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400890</post-id>	</item>
		<item>
		<title>CougarStats: a free and open-source Statistics web app for Teaching and Learning</title>
		<link>https://www.r-bloggers.com/2026/04/cougarstats-a-free-and-open-source-statistics-web-app-for-teaching-and-learning/</link>
		
		<dc:creator><![CDATA[Ashok Krishnamurthy]]></dc:creator>
		<pubDate>Wed, 29 Apr 2026 06:18:25 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=17352</guid>

					<description><![CDATA[<p>Hello, I’d like to share CougarStats, a free and open-source R Shiny web app I developed to support the teaching and learning of Statistics. CougarStats runs entirely in a browser and is designed for accessibility and ease of use. You can explore the app here: https://www.cougarstats.ca/    ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/cougarstats-a-free-and-open-source-statistics-web-app-for-teaching-and-learning/">CougarStats: a free and open-source Statistics web app for Teaching and Learning</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/cougarstats-a-free-and-open-source-statistics-web-app-for-teaching-and-learning/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
Hello,<br />
<br />
<div class="gmail_default"><span style="font-family: georgia, serif">I’d like to share <i>CougarStats</i>, a free and open-source R Shiny web app I developed to support the teaching and learning of Statistics. CougarStats runs entirely in a browser and is designed for accessibility and ease of use. You can explore the app here:<span style="color: #0000ff"> </span><a href="https://www.cougarstats.ca/" data-saferedirecturl="https://www.google.com/url?q=https://www.cougarstats.ca/&#038;source=gmail&#038;ust=1762130919782000&#038;usg=AOvVaw3JsAA9uJ8ViqpkU-5Kuau8" rel="nofollow" target="_blank"><span style="color: #0000ff">https://www.cougarstats.ca/</span></a> </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"> </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif">The name <i>CougarStats</i> is inspired by Mount Royal University’s athletics mascot, the cougar, symbolizing strength and agility, and by the app’s focus on statistics. </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"> </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"><b>Key features of <i>CougarStats</i></b></span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"> </span></div>
<div class="gmail_default">
<ul>
	<li><span style="font-family: georgia, serif"><b>Descriptive Statistics:</b> </span><span>Compute measures like mean, median, mode, quartiles, IQR, standard deviation, and identify potential outliers. </span></li>
	<li><span style="font-family: georgia, serif"><b>Data Visualization:</b> Construct Boxplots, Histograms, and Scatterplots. </span></li>
	<li><span style="font-family: georgia, serif"><b>Probability:</b> Calculate marginal, joint, union, and conditional probability for contingency tables; exact and cumulative probabilities for Binomial, Poisson, Negative Binomial and Hypergeometric distributions; and cumulative probabilities for the Normal distribution. </span></li>
	<li><span style="font-family: georgia, serif"><b>Sample Size Estimation:</b> Determine the required sample sizes for various scenarios. </span></li>
	<li><span style="font-family: georgia, serif"><b>Statistical Inference:</b> Construct confidence intervals, conduct hypothesis tests for one- and two-samples (mean, proportion and standard deviation). </span></li>
	<li><span style="font-family: georgia, serif"><b>ANOVA:</b> Perform one-way Analysis of Variance with an option to conduct Bonferroni post hoc tests. </span></li>
	<li><span style="font-family: georgia, serif"><b>Regression and Correlation:</b> Fit simple linear regression models and compute Pearson correlation coefficient, multiple linear regression, logistic regression. </span></li>
	<li><span style="font-family: georgia, serif"><b>Categorical Data Analysis:</b> Perform Chi-Square test of independence with and without Yates’ continuity correction, Fisher’s exact test. </span></li>
	<li><span style="font-family: georgia, serif"><b>Nonparametric Tests:</b> Perform the Mann-Whitney <i>U</i> Test, Kruskal-Wallis test etc.</span></li>
</ul>
</div>
<div class="gmail_default"><span style="font-family: georgia, serif"> </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif">I would be delighted if you could explore <i>CougarStats</i> and share it with your students and colleagues who might find it useful.</span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"> </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif">Thank you for your time, and I look forward to hearing your thoughts.</span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"> </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif">Sincerely, </span></div>
<div class="gmail_default"><span style="font-family: georgia, serif"><span style="font-family: georgia, serif">Ashok<br />
<br />
—<br />
</span></span>
<div><span style="font-family: georgia, serif">Ashok Krishnamurthy, PhD<br />
</span></div>
<div><span style="font-family: georgia, serif">Associate Professor</span></div>
<div><span style="font-family: georgia, serif">Department of Mathematics and Computing</span></div>
<div><span style="font-family: georgia, serif">Mount Royal University</span></div>
<div><span style="font-family: georgia, serif">4825 Mount Royal Gate SW</span></div>
<div><span style="font-family: georgia, serif">Calgary, AB, T3E 6K6 Canada</span></div>
<div></div>
<div><span style="color: #0000ff"><a href="mailto:akrishnamurthy@mtroyal.ca" rel="nofollow" style="color: #0000ff" target="_blank"><span style="font-family: georgia, serif">akrishnamurthy@mtroyal.ca</span></a></span></div>
</div><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/cougarstats-a-free-and-open-source-statistics-web-app-for-teaching-and-learning/" rel="nofollow" target="_blank">CougarStats: a free and open-source Statistics web app for Teaching and Learning</a> was first posted on April 29, 2026 at 6:18 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/cougarstats-a-free-and-open-source-statistics-web-app-for-teaching-and-learning/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/cougarstats-a-free-and-open-source-statistics-web-app-for-teaching-and-learning/">CougarStats: a free and open-source Statistics web app for Teaching and Learning</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400894</post-id>	</item>
		<item>
		<title>grouper: An R package for Optimal Group Assignment</title>
		<link>https://www.r-bloggers.com/2026/04/grouper-an-r-package-for-optimal-group-assignment/</link>
		
		<dc:creator><![CDATA[Vik Gopal]]></dc:creator>
		<pubDate>Wed, 29 Apr 2026 06:18:23 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=17067</guid>

					<description><![CDATA[<p>Introduction Universities are increasingly using collaborative learning pedagogies, which can benefit learners through deeper understanding of course content and teamwork skills. However, the realisation of these sought-after benefits depend on how educators assign learners to groups. Educators have formulated various mathematical models to perform this assignment. Some have developed developed ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/grouper-an-r-package-for-optimal-group-assignment/">grouper: An R package for Optimal Group Assignment</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/grouper-an-r-package-for-optimal-group-assignment/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<h2>Introduction</h2>

Universities are increasingly using collaborative learning pedagogies, which can benefit learners through deeper understanding of course content and teamwork skills. However, the realisation of these sought-after benefits depend on how educators assign learners to groups. <br />
<br />
Educators have formulated various mathematical models to perform this assignment. Some have developed developed models that prioritised maximising students’ project preferences. Others developed a model that prioritised students’ preferences, group sizes and group composition. Yet other models address related, but distinct, problems such as assigning students to elective courses or incorporating staff workload into student-to-project supervisor assignments. <br />
<br />
Whichever approach is used, it is apparent that there is a need for an algorithmic solution for the assignment. This would ease the burden on the instructor, while providing an objective procedure for the assignment. Our contribution is an R package <code>grouper</code> that offers two flexible group allocation strategies.<br />
<h2>Optimisation Models</h2>
<p dir="auto"><code>grouper</code> provides two distinct integer linear programming optimisation models.</p>
<div class="sourceCode" id="cb2">
<pre>library(grouper)
library(ompr)
library(ompr.roi)
library(ROI.plugin.glpk)</pre>
</div>
<h3>Preference-Based Assignment</h3>
<p dir="auto">The Preference-Based Assignment (PBA) model allows educators to assign student groups to topics to maximise overall student preferences for those topics. The topics can be viewed as project titles. The model allows for repetitions of each project title. This formulation also allows each project team to comprise multiple sub-groups. This is useful in cases where the project requires teams with different functionality to work together, e.g. where one team works on a front-end while the other develops a back-end model.</p>
<p dir="auto">To execute the optimisation routine, an instructor prepares:</p>
<ol>
	<li style="list-style-type: none">
<ol>
	<li style="list-style-type: none">
<ol dir="auto">
	<li>A group composition table listing the member students within each self-formed group</li>
	<li>A preference matrix containing the preference that each self-formed group has for each topic.</li>
	<li>A YAML file defining the remaining parameters of the model.</li>
</ol>
</li>
</ol>
</li>
</ol>
<h3>Examples</h3>

Consider the following simple dataset with 8 students:

<div class="sourceCode" id="cb6">
<pre>pba_gc_ex002
#&gt;   id grouping
#&gt; 1  1        1
#&gt; 2  2        1
#&gt; 3  3        2
#&gt; 4  4        2
#&gt; 5  5        3
#&gt; 6  6        3
#&gt; 7  7        4
#&gt; 8  8        4</pre>
</div>
<p>Each student is in a self-formed group of size 2, indicated via the <code>grouping</code> column. Suppose that, for this set of students, the instructor wishes to assign students into two topics, with each topic having two sub-groups. This requires the preference matrix to have 4 columns – one for each topic-subgroup combination. Remember that the ordering of topics/subtopics in the preference matrix should be:</p>
<p><em>Topic1-Subtopic1, Topic2-Subtopic1, Topic1-Subtopic2, Topic2-Subtopic2</em></p>
<p>Thus there should be 4 rows in the preference matrix – one for each self-formed group.</p>
<div class="sourceCode" id="cb7">
<pre>pba_prefmat_ex002
#&gt;      col1 col2 col3 col4
#&gt; [1,]    4    3    2    1
#&gt; [2,]    3    4    2    1
#&gt; [3,]    1    2    4    3
#&gt; [4,]    1    2    3    4</pre>
</div>
<p>The YAML file for this model contains the following parameters:</p>
<pre>n_topics: 2
B: 2
R: 1
nmin: 2
nmax: 2
rmin: 1
rmax: 1</pre>
<p>B corresponds to the number of sub-topics per topic, while r<sub>min</sub> and r<sub>max</sub> denote the minimum and maximum number of repetitions of each topic. n<sub>min</sub> and n<sub>max</sub> denote the minimum and maximum number of members in each sub-topic group.<br />
<br />
It is possible to assign each self-formed group to its optimal choice of topic-subtopic combination. In our solution, we should see that group 1 is assigned to subtopic 1 of topic 1, group 2 is assigned to sub-topic 1 of topic 2, and so on.</p>
<div class="sourceCode" id="cb8">
<pre>df_ex002_list &lt;- extract_student_info(pba_gc_ex002, &quot;preference&quot;, 
                                     self_formed_groups = 2, 
                                     pref_mat = pba_prefmat_ex002)
yaml_ex002_list &lt;- extract_params_yaml(system.file(&quot;extdata&quot;, 
                                         &quot;pba_params_ex002.yml&quot;,  
                                          package = &quot;grouper&quot;),
                                      &quot;preference&quot;)
m2 &lt;- prepare_model(df_ex002_list, yaml_ex002_list, &quot;preference&quot;)
result2 &lt;- solve_model(m2, with_ROI(solver=&quot;glpk&quot;))

assign_groups(result2, assignment = &quot;preference&quot;, 
              dframe=pba_gc_ex002, yaml_ex002_list, 
              group_names=&quot;grouping&quot;)
#&gt;   topic2 subtopic rep group size
#&gt; 1      1        1   1     1    2
#&gt; 2      2        1   1     2    2
#&gt; 3      1        2   1     3    2
#&gt; 4      2        2   1     4    2</pre>
</div>
<h3>Diversity-Based Assignment</h3>
<p dir="auto">The Diversity-Based Assignment (DBA) model enables educators to assign students to groups and topics with the dual, but weighted, aims of maximising diversity (based on student attributes) within groups and balancing specific skill levels across different groups.</p>
<p dir="auto">To execute the DBA optimisation routine, the instructor prepares:</p>
<ol>
	<li style="list-style-type: none">
<ol>
	<li style="list-style-type: none">
<ol dir="auto">
	<li>A group composition table containing:

<ol dir="auto">
	<li>the member students within each self-formed group,</li>
	<li>the demographics that will be used to compute pairwise dissimilarity between students, and</li>
	<li>a numeric measure of each student’s skill.</li>
</ol>
</li>
	<li>A YAML file defining the remaining parameters of the model.</li>
</ol>
</li>
</ol>
</li>
</ol>
<h4>Examples</h4>

Consider the following dataset, that comes with the package. There are 4 students in total.
<div class="sourceCode" id="cb2">
<pre>dba_gc_ex001
#&gt;   id major skill groups
#&gt; 1  1     A     1      1
#&gt; 2  2     A     1      2
#&gt; 3  3     B     3      3
#&gt; 4  4     B     3      4</pre>
</div>
<p>It is intuitive that an assignment into two groups of size two, based on the diversity of majors alone, should assign students 1 and 2 into the first group and the remaining two students into another group.</p>
<p>The corresponding YAML <code>dba_gc_ex001.yml</code> file for this exercise consists of the following lines:</p>
<pre>n_topics:  2
R:  1
nmin: 2
nmax: 2
rmin: 1
rmax: 1</pre>
<p>To run the assignment, we can use the following commands. We can use either the gurobi solver, or the glpk solver for this example. Both are equally fast.</p>
<div class="sourceCode" id="cb4">
<pre># Indicate appropriate columns using integer ids.
df_ex001_list &lt;- extract_student_info(dba_gc_ex001, &quot;diversity&quot;,
                                      demographic_cols = 2, 
                                      skills = 3, 
                                      self_formed_groups = 4)

yaml_ex001_list &lt;- extract_params_yaml(system.file(&quot;extdata&quot;, 
                                         &quot;dba_params_ex001.yml&quot;,  
                                         package = &quot;grouper&quot;),
                                       &quot;diversity&quot;)
m1 &lt;- prepare_model(df_ex001_list, yaml_ex001_list,
                    assignment=&quot;diversity&quot;,w1=0.5, w2=0.5)

result3 &lt;- solve_model(m1, with_ROI(solver=&quot;glpk&quot;))
assign_groups(result3, assignment = &quot;diversity&quot;, 
              dframe=dba_gc_ex001, 
              group_names=&quot;groups&quot;)
#&gt;   topic rep group id major skill
#&gt; 1     1   1     2  2     A     1
#&gt; 2     1   1     3  3     B     3
#&gt; 3     2   1     1  1     A     1
#&gt; 4     2   1     4  4     B     3</pre>
</div>
<p>We can see that students 2 and 3 have been assigned to topic 1, repetition 1. Students 1 and 4 have been assigned to topic 2, repetition 1. w<sub>1</sub> and w<sub>2</sub> both have weights 0.5, which means the skills and demographic inputs are given equal weight in the optimisation.<br />
<br />
At present, the routines use the daisy function from the cluster package to compute a pairwise dissimilarity matrix between students. However, it is also possible to supply your own custom dissimilarity matrix. Consider the following dataset of 4 students:</p>
<pre>dba_gc_ex003
#&gt;   year   major self_groups id
#&gt; 1    1    math           1  1
#&gt; 2    2 history           2  2
#&gt; 3    3    dsds           3  3
#&gt; 4    4    elts           4  4</pre>
<p>Now consider a situation where we wish to consider years 1 and 2 different from years 3 and 4, and <code>math</code> and <code>dsds</code> (STEM majors) to be different from <code>elts</code> and <code>history</code> (non-STEM majors). For each difference, we assign a score of 1. This means that students 1 and 2 would have a dissimilarity score of 1 due to their difference in majors. Students 1 and 3 would also have a score of 1, but due to their difference in years. Students 1 and 4 would have score of 2, due to their differences in majors and in years. The overall dissimilarity matrix would be:</p>
<div class="sourceCode" id="cb7">
<pre>d_mat &lt;- matrix(c(0, 1, 1, 2,
                  1, 0, 2, 1,
                  1, 2, 0, 1,
                  2, 1, 1, 0), nrow=4, byrow = TRUE)</pre>
</div>
<p>To run the optimisation for this model, we can execute the following code:</p>
<div class="sourceCode" id="cb8">
<pre>df_ex003_list &lt;- extract_student_info(dba_gc_ex003, &quot;diversity&quot;,
                                       skills = NULL,
                                       self_formed_groups = 3,
                                       d_mat=d_mat)
yaml_ex003_list &lt;- extract_params_yaml(system.file(&quot;extdata&quot;,   
                                         &quot;dba_params_ex003.yml&quot;,
                                         package = &quot;grouper&quot;), 
                                       &quot;diversity&quot;)
m3 &lt;- prepare_model(df_ex003_list, yaml_ex003_list, w1=1.0, w2=0.0)
result &lt;- solve_model(m3, with_ROI(solver=&quot;glpk&quot;)

assign_groups(result, &quot;diversity&quot;, dba_gc_ex003,
              group_names=&quot;self_groups&quot;)
#&gt;   topic rep group year   major id
#&gt; 1     1   1     1    1    math  1
#&gt; 2     1   1     4    4    elts  4
#&gt; 3     2   1     2    2 history  2
#&gt; 4     2   1     3    3    dsds  3</pre>
</div>
<p>As you can see, the members of the two groups have maximal difference between them – they differ in terms of their year, <em>and</em> in terms of their major. Notice that we specified</p>
<pre>skills = NULL</pre>
<p>and</p>
<pre>w2 = 0.0</pre>
<p>This ensures that no skills columns were taken into account in this optimisation.</p>
<h3>Gurobi Optimiser</h3>

While the routines above use the glpk optimiser, we recommend using the Gurobi optimiser. The latter is a commercial software that runs to completion much faster than glpk. For more information, please refer to <a href="https://www.gurobi.com/solutions/gurobi-optimizer/" rel="nofollow" target="_blank">this website</a>. Note that <a href="https://www.gurobi.com/academia/academic-program-and-licenses/" rel="nofollow" target="_blank">academic licenses</a> are available from Gurobi.<br />
<h3>Shiny Applications</h3>

The package provides numerous options for each of the two optimisation models. However, there are also two shiny applications included with the package. They may be useful if one only needs a straightforward group assignment. <br />
<br />
To run the DBA shiny app, the following code will suffice:

<pre>library(shiny)
runApp(appDir=system.file(&quot;shiny&quot;, &quot;dbaWebApp&quot;, package=&quot;grouper&quot;))

# Analogous code for PBA app:
# runApp(appDir=system.file(&quot;shiny&quot;, &quot;pbaWebApp&quot;, package=&quot;grouper&quot;))</pre>
<p>Here is a screen shot of the diversity-based shiny application. <br />
<br />
<img loading="lazy" fetchpriority="high" decoding="async" src="https://i1.wp.com/r-posts.com/wp-content/uploads/2025/10/dba_screenshot-450x216.png?resize=450%2C216" alt="" width="450" height="216" class="aligncenter size-large wp-image-17093" srcset_temp="https://i1.wp.com/r-posts.com/wp-content/uploads/2025/10/dba_screenshot-450x216.png?resize=450%2C216 450w, http://r-posts.com/wp-content/uploads/2025/10/dba_screenshot-768x368.png 768w, http://r-posts.com/wp-content/uploads/2025/10/dba_screenshot-1536x737.png 1536w, http://r-posts.com/wp-content/uploads/2025/10/dba_screenshot.png 1868w" sizes="(max-width: 450px) 100vw, 450px" data-recalc-dims="1" /><br />
<br />
The system folders with the shiny apps also contain example csv files for use with the apps.</p>
<h3>More Details</h3>

The two optimisation models are flexibly parametrised. Here are some of the features:<br />
<ul>
	<li style="list-style-type: none">
<ul>
	<li>Define the number of repetitions for each topic.</li>
	<li>Define the max. and min. number of group members for each topic.</li>
</ul>
</li>
</ul>

The vignettes also contain the precise mathematical formulation of the optimisation models. For full details, please refer to these links:<br />
<ul>
	<li style="list-style-type: none">
<ul>
	<li><a href="http://cran.r-project.org/package=grouper" rel="nofollow" target="_blank">CRAN page</a></li>
	<li><a href="https://github.com/singator/grouper" rel="nofollow" target="_blank">Github repository</a></li>
</ul>
</li>
</ul><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/grouper-an-r-package-for-optimal-group-assignment/" rel="nofollow" target="_blank">grouper: An R package for Optimal Group Assignment</a> was first posted on April 29, 2026 at 6:18 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/grouper-an-r-package-for-optimal-group-assignment/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/grouper-an-r-package-for-optimal-group-assignment/">grouper: An R package for Optimal Group Assignment</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400896</post-id>	</item>
		<item>
		<title>Understanding R’s `describe()` Function: A Complete Guide to Summary Statistics</title>
		<link>https://www.r-bloggers.com/2026/04/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/</link>
		
		<dc:creator><![CDATA[Nick Han]]></dc:creator>
		<pubDate>Wed, 29 Apr 2026 06:09:11 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=14802</guid>

					<description><![CDATA[<p>Understanding R’s describe() Function: A Complete Guide to Summary Statistics Table of Contents Introduction to describe() Breaking Down the Output Columns Key Statistics and Their Interpretation Practical Examples When to Use Which Statistic Extending the Functionality Conclusion Introduction to describe() The describe() function from R’s psych package (Revelle, 2023) ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/">Understanding R’s `describe()` Function: A Complete Guide to Summary Statistics</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<span style="font-size: 35px;font-weight: bold">Understanding R’s </span><code>describe()</code><span style="font-size: 35px;font-weight: bold"> Function: A Complete Guide to Summary Statistics</span><br />
<div class="toc">
<h2>Table of Contents</h2>
<ol>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#introduction-to-describe" rel="nofollow" target="_blank">Introduction to <code>describe()</code></a></li>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#breaking-down-the-output-columns" rel="nofollow" target="_blank">Breaking Down the Output Columns</a></li>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#key-statistics-and-their-interpretation" rel="nofollow" target="_blank">Key Statistics and Their Interpretation</a></li>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#practical-examples" rel="nofollow" target="_blank">Practical Examples</a></li>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#when-to-use-which-statistic" rel="nofollow" target="_blank">When to Use Which Statistic</a></li>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#extending-the-functionality" rel="nofollow" target="_blank">Extending the Functionality</a></li>
	<li><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/#conclusion" rel="nofollow" target="_blank">Conclusion</a></li>
</ol>
</div>
<h2 id="introduction-to-describe">Introduction to <code>describe()</code></h2>
<p>The <code>describe()</code> function from R’s <code>psych</code> package (Revelle, 2023) provides a comprehensive statistical summary of your dataset. Unlike R’s base <code>summary()</code> function, it includes additional metrics that are particularly useful for data exploration and assumption checking.</p>
<pre>library(psych)
describe(your_data)</pre>
<h2 id="breaking-down-the-output-columns">Breaking Down the Output Columns</h2>
<p>Here’s what each column in the output represents:</p>
<table>
<thead>
<tr>
<th>Column</th>
<th>Description</th>
<th>Formula/Calculation</th>
<th>Ideal Use Case</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>vars</strong></td>
<td>Variable index number</td>
<td>–</td>
<td>Tracking variable order</td>
</tr>
<tr>
<td><strong>n</strong></td>
<td>Complete cases</td>
<td><code>length(na.omit(x))</code></td>
<td>Data completeness check</td>
</tr>
<tr>
<td><strong>mean</strong></td>
<td>Arithmetic average</td>
<td><code>sum(x)/n</code></td>
<td>Normally distributed data</td>
</tr>
<tr>
<td><strong>sd</strong></td>
<td>Standard deviation</td>
<td><code>sqrt(var(x))</code></td>
<td>Measuring spread</td>
</tr>
<tr>
<td><strong>median</strong></td>
<td>50th percentile</td>
<td><code>quantile(x, 0.5)</code></td>
<td>Skewed distributions</td>
</tr>
<tr>
<td><strong>trimmed</strong></td>
<td>Mean after removing extremes</td>
<td><code>mean(x, trim=0.1)</code></td>
<td>Robust central tendency</td>
</tr>
<tr>
<td><strong>mad</strong></td>
<td>Median absolute deviation</td>
<td><code>median(abs(x-median(x)))</code></td>
<td>Outlier-resistant spread</td>
</tr>
<tr>
<td><strong>min</strong></td>
<td>Minimum value</td>
<td><code>min(x)</code></td>
<td>Range assessment</td>
</tr>
<tr>
<td><strong>max</strong></td>
<td>Maximum value</td>
<td><code>max(x)</code></td>
<td>Range assessment</td>
</tr>
<tr>
<td><strong>range</strong></td>
<td>Max – Min</td>
<td><code>max(x)-min(x)</code></td>
<td>Total spread</td>
</tr>
<tr>
<td><strong>skew</strong></td>
<td>Distribution asymmetry</td>
<td><code>sum((x-mean(x))³)/(n*sd(x)³)</code></td>
<td>Detecting skew direction</td>
</tr>
<tr>
<td><strong>kurtosis</strong></td>
<td>Tailedness</td>
<td><code>sum((x-mean(x))⁴)/(n*sd(x)⁴)-3</code></td>
<td>Outlier propensity</td>
</tr>
<tr>
<td><strong>se</strong></td>
<td>Standard error</td>
<td><code>sd(x)/sqrt(n)</code></td>
<td>Precision of mean estimate</td>
</tr>
</tbody>
</table>
<h2 id="key-statistics-and-their-interpretation">Key Statistics and Their Interpretation</h2>
<h3>Central Tendency</h3>
<ul>
	<li><strong>Mean vs. Median</strong>: Differences indicate skewness</li>
	<li><strong>Trimmed Mean</strong>: Removes influence of outliers (default drops top/bottom 10%)</li>
</ul>
<h3>Variability</h3>
<ul>
	<li><strong>SD vs. MAD</strong>: Use MAD when outliers are present</li>
	<li><strong>Range</strong>: Simple but outlier-sensitive</li>
</ul>
<h3>Distribution Shape</h3>
<ul>
	<li><strong>Skewness</strong>:

<ul>
	<li>>0: Right-tailed</li>
	<li><0: Left-tailed</li>
	<li>0: Symmetric</li>
</ul>
</li>
	<li><strong>Kurtosis</strong> (Excess):

<ul>
	<li>>0: Heavy-tailed (more outliers than normal)</li>
	<li><0: Light-tailed</li>
</ul>
</li>
</ul>
<h2 id="practical-examples">Practical Examples</h2>
<h3>Example 1: MPG from mtcars</h3>
<pre>describe(mtcars$mpg)</pre>
<p>Output Interpretation:</p>
<pre>   vars  n   mean    sd median trimmed   mad min  max range skew kurtosis   se
1     1 32 20.09 6.03   19.2   19.70 5.41 10.4 33.9  23.5 0.61    -0.37 1.07</pre>
<ul>
	<li><strong>Right-skewed</strong> (mean > median, positive skew)</li>
	<li><strong>Light-tailed</strong> (negative kurtosis)</li>
	<li><strong>SD (6.03) > MAD (5.41)</strong>: Suggests some outlier influence</li>
</ul>
<h2 id="when-to-use-which-statistic">When to Use Which Statistic</h2>
<table>
<thead>
<tr>
<th>Scenario</th>
<th>Recommended Statistics</th>
</tr>
</thead>
<tbody>
<tr>
<td>Normal Distribution</td>
<td>Mean, SD</td>
</tr>
<tr>
<td>Skewed Data</td>
<td>Median, IQR, MAD</td>
</tr>
<tr>
<td>Outlier Detection</td>
<td>MAD, trimmed mean, kurtosis</td>
</tr>
<tr>
<td>Parametric Testing</td>
<td>Mean, SE</td>
</tr>
<tr>
<td>Nonparametric Analysis</td>
<td>Median, IQR</td>
</tr>
</tbody>
</table>
<h2 id="extending-the-functionality">Extending the Functionality</h2>
<h3>Adding IQR</h3>
<p>The default <code>describe()</code> doesn’t show IQR, but you can add it:</p>
<pre>library(dplyr)
describe(mtcars) %&gt;% 
  mutate(IQR = apply(mtcars, 2, IQR, na.rm = TRUE))</pre>
<h3>Comparing Groups</h3>
<p>Use <code>describeBy()</code> for grouped statistics:</p>
<pre>describeBy(mtcars$mpg, group = mtcars$cyl)</pre>
<h2 id="conclusion">Conclusion</h2>
<p>R’s <code>describe()</code> function provides a powerful starting point for exploratory data analysis. By understanding each statistic it provides, you can:</p>
<ul>
	<li>Detect data quality issues</li>
	<li>Choose appropriate analysis methods</li>
	<li>Understand your variables’ distributions</li>
	<li>Make informed decisions about data transformations</li>
</ul>
<p>For formal reporting, consider supplementing these metrics with visualization and statistical tests.</p>
<blockquote><strong>Pro Tip</strong>: Always visualize your data alongside these statistics – numbers tell part of the story, but plots reveal the full picture!</blockquote>
<p>Happy coding!<br />
<br />
—<br />
<strong>Reference:</strong><br />
<span>Revelle, W. (2023). </span><em>psych: Procedures for Psychological, Psychometric, and Personality Research</em><span>. Northwestern University.</span></p><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/" rel="nofollow" target="_blank">Understanding R’s `describe()` Function: A Complete Guide to Summary Statistics</a> was first posted on April 29, 2026 at 6:09 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/understanding-rs-describe-function-a-complete-guide-to-summary-statistics/">Understanding R’s `describe()` Function: A Complete Guide to Summary Statistics</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400901</post-id>	</item>
		<item>
		<title>Understanding Statistical Coefficients: From Regression to Variation</title>
		<link>https://www.r-bloggers.com/2026/04/understanding-statistical-coefficients-from-regression-to-variation/</link>
		
		<dc:creator><![CDATA[Nick Han]]></dc:creator>
		<pubDate>Wed, 29 Apr 2026 06:09:04 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=14800</guid>

					<description><![CDATA[<p>The Data Analyst’s Guide to  Statistical Coefficients Table of Contents What Are Coefficients? Regression Coefficient Coefficient of Determination (R²) Coefficient of Variation (CV) Correlation Coefficient Other Common Coefficients Implementation in R Key Takeaways What Are Coefficients? In statistics and data analysis, coefficients are numerical measures that quantify relationships between ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/understanding-statistical-coefficients-from-regression-to-variation/">Understanding Statistical Coefficients: From Regression to Variation</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<span style="font-size: 35px;font-weight: bold">The Data Analyst’s Guide to  Statistical Coefficients</span><br />
<div class="toc">
<h2>Table of Contents</h2>
<ol>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#what-are-coefficients" rel="nofollow" target="_blank">What Are Coefficients?</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#regression-coefficient" rel="nofollow" target="_blank">Regression Coefficient</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#coefficient-of-determination" rel="nofollow" target="_blank">Coefficient of Determination (R²)</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#coefficient-of-variation" rel="nofollow" target="_blank">Coefficient of Variation (CV)</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#correlation-coefficient" rel="nofollow" target="_blank">Correlation Coefficient</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#other-common-coefficients" rel="nofollow" target="_blank">Other Common Coefficients</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#implementation-in-r" rel="nofollow" target="_blank">Implementation in R</a></li>
	<li><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/#key-takeaways" rel="nofollow" target="_blank">Key Takeaways</a></li>
</ol>
</div>
<h2 id="what-are-coefficients">What Are Coefficients?</h2>
<p>In statistics and data analysis, <strong>coefficients</strong> are numerical measures that quantify relationships between variables or characteristics of data distributions. They serve as fundamental indicators in statistical modeling and data interpretation.</p>
<h2 id="regression-coefficient">1. Regression Coefficient</h2>
<h3>Definition</h3>
<p>The regression coefficient measures the relationship between an independent variable (X) and a dependent variable (Y).</p>
<h3>Formula</h3>
<p>For linear model Y = aX + b:</p>
<ul>
	<li>a: Regression coefficient (change in Y per unit change in X)</li>
	<li>b: Intercept</li>
</ul>
<h3>R Implementation</h3>
<pre># Linear regression example
model &lt;- lm(mpg ~ wt, data = mtcars)
summary(model)

# Extract coefficients
coef(model)</pre>
<h3>Interpretation</h3>
<p>A coefficient of -5.34 for vehicle weight (wt) means each additional ton reduces mileage by 5.34 mpg on average.</p>
<h2 id="coefficient-of-determination">2. Coefficient of Determination (R²)</h2>
<h3>Definition</h3>
<p>R-squared represents the proportion of variance in the dependent variable explained by the model (0-1 scale).</p>
<h3>R Code</h3>
<pre># Get R-squared value
summary(model)$r.squared</pre>
<h3>Guidelines</h3>
<ul>
	<li>R² = 0.75 → Model explains 75% of data variation</li>
	<li>Higher values indicate better model fit</li>
</ul>
<h2 id="coefficient-of-variation">3. Coefficient of Variation (CV)</h2>
<h3>Definition</h3>
<p>CV is a standardized measure of dispersion expressed as percentage of the mean.</p>
<h3>Formula</h3>
<p>CV% = (Standard Deviation / Mean) × 100%</p>
<h3>R Function</h3>
<pre># Calculate CV
cv &lt;- function(x) {
  (sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE)) * 100
}

# Example usage
cv(mtcars$mpg)</pre>
<h3>Interpretation Benchmarks</h3>
<ul>
	<li>CV < 15%: Low variability</li>
	<li>15-30%: Moderate variability</li>
	<li>>30%: High variability</li>
</ul>
<h2 id="correlation-coefficient">4. Correlation Coefficient</h2>
<h3>Definition</h3>
<p>Measures the strength and direction of linear relationship between two variables (-1 to 1).</p>
<h3>R Implementation</h3>
<pre># Calculate correlation
cor(mtcars$mpg, mtcars$wt)

# Correlation matrix
cor(mtcars[, c(&quot;mpg&quot;, &quot;wt&quot;, &quot;hp&quot;)])</pre>
<h3>Interpretation</h3>
<ul>
	<li>1: Perfect positive correlation</li>
	<li>-1: Perfect negative correlation</li>
	<li>0: No linear correlation</li>
</ul>
<h2 id="other-common-coefficients">Other Common Coefficients</h2>
<table>
<thead>
<tr>
<th>Coefficient</th>
<th>Description</th>
<th>R Package/Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>Skewness</td>
<td>Measures distribution asymmetry</td>
<td><code>moments::skewness()</code></td>
</tr>
<tr>
<td>Kurtosis</td>
<td>Measures tail heaviness</td>
<td><code>moments::kurtosis()</code></td>
</tr>
<tr>
<td>Concordance</td>
<td>Assesses agreement</td>
<td><code>epiR::epi.ccc()</code></td>
</tr>
</tbody>
</table>
<h2 id="implementation-in-r">Implementation in R</h2>
<h3>Comprehensive Analysis</h3>
<pre>library(psych)

# Descriptive statistics (includes multiple coefficients)
describe(mtcars)

# Full regression output
summary(lm(mpg ~ ., data = mtcars))</pre>
<h3>Custom Coefficient Calculations</h3>
<pre># Multi-coefficient function
data_analysis &lt;- function(x) {
  list(
    mean = mean(x),
    sd = sd(x),
    cv = cv(x),
    skewness = moments::skewness(x),
    kurtosis = moments::kurtosis(x)
  )
}

lapply(mtcars[, 1:4], data_analysis)</pre>
<h3>Visualization</h3>
<pre>library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() + 
  geom_smooth(method = &quot;lm&quot;) +
  labs(title = &quot;MPG vs Weight with Regression Line&quot;,
       x = &quot;Weight (tons)&quot;,
       y = &quot;Miles per Gallon&quot;)</pre>
<h2 id="key-takeaways">Key Takeaways</h2>
<ol>
	<li><strong>Select coefficients</strong> based on analytical goals:

<ul>
	<li>Variable relationships → Regression/Correlation coefficients</li>
	<li>Model evaluation → R-squared</li>
	<li>Variability comparison → CV</li>
</ul>
</li>
	<li><strong>R advantages</strong>:

<ul>
	<li>Built-in functions for all major coefficients</li>
	<li>Seamless integration of statistical and visual analysis</li>
</ul>
</li>
	<li><strong>Best practices</strong>:

<ul>
	<li>Understand assumptions behind each coefficient</li>
	<li>Combine statistical results with domain knowledge</li>
	<li>Clearly distinguish between different coefficients</li>
</ul>
</li>
	<li><strong>Advanced applications</strong>:

<pre># Robust regression (for outlier-resistant coefficients)
library(MASS)
rlm(mpg ~ wt, data = mtcars)

# Standardized coefficients
library(lm.beta)
lm.beta(model)</pre>
</li>
</ol>
<p>By mastering these statistical coefficients and their R implementations, you’ll be equipped to conduct more rigorous data analysis and communicate results effectively. Remember that coefficients are tools – their proper interpretation always depends on context and research questions.<br />
<br />
Happy coding!</p><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/" rel="nofollow" target="_blank">Understanding Statistical Coefficients: From Regression to Variation</a> was first posted on April 29, 2026 at 6:09 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/understanding-statistical-coefficients-from-regression-to-variation/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/understanding-statistical-coefficients-from-regression-to-variation/">Understanding Statistical Coefficients: From Regression to Variation</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400917</post-id>	</item>
		<item>
		<title>Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny</title>
		<link>https://www.r-bloggers.com/2026/04/producing-systematic-literature-reviews-with-bibliometrix-r-and-biblioshiny/</link>
		
		<dc:creator><![CDATA[Luca D'Aniello]]></dc:creator>
		<pubDate>Wed, 29 Apr 2026 06:08:27 +0000</pubDate>
				<category><![CDATA[R bloggers]]></category>
		<guid isPermaLink="false">https://r-posts.com/?p=14686</guid>

					<description><![CDATA[<p>Summer School in Science Mapping (SSSM) 2025 – I International Edition Title: Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny Date &#038; Location: June 9-13, 2025 – Naples, ITA We are pleased to announce the upcoming Summer School in Science Mapping (SSSM) 2025 – I International Edition, an intensive training program focused on conducting … Continue reading ...</p>
<strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/producing-systematic-literature-reviews-with-bibliometrix-r-and-biblioshiny/">Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny</a>]]></description>
										<content:encoded><![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://r-posts.com/producing-systematic-literature-reviews-with-bibliometrix-r-and-biblioshiny/"> R-posts.com</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<h1><span style="font-size: 18pt">Summer School in Science Mapping (SSSM) 2025 – I International Edition</span></h1>
<p><strong><em><span>Title</span></em><span>:</span></strong><span> Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny<br />
<strong><em>Date &#038; Location</em>:</strong> June 9-13, 2025 – Naples, ITA</span></p>
<p><span>We are pleased to announce the upcoming Summer School in Science Mapping (SSSM) 2025 – I International Edition, an intensive training program focused on conducting Systematic Literature Reviews using the Bibliometrix R package and its shiny-app Biblioshiny.</span></p>
<p><span>Organized by the academic spin-off K-Synth in collaboration with the Department of Economics and Statistics at the University of Naples Federico II, the school will be held in Naples, Italy, from June 9 to June 13, 2025.</span></p>
<h2><span style="font-size: 14pt">Aim and Scope</span></h2>
<p><span>The SSSM 2025 is an intensive training program tailored for early-career researchers and academics seeking to enhance their expertise in bibliometric methods and scientific mapping. By integrating theoretical foundations with practical sessions, the school equips participants with robust skills in citation analysis, co-citation techniques, science mapping, and reproducible workflows for scholarly evaluation. Designed as both a learning and networking opportunity, SSSM 2025 fosters methodological development and international collaboration in a dynamic, research-oriented environment.</span></p>
<p><span>The school’s content covers:</span></p>
<p><span>– Overview of bibliometric theory and methods<br />
– Query design and data retrieval from major scientific databases<br />
– Descriptive, relational, and structural bibliometric analyses in R<br />
– Practical training in Bibliometrix R package and Biblioshiny app<br />
– Applications to real-world research review cases</span></p>
<h2><span style="font-size: 14pt">Lecturers and Guest Speakers</span></h2>
<p><span>The school will be led by Professors Massimo Aria and Corrado Cuccurullo, the developers of Bibliometrix and Biblioshiny.</span></p>
<p><span>Additionally, the 2025 edition will feature the following keynotes by distinguished international scholars in the field of scientometrics:</span></p>
<p><span>– Nicolas Robinson-Garcia (University of Granada), Scientific Director of the Computational Social Sciences and Humanities Unit (U-CHASS)<br />
– Manuel Jesús Cobo Martín (University of Cádiz), lead developer of the SciMAT software<br />
– Nicola De Bellis (University of Modena and Reggio Emilia), Coordinator of the Bibliometric Office and author of influential studies in the evaluation of scientific research</span></p>
<h2><span style="font-size: 14pt">Target Audience and Prerequisites</span></h2>
<p><span>This Summer School is designed for PhD students, postdoctoral researchers, and academics affiliated with universities or research institutions. Participants are expected to have a basic knowledge of R programming and be familiar with RStudio. </span></p>
<h2><span style="font-size: 14pt">Registration and Fees</span></h2>
<p><span>Registration is open on the official Bibliometrix website (check the Summer School section):</span></p>
<p><span><a href="https://www.bibliometrix.org/sssm/" rel="nofollow" target="_blank">https://www.bibliometrix.org/sssm/</a> </span></p>
<p><span>For any inquiries, feel free to contact the organizing committee at: <a href="mailto:sssm@bibliometrix.com" rel="nofollow" target="_blank">sssm@bibliometrix.com</a> </span></p><hr style="border-top: black solid 1px" /><a href="http://r-posts.com/producing-systematic-literature-reviews-with-bibliometrix-r-and-biblioshiny/" rel="nofollow" target="_blank">Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny</a> was first posted on April 29, 2026 at 6:08 am.<br />
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://r-posts.com/producing-systematic-literature-reviews-with-bibliometrix-r-and-biblioshiny/"> R-posts.com</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/04/producing-systematic-literature-reviews-with-bibliometrix-r-and-biblioshiny/">Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny</a>]]></content:encoded>
					
		
		<enclosure url="" length="0" type="" />

		<post-id xmlns="com-wordpress:feed-additions:1">400919</post-id>	</item>
	</channel>
</rss>
