R-bloggers

[R] How to Position the Legend Inside a Plot in ggplot2

R on Zhenguo Zhang's Blog — Sat, 30 May 2026 00:00:00 +0000

[This article was first published on R on Zhenguo Zhang's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Zhenguo Zhang’s Blog https://fortune9.netlify.app/2026/05/30/r-position-legend-inside-plot/ –

By default, ggplot2 places the legend outside the plot area (usually on the right). However, sometimes you may want to move the legend inside the plot to save space or improve the layout. This post explores how to achieve this using theme() parameters.

Example 1: Basic Usage of `legend.position`

The simplest way to move a legend inside the plot is by providing a numeric vector of length two to the legend.position argument in theme(). These coordinates represent the relative position within the plot panel, ranging from 0 to 1. Check the section on “Key Positioning Controls” at the end for a detailed explanation of the coordinate system.

library(ggplot2)

ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  theme_bw() +
  theme(legend.position = c(0.8, 0.8))

Example 2: Fine-Tuning with Justification and Background

When you place a legend inside, you often need to adjust which part of the legend box aligns with your coordinates and handle the background if it obscures data.

legend.justification: Controls the anchor point of the legend box. For example, c("right", "top") means the top-right corner of the legend box will be placed at the specified coordinates.
legend.background: Use element_blank() to make the background transparent.

ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  theme_bw() +
  theme(
    legend.position = c(1, 1),
    legend.justification = c("right", "top"),
    legend.background = element_blank()
  )

Example 3: Positioning in Faceted Plots

Moving the legend inside a faceted plot works similarly: the whole figure is considered as one plot for the coordinate system. You can use the coordinates to place it within one of the empty spaces or over a specific panel. In the following example, the legend is placed in the bottom center of the plot, and a background is added to improve readability:

ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  facet_wrap(~am) +
  theme_bw() +
  theme(
    legend.position = c(0.5, 0.2),
    legend.background = element_rect(fill = "white", color = "grey80")
  )

Example 4: The “Empty Space” Trick for Faceted Plots

If you have a panel which doesn’t have any data, and then it is the perfect spot to place legend. To show this, the facet variable carb has only 3 values, and we will make a 2×2 facet grid. We will also remove the background and border of the legend. The following example shows this:

# Use 3 categories to leave the 4th spot in a 2x2 grid empty
mtcars_subset <- mtcars[mtcars$carb %in% c(1, 2, 4), ]

ggplot(mtcars_subset, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  facet_wrap(~carb, nrow = 2, ncol = 2) +
  theme_bw() +
  theme(
    # Coordinates c(0.75, 0.25) target the center of the empty bottom-right area
    legend.position = c(0.75, 0.25),
    legend.background = element_blank(),
    legend.justification = "center"
  )

Key Positioning Controls

To master legend positioning, keep these three controls in mind:

Coordinate System: The coordinates c(X, Y) represent a relative scale from 0 to 1 across the plot panel.
- c(0, 0): Bottom-left corner
- c(1, 1): Top-right corner
- c(0.5, 0.5): Center of the plot
Anchor Point (legend.justification): Use this to control which part of the legend box aligns with your coordinates. For example, c("right", "top") ensures the top-right corner of the legend box sits exactly at your target coordinate.
Background Removal: If your legend blocks data points, clear the legend box background using legend.background = element_blank(). Alternatively, use element_rect() to give it a semi-transparent or styled background.

Happy Programming!

- https://fortune9.netlify.app/2026/05/30/r-position-legend-inside-plot/ -

To leave a comment for the author, please follow the link and comment on their blog: R on Zhenguo Zhang's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: [R] How to Position the Legend Inside a Plot in ggplot2

EuroBioC2026 Tidyomics Hackathon

Stefano Mangiola — Fri, 29 May 2026 00:00:00 +0000

[This article was first published on tidyomicsBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Join us in Turku

The Tidyomics community is organising a hackathon during the pre-conference programme of EuroBioC2026 in Turku, Finland. The hackathon will take place on June 1-2, 2026, ahead of the main EuroBioC2026 conference on June 3-5, 2026.

This is a community event for people who build, use, teach, document, or are curious about tidy interfaces for omics data analysis in R. Whether you are a regular contributor or new to Tidyomics, the hackathon is a chance to meet collaborators, turn ideas into concrete issues, and make progress on packages, tutorials, and documentation together.

Grab your Tidyomics stickers during the conference! Thanks to Bea Campillo Minano for printing them.

Tidyomics stickers for hackathon participants

What we will work on

The hackathon will focus on practical, community-led work across the Tidyomics ecosystem. Possible projects include:

fixing bugs and closing issues across Tidyomics packages
implementing new features and workflow improvements
improving documentation, tutorials, and vignettes
developing examples that make tidy omics workflows easier to learn and reuse

The Tidyomics open challenges board will be the main source of ideas for the hackathon. Before the event, please browse the board and identify problems or projects you would like to work on.

How to participate

Please start from the Tidyomics Hackathon Turku 2026 GitHub repository. The repository includes event details, preparation steps, and space for participants to open or discuss hackathon issues.

Before the event:

Browse the Tidyomics open challenges.
Select a problem you are interested in.
Open an issue in the hackathon repository, or comment/react if an issue already exists.
Join the community discussion in the EuroBioC2026 Turku hackathon Zulip channel.

Groups and projects will be finalised during the first day, based on open issues, participant interest, and available expertise.

Communication

The official discussion space for the hackathon is the EuroBioC2026 Turku hackathon channel on the Bioconductor Zulip server. The virtual meeting room link will also be shared through that channel and by email.

If you are attending EuroBioC2026 online, note that the conference provides free live streaming for remote participants.

EuroBioC2026 Turku conference sticker

People

The hackathon was proposed by Nicholas Cooley.

The organising committee is:

Juan Henao
Stevie Pederson
Michael Love
Nicholas Cooley

See you there

EuroBioC2026 is an opportunity to bring the Bioconductor and Tidyomics communities together around shared tools, shared challenges, and shared documentation. We hope to see you in Turku, on Zulip, and in the hackathon repository.

To leave a comment for the author, please follow the link and comment on their blog: tidyomicsBlog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: EuroBioC2026 Tidyomics Hackathon

Behavior-Driven Development in R Shiny: Asserting Outcomes with Then Steps

jakub::sobolewski — Fri, 29 May 2026 00:00:00 +0000

[This article was first published on jakub::sobolewski, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You’ve set up all the preconditions.

You’ve triggered the action.

Now the specification needs to answer one question: did the system do the right thing?

That’s what Then steps are for. How precisely you answer it determines whether your specifications act as a reliable safety net or produce false confidence.

This article is the 3rd part of a series on writing BDD specifications for Shiny applications. We’ve built a data submission form, managed preconditions with Given steps, and modeled user interactions with When steps.

Read the previous articles to get full context, or continue here to focus on writing Then steps.

Level-up your testing game! Grab your copy of the R testing roadmap.

The Purpose of Then

Then steps answer one question: What changed as a result of the user’s action?

Not how it changed internally. What the user can now observe — or what the system has now done — that wasn’t true before.

This distinction shapes every Then step you write. The assertion lives at the level of the outcome, not at the level of the mechanism that produced it.

# Outcome — what the user cares about
then_there_are_entries <- function(context, n) {
  expect_equal(context$storage$size(), n)
  context
}

# Mechanism — how the app stored it internally
then_the_database_table_has_n_rows <- function(context, n) {
  expect_equal(nrow(DBI::dbReadTable(context$conn, "entries")), n)
  context
}

Both assertions verify the same underlying fact — that the entry was saved — but the first one survives a switch from disk cache to database to API. The second breaks the moment you change the storage technology.

Assert what the user or system guarantees. Leave how it’s achieved to implementation tests.

What to Assert

Then steps verify outcomes across three kinds of observable evidence. These categories describe what you can check — they don’t prescribe how many steps to use. How to group assertions into steps is a separate question, covered in the next section.

User-visible state

Did the UI change the way a user would expect?

then_i_am_prompted_to_provide_required_fields <- function(context) {
  context$driver$expect_validation_feedback()
  context
}

Assertions use driver methods — the same translation layer that When steps use — so they’re protected from UI implementation details.

System state

Sometimes the most meaningful evidence isn’t what the user sees but what the system now holds.

then_the_entry_has_title <- function(context, expected_title) {
  entry <- context$storage$get_first()
  expect_equal(entry$title, expected_title)
  context
}

These assertions reach directly into the storage object that Given steps configured. They’re faster than UI assertions and more precise: they check the exact state of the system without waiting for the browser to render anything.

Side effects

Some behaviors produce no visible UI change and no stored data — they trigger side effects: emails, API calls, log entries. These still need verification.

then_email_notification_is_sent <- function(context) {
  context$email_service$expect_sent()
  context
}

then_i_am_informed_email_was_not_sent <- function(context) {
  context$driver$expect_visible("email_failure_message")
  context
}

The first assertion works because the email service test double (set up in Given) records whether it was called. The second checks that the app communicated the failure to the user — which is the behavior the specification is actually about.

Implementing Then Steps in the Driver

Then steps that check UI state belong in the driver, just like When steps. This keeps the assertion details hidden from the specification:

#' tests/testthat/setup-driver.R
MyAppDriver <- R6::R6Class(
  classname = "MyAppDriver",
  inherit = shinytest2::AppDriver,
  public = list(
    expect_visible = function(output_id) {
      # Find if HTML element is visible
      invisible(self)
    },
    expect_entry_count = function(n) {
      # Find number of HTML elements
      invisible(self)
    },
    expect_validation_feedback = function() {
      # Check if HTML element is visible
      invisible(self)
    }
  )
)

Each driver method is a named, reusable assertion. If the HTML structure changes, you fix it in one place.

Grouping Then Steps

The right question is not “how many assertions per step?” but “which assertions belong together?”

Group assertions that describe the same observable behavior. Split when outcomes can diverge independently.

Consider what happens when a user successfully submits an entry: the storage grows by one, and a confirmation message appears. These two outcomes are two sides of the same coin — if one is true, the other should be too. Splitting them into separate steps implies they can diverge, which invites the reader to wonder what it would mean for storage to succeed but no confirmation to appear, or vice versa. Grouping them into a single step names the behavior directly:

then_entry_is_submitted <- function(context) {
  expect_equal(context$storage$size(), 1)
  context$driver$expect_visible("confirmation_message")
  context
}

Email notification is different. Submission can succeed even when the email service fails — the two outcomes genuinely can diverge. That’s exactly when a separate step is the right call:

it("should submit entry and send notification", {
  given_no_content() |>
    given_an_authenticated_user(email = "user@example.com") |>
    given_email_service_is_available() |>
    when_i_submit_entry_with_all_required_fields() |>
    then_entry_is_submitted() |>
    then_notification_was_sent_to_the_authenticated_user()
})

Two steps, not three — because the spec now reflects the actual structure of the behavior. If then_entry_is_submitted fails, you know the core submission broke. If then_notification_was_sent_to_the_authenticated_user fails, you know the side effect broke. The split carries information because it maps to a real divergence point.

Then steps should also be read-only. They inspect state; they don’t change it. A Then step that modifies storage or triggers side effects is doing the wrong job. Keep the flow clean: Given sets up, When acts, Then observes.

Making Failure Messages Helpful

A failing test with a good message saves minutes. A failing test with a bad message wastes hours.

The most common mistake is letting low-level assertion failures surface directly. When expect_equal(nrow(df), 2) fails with "actual 0, expected 2", that tells you nothing about which scenario failed or what the data looked like.

The label argument sets the name testthat uses for the object (first argument) in the failure message. Keep it short and descriptive — testthat appends the actual and expected values itself:

then_there_are_entries <- function(context, n) {
  testthat::expect_equal(
    context$storage$size(), n,
    label = "number of entries in storage"
  )
  context
}

A failure now reads: "Expected number of entries in storage to equal n" — immediately clear without any manual formatting.

For UI assertions, label the element being checked:

expect_visible = function(output_id) {
  val <- self$get_value(output = output_id)
  testthat::expect_true(
    !is.null(val) && nchar(val) > 0,
    label = sprintf("output '%s'", output_id)
  )
  invisible(self)
}

A failure reads: "Expected output 'confirmation_message' to be TRUE" — which immediately tells you which element to look at.

What Doesn’t Belong in Then Steps

Don’t push implementation tests up to Then steps.

If you’re asserting the exact text of an error message, the CSS class applied to an invalid input, or the exact SQL that was executed — those are implementation details. They don’t belong here.

Acceptance-level Then steps answer: did the right thing happen from the user’s perspective?

The Complete Picture

Putting it all together, a well-formed specification has a clear structure at every level:

describe("data submission", {
  it("should submit entry and send notification", {
    given_no_content() |>
      given_an_authenticated_user(email = "user@example.com") |>
      given_email_service_is_available() |>
      when_i_submit_entry_with_all_required_fields() |>
      then_entry_is_submitted() |>
      then_notification_was_sent_to_the_authenticated_user()
  })

  it("should handle email service failure gracefully", {
    given_no_content() |>
      given_an_authenticated_user() |>
      given_email_service_is_unavailable() |>
      when_i_submit_entry_with_all_required_fields() |>
      then_entry_is_submitted() |>
      then_i_am_informed_email_was_not_sent()
  })

  it("should require all required fields", {
    given_no_content() |>
      given_an_authenticated_user() |>
      when_i_submit_entry_with_missing_required_fields() |>
      then_i_am_prompted_to_provide_required_fields()
  })
})

Each specification reads like a sentence. None of them mention shinytest2, input IDs, or database queries. The implementation lives behind the DSL, where it can change without touching the specifications.

Wrapping up

Then steps are where a specification earns its credibility.

A specification that only tests easy outcomes produces false confidence. One that asserts user-visible state, system state, and side effects actually catches real problems. Assert outcomes, not implementations. Group assertions by observable behavior and split them only where they can genuinely diverge. Write failure messages that save time. Push implementation details down to unit and module tests.

With Given, When, and Then steps in place, the specifications read like requirements and run on every build. That’s worth more than any testing framework on its own.

To leave a comment for the author, please follow the link and comment on their blog: jakub::sobolewski.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Behavior-Driven Development in R Shiny: Asserting Outcomes with Then Steps

Ghost in the Shell Script

The Jumping Rivers Blog — Thu, 28 May 2026 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been using Unix and Linux systems for nearly 20 years. When I found myself with a Windows laptop for a while I felt lost and quickly installed Windows Subsystem for Linux. And yet the number of commands I know and use regularly is surprisingly short: cd, ls, rm, pwd, mkdir, source, node, npm and, of course, git. Even then I don’t think I really “know” git.

In his excellent book “Don’t Make Me Think Revisited”, Steve Krug describes how

Faced with any sort of technology, very few people take the time to read instructions. Instead, we forge ahead and muddle through, making up our own vaguely plausible stories about what we’re doing and why it works.

Krug’s book is about web and mobile usability but the point he makes here applies across technological domains. And it is safe to say I’ve been “muddling through” with Unix. If you asked me whether something was a Unix command or part of the shell I’d probably just shrug and say something like “I… I don’t know that!?”.

Until recently StackOverflow had always been my crutch to fall back on. From the frequency with which it appears in answers, I know that find is a very useful and versatile command. However, the odds of me remembering how to use it without looking it up are indistinguishable from 0.

But now we have alternatives in the form of various large language models or, to be more precise, multi-modal AI Assistants (like Claude and ChatGPT) and Agents (like Claude Code and Codex). I’d describe myself as AI-agnostic: I see benefits, I worry about issues. So it was with some (I’d say healthy) degree of scepticism I started using Claude to assist me in writing some shell functions to automate tasks I do or would like to do regularly.

I say “I” wrote. Reality is of course that I asked Claude to write and rewrite things for me and then I questioned why it did what it did. This probably makes it sound like I was assessing Claude for a job but it’s actually that I want to understand what has been written and, as we’ve already covered, I am not very good at this stuff.

Task One: fixing a common `git` mistake

One of my most common slip-ups with git is to commit changes to one branch (usually main) that were meant to go on a new feature branch. I know a way to recover from this: create the new branch I meant to create all along; do an interactive rebase to remove the commit from main; switch to the new feature branch I just created. There’s not that many steps to the process but it’s still a bit of a faff. Let’s ask Claude:

Tim

If you git [commit] to a wrong branch, is there a quick way to move the change to a new branch and remove it from the last commit from the current branch?

Claude

Yes — assuming you haven’t pushed yet:

# Create and switch to the correct branch from current state
git switch -c correct-branch

# Go back to the wrong branch and undo the commit (keeps changes staged)
git switch wrong-branch
git reset --soft HEAD~1

# Discard the changes
git restore --staged .
git restore .

Or if you want to keep it even simpler — if the commit is only on the wrong branch and not pushed:

git switch -c correct-branch
git switch wrong-branch
git reset --hard HEAD~1

The switch -c creates the new branch pointing at the same commit, so your work is preserved there. Then you just strip it off the wrong branch with reset –hard.

Here we hit a couple of issues.

Neither of Claude’s suggestions did the final step of what I wanted: moving me back to the new, correct, branch. This is entirely on me. I forgot to ask (and didn’t notice for a bit). I do that kind of thing quite a lot.
Even for simple tasks, Claude’s responses can be quite verbose.

The problem here is not a practical one regarding the use of Claude or an alternative agent, the real problem is it’s quite difficult to write a blog post of a sensible length that actually goes anywhere while accurately recounting the conversation you have with Claude. So what follows is a paraphrase rather than a transcript; shortened for the sake of your sanity and mine.

Tim

Can I use git switch – instead?

Claude

Yes, and it’s cleaner. git switch - saves you typing the wrong branch name out explicitly.

Tim

Seems like the sort of thing I should make into a Bash function

Claude

Agreed. Something like this (with an added safety check):

git-move-last() {
 if [ -z "$1" ]; then
 echo "Usage: git-move-last "
 return 1
 fi
 git switch -c "$1"
 git switch -
 git reset --hard HEAD~1
}

Tim

I’ve gone for git-behead. If I wanted to move to the new branch at the end I just add another git switch -?

Claude

Ha .

Exactly:

git-behead() {
 if [ -z "$1" ]; then
 echo "Usage: git-behead "
 return 1
 fi
 git switch -c "$1"
 git switch -
 git reset --hard HEAD~1
 git switch -
}

Tim

Ok, one more idea. Suppose I haven’t been looking what I’m doing and I want to remove more than just the last commit to a new branch. Can we recreate the above but with an optional commit count that defaults to 1?

Claude

git-behead() {
 if [ -z "$1" ]; then
 echo "Usage: git-behead  [commit-count]"
 return 1
 fi
 local count="${2:-1}"
 git switch -c "$1"
 git switch -
 git reset --hard HEAD~"$count"
 git switch -
}

Tim

Could you comment anything you think isn’t self-explanatory?

Claude

git-behead() {
 if [ -z "$1" ]; then
 echo "Usage: git-behead  [commit-count]"
 return 1
 fi
 local count="${2:-1}" # default to 1 commit if not specified
 git switch -c "$1" # create new branch at current HEAD
 git switch - # switch back to original branch
 git reset --hard HEAD~"$count" # remove commits from original branch
 git switch - # switch to new branch
}

Putting this in my .bashrc file gives me a simple command that I can invoke whenever I make the mistake I regularly make and an optional argument to use whenever I haven’t been paying attention to the branch I’m on for an extended period. It’s also short enough that I think I follow every line, despite my limited knowledge of Bash, and there are comments there for when future me has forgotten. This conversation took about fifteen minutes and included more questions than I’ve included here, plus time I spent running various ad-hoc tests with the different iterations of the function. That seems like a pretty efficient use of my time.

Task Two: listing image files

I do a lot of data visualisation, I look after the artwork for this website and I take photographs for our conferences. In short, I deal with a lot of images in various formats. It would be nice to be able to list only images and their key properties in a specific location with a simple function. Let’s ask Claude:

Tim

Can you write me a Bash script for displaying a list of images in a directory that shows file name, width, height and size (in kB) and nothing else?

Claude

Here’s a clean script using identify from ImageMagick (widely available) with a du fallback for file size:

#!/usr/bin/env bash

DIR="${1:-.}"

if ! command -v identify &>/dev/null; then
 echo "Error: imagemagick is required (install with: brew install imagemagick / apt install imagemagick)" >&2
 exit 1
fi

shopt -s nullglob nocaseglob
files=("$DIR"/*.{jpg,jpeg,png,gif,webp,bmp,tiff,tif,avif})
shopt -u nullglob nocaseglob

if [[ ${#files[@]} -eq 0 ]]; then
 echo "No images found in: $DIR" >&2
 exit 0
fi

printf "%-40s %8s %8s %10s\n" "FILENAME" "WIDTH" "HEIGHT" "SIZE (kB)"
printf "%-40s %8s %8s %10s\n" "--------" "-----" "------" "---------"

for f in "${files[@]}"; do
 name="$(basename "$f")"
 read -r width height < <(identify -format "%w %h" "$f" 2>/dev/null | head -1)
 size_kb=$(awk "BEGIN { printf \"%.1f\", $(wc -c < "$f") / 1024 }")
 printf "%-40s %8s %8s %10s\n" "$name" "$width" "$height" "$size_kb"
done

Supported formats: jpg, jpeg, png, gif, webp, bmp, tiff, avif.

Example output:

FILENAME WIDTH HEIGHT SIZE (kB)
-------- ----- ------ ---------
banner.png 1200 630 84.3
avatar.jpg 400 400 32.1
hero.webp 1920 1080 210.7

I asked for a script, I got a script. I wanted a function. This one is, again, on me.

Tim

Sorry, can we make this a function rather than a script?

Claude

The key changes: shebang removed, exit replaced with return, and variables are now local. To use it, source it in your shell and call it as normal:

imgls() {
 local dir="${1:-.}"

 if ! command -v identify &>/dev/null; then
 echo "Error: imagemagick is required (install with: brew install imagemagick / apt install imagemagick)" >&2
 return 1
 fi

 shopt -s nullglob nocaseglob
 local files=("$dir"/*.{jpg,jpeg,png,gif,webp,bmp,tiff,tif,avif})
 shopt -u nullglob nocaseglob

 if [[ ${#files[@]} -eq 0 ]]; then
 echo "No images found in: $dir" >&2
 return 0
 fi

 printf "%-40s %8s %8s %10s\n" "FILENAME" "WIDTH" "HEIGHT" "SIZE (kB)"
 printf "%-40s %8s %8s %10s\n" "--------" "-----" "------" "---------"

 local f name width height size_kb
 for f in "${files[@]}"; do
 name="$(basename "$f")"
 read -r width height < <(identify -format "%w %h" "$f" 2>/dev/null | head -1)
 size_kb=$(awk "BEGIN { printf \"%.1f\", $(wc -c < "$f") / 1024 }")
 printf "%-40s %8s %8s %10s\n" "$name" "$width" "$height" "$size_kb"
 done
}

The conversation goes on a bit from here and the function becomes more and more detailed. It would be tedious to show the whole conversation but to summarise, “we”:

Add support for SVG, including extracting dimensions from the viewBox attribute.
Fix a bug with GIF dimensions.
Tweak the header text.
Add support for recursive searching of a directory.
Fix the layout when doing a recursive search.
Round the values in the SIZE column
Add comments.

The final result:

imgls() {
 local recursive=0

 # Parse flags — currently only -r (recursive) is supported
 while getopts ":r" opt; do
 case $opt in
 r) recursive=1 ;;
 *) echo "Usage: imgls [-r] [directory]" >&2; return 1 ;;
 esac
 done
 # Shift past any parsed flags so $1 is the directory argument
 shift $((OPTIND - 1))
 OPTIND=1

 # Use first argument as directory, default to current directory if omitted
 local dir="${1:-.}"

 # identify is provided by imagemagick and used to read raster image dimensions
 if ! command -v identify &>/dev/null; then
 echo "Error: imagemagick is required (install with: brew install imagemagick / apt install imagemagick)" >&2
 return 1
 fi

 local exts=("jpg" "jpeg" "png" "gif" "webp" "bmp" "tiff" "tif" "avif" "svg")
 local files=()

 if [[ $recursive -eq 1 ]]; then
 # Build a find command matching all extensions, case-insensitively
 local find_args=("$dir" "-type" "f" "(")
 for i in "${!exts[@]}"; do
 [[ $i -gt 0 ]] && find_args+=("-o")
 find_args+=("-iname" "*.${exts[$i]}")
 done
 find_args+=(")")
 while IFS= read -r f; do
 files+=("$f")
 done < <(find "${find_args[@]}" | sort)
 else
 # nullglob: unmatched globs expand to nothing instead of being treated as literals
 # nocaseglob: match extensions case-insensitively (e.g. .JPG as well as .jpg)
 shopt -s nullglob nocaseglob
 files=("$dir"/*.{jpg,jpeg,png,gif,webp,bmp,tiff,tif,avif,svg})
 shopt -u nullglob nocaseglob
 fi

 if [[ ${#files[@]} -eq 0 ]]; then
 echo "No images found in: $dir" >&2
 return 0
 fi

 # In recursive mode show the relative path, otherwise just the filename
 printf "%-40s %11s %12s %10s\n" "FILENAME" "WIDTH (px)" "HEIGHT (px)" "SIZE (kB)"
 printf "%-40s %11s %12s %10s\n" "--------" "----------" "-----------" "---------"

 local f name width height size_kb viewbox current_subdir="" subdir
 for f in "${files[@]}"; do
 if [[ $recursive -eq 1 ]]; then
 # Print a header line when we enter a new subdirectory
 subdir="$(dirname "${f#$dir/}")"
 if [[ "$subdir" != "$current_subdir" ]]; then
 [[ -n "$current_subdir" ]] && echo ""
 echo "${subdir}/"
 current_subdir="$subdir"
 fi
 name="$(basename "$f")"
 else
 name="$(basename "$f")"
 fi
 # wc -c counts raw bytes; awk converts to kB with one decimal place
 size_kb=$(awk "BEGIN { printf \"%d\", $(wc -c < "$f") / 1024 }")

 if [[ "${f,,}" == *.svg ]]; then
 # SVGs are XML so we can parse dimensions directly without imagemagick.
 # Prefer viewBox (e.g. "0 0 100 200") as it's the most reliable source
 # of the intended coordinate space; fall back to width/height attributes.
 viewbox=$(grep -oi 'viewBox="[^"]*"' "$f" 2>/dev/null | head -1 | grep -oi '"[^"]*"' | tr -d '"')
 if [[ -n "$viewbox" ]]; then
 # viewBox is "min-x min-y width height" — fields 3 and 4
 width=$(echo "$viewbox" | awk '{printf "%g", $3}')
 height=$(echo "$viewbox" | awk '{printf "%g", $4}')
 else
 width=$(grep -oi 'width="[^"]*"' "$f" 2>/dev/null | head -1 | grep -oi '"[^"]*"' | tr -d '"px"')
 height=$(grep -oi 'height="[^"]*"' "$f" 2>/dev/null | head -1 | grep -oi '"[^"]*"' | tr -d '"px"')
 fi
 # ${var:--} expands to "-" if var is empty or unset
 width="${width:--}"
 height="${height:--}"
 else
 # [0] selects only the first frame, preventing animated GIFs from
 # producing one line of output per frame
 read -r width height < <(identify -format "%w %h" "${f}[0]" 2>/dev/null)
 fi

 printf "%-40s %11s %12s %10s\n" "$name" "$width" "$height" "$size_kb"
 done
}

The “development time” here was longer and I’d be fibbing if I said I understood every line of this function. But I do get the general principles and understand what it can and can’t do. And being able to list all images (of nearly all relevant types) in a directory with just their key properties in a nice tabular format is a very nice productivity win for me.

Final Thoughts

Discussions about AI easily tend towards the extremes: ‘it’s amazing’, ‘it’s a disaster’, ‘it’s useless’. This is particularly unhelpful when focusing on how successfully or otherwise it executes a single task. All we’ve really done in this post is create a couple of Bash scripts. For me the AI in question, Claude, has been useful. No more, no less. Despite this, the process did make me think a lot about how the way we interact with machines is changing. While it’s easy to over-personify AI, this line from the film that gave this blog post its title seems now, in 2026, to have been prescient:

The advent of computers, and the subsequent accumulation of incalculable data has given rise to a new system of memory and thought parallel to your own.

— Puppet Master, Ghost in the Shell (1995)

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Ghost in the Shell Script

rOpenSci News Digest, May 2026

rOpenSci — Thu, 28 May 2026 00:00:00 +0000

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dear rOpenSci friends, it’s time for our monthly news roundup! You can read this post on our blog. Now let’s dive into the activity at and around rOpenSci!

rOpenSci HQ

15 Years of rOpenSci, and we’re just getting started

This year we celebrate 15 years of rOpenSci! First up we have a coworking session followed by a couple casual virtual community celebrations. Please join us! Read more in our blog post about these upcoming events.

Also exciting is that this blog post about our 15 year anniversary was featured in the R Weekly Highlights podcast hosted by Eric Nantz and Mike Thomas.

Champions Program update

The rOpenSci Champions Program is currently running two very active cohorts, with many activities happening across the program, and everything is on track!

The 2025–2026 cohort is wrapping up their projects and outreach activities, with several Champions presenting their work at events such as R/Pharma, useR!, Posit, and LatinR 2026. We also recently held a cohort meeting bringing together Champions and mentors from the 2025–2026 cohort and one meeting for mentors only, to share mentoring tips and tricks.

Meanwhile, the 2026–2027 cohort has started their training activities, including three weekly training workshops for Champions and a mentor training workshop. Mentors and Champions have also had their first meetings together to get to know each other and begin planning their collaboration.

Stay tuned to our events page for Champions outreach activities and to our blog to learn more about Champions’ projects.

May is Maintainer Month

Open source software doesn’t sustain itself. Behind every R package, there is at least one person who responds to issues, reviews pull requests, keeps up with dependency changes, and makes sure everything still works.

We joined the #MaintainerMonth celebration with a month-long series of rOpenSci’s maintainer spotlights: 36 maintainers from 15 countries across 4 continents, maintaining more than 50 packages that together serve thousands of researchers and data practitioners around the world.

Follow our social media (Mastodon and LinkedIn) and the official hashtag (#MaintainerMonth) to learn more about the maintainers who support open source software. Thank you to those who agreed to participate and to all the maintainers in our community.

Software-Review update

This month saw a slight policy shift in the way editors handle new software review submissions, assisted by an internal update to our bot system. Previously, our rotating Editor-in-Chief (EiC) was solely responsible for initial decisions on whether to proceed with submissions. Once a decision was reached, a handling editor took over, and generally guided each submission through to approval. However, recent increases in submissions have been too much for one EiC to handle. Submissions will now be passed on more quickly to handling editors who will then make initial scope and fit decisions.

As part of this change, the EiC can now ask the bot to “ping editors”. Editors will receive an email alerting them of this new submission and can then decide whether to take it on as editor. Each time an editor clicks on a link in the email, the EiC will receive an email notification, and once an editor is assigned, all links will be automatically deactivated.

For anybody interested in how to develop an email alert system like this, please contact us and we’ll have happy to share what we learnt along the way. Full code is in this R file.

Social media cards for R-universe

When posting about your package or universe on social media, you can get a pretty social media card featuring important information about (respectively):

the package such as its maintainer, its keywords, its URL, its version, its number of stars, etc.

the universe such as its number of packages and contributors, etc.

More information in the R-universe documentation.

Coworking

Read all about coworking!

Tuesday June 2nd 2026, 14:00 Europe Central (12:00 UTC) “15 years with rOpenSci”, with Steffi LaZerte and cohost Noam Ross.
- Explore rOpenSci projects
- Do a deep dive into the rOpenSci blog history
- Meet rOpenSci’s Executive Director, Noam Ross, and discuss how rOpenSci has evolved over the last 15 years.
Tuesday July 7nd 2026, 09:00 Americas Pacific (16:00 UTC) “Debugging in R”, with Yanina Bellini Saibene and cohost Shannon Pileggi.
- Read up on debugging in R
- Meet community host, Shannon Pileggi, and discuss tips and tricks for debugging in R.

And remember, you can always cowork independently on work related to R, work on packages that tend to be neglected, or work on what ever you need to get done!

Software

New versions

The following ten packages have had an update since the last newsletter: cffr (v1.4.0), gutenbergr (v0.5.1), dataset (0.4.4), sofa (v0.4.1), saperlipopette (v1.0.0), medrxivr (snapshot), textreuse (v1.0.1), EDIutils (v2.1.0), rsi (v0.3.3), and UCSCXenaTools (v1.7.0).

The writexl package has a new maintainer, Bill Denney. NLMR is now maintained by Jakub Nowosad.

Software Peer Review

There are eighteen recently closed and active submissions and 4 submissions on hold. Issues are at different stages:

One at ‘6/approved’:
- saperlipopette, Create Example Git Messes. Submitted by Maëlle Salmon.
Three at ‘5/awaiting-reviewer(s)-response’:
- nycOpenData, Convenient Access to NYC Open Data API Endpoints. Submitted by Christian Martinez.
- ernest, A Toolkit for Nested Sampling. Submitted by Kyle Dewsnap. (Stats).
- pkgmatch, Find R Packages Matching Either Descriptions or Other R Packages. Submitted by mark padgham.
Six at ‘4/review(s)-in-awaiting-changes’:
- pvEBayes, Empirical Bayes Methods for Pharmacovigilance. Submitted by Yihao Tan. (Stats).
- RAMEN, RAMEN: Regional Association of Methylome variability with the Exposome and geNome. Submitted by Erick Navarro-Delgado.
- logolink, An Interface for Running NetLogo Simulations. Submitted by Daniel Vartanian.
- rcrisp, Automate the Delineation of Urban River Spaces. Submitted by Claudiu Forgaci. (Stats).
- priorsense, Prior Diagnostics and Sensitivity Analysis. Submitted by Noa Kallioinen. (Stats).
- galamm, Generalized Additive Latent and Mixed Models. Submitted by Øystein Sørensen. (Stats).
One at ‘3/reviewer(s)-assigned’:
- ciecl, International Classification of Diseases ICD-10/ICD-11 for Chile. Submitted by Rodolfo Tasso.
Five at ‘2/seeking-reviewer(s)’:
- EpiStrainDynamics, Infer temporal trends of multiple pathogens. Submitted by Saras Windecker. (Stats).
- lakefetch, Calculate Fetch and Wave Exposure for Lake Sampling Points. Submitted by jeremylfarrell.
- RAQSAPI, A Simple Interface to the US EPA Air Quality System Data Mart API. Submitted by mccroweyclinton-EPA.
- fcmconfr, Fuzzy Cognitive Map Analysis in R. Submitted by benroston. (Stats).
- coevolve, Fit Bayesian Generalized Dynamic Phylogenetic Models using Stan. Submitted by Scott Claessens. (Stats).
Two at ‘1/editor-checks’:
- metasurvey, Reproducible Survey Data Processing with Step Pipelines. Submitted by Mauro Loprete.
- LBDiscoverAnalysis, Co-occurrence Discovery Models and Visualization for Biomedical LBD. Submitted by Chao Liu.

Find out more about Software Peer Review and how to get involved.

On the blog

New Mentoring Team with Experienced Mentors and New Voices by Alber Hamersson Sánchez Ipia, Pablo Paccioretti, Erick Navarro Delgado, Guadalupe Pascal, Andrea Gomez Vargas, Monika Avila Marquez, Luis D. Verde Arregoitia, Francisco Cardozo, Elio Campitelli, Pao Corrales, and Yanina Bellini Saibene. Meet the new team of mentors who will accompany our Champions in their projects for a year, sharing experience, guidance and enthusiasm. Other languages: Nuevo equipo de mentor(a|e)s con experiencia y nuevas voces (es).
15 Years of rOpenSci, and We’re Just Getting Started by The rOpenSci Team.

Calls for contributions

Calls for maintainers

If you’re interested in maintaining any of the R packages below, you might enjoy reading our blog post What Does It Mean to Maintain a Package?.

charlatan, create fake data in R. Issue for volunteering.
landscapetools, R package for some of the less-glamorous tasks involved in landscape analysis. Issue for volunteering.
hddtools, Tools to discover hydrological data, accessing catalogues and databases from various data providers. Issue for volunteering.

Calls for contributions

Refer to our help wanted page – before opening a PR, we recommend asking in the issue whether help is still needed.

Package development corner

Some useful information for R package developers.

devtools’ install functions now officially deprecated

The devtools::install_ functions like devtools::install_github() have now been officially deprecated in favor of pak. The manual page contains a migration guide. Time to update older READMEs!

New Git commands!

New in Git: git history! Thanks to Hugo Gruson for bringing this to our attention.

Imagine…

Oops, that old commit’s message had a typo! git history reword

Oops, that old commit is too big! git history split (to split it into two commits)

Git release notes; Git docs.

roxygen2 8.0.0

A new version of roxygen2 hit CRAN. Among the highlights are improvements for R6, S7, inheritance, and better documentation of roxygen2 itself, including the vignette on extending roxygen2 with your own tags and roclets. Release announcement.

Comment `DESCRIPTION` files

Have you noticed this line in the release notes of R 4.6.0?

read.dcf() now recognizes lines starting with # as comment lines. By Dirk Eddelbuettel, Laurent Gatto and Hugo Gruson.

This means it’s becoming possible to add comments to DESCRIPTION files, for instance to note why you added a given dependency.

{cross} R package for running code with different package versions

Have you ever used Git worktree to load several R package versions at once (in different sessions), for instance to create a benchmark? Davis Vaughan made such endeavors much smoother with his cross R package. cross uses Git worktree under the hood to install different package versions in temporary libraries and then runs R code with these different versions. Furthermore, it has a user-friendly interface!

For example, to compare the released version of vctrs to the developement version on GitHub, you could run the following.

cross::bench_versions(pkgs = c("vctrs", "r-lib/vctrs"), {
 library(vctrs)
 x <- c(1, NA, 2, 3, NA)
 bench::mark(missing = vec_detect_missing(x))
})

On attribution with a footnote

James Balamuta wrote two thoughtful posts on a tricky situation and its resolution. TL;DR James’ work on portable R was used in another repository with only minimal acknowledgement through a footnote in a comment. However, those involved came together to work out a resolution through meaningful conversation.

Last words

Thanks for reading! If you want to get involved with rOpenSci, check out our Contributing Guide. This guide will help direct you to the right place, whether you want to make code contributions, non-code contributions, or contribute in other ways such as through sharing use cases. You can also support our work through donations.

If you haven’t subscribed to our newsletter yet, you can do so though our signup form. Until it’s time for our next newsletter, you can keep in touch with us through our website, Mastodon, or LinkedIn. See you soon!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: rOpenSci News Digest, May 2026

survivoR now includes US50 and AU12

Daniel Oehm — Wed, 27 May 2026 09:54:21 +0000

[This article was first published on R Archives - Dan Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After a couple of big seasons survivoR v2.3.12 has been updated with US50 and AU12.

Get it on:

As of writing, the survivoR package is closing in on 30,000 downloads, which is wild to me. Not only that, but the CRAN installs are probably the lowest volume of all formats. Installs directly from Git and the XLSX downloads are probably much higher. I never thought this data would be so popular.

I don’t have much of a social media presence, pretty much only Bluesky these days, and a very small following there. Regardless many people reach out to say thanks for pulling it together. It has enabled them to build their own websites (of which there are quite a few – Survivor-reference, survivorELO), be used for research papers, Master theses, school projects, help potential castaways with their Survivor auditions, help future Survivor players prep for their season, even then Survivordle web app.

I enjoy hearing about how others are utilizing it. If you have experience with it as well, please reach out—I am always keen to discuss Survivor!

The post survivoR now includes US50 and AU12 appeared first on Dan Oehm | Gradient Descending.

To leave a comment for the author, please follow the link and comment on their blog: R Archives - Dan Oehm | Gradient Descending.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: survivoR now includes US50 and AU12

Speeding up Stan model builds for R package developers

R | Dr Tom Palmer — Tue, 26 May 2026 00:00:00 +0000

[This article was first published on R | Dr Tom Palmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In my previous job my work computer was a Windows desktop – yes, those were the days before laptops and hotdesking!

My PhD student was interested in Bayesian methods and we put together an R package which included some Stan models. I was always frustrated by how slowly these compiled on our Windows machines. A few years later, when I got a MacBook Air I was shocked how much faster they compiled.

On my Windows machine our mrbayes package takes 3 minutes 55 seconds to compile and install. On my M4 MacBook Air it takes 1 minute 16 seconds.

The following tips show how to improve those timings.

To generate the timings I used

time R CMD INSTALL --preclean .

Big win 1: Enable parallel compilations with the `MAKEFLAGS` environment variable

Set the MAKEFLAGS environment variable in your ~/.Renviron file. This controls how many make jobs run concurrently. Choose a number no larger than the number of processing cores your machine has. To find this run

# Windows - in a Git Bash shell
echo $NUMBER_OF_PROCESSORS

# macOS
sysctl -n hw.logicalcpu

# Ubuntu Linux
nproc

A reasonable starting point is your core count, or a few fewer to leave headroom for whatever else you’re doing during a compilation. For example,

# In ~/.Renviron
MAKEFLAGS=-j6

Close and restart R/RStudio after making this change.

On my Windows machine this reduced the build from 3:55 to 1:15. To find your own sweet spot empirically, see the example at the end of Big win 2.

Big win 2: Enable C/C++ compiler cache using `ccache`

Install ccache, I find it easiest to use a package manager, e.g.,

# macOS
brew install ccache

# Ubuntu/Debian Linux
apt install ccache

# Windows
winget install ccache

Whichever installation method you use make sure ccache is on your PATH after installation. You can test with, say,

ccache --version

To enable ccache, on macOS and Linux this goes in ~/.R/Makevars; on Windows it’s ~/.R/Makevars.win (create the directory and file if they don’t exist), set

# macOS
CC = ccache clang
CXX = ccache clang++
CXX17 = ccache clang++

# Windows and Linux
# Most Linux users will be on gcc by default
# Change to clang if you're using that
CC = ccache gcc
CXX = ccache g++
CXX17 = ccache g++

After a first compilation run for the cache to be generated, subsequent compilations are much faster.

Windows, second compilation: 18 seconds
M4 MacBook Air, second compilation: 5 seconds

Perhaps more importantly, if, say, your package has 5 models and you only amend the code for one of them, ccache knows to use the cache for the 4 unchanged models.

Windows, second compilation, only 1 model edited: 1 minute 10 seconds
M4 MacBook Air, second compilation, only 1 model edited: 19 seconds

You can verify ccache is working, by observing the timing decrease and by checking the output of

ccache -s

It is also useful to zero the ccache statistics before a timing run with

ccache -z

Testing which of your models takes the longest to compile

Here’s a quick script to test which model takes the longest to compile. Save it as say test.sh at the top level of your repo and add ^test\.sh$ to your .Rbuildignore file (to avoid an R CMD check NOTE about unknown files at the top level).

for model in inst/stan/*.stan; do
  cp "$model" "$model.bak"
  # Insert at the top of the file
  sed -i "1i // benchmark $(date +%s%N)" "$model"
  ccache -z
  SECONDS=0
  R CMD INSTALL --preclean . >/dev/null 2>&1
  echo "$(basename $model): ${SECONDS}s"
  ccache -s | grep -E "Hits|Misses" | head -2
  mv "$model.bak" "$model"
done

Finding your `MAKEFLAGS` sweet spot

With ccache installed you can now benchmark different -jN values cleanly (the ccache -C calls ensure each run is a cold compile, so you measure raw compilation cost rather than cache hits). You can increase the number sequence up to the number of processing cores your machine has.

for j in 1 2 3 4 6 8 10; do
  ccache -C >/dev/null
  echo "=== -j$j ==="
  SECONDS=0
  MAKEFLAGS=-j$j R CMD INSTALL --preclean . >/dev/null 2>&1
  echo "elapsed: ${SECONDS}s"
done

The timings on my MacBook Air were

=== -j1 ===
elapsed: 76s
=== -j2 ===
elapsed: 48s
=== -j3 ===
elapsed: 35s
=== -j4 ===
elapsed: 36s
=== -j6 ===
elapsed: 27s
=== -j8 ===
elapsed: 27s
=== -j10 ===
elapsed: 28s

My MacBook Air has 10 cores, but only 4 of those are performance cores, so I settled on -j6 as that is where my timings plateaued — and it leaves headroom for me inevitably checking my email during a compilation.

Big win 3: Combining these in GitHub Actions workflows

In my .github/workflows/R-CMD-check.yaml I have steps for these speedups. Firstly, to set MAKEFLAGS.

      - name: Set parallel compilation flags (Linux and macOS)
        if: runner.os != 'Windows'
        shell: bash
        run: |
          NCPUS=$(nproc 2>/dev/null || sysctl -n hw.logicalcpu)
          echo "Detected ${NCPUS} processors"
          echo "MAKEFLAGS=-j${NCPUS}" >> ~/.Renviron

      - name: Set parallel compilation flags (Windows)
        if: runner.os == 'Windows'
        shell: pwsh
        run: |
          Write-Output "Detected $env:NUMBER_OF_PROCESSORS processors"
          Add-Content -Path "$HOME\.Renviron" -Value "MAKEFLAGS=-j$env:NUMBER_OF_PROCESSORS"

You can also use ccache in GitHub Actions, as follows:

      # ccache speeds up Stan model compilation dramatically on warm cache.
      # Note: Windows support via ccache-action is documented as "probably works"
      # rather than fully stable; if it causes issues, scope this step to non-Windows.
      - name: Setup ccache
        uses: hendrikmuhs/ccache-action@v1.2.23
        with:
          # Key invalidates when Stan models or DESCRIPTION change.
          # Older caches partially seed new ones via restore-keys.
          key: ccache-${{ matrix.config.os }}-R-${{ matrix.config.r }}-${{ hashFiles('inst/stan/**/*.stan', 'DESCRIPTION') }}
          restore-keys: |
            ccache-${{ matrix.config.os }}-R-${{ matrix.config.r }}-
            ccache-${{ matrix.config.os }}-R-
          max-size: "2G"

      - name: Configure R to use ccache (Linux and macOS)
        if: runner.os != 'Windows'
        shell: bash
        run: |
          mkdir -p ~/.R
          if [ "$RUNNER_OS" = "macOS" ]; then
            cat >> ~/.R/Makevars <<'EOF'
          CC = ccache clang
          CXX = ccache clang++
          CXX14 = ccache clang++
          CXX17 = ccache clang++
          CXX20 = ccache clang++
          EOF
          else
            cat >> ~/.R/Makevars <<'EOF'
          CC = ccache gcc
          CXX = ccache g++
          CXX14 = ccache g++
          CXX17 = ccache g++
          CXX20 = ccache g++
          EOF
          fi
          echo "--- ~/.R/Makevars ---"
          cat ~/.R/Makevars

      - name: Configure R to use ccache (Windows)
        if: runner.os == 'Windows'
        shell: pwsh
        run: |
          New-Item -ItemType Directory -Force -Path "$HOME\.R" | Out-Null
          $makevars = @"
          CC = ccache gcc
          CXX = ccache g++
          CXX14 = ccache g++
          CXX17 = ccache g++
          CXX20 = ccache g++
          "@
          Add-Content -Path "$HOME\.R\Makevars.win" -Value $makevars
          Write-Output "--- ~/.R/Makevars.win ---"
          Get-Content "$HOME\.R\Makevars.win"

You can see the full file in my repo.

This reduced my ubuntu-latest run for r-release from 7 minutes 30 seconds to 4 minutes 49 seconds.

Big win 4: Switch to `clang`

I found that switching from gcc to clang gives a noticeable speedup; the single core compile time dropped from 3 minutes 55 seconds to 3 minutes flat on my Windows machine.

To do this you need to install clang. On Windows you install clang within RTools45 — more involved than on Linux, but doable.

# Windows within RTools45 Bash shell
# Launch C:\rtools45\ucrt64.exe
# You may need to close and reopen the shell after the first command
pacman -Syu
pacman -S mingw-w64-ucrt-x86_64-clang

# Ubuntu/Debian Linux
sudo apt install clang

At this point on Windows running

which clang

should return /ucrt/bin/clang.

Switch to clang in ~/.R/Makevars (if you’re not using ccache delete that prefix)

# On Linux
CC = ccache clang
CXX = ccache clang++
CXX14 = ccache clang++
CXX17 = ccache clang++
CXX20 = ccache clang++

and in ~/.R/Makevars.win on Windows

# On Windows
CC = ccache C:/rtools45/ucrt64/bin/clang.exe
CXX = ccache C:/rtools45/ucrt64/bin/clang++.exe
CXX17 = ccache C:/rtools45/ucrt64/bin/clang++.exe

Windows users will need to add the following to PATH

C:\rtools45\ucrt64\bin
C:\rtools45\usr\bin

You can verify things are working by running

R CMD config CXX17

I believe you need clang version 18 or later to see the speedups.

Small win 1: WSL users should use the native file system

Within WSL it is possible to access files from within its native Linux filesystem, i.e., within /home/user/..., and also on the Windows filesystem, e.g., in /mnt/c/.... I believe file operations are noticeably faster within /home/user/....

Naive guesses that made no difference

I had wondered whether running a non-debug compilation with say

pkgbuild::compile_dll(debug = FALSE)

would speed things up. It turns out it does not. For Stan models, most of the time is spent in C++ template instantiation by the compiler, not in optimisation passes — so disabling debug flags or lowering the optimisation level barely helps.

I also wondered whether using R on Windows Subsystem for Linux would speed things up just by virtue of being on Linux. It did not, timings using gcc on Windows and WSL Ubuntu were essentially identical. The advantage of using WSL is that it is easier to switch to using clang on Linux.

(Money no object) Big win 5: Switch to an Apple Silicon Mac

Apple silicon Macs have excellent single threaded performance, their unified memory architecture has very high bandwidth, they have large L1 and L2 caches, and fast NVMe SSDs. Together these produce very fast Stan model compilation times, even on the lowest end Apple Silicon Macs.

Summary

In summary, five big wins and one small win for speeding up Stan model compilation in R packages.

To leave a comment for the author, please follow the link and comment on their blog: R | Dr Tom Palmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Speeding up Stan model builds for R package developers

Behavior-Driven Development in R Shiny: Modeling User Behavior with When Steps

jakub::sobolewski — Tue, 26 May 2026 00:00:00 +0000

[This article was first published on jakub::sobolewski, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A specification tells a story.

Given describes the world before the action. When describes the action itself. Then describes what changed as a result.

If Given is the setup and Then is the payoff, When is where the plot happens, and how you write it determines whether your specifications stay readable for months or decay after the very first change you make to your project.

This article is part of a series on Behavior-Driven Development for Shiny applications. We’ve been building a data submission form from scratch, adding an email notification feature, and managing preconditions with Given steps.

Read the previous articles to get up to speed, or continue here to focus on how to write When steps.

Level-up your testing game! Grab your copy of the R testing roadmap.

The Purpose of When

When steps answer one question: What is the user doing?

Not what button they clicked. Not what input ID was set. What they did, from their perspective.

The difference matters because UI implementations change constantly, user behaviors change only when business requirements change. Input IDs get renamed, buttons get moved, forms get refactored into modules. But users still interact with the same form, they submit the same data. If your When steps name those interface details, they break every time something shifts. If they describe user intent, they survive almost any UI change.

A When step should read like a sentence from a user story.

Compare these two:

# Implementation-coupled! Breaks when the submit button is removed or chaned
when_i_click_submit_button <- function(context) {
  ...
}

# Behavior-focused! Survives any UI refactor
when_i_submit_entry_with_all_required_fields <- function(context) {
  ...
}

The second version tells you what the user accomplished, not how the app was wired.

That’s the goal.

Starting the App

When steps are where the Shiny app launches — but not directly.

As we established in the Given article, the app starts lazily: the first When step calls .start_driver(), which creates the driver with all configuration accumulated from Given steps. Every subsequent When step interacts with the already-running app.

#' tests/testthat/setup-dsl.R
when_i_submit_entry_with_all_required_fields <- function(context) {
  .start_driver(context)
  context$driver$fill_required_fields(
    url = "https://example.com/",
    title = "Example"
  )
  context
}

Calling .start_driver() in every When step is safe — it’s a no-op if the driver already exists:

.start_driver <- function(context) {
  if (is.null(context$driver)) {
    context$driver <- MyAppDriver$new(
      storage = context$storage,
      email_service = context$email_service,
      user = context$user
    )
  }
  context
}

This means Given steps are purely configuration — they never touch the running app. The app only starts when the user’s first action is about to happen. This makes Given steps fast, composable, and easy to reason about.

Implementing When Steps

The When step function itself should be thin. Its only job is to delegate to the driver and pass the context forward.

All implementation details live in the driver:

#' tests/testthat/setup-driver.R
MyAppDriver <- R6::R6Class(
  classname = "MyAppDriver",
  inherit = shinytest2::AppDriver,
  public = list(
    fill_required_fields = function(url, title) {
      self$set_inputs(url = url, title = title)
      self$click(input = "submit")
      self$wait_for_idle()
    },
    fill_all_fields = function(url, title, description) {
      self$set_inputs(url = url, title = title, description = description)
      self$click(input = "submit")
      self$wait_for_idle()
    }
  )
)

The driver encapsulates everything: which inputs to set, what to click, what to wait for. The When step just calls the right driver method by name.

This separation pays off immediately. If the submit button ID changes from "submit" to "btn_submit", you fix it in one place — the driver method — and every specification that calls fill_required_fields continues to work unchanged.

Note that the same driver method can be called from multiple When steps.

The driver is a translation layer. It translates user intent (fill required fields) into UI mechanics (set these inputs, click this button).

Naming When Steps

Good names follow a simple pattern: when___.

The verb describes the action from the user’s point of view. The object is what they acted on — expressed in domain language, not UI language.

# Good! We use domain language
when_i_submit_entry_with_all_required_fields()
when_i_submit_entry_with_missing_title()
when_i_delete_the_first_entry()
when_i_inspect_the_submissions_list()

# Bad! UI details are leaking in
when_i_click_the_submit_button()
when_i_clear_the_title_input()
when_i_navigate_to_slash_list()

Domain names survive redesigns. UI names couple your specifications to the current implementation.

One more rule: one action per When step. A step named when_i_fill_and_submit_the_form() is doing two things. That makes it harder to compose and harder to understand which action caused which outcome.

# One action per step — composable
when_i_fill_in_the_form(url = "https://example.com", title = "Example") |>
  when_i_submit_the_form()

# Doing too much — hard to reuse
when_i_fill_and_submit_the_form()

Smaller, focused steps let you test multi-step flows precisely.

You might think, this is an easy example.

Real apps aren’t single-page forms. They have navigation, multiple pages, modals, they might feel like navigating through a maze. It’s not that simple to interact with all the features! It’s not, but we can apply the same principles to more complex interactions.

Here’s what I do.

In apps we have pages. Users need to navigate to those pages before they can interact with them. How do we name When steps that represent navigation?

I prefer to name them based on their intent, what they’re trying to achieve, rather than the mechanics of how they get there. For example:

# Good! Focuses on the user's goal, not the UI structure
when_i_inspect_submissions()
when_i_open_the_first_entry()
when_i_change_settings(...) # composite step, implicitly navigates to the settings page and changes settings

# Bad! Reveals that there is a "submissions" page, what if we move submissions listing to another page?
when_i_navigate_to_submissions_page()

This way, the step names reflect the user’s goals, not the UI structure. If we later change how navigation works — say we add another page and move our targeted component there — we can update the driver methods without changing the When step names.

That’s how specifications stay stable and focused on user behavior, not implementation details.

That’s how specifications stay stable and don’t need updating every time the UI changes.

Multi-Step Flows

When a scenario involves several user actions in sequence, chain them:

it("should allow editing a submitted entry", {
  given_existing_entries() |>
    given_an_authenticated_user() |>
    when_i_open_the_first_entry() |>
    when_i_update_the_title(title = "Updated Title") |>
    when_i_submit_the_form() |>
    then_the_entry_has_title("Updated Title")
})

Each step does one thing. The sequence reads like a user story. And if the assertion fails, you can see exactly which action preceded the broken state.

This is also where the context pipeline pattern earns its keep. Every step receives context and returns context, so you can add or remove steps in the middle without restructuring the whole test:

it("should show confirmation after submitting", {
  given_no_content() |>
    given_an_authenticated_user() |>
    when_i_submit_entry_with_all_required_fields() |>
    then_entry_is_submitted()
})

The pipeline works because every step function — Given, When, or Then — has the same contract: receive context, return context. That uniformity is what lets pipe operator connect them.

Here is the pattern on its own, with no Shiny involved. The When steps are thin — each one just calls a driver method and returns context:

# A minimal driver that records what it was asked to do
MockDriver <- function() {
  form <- NULL
  submitted <- FALSE
  list(
    fill_required_fields = function(url, title) {
      form <<- list(url = url, title = title)
    },
    submit = function() {
      submitted <<- TRUE
    },
    get_form = function() form,
    was_submitted = function() submitted
  )
}

# When steps are thin: call a driver method, return context
when_i_fill_in_the_form <- function(context, url, title) {
  context$driver$fill_required_fields(url = url, title = title)
  context
}

when_i_submit_the_form <- function(context) {
  context$driver$submit()
  context
}

ctx <- new.env(parent = emptyenv())
ctx$driver <- MockDriver()

ctx |>
  when_i_fill_in_the_form(url = "https://example.com", title = "Example") |>
  when_i_submit_the_form()

The step functions contain no UI details — those live entirely inside MockDriver. Swapping the driver is all it takes to point the same steps at a different implementation.

Testing Error Paths

User interactions that trigger failures are still described with When steps. The step describes what the user attempted, not whether it succeeded.

it("should require all required fields", {
  given_no_content() |>
    given_an_authenticated_user() |>
    when_i_submit_entry_with_missing_required_fields() |>
    then_i_am_prompted_to_provide_required_fields()
})

The When step looks identical to the happy-path version — which is the point. The user tried to submit with missing fields. The difference lives in the Then step, which checks what the app communicated back to the user.

when_i_submit_entry_with_missing_required_fields <- function(context) {
  .start_driver(context)
  context$driver$fill_required_fields(
    url = "https://example.com/",
    title = ""  # Missing required field
  )
  context
}

You don’t need to create special “error path” When steps with names like when_i_incorrectly_submit(). Name them after what the user did, not after whether it was correct. The Then step will check whether the app responded correctly to that action.

Parameterising When Steps

Sometimes you want to test multiple variations of the same action. Rather than creating a separate step for each variation, make the step accept parameters:

when_i_submit_entry_with_title <- function(context, title) {
  .start_driver(context)
  context$driver$fill_required_fields(
    url = "https://example.com/",
    title = title
  )
  context
}

Now you can express different scenarios clearly:

it("should reject titles that are too short", {
  given_no_content() |>
    given_an_authenticated_user() |>
    when_i_submit_entry_with_title(title = "ab") |>
    then_i_am_prompted_to_provide_required_fields()
})

it("should accept titles of the minimum length", {
  given_no_content() |>
    given_an_authenticated_user() |>
    when_i_submit_entry_with_title(title = "abc") |>
    then_there_are_entries(n = 1)
})

Use parameters when the difference between scenarios is data, not behavior. Keep the step name stable; let the arguments express the variation.

The Cucumber Way

cucumber allows you to write multiple variants of scenarios with Scenario Outline and Examples keywords.

In a cucumber feature file this scenario would look like this:

Feature: Data submission form validation
  Scenario Outline: Submitting entries with different title lengths
    Given no content
    And an authenticated user
    When I submit an entry with title 
    Then I am prompted to provide required fields
    Examples:
      | title |
      | "ab"  |
      | "abc" |</pre>
<p>In our implementation we don’t have a built-in way to express scenario outlines, but we can achieve the same effect by writing multiple scenarios that call the same parameterised When step.</p>
<hr>
<h2 id="signs-a-when-step-is-too-complex">Signs a When Step Is Too Complex</h2>
<p>A When step that runs more than a few lines of logic is usually trying to do too much.</p>
<ol>
<li><strong>If a step is setting state</strong> — checking whether something exists before interacting with it, or conditionally taking different actions — that logic belongs in Given steps or in the driver implementation, not the When step itself.</li>
<li><strong>If a step has branching logic</strong> — doing one thing when a flag is set and another when it isn’t — split it into two steps with clear names.</li>
<li><strong>If a step is making assertions</strong> — checking that something happened mid-flow — move those checks to Then steps. A When step should interact and nothing more. Mixing assertions into When steps makes failures ambiguous: you can’t tell whether the setup was wrong or the outcome was wrong. There might be exceptions to assert whether we’re on the right path in a multi-step flow – to fail eagerly – but generally, try to keep assertions in Then steps.</li>
</ol>
<h2 id="the-pattern-in-practice">The Pattern in Practice</h2>
<p>Here’s the full shape of a specification using the When steps we’ve built:</p>
<pre>describe("data submission", {
  it("should submit entry and send notification", {
    given_no_content() |>
      given_an_authenticated_user() |>
      given_email_service_is_available() |>
      when_i_submit_entry_with_all_required_fields() |>
      then_entry_is_submitted() |>
      then_email_notification_is_sent()
  })

  it("should require all required fields", {
    given_no_content() |>
      given_an_authenticated_user() |>
      when_i_submit_entry_with_missing_required_fields() |>
      then_i_am_prompted_to_provide_required_fields()
  })

  it("should allow editing a submitted entry", {
    given_existing_entries() |>
      given_an_authenticated_user() |>
      when_i_open_the_first_entry() |>
      when_i_update_the_title(title = "Updated Title") |>
      when_i_submit_the_form() |>
      then_the_entry_has_title("Updated Title")
  })
})</pre>
<p>The specifications read naturally. None of them mention input IDs, button selectors, or shinytest2. Those details live in the driver, hidden away from the scenarios they serve.</p>
<hr>
<p>Note how specifications don’t even reveal it’s a Shiny app. They don’t reveal how we interact with it — <code>shinytest2</code>, <code>selenider</code>, anything goes. It’s all hidden and irrelevant to the story we’re telling. You can swap the internals and keep the same specifications true. You can even swap the system implementation entirely, and as long as it still satisfies the same business requirements, the specifications don’t need touching. It’s live and executable — not a list of tickets in an external system no one reads.</p>
<p>That’s exactly the opposite of what happens when we write tests that are tightly coupled to the UI structure.</p>
<h2 id="wrapping-up">Wrapping up</h2>
<p>When steps model user behavior, not UI implementation.</p>
<p>That distinction is what keeps specifications readable days, weeks, months later, when the form has been refactored, button IDs have changed, and nobody remembers what the original implementation looked like. Keep steps focused on one action. Put the implementation details in the driver. Use domain language, not UI language.</p>
<p>Specifications written this way tend to survive refactors you haven’t thought of yet.</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://jakubsobolewski.com/blog/bdd-shiny-when"> jakub::sobolewski</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/behavior-driven-development-in-r-shiny-modeling-user-behavior-with-when-steps/">Behavior-Driven Development in R Shiny: Modeling User Behavior with When Steps</a>
</article>
<article>
<h1>Repost: ctrlvee: Extract external R code and insert inline</h1>
<p>Stephen Turner — Fri, 22 May 2026 13:34:48 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="http://gettinggeneticsdone.blogspot.com/2026/05/ctrlvee-extract-external-r-code-insert-inline-positron-rstudio-addin.html"> Getting Genetics Done</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p><mark><b>Reposted from the original at <a href="https://blog.stephenturner.us/p/ctrlvee-extract-external-r-code-insert-inline-positron-rstudio-addin" rel="nofollow" target="_blank">https://blog.stephenturner.us/p/ctrlvee-extract-external-r-code-insert-inline-positron-rstudio-addin</a>.</b></mark> </p><p><span>Ever find yourself looking through a pkgdown page or a Quarto 
book, copying and pasting code chunks from your browser into your IDE? I
 do, and it’s a minor annoyance.</span><span data-state="closed" style="min-width: 0px;"><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" href="https://blog.stephenturner.us/p/ctrlvee-extract-external-r-code-insert-inline-positron-rstudio-addin#footnote-1" id="footnote-anchor-1" rel="nofollow" target="_blank">1</a></span></p><p><span>My friend and colleague VP Nagraj published a new R package called </span><strong>ctrlvee</strong><span> that makes this a lot easier.</span></p><ul><li><p><strong><span>CRAN: </span><a href="https://cran.r-project.org/package=ctrlvee" rel="nofollow" target="_blank">https://cran.r-project.org/package=ctrlvee</a></strong></p></li><li><p><strong><span>GitHub: </span><a href="https://github.com/vpnagraj/ctrlvee" rel="nofollow" target="_blank">https://github.com/vpnagraj/ctrlvee</a></strong></p></li></ul><p><span>It
 does one thing. Put your cursor anywhere in an R script in Positron or 
RStudio, call the add-in, provide a URL, and a few milliseconds later 
you’ll have all the code from that page in your editor, separated by 
chunk boundaries (along with some metadata and a note to </span><a href="https://blog.stephenturner.us/p/pick-a-license-not-any-license" rel="nofollow" target="_blank">check the license!</a><span>).</span></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img can-restack" data-component-name="Image2ToDOM" href="https://substackcdn.com/image/fetch/$s_!7WR6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbfbd671-9318-421e-93b9-8ea4f5ef9e9a_1410x782.png" rel="nofollow" target="_blank"><div class="image2-inset"></source><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"></div></div></div></a></figure></div><p><span>The package README provides a demonstration using the “Data Validation and QA” chapter of my </span><em>Data Science Team Training</em><span> book (</span><strong><a href="https://dstt.stephenturner.us/" rel="nofollow" target="_blank">dstt.stephenturner.us</a></strong><span>).</span></p><ol><li><p><span>Install the package: </span><code>install.packages("ctrlvee")</code></p></li><li><p><span>Run the add-in. In Positron you’ll open the command palette, search for Run RStudio Addin, then </span><em>extract external R code and insert inline</em><span>. You’ll get a modal asking you for a URL. </span></p></li><li><p><span>Paste one in. E.g., </span><strong>https://dstt.stephenturner.us/validation.html</strong></p></li><li><p>The R code from the website appears in your editor </p></li></ol><p>Here’s a demo.</p><div class="videoScrollTarget-SzB20Y" data-component-name="VideoEmbedPlayer" id="media-32b2a9b2-d551-43a0-bffc-647ab75b032e"><div class="videoEmbed-_FycLU"><div aria-label="Video player" class="with-preview video-player-with-background video-player-wrapper" role="region"><div class="video-player video-player video-player-with-background videoPlayer-vlcedM" style="padding-bottom: 60%;"><div class="pencraft pc-position-absolute pc-reset buttonContainer-tH3LP9 video-player-button"></div></div></div></div></div><div class="subscribe-widget is-signed-up is-fully-subscribed" data-component-name="SubscribeWidget"><div class="pencraft pc-reset button-wrapper"><div class="pencraft pc-display-flex pc-justifyContent-center pc-reset"></div></div></div><p><span>Here’s what the extracted/inserted code looks like, from </span><a href="https://dstt.stephenturner.us/validation.html" rel="nofollow" target="_blank">this source</a><span>.</span></p><pre># -----------------------------------------------------------------
# Chunks fetched by ctrlvee from: https://dstt.stephenturner.us/validation.html
# Strategy: Rendered HTML page
# Date: 2026-05-16 05:14:44
# Chunks: 8
# NOTE: Check the source license before reusing this code.
# -----------------------------------------------------------------

flu <- data.frame(
    week = c(1, 2, 3, 4, 4),
    county = c("Fairfax", "Arlington", NA, "Loudoun", "Loudoun"),
    disease = c("Flu", "Flu", "Flu", "Flu", "Flu"),
    cases = c(23, 41, 18, -5, 12),
    rate = c(2.1, 3.8, 1.6, NA, 1.1)
)

flu

# ---- chunk boundary ----

if (any(flu$cases < 0, na.rm = TRUE)) {
    stop("Negative case counts detected. Inspect raw data before proceeding.")
}

# ---- chunk boundary ----

stopifnot(
    "Negative case counts" = all(flu$cases >= 0, na.rm = TRUE),
    "Missing county values" = !anyNA(flu$county),
    "Duplicate records" = !anyDuplicated(flu[, c("week", "county")])
)

# ---- chunk boundary ----

install.packages("pointblank")

# ---- chunk boundary ----

library(pointblank)

agent <- create_agent(tbl = flu, label = "Weekly flu surveillance") |>
    col_vals_gte(
        columns = cases,
        value = 0,
        label = "Case counts must be non-negative"
    ) |>
    col_vals_not_null(
        columns = c(week, county),
        label = "Week and county cannot be missing"
    ) |>
    rows_distinct(
        columns = c(week, county),
        label = "No duplicate week/county records"
    ) |>
    interrogate()

agent

# ---- chunk boundary ----

create_agent(tbl = flu, label = "Weekly flu surveillance — extended") |>
    col_is_numeric(
        columns = c(cases, rate),
        label = "Case count and rate must be numeric"
    ) |>
    col_vals_in_set(
        columns = disease,
        set = c("Flu", "COVID-19", "RSV"),
        label = "Disease must be from the approved list"
    ) |>
    col_vals_between(
        columns = week,
        left = 1,
        right = 52,
        label = "Week must be between 1 and 52"
    ) |>
    col_vals_gte(
        columns = rate,
        value = 0,
        na_pass = TRUE,
        label = "Rate must be non-negative (NAs allowed)"
    ) |>
    interrogate()

# ---- chunk boundary ----

if (!all_passed(agent)) {
    stop("Data validation failed. Review the agent report before proceeding.")
}

# ---- chunk boundary ----

library(readr)
library(pointblank)

flu <- read_csv("data/flu-2024.csv")

# Validate immediately after reading
agent <- create_agent(tbl = flu, label = "flu-2024 validation") |>
    col_vals_gte(columns = cases, value = 0, label = "No negative counts") |>
    col_vals_not_null(columns = c(week, county), label = "No missing keys") |>
    rows_distinct(columns = c(week, county), label = "No duplicate records") |>
    interrogate()

if (!all_passed(agent)) {
    stop("Validation failed — see agent report above.")
}</pre><div class="subscribe-widget is-signed-up is-fully-subscribed" data-component-name="SubscribeWidget"><div class="pencraft pc-reset button-wrapper"><div class="pencraft pc-display-flex pc-justifyContent-center pc-reset"></div></div></div><div class="footnote" data-component-name="FootnoteToDOM"><br /></div><p> </p><div class="blogger-post-footer">Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution (CC BY) License.</div>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="http://gettinggeneticsdone.blogspot.com/2026/05/ctrlvee-extract-external-r-code-insert-inline-positron-rstudio-addin.html"> Getting Genetics Done</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/repost-ctrlvee-extract-external-r-code-and-insert-inline/">Repost: ctrlvee: Extract external R code and insert inline</a>
</article>
<article>
<h1>[R] How to use ggpattern to add patterns to ggplot2 plots</h1>
<p>R on Zhenguo Zhang's Blog — Fri, 22 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://fortune9.netlify.app/2026/05/22/r-how-to-use-ggpattern/"> R on Zhenguo Zhang's Blog</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
Zhenguo Zhang’s Blog https://fortune9.netlify.app/2026/05/22/r-how-to-use-ggpattern/ –


<p>Adding patterns to plots is a great way to improve accessibility (making plots colorblind-friendly) and to add an extra dimension of information. The <code>ggpattern</code> package provides a rich set of tools to achieve this in <code>ggplot2</code>.</p>
<div id="basic-example-with-mtcars" class="section level2">
<h2>Basic Example with mtcars</h2>
<p>Let’s start with a barplot using the <code>mtcars</code> dataset. We will map <code>cyl</code> to the x-axis, <code>mpg</code> to the y-axis, use <code>gear</code> for the fill color, and <code>am</code> for the pattern.</p>
<pre># Prepare data
df <- mtcars %>%
  mutate(
    cyl = factor(cyl),
    gear = factor(gear),
    am = factor(am, labels = c("Automatic", "Manual"))
  )

# Create barplot
ggplot(df, aes(x = cyl, y = mpg, fill = gear)) +
  geom_bar_pattern(
    aes(pattern = am),
    stat = "identity",
    position = "dodge",
    color = "black",
    pattern_fill = "black",
    pattern_spacing = 0.02,
    pattern_key_scale_factor = 1.5
  ) +
  scale_pattern_manual(values = c("stripe", "crosshatch")) +
  labs(
    title = "MPG by Cylinders, Gear, and Transmission",
    subtitle = "Patterns representing transmission type",
    x = "Cylinders",
    y = "Miles Per Gallon (MPG)",
    fill = "Gears",
    pattern = "Transmission"
  ) +
  theme_minimal()</pre>
<p></p>
</div>
<div id="exploring-key-parameters" class="section level2">
<h2>Exploring Key Parameters</h2>
<p>The <code>ggpattern</code> package allows for fine-grained control over how patterns are rendered. Here are some of the most important parameters:</p>
<div id="pattern_key_scale_factor" class="section level3">
<h3>1. pattern_key_scale_factor</h3>
<p>This parameter scales the pattern density specifically in the legend. Without it, patterns in the legend can often be hard to see.
Reduce the value will make it easier fit into the legend keys.</p>
<pre>p_base <- ggplot(df, aes(x = cyl, y = mpg, fill = gear)) +
  geom_bar_pattern(
    aes(pattern = am),
    stat = "identity",
    position = "dodge",
    color = "black",
    pattern_spacing = 0.02
  ) +
  theme_minimal() +
  labs(title = "Default vs Scaled Legend")

# Default legend
p1 <- p_base + labs(subtitle = "Default (key_scale_factor = 1)")

# Scaled legend
p2 <- p_base + 
  geom_bar_pattern(
    aes(pattern = am),
    stat = "identity",
    position = "dodge",
    color = "black",
    pattern_spacing = 0.02,
    pattern_key_scale_factor = 0.5
  ) +
  labs(subtitle = "Scaled (key_scale_factor = 0.5)")

p1</pre>
<p></p>
<pre>p2</pre>
<p></p>
</div>
<div id="pattern_spacing-and-pattern_density" class="section level3">
<h3>2. pattern_spacing and pattern_density</h3>
<p><code>pattern_spacing</code> controls the distance between pattern elements (lower = denser), while <code>pattern_density</code> controls the coverage (higher = denser).</p>
<pre># Spacing effect
ggplot(df, aes(x = cyl, y = mpg, fill = gear)) +
  geom_bar_pattern(
    aes(pattern = am),
    stat = "identity",
    position = "dodge",
    pattern_spacing = 0.2, # Wider spacing
    pattern_key_scale_factor = 1.5
  ) +
  labs(title = "Effect of pattern_spacing = 0.5 (Sparser)") +
  theme_minimal()</pre>
<p></p>
<pre># Density effect
ggplot(df, aes(x = cyl, y = mpg, fill = gear)) +
  geom_bar_pattern(
    aes(pattern = am),
    stat = "identity",
    position = "dodge",
    pattern_density = 0.8, # Higher density
    pattern_key_scale_factor = 1.5
  ) +
  labs(title = "Effect of pattern_density = 0.8 (Denser)") +
  theme_minimal()</pre>
<p></p>
</div>
<div id="pattern_fill" class="section level3">
<h3>3. pattern_fill</h3>
<p><code>pattern_fill</code> sets the color of the pattern elements themselves.</p>
<pre>ggplot(df, aes(x = cyl, y = mpg, fill = gear)) +
  geom_bar_pattern(
    aes(pattern = am),
    stat = "identity",
    position = "dodge",
    pattern_fill = "white", # White pattern lines
    pattern_density = 0.5, # make it bigger so that the color is easier to see
    pattern_spacing = 0.02,
    pattern_key_scale_factor = 1.5
  ) +
  labs(title = "Effect of pattern_fill = 'white'") +
  theme_minimal()</pre>
<p></p>
</div>
<div id="pattern_shape" class="section level3">
<h3>4. pattern_shape</h3>
<p>For geometric patterns like <code>regular_polygon</code>, <code>pattern_shape</code> controls the number of sides.</p>
<pre>ggplot(df, aes(x = cyl, y = mpg, fill = gear)) +
  geom_bar_pattern(
    aes(pattern_shape=gear),
    pattern = "regular_polygon",
    stat = "identity",
    position = "dodge",
    pattern_size = 0.5, # stroke line width
    pattern_density = 0.5, # increase the shape size
    pattern_spacing = 0.03,
    pattern_key_scale_factor = 1.5
  ) +
  scale_pattern_shape_manual(values = c("3"="circle", "4"="convex6", "5"="square")) + # manually set shapes
  labs(title = "Effect of mapping pattern_shape to gear") +
  theme_minimal()</pre>
<p></p>
</div>
</div>
<div id="troubleshooting" class="section level2">
<h2>Troubleshooting</h2>
<p>If you encounter issues installing or loading <code>ggpattern</code> on Linux, you may need to install the following system dependencies:</p>
<pre>sudo apt-get install libudunits2-dev
sudo apt install libproj22
sudo apt-get install libgdal-dev</pre>
</div>
- https://fortune9.netlify.app/2026/05/22/r-how-to-use-ggpattern/ - 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://fortune9.netlify.app/2026/05/22/r-how-to-use-ggpattern/"> R on Zhenguo Zhang's Blog</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/r-how-to-use-ggpattern-to-add-patterns-to-ggplot2-plots/">[R] How to use ggpattern to add patterns to ggplot2 plots</a>
</article>
<article>
<h1>Functions over Idioms – Writing R in Python with rfuns</h1>
<p>Jonathan Carroll — Fri, 22 May 2026 00:00:00 +0000</p>
<![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://jcarroll.com.au/2026/05/22/functions-over-idioms-rfuns/"> rstats on Irregularly Scheduled Programming</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>If you’ve read any of my past posts you know I like to program in several
different languages, some of which I like more than others. Sometimes a problem
calls for a particular language to be used, and with that comes adjusting one’s
brain to thinking in that language and using the appropriate idioms to leverage
that language’s features. But what if I don’t want to?</p>
<div class="float">

<div class="figcaption">I don’t want to</div>
</div>
<p>The line between R and Python has been heavily blurred the last few years,
particularly with <a href="https://rstudio.github.io/reticulate/" rel="nofollow" target="_blank">{reticulate}</a> enabling
us to use Python within R code, RStudio rebranding as <a href="https://posit.co/" rel="nofollow" target="_blank">Posit</a>
and taking on a strong Python development effort, releasing
<a href="https://posit.co/products/ide/positron" rel="nofollow" target="_blank">Positron</a> as a multi-language IDE, and
<a href="https://quarto.org/" rel="nofollow" target="_blank">Quarto</a> being a multi-language rethink of Rmarkdown.</p>
<p>I occasionally <em>need</em> to use Python directly – an SDK wrapping an API exists and
I don’t particularly want to spend a lot of time writing my own R version,
especially before I know what I want to get out of the endpoints. At this point
I tend to bump up against my muscle-memory from R and try to use functions I’m
familiar with from R, but which don’t actually exist in Python. Now, that might
sometimes be because the pattern I’m trying to encode simply has a different name
in Python; instead of an <code>sapply(x, f)</code></p>
<pre>sapply(c(2, 3, 4, 5), \(x) x ^ 2)
## [1]  4  9 16 25</pre>
<p>I should reach for <code>map</code>, in which case I am reminded that this produces a lazy
iterator that doesn’t show me the results</p>
<pre>map(lambda x: x ** 2, [2, 3, 4, 5])
## <map object at 0x10d7fbee0></pre>
<p>and so I need to wrap it into a list to get the values out</p>
<pre>list(map(lambda x: x ** 2, [2, 3, 4, 5]))
## [4, 9, 16, 25]</pre>
<p>Or, I could use a list comprehension which <em>isn’t</em> lazy</p>
<pre>[v ** 2 for v in [2, 3, 4, 5]]
## [4, 9, 16, 25]</pre>
<p>That’s the <em>idiom</em> that I should be reaching for. Sure.</p>
<p>Other times there’s a package I need to use and a slightly different way of
approaching the problem. In R I love the <code>table()</code> function for getting
histogram-like counts of the unique values of a vector</p>
<pre>table(c("b", "a", "c", "a", "b", "a"))
## 
## a b c 
## 3 2 1</pre>
<p>which in Python looks like</p>
<pre>from collections import Counter

sorted(Counter(["b", "a", "c", "a", "b", "a"]).items())
## [('a', 3), ('b', 2), ('c', 1)]</pre>
<p>Probably Pythonistas remember that idiom and the package to import and the
<code>.items()</code> extractor and the fact that they maybe want to sort the result. But I
kept coming back to a question I ask myself: <em>what if I don’t want to</em>? Why is
there not a function that wraps this idiom? If there was, why not just call it
“table”? Admittedly, it’s far from the catchiest, most memorable, or most useful
name, but it’s immediately recognisable to an R user (ditto for “sapply”).</p>
<p>One approach I considered here was to just call R from Python. That <em>can</em> be done,
but I doubt I or anyone else wants to deal with that every time we want to iterate
over a list. There’s a package on the Python package index which seems to support
this nicely: <a href="https://pypi.org/project/r-functions/" class="uri" rel="nofollow" target="_blank">https://pypi.org/project/r-functions/</a> but it’s wrappers around
individual R files, via RScript. I’m thinking more along the lines of ‘native
Python with an R interface’.</p>
<p>Python is an object-oriented language, but it <em>has</em> functions, so why not make one</p>
<pre>from collections import Counter

def table(x):
    return dict(sorted(Counter(x).items()))

table(["b", "a", "c", "a", "b", "a"])
## {'a': 3, 'b': 2, 'c': 1}
def sapply(x, func):
    return [func(v) for v in x]
  
sapply([2, 3, 4, 5], lambda x: x ** 2)
## [4, 9, 16, 25]</pre>
<p>and have a nicer function interface to apply these idioms? I thought about this
a bit longer, and realised there’s <strong>lots</strong> of functions I use in R that I wish
I could use in Python. An idiom for finding the index of elements of a ‘vector’
(list in Python) which are true (<code>TRUE</code> in R, <code>True</code> in Python) is</p>
<pre>[i for i, v in enumerate(x) if v]</pre>
<p>but I just want to call <code>which(x)</code></p>
<pre>which(c(FALSE, FALSE, TRUE, FALSE , TRUE))
## [1] 3 5</pre>
<p>so why not define this</p>
<pre>def which(x):
    return [i for i, v in enumerate(x) if v]
  
which([False, False, True, False, True])
## [2, 4]</pre>
<p>(remembering that Python is 0-indexed).</p>
<p>How far could one take this? Quite a long way!</p>
<p>I thought more about what differences would need to be accounted for, and one that
immediately came to mind was that R is vectorised. If I was to recreate R’s
character counting function <code>nchar(s)</code> as essentially <code>len(s)</code>, I’d need to consider
whether I wanted it to work on a single string or a ‘vector’ of strings</p>
<p>In R:</p>
<pre>nchar(c("these", "all", "have", "different", "lengths"))
## [1] 5 3 4 9 7</pre>
<p>But in Python, <code>len()</code> expects a single value, so it calculates the length of
the list</p>
<pre>len(["these", "all", "have", "different", "lengths"])
## 5</pre>
<p>The ‘proper’ way to do it is to map over the list</p>
<pre>[len(s) for s in ["these", "all", "have", "different", "lengths"]]
## [5, 3, 4, 9, 7]</pre>
<p>but again, why do I need to use an idiom for this? What if I just made a decorator
to change a regular function to a vectorised one by applying this list
comprehension internally when it’s passed a list (or a tuple), and which otherwise
just evaluates the function with the argument?</p>
<pre>import functools

def make_vec(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        if isinstance(args[0], (list, tuple)):
            return [func(xi, *args[1:], **kwargs) for xi in args[0]]
        return func(*args, **kwargs)
    return wrapper

@make_vec
def my_len(s):
    return len(s)

my_len(["these", "all", "have", "different", "lengths"])
## [5, 3, 4, 9, 7]</pre>
<p>and I could name it… “nchar”!</p>
<p>The other use-case that came to mind was <a href="https://fosstodon.org/deck/@eliocamp@mastodon.social/116531644286276585" rel="nofollow" target="_blank">Elio venting</a>
(and referencing a post to which <a href="https://jcarroll.com.au/2025/12/05/haskell-is-a-great-language-for-data-science/" rel="nofollow" target="_blank">I also wrote a sort of response</a>)
that they needed to list the files in the current directory</p>
<blockquote class="mastodon-embed" data-embed-url="https://mastodon.social/@eliocamp/116531644254157709/embed" style="background: #FCF8FF; border-radius: 8px; border: 1px solid #C9C4DA; margin: 0; max-width: 540px; min-width: 270px; overflow: hidden; padding: 0;">
<a href="https://mastodon.social/@eliocamp/116531644254157709" style="align-items: center; color: #1C1A25; display: flex; flex-direction: column; font-family: system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Oxygen, Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', Roboto, sans-serif; font-size: 14px; justify-content: center; letter-spacing: 0.25px; line-height: 20px; padding: 24px; text-decoration: none;" rel="nofollow" target="_blank"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="32" height="32" viewBox="0 0 79 75"><path d="M63 45.3v-20c0-4.1-1-7.3-3.2-9.7-2.1-2.4-5-3.7-8.5-3.7-4.1 0-7.2 1.6-9.3 4.7l-2 3.3-2-3.3c-2-3.1-5.1-4.7-9.2-4.7-3.5 0-6.4 1.3-8.6 3.7-2.1 2.4-3.1 5.6-3.1 9.7v20h8V25.9c0-4.1 1.7-6.2 5.2-6.2 3.8 0 5.8 2.5 5.8 7.4V37.7H44V27.1c0-4.9 1.9-7.4 5.8-7.4 3.5 0 5.2 2.1 5.2 6.2V45.3h8ZM74.7 16.6c.6 6 .1 15.7.1 17.3 0 .5-.1 4.8-.1 5.3-.7 11.5-8 16-15.6 17.5-.1 0-.2 0-.3 0-4.9 1-10 1.2-14.9 1.4-1.2 0-2.4 0-3.6 0-4.8 0-9.7-.6-14.4-1.7-.1 0-.1 0-.1 0s-.1 0-.1 0 0 .1 0 .1 0 0 0 0c.1 1.6.4 3.1 1 4.5.6 1.7 2.9 5.7 11.4 5.7 5 0 9.9-.6 14.8-1.7 0 0 0 0 0 0 .1 0 .1 0 .1 0 0 .1 0 .1 0 .1.1 0 .1 0 .1.1v5.6s0 .1-.1.1c0 0 0 0 0 .1-1.6 1.1-3.7 1.7-5.6 2.3-.8.3-1.6.5-2.4.7-7.5 1.7-15.4 1.3-22.7-1.2-6.8-2.4-13.8-8.2-15.5-15.2-.9-3.8-1.6-7.6-1.9-11.5-.6-5.8-.6-11.7-.8-17.5C3.9 24.5 4 20 4.9 16 6.7 7.9 14.1 2.2 22.3 1c1.4-.2 4.1-1 16.5-1h.1C51.4 0 56.7.8 58.1 1c8.4 1.2 15.5 7.5 16.6 15.6Z" fill="currentColor"/></svg>
<div style="color: #787588; margin-top: 16px;">
Post by <span class="citation">@eliocamp</span><span class="citation">@mastodon.social</span>
</div>
<div style="font-weight: 500;">
View on Mastodon
</div>
</a>
</blockquote>
<script data-allowed-prefixes="https://mastodon.social/" async src="https://mastodon.social/embed.js"></script>
<p>with the idiom</p>
<pre>import os

[os.path.join(path, f) for f in os.listdir(path)]</pre>
<p>The supplied suggestions included</p>
<pre>from pathlib import Path

list(Path(path).iterdir())</pre>
<p>(just rolls off the tongue, doesn’t it?) which returns a list of <code>PosixPath()</code>
objects and is hardly easy to parse visually.</p>
<p>So, why not have a function?!?</p>
<pre>import os

def list_files(path):
    return [os.path.join(path, f) for f in os.listdir(path)]

path = "path/to/files"

list_files(path)
## ['path/to/files/file1.txt', 'path/to/files/file2.txt', 'path/to/files/file3.csv']</pre>
<p>I would have liked to call this <code>list.files()</code> but, since Python strictly uses
the dot for method calling, it can’t be that.</p>
<p>This then raises the question of “should I support the arguments already in the R
functions?” In this case, should it support a <code>recursive</code> argument? Yes, that
adds complexity, but it’s surely do-able. At this point I reached for some AI
assistance and had Claude help me to implement as many functions as we could think
of, supporting as many common arguments as possible. This involved extending the
decorator to support vectorising other arguments (which also need to be careful
about dots).</p>
<p>On testing it out, it looked like we had something viable.</p>
<p>One last piece I wanted to support, though: the <code>which()</code> example above extracts
the elements of a <em>logical</em> vector which are <code>True</code>, but in order to build that vector
in the first place, I would naturally leverage R’s vectorisation as an array
language. The two steps involved here are to first compute the comparison resulting
in a logical vector, then to use <code>which()</code> to identify the indices of those which are
true</p>
<pre>which(c("c", "b", "a", "c", "a", "b") == "a")
## [1] 3 5</pre>
<p>The vectorisation decorator above doesn’t help here, because it’s at the point of
<code>==</code> that we want to vectorise</p>
<pre>['c', 'b', 'a', 'c', 'a', 'b'] == 'a'
## False</pre>
<p>This is <code>False</code> because the character <code>'a'</code> is not equal to the given list.</p>
<p>The appropriate idiom is once again to use a list comprehension</p>
<pre>which(x == 'a' for x in ['c', 'b', 'a', 'c', 'a', 'b'])
## [2, 4]</pre>
<p>The solution I’m fond of is to create a new ‘Vec’ class which wraps binary operators
with a list comprehension, again abstracting away this detail. This means
implementing <code>__eq__</code>, <code>__add__</code>, <code>__and__</code> and lots of other binary operations,
but with that, and a wrapper to create such an object, the comparison operators
can be vectorised</p>
<pre>vals = vec(['c', 'b', 'a', 'c', 'a', 'b'])
which(vals == 'a')
## [2, 4]</pre>
<p>Not pristine, but quite clean, if you ask me.</p>
<p>With all these pieces in place, adding implementations for common base R functions
including most arguments and a way to vectorise lists, I wrapped everything up
into a Python package (my first) to learn how to do it.</p>
<p>The workflow isn’t particularly painful, with my biggest complication being
different versions of Python supporting different requirements in <code>pyproject.toml</code>,
and so some GitHub Actions are failing because of that.</p>
<p>As part of building out the implementations I had Claude add tests for each of the
functions with some expected values – if I <em>do</em> want to improve some of the idioms
internally, I want to ensure I don’t change the values produced. That works for
having any testing at all, but how can I be sure that I’m reproducing what I
would get if I was working in R? One option was to just run all of the test
functions by hand and confirm that the values look similar enough, accounting for
list vs vector and 0 vs 1 indexing. Instead, Claude managed to write an adaptor
for <code>pytest</code> which does the realignment of e.g. <code>list_files</code> to <code>list.files</code>
(and similarly for arguments), realigns the indexing where needed, and runs all
existing tests directly in R via <code>rpy2</code> (skipping over some for which I don’t
have tests yet). I’m disabling automated testing of this because I suspect it
could get flaky dealing with both R <em>and</em> Python on GitHub Actions, but I can
confirm that all the current tests pass.</p>
<p>I wanted to have a documentation website similar to what we have via {pkgdown} and
came across <a href="https://github.com/machow/quartodoc" rel="nofollow" target="_blank">quartodoc</a> which is what the
<a href="https://rstudio.github.io/pins-python/" rel="nofollow" target="_blank">Python version of {pins} uses</a>. Getting
that to work required downgrading a specific Python dependency, but was otherwise
painless.</p>
<p>I have a working package locally – how do I share it? This seemed like the perfect
opportunity to learn what the release process looks like for Python. I have a
handful of packages on CRAN and one on Bioconductor, and the process there is
far from frictionless, with the side-effect that there’s some trust you can place
on the interoperability of packages and minimal (automated) code checking. While
Python is more ‘wild west’ in terms of what can be uploaded, it’s really nice to see
that they do have an <a href="https://test.pypi.org/" rel="nofollow" target="_blank">entirely separate test server</a>
where you can upload your package and see how it looks. I’m reminded of the quote</p>
<blockquote>
<p>Everybody has a testing environment. Some people are lucky enough to have a totally separate environment to run production in.</p>
</blockquote>
<p>Given that it’s not currently possible to run 100% of the CRAN checks locally
(and even some that you <em>can</em> give a different result to what’s on their systems)
this does make me a little jealous. I wonder whether the decrease in load from
rejecting failing submissions would offset supporting a test submission server.</p>
<p>All went well pushing to the test server (via an authentication key) and I managed
to build up the courage to push to the production instance…
<a href="https://pypi.org/project/rfuns/" rel="nofollow" target="_blank">it’s live!</a></p>
<div class="float">

<div class="figcaption">rfuns logo – R functions in Python… are fun</div>
</div>
<p>and the <a href="https://jonocarroll.github.io/rfuns/" rel="nofollow" target="_blank">documentation site</a> isn’t too bad,
either (in my opinion).</p>
<p>This means that you can now run</p>
<pre>uv add rfuns</pre>
<p>(or the equivalent in whatever virtual environment management configuration you’re
using, e.g. <code>pip install rfuns</code>) and start using some R functions directly in
Python!</p>
<p>Depending on how you like to manage your imports, you can import everything</p>
<pre>from rfuns import *

which([False, False, True, False, True])
## [2, 4]</pre>
<p>or, if you prefer to namespace</p>
<pre>import rfuns as r

r.which([False, False, True, False, True])
## [2, 4]</pre>
<p>The list of functions currently imported, grouped into sections is:</p>
<div id="strings" class="section level3">
<h3>Strings</h3>
<ul>
<li><code>nchar(x)</code></li>
<li><code>nzchar(x)</code></li>
<li><code>paste(*args, sep=" ", collapse=None)</code></li>
<li><code>paste0(*args, collapse=None)</code></li>
<li><code>grepl(pattern, x, ignore_case=False, fixed=False)</code></li>
<li><code>grep(pattern, x, ignore_case=False, fixed=False, value=False, invert=False)</code></li>
<li><code>gsub(pattern, replacement, x, ignore_case=False, fixed=False)</code></li>
<li><code>sub(pattern, replacement, x, ignore_case=False, fixed=False)</code></li>
<li><code>trimws(x, which="both", whitespace=r"[ \t\r\n]")</code></li>
<li><code>toupper(x)</code></li>
<li><code>tolower(x)</code></li>
<li><code>startsWith(x, prefix)</code></li>
<li><code>endsWith(x, suffix)</code></li>
<li><code>strsplit(x, split, fixed=False)</code></li>
<li><code>substr(x, start, stop)</code></li>
<li><code>chartr(old, new, x)</code></li>
<li><code>formatC(x, digits=6, format="g", width=None)</code></li>
</ul>
</div>
<div id="vectors" class="section level3">
<h3>Vectors</h3>
<ul>
<li><code>which(x)</code></li>
<li><code>which_min(x)</code></li>
<li><code>which_max(x)</code></li>
<li><code>diff(x, lag=1)</code></li>
<li><code>cumsum(x)</code></li>
<li><code>cumprod(x)</code></li>
<li><code>cummax(x)</code></li>
<li><code>cummin(x)</code></li>
<li><code>rev(x)</code></li>
<li><code>duplicated(x)</code></li>
<li><code>setdiff(x, y)</code></li>
<li><code>intersect(x, y)</code></li>
<li><code>union(x, y)</code></li>
<li><code>unique(x)</code></li>
<li><code>seq_along(x)</code></li>
<li><code>seq_len(n)</code></li>
<li><code>seq(from_=0, to=None, by=None, length_out=None)</code> (<code>from</code> is a reserved keyword)</li>
<li><code>sign(x)</code></li>
<li><code>r_range(x)</code> (renamed to not conflict with <code>range()</code>)</li>
</ul>
</div>
<div id="math" class="section level3">
<h3>Math</h3>
<ul>
<li><code>sign(x)</code></li>
<li><code>trunc(x)</code></li>
<li><code>ceiling(x)</code></li>
<li><code>floor(x)</code></li>
<li><code>sqrt(x)</code></li>
<li><code>log(x, base=None)</code></li>
<li><code>log2(x)</code></li>
<li><code>log10(x)</code></li>
<li><code>exp(x)</code></li>
<li><code>abs(x)</code></li>
<li><code>var(x, na_rm=False)</code></li>
<li><code>sd(x, na_rm=False)</code></li>
<li><code>mean(x, na_rm=False)</code></li>
<li><code>median(x, na_rm=False)</code></li>
<li><code>quantile(x, probs=None, na_rm=False)</code></li>
<li><code>scale(x, center=True, scale_=True)</code></li>
<li><code>round(x, digits=0)</code></li>
</ul>
</div>
<div id="files" class="section level3">
<h3>Files</h3>
<ul>
<li><code>list_files(path=".", pattern=None, all_files=False, full_names=False, recursive=False, ignore_case=False, include_dirs=False, no_dot=False)</code></li>
<li><code>file_exists(path)</code></li>
<li><code>dir_exists(path)</code></li>
<li><code>basename(path)</code></li>
<li><code>dirname(path)</code></li>
<li><code>file_path(*args)</code></li>
</ul>
</div>
<div id="table" class="section level3">
<h3>Table</h3>
<ul>
<li><code>table(x)</code></li>
<li><code>prop_table(x)</code></li>
<li><code>margin_table(x)</code></li>
</ul>
</div>
<div id="functional" class="section level3">
<h3>Functional</h3>
<ul>
<li><code>lapply(x, func)</code></li>
<li><code>sapply(x, func)</code></li>
<li><code>vapply(x, func, expected_type)</code></li>
<li><code>tapply(x, index, func)</code></li>
<li><code>rapply(x, func)</code></li>
<li><code>Filter(func, x)</code></li>
<li><code>Map(func, *args)</code></li>
<li><code>Reduce(func, x, init=None, accumulate=False)</code></li>
</ul>
</div>
<div id="inspect" class="section level3">
<h3>Inspect</h3>
<ul>
<li><code>head(x, n=6)</code></li>
<li><code>tail(x, n=6)</code></li>
<li><code>length(x)</code></li>
<li><code>nrow(x)</code></li>
<li><code>ncol(x)</code></li>
<li><code>dim(x)</code></li>
<li><code>summary(x)</code></li>
<li><code>rstr(x)</code> (renamed to not conflict with <code>str()</code>)</li>
</ul>
</div>
<div id="utils" class="section level3">
<h3>Utils</h3>
<ul>
<li><code>vec(x)</code></li>
</ul>
<p>Some of these are vectorised</p>
<pre>nchar(["these", "all", "have", "different", "lengths"])
## [5, 3, 4, 9, 7]
grepl("ar", ["frog", "carpet", "basket", "dart"])
## [False, True, False, True]
sqrt([36, 81, 9])
## [6.0, 9.0, 3.0]</pre>
<p>while others (approximately, up to 0-indexing) preserve the R behaviour, such as
how <code>seq()</code> works</p>
<pre>seq(5)
## [0, 1, 2, 3, 4]
seq(from_=0, to=10, by=2)
## [0, 2, 4, 6, 8, 10]</pre>
<p>(note that <code>from</code> is a keyword in Python, so the argument here is now <code>from_</code>)
and set operations</p>
<pre>setdiff([5, 2, 4, 1], [2, 1])
## [5, 4]</pre>
<p>whereas this does not preserve order</p>
<pre>set([5, 2, 4, 1]) -  set([2, 1])
## {4, 5}</pre>
<p>Doing <em>all</em> of this myself would have taken quite some time, so I’m grateful to
be able to direct an agent towards accomplishing some of the tedious parts of this
project. I still drove the decision making and made sure to verify outputs, so I
don’t consider this a ‘vibe-coded’ project.</p>
<p>I’m not recommending you use this in production at all – I’ve taken whatever
idiom I could find (or generate) for the internals of all of these, and haven’t
paid any attention to their performance. The goal was to make it easier for me
to work interactively in a REPL when I’m reaching for particular functions. That
being said, I’ll gladly do my best to understand the Pythonic versions as best
as I can so that I can better appreciate native Python and use the idioms when
my helper package isn’t available (or unsuitable). I’d say it’s fair to argue
that R users using Python <em>should</em> learn how to do things in a Pythonic way, but
I also just want to get some small things done occasionally, so I’m happy this
now exists.</p>
<p>If you’re working with non-R colleagues then introducing these abstractions —
while they may make your life simpler in the moment — will probably result in
confusion as you’re hiding away the implementation and giving it a name they
won’t recognise. That’s precisely what functions are for (with helpful names),
of course, but unless this package becomes popular, I’ll bet that the inline
idioms are more welcomed in a codebase.</p>
<p>I’d love to hear what people think about this, although I’m entirely fine with me
being the sole user of it. Should I just force my muscle-memory to take on the
Python idioms? Am I going to be punished for <a href="https://ghostbusters.fandom.com/wiki/Cross_the_Streams" rel="nofollow" target="_blank">‘crossing the
streams’</a> of two
incompatible languages? Would this be helpful to you? Are there other
considerations I’ve missed? As always, I can be found on
<a href="https://fosstodon.org/@jonocarroll" rel="nofollow" target="_blank">Mastodon</a> and the comment section below.</p>
<p>Shoutouts to Elio Campitelli and Michael Sumner for feedback on a draft of this
post.</p>
<br />
<details>
<summary>
<tt>devtools::session_info()</tt>
</article>
<article>
<h1>Zero Sum Problems</h1>
<p>R on kieranhealy.org — Thu, 21 May 2026 19:50:40 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/"> R on kieranhealy.org</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>Over at <a href="https://daringfireball.net/" rel="nofollow" target="_blank">Daring Fireball</a>, John Gruber makes a passing observation about the Apple Sports app:</p>



<blockquote>
    <p>I’ve got some gripes about certain specific aspects of Apple Sports. Like, where does one even <em>start</em> to explain how much is wrong with <a href="https://daringfireball.net/misc/2026/05/apple-sports-team-stats-wtf.png" rel="nofollow" target="_blank">their zero-sum visualization of team stats</a>? Has anyone ever even seen a presentation like that before? <a href="https://kieranhealy.org/" rel="nofollow" target="_blank">Anyone</a>?</p>

</blockquote>

<p>That “Anyone” link lands over here. Hi everyone! The team stats image <em>is</em> quite confusing. It’s a summary of a game between the San Antonio Spurs and the Oklahoma City Thunder. I don’t know much about basketball, but I do know a bit about data visualization and in a pleasing coincidence my former student <a href="https://www.linkedin.com/in/joshua-fink" rel="nofollow" target="_blank">Josh Fink</a> is the A-VP of Basketball Data Science for the Spurs. Here is the image that John objected to:</p>
<figure><a href="https://i0.wp.com/kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/apple-sports-team-stats-wtf.png?ssl=1" rel="nofollow" target="_blank">
    </a><figcaption>
            <p>I had to look at it for a while as well.</p>
        </figcaption>
</figure>
<p>I just finished driving a very long way up the side of the country, so I’m kind of tired. But even allowing for that, boy, this way of representing things really is quite confusing. Not being an Apple Sports user I had to look at it for a bit to understand what was happening. But, now that it has given me a headache, I can kind of see why whoever designed this ended up in the undoubtedly bad place they did.</p>
<p>Before I get to why I have some sympathy for the designer, <em>why</em> did I find this representation of these numbers so disorienting? It’s not just just because I’ve been driving for nine hours. John is right to call the picture a “Zero Sum” representation. The design <em>strongly</em> suggests to the viewer that, within each row, we’re looking at each team’s share of a total. Each pair of black and blue lines seem to be vying for control of their whole row, with the longest line being the “winner” in each case.</p>
<p>This sort of representation would make perfect sense for a measure that really
<em>was</em> zero sum. Take an example from a properly good sport, like rugby. There,
like in basketball, to a first approximation a team either has the ball or it
doesn’t.<sup id="fnref:1"><a href="https://kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/#fn:1" class="footnote-ref" role="doc-noteref" rel="nofollow" target="_blank">1</a></sup> But there’s no shot clock in rugby, and possession routinely gets
turned over without the game stopping. So, knowing that Team A had 65%
possession is not only informative, it also immediately entails that Team B had
35%. You could show that with a representation like one of the rows above.</p>
<p>Literally none of the measures in the Basketball data above are zero-sum in this way. Both teams could shoot 100% from the free throw line, or zero percent. But because the first three measures shown are percentages, this reinforces the zero-sum impression given by the lines. It certainly did that in my case. But then, starting with Assists, the remaining rows are just absolute numbers. When I started looking at the absolute numbers, I got confused a second time by the length of the lines. “Oh so it’s not a share, it’s the value” I thought—but no, they do correspond in terms of relative proportions to the teams share within each row. But they’re not really <em>shares</em> they’re just <em>magnitudes</em>. But they have to be shown in a fixed space and we want to make them relatively comparable somehow so …  Argh.</p>
<p>It would be nice if there were One Weird Trick to fully fix this figure. But I’m not sure that there is. For example, at a minimum we could redraw these numbers to reflect the fact that they’re not zero-sum. Keep each measure as a row (i.e. on the y-axis) but have the lines, or columns, be side by side within each category instead of facing off. Like this:</p>
<figure><a href="https://i1.wp.com/kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/gruber-stats1.png?ssl=1" rel="nofollow" target="_blank">
    </a><figcaption>
            <p>Team Stats side by side for each measure.</p>
        </figcaption>
</figure>
<p>This view at least lets you immediately see who “won” each measure. The viewer
can just directly compare the length of the bars in each category. <a href="https://socviz.co/01-look-at-data.html#visual-tasks-and-decoding-graphs" rel="nofollow" target="_blank">People are
really good at doing that
accurately.</a>
In that sense it’s much less confusing than the original. But there’s still a
lot wrong with it. The core problem is that when we draw a graph like this,
we’re usually putting <em>the same kind of thing</em> (e.g. countries, or religious
groups, or sports teams) on the y-axis, and then seeing how different their
scores are on some single measure (e.g. GDP, or number of adherents, or average
points scored per game), which we put on the x-axis. Maybe we use color to break
things out by some third measure as well.<sup id="fnref:2"><a href="https://kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/#fn:2" class="footnote-ref" role="doc-noteref" rel="nofollow" target="_blank">2</a></sup> In
this case, I’ve just labeled the x-axis as generically as possible. “Value”
covers the range of all the measures. The lowest value is 5, in Largest Lead.
The highest is 88, in Free Throw %. But these numbers are not meaningfully
comparable. The graph encourages us to compare across as well as within
categories. But while within-category comparisons are meaningful, the
between-category ones are not. There were way more Bench Points than Blocks in
the game. But that is not a useful thing to know.</p>
<p>Knowing who won each measure isn’t nothing. It can be informative about how the game went, maybe especially when a team won the game but “lost” on a number of the measures. If you really wanted to lean in to that aspect, you could sort of justify the zero-sum view, and maybe look for a way to sort and order by “how much” a team “won” each category. But again, what’s the right denominator for those measures? For instance, do we care about a team’s share of all Defensive Rebounds in the game? Or do we care about the share of Defensive Rebounds a team won relative to every opportunity it had to make a Defensive Rebound? How meaningful is ordering our rows by those kinds of shares? Even worse, some measures (notably Fouls) are <em>bad</em> to “win”, so we’d have to do something about those.</p>
<figure><a href="https://i2.wp.com/kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/gruber-stats2.png?ssl=1" rel="nofollow" target="_blank">
    </a><figcaption>
            <p>Team Stats side by side and ordered from absolute highest to lowest, whatever that means.</p>
        </figcaption>
</figure>
<p>Our fundamental problem is that we just have two cases (the teams) and fifteen
different measures, or variables. Each variable, except for the three
percentages, is in effect on its own scale. There’s no direct way to make
comparisons across them. Sure, some of these measures are probably going to be
associated with one another—e.g. Turnovers and Points Off Turnovers—but the
numeric values aren’t directly comparable in general. If you know a lot about
basketball you might have some informative rules of thumb about each one of
these measures, or some of them in combination. But at that point the lines in
this particular graph are not going to be doing any work for you; you’ll just
end up looking directly at the numbers. If we had data on all these measures for
every NBA game for a whole season then we could of course do much more with
them, because then each measure would have a distribution across all games and
across all teams.</p>
<p>As it is, the purpose of the “Stats” screen in Apple Sports is just to summarize
information from a single game. The other thing I could think of to do with the
numbers as kind of graph is something like this:</p>
<figure><a href="https://i1.wp.com/kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/gruber-stats3.png?ssl=1" rel="nofollow" target="_blank">
    </a><figcaption>
            <p>A back-to-back column chart.</p>
        </figcaption>
</figure>
<p>This is <em>marginally</em> more helpful than the one before just because, again, it
gets rid of the unhelpful zero-sum look of the original. As I hope you can
immediately see, it creates many other difficulties. It also doesn’t do away
with the core problem. That problem is principally one of information design
rather than data visualization. What I mean is that what we’re trying to
organize is, in effect, fifteen pairs of related but fundamentally distinct
numbers. If we had fifteen <em>cases</em> and two <em>variables</em> things would be simple. But
with fifteen variables and two cases … well, this is not the kind of thing you
can make a single effective and non-confusing graph out of. That’s why I kind of
sympathize with the designer. In a constrained space they have to show thirty
numbers (thirty two, including the score). Lots of information. A straight table
seems like it would be boring. Surely there’s some way to thematically integrate
the numbers in a visually appealing manner that brings out some of the
relationships across the rows. That’s what graphs do; it seems like the right
thing to reach for. But at its heart this information is not a graph. It just
sort of looks like one, and that ends up confusing people.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Modulo some measurement decisions about how to determine when possession is turned over while the ball is in play. <a href="https://kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/#fnref:1" class="footnote-backref" role="doc-backlink" rel="nofollow" target="_blank">︎</a></p>
</li>
<li id="fn:2">
<p><a href="https://socviz.co/05-more-on-geoms.html#fig-ch-05-organdata-06" rel="nofollow" target="_blank">Here’s an
example</a> of a graph with a categorical measure on the y-axis, a continuous measure on the x-axis, and an additional categorical feature shown with color. <a href="https://kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/#fnref:2" class="footnote-backref" role="doc-backlink" rel="nofollow" target="_blank">︎</a></p>
</li>
</ol>
</div>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://kieranhealy.org/blog/archives/2026/05/21/zero-sum-problems/"> R on kieranhealy.org</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/zero-sum-problems/">Zero Sum Problems</a>
</article>
<article>
<h1>Conformalized TabICL: Prediction Intervals for a State-Of-The-Art Tabular Foundation Model in Python and R</h1>
<p>T. Moudiki — Thu, 21 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/21/r/python/Conformalized-TabICL-nnetsauce"> T. Moudiki's Webpage - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>A few days ago, I presented <a href="https://thierrymoudiki.github.io/blog/2026/05/17/r/python/conformalized-tabpfn" rel="nofollow" target="_blank">Conformalized TabPFN: Prediction Intervals for a Pretrained Transformer for Tabular Data in Python and R</a>. Today, it’s about <a href="https://github.com/soda-inria/tabicl" rel="nofollow" target="_blank">TabICL</a>, another state-of-the-art tabular foundation model. <code>TabICL</code> requires no token, as you’ll notice in the following Python and R code.</p>

<h1 id="1---python-version">1 – Python version</h1>

<pre>!pip install tabicl nnetsauce # scikit-learn matplotlib numpy

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error
from tabicl import TabICLRegressor
import nnetsauce as ns
import numpy as np
import matplotlib.pyplot as plt
from time import time

# ── data ───────────────────────────────────────────────────
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ── base models ────────────────────────────────────────────
models = {
    "TabICL": TabICLRegressor(),
    "RidgeCV": RidgeCV(),
}

results = {}
for name, reg in models.items():
    start = time()
    conf = ns.PredictionInterval(reg, level=95)
    conf.fit(X_train, y_train)
    pi = conf.predict(X_test, return_pi=True)
    print(f"{name:10s}  time={time() - start:.1f}s")

    coverage = np.mean((pi.lower <= y_test) & (pi.upper >= y_test))
    width    = np.mean(pi.upper - pi.lower)
    rmse     = np.sqrt(mean_squared_error(y_test, pi.mean))

    results[name] = {"pi": pi, "coverage": coverage,
                     "width": width, "rmse": rmse}
    print(f"{name:10s}  RMSE={rmse:.1f}  "
          f"coverage={coverage:.3f}  avg_width={width:.1f}")

# ── plot side-by-side ──────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)
colors = {"TabICL": "orange", "RidgeCV": "steelblue"}
max_idx = 50

for ax, (name, res) in zip(axes, results.items()):
    pi = res["pi"]
    x  = range(max_idx)
    ax.fill_between(x, pi.lower[:max_idx], pi.upper[:max_idx],
                     alpha=0.35, color=colors[name], label="95% PI")
    ax.plot(x, pi.mean[:max_idx], "k--", lw=1.5, label="predicted")
    ax.plot(x, y_test[:max_idx], "k.", ms=6, alpha=0.4, label="observed")
    ax.set_title(
        f"{name}  |  cov={res['coverage']:.3f}  width={res['width']:.1f}"
    )
    ax.legend(fontsize=8)

plt.suptitle("Conformalized TabICL vs RidgeCV — diabetes dataset")
plt.tight_layout()
plt.show()

Checkpoint 'tabicl-regressor-v2-20260212.ckpt' not cached.
 Downloading from Hugging Face Hub (jingang/TabICL).




tabicl-regressor-v2-20260212.ckpt:   0%|          | 0.00/114M [00:00<?, ?B/s]


TabICL      time=21.8s
TabICL      RMSE=54.4  coverage=0.955  avg_width=226.1
RidgeCV     time=0.0s
RidgeCV     RMSE=53.9  coverage=0.955  avg_width=211.5
</pre>

<p></p>

<h1 id="2---r-version">2 - R version</h1>

<pre> %load_ext rpy2.ipython # in a Colab notebook, use this

%R install.packages("reticulate")

%%R  # in Colab/Jupyter with rpy2; remove this line for pure R

library(reticulate)

# pip install tabicl nnetsauce scikit-learn matplotlib numpy

sklearn_ds  <- import("sklearn.datasets")
sklearn_ms  <- import("sklearn.model_selection")
sklearn_m   <- import("sklearn.metrics")
sklearn_lm  <- import("sklearn.linear_model")
tabicl      <- import("tabicl")
ns          <- import("nnetsauce")
np          <- import("numpy")
plt         <- import("matplotlib.pyplot")

# ── data ───────────────────────────────────────────────────
d       <- sklearn_ds$load_diabetes(return_X_y = TRUE)
X <- d[[1]]; y <- d[[2]]
sp      <- sklearn_ms$train_test_split(X, y,
             test_size = 0.2, random_state = 42L)
X_train <- sp[[1]]; X_test <- sp[[2]]
y_train <- sp[[3]]; y_test <- sp[[4]]

# ── helper: fit + evaluate ─────────────────────────────────
eval_model <- function(reg, name) {
  conf <- ns$PredictionInterval(reg, level = 95L)
  conf$fit(X_train, y_train)
  pi   <- conf$predict(X_test, return_pi = TRUE)

  cov  <- np$mean((pi$lower <= y_test) * (pi$upper >= y_test))
  wid  <- np$mean(pi$upper - pi$lower)
  rmse <- sqrt(sklearn_m$mean_squared_error(y_test, pi$mean))

  cat(sprintf("%-10s  RMSE=%.1f  coverage=%.3f  avg_width=%.1f\n",
              name, rmse, cov, wid))
  invisible(pi)
}

# ── run both models ────────────────────────────────────────
pi_tabicl  <- eval_model(tabicl$TabICLRegressor(),  "TabICL")
pi_ridge   <- eval_model(sklearn_lm$RidgeCV(),       "RidgeCV")

# ── plot ───────────────────────────────────────────────────
max_idx <- 50L
x_range <- np$array(0:(max_idx - 1))

plot_pi <- function(pi, title, col) {
  x_fill <- np$concatenate(list(x_range, x_range[max_idx:1]))
  y_fill <- np$concatenate(list(
    pi$upper[1:max_idx], pi$lower[max_idx:1]))
  plt$fill(x_fill, y_fill, alpha=0.35, fc=col, ec="None", label="95% PI")
  plt$plot(x_range, pi$mean[1:max_idx], "k--", lw=1.5, label="predicted")
  plt$plot(x_range, y_test[1:max_idx], "k.", ms=6L, alpha=0.4, label="observed")
  plt$title(title); plt$legend(fontsize=8L)
}

fig <- plt$figure(figsize=c(12, 4))
plt$subplot(1L, 2L, 1L); plot_pi(pi_tabicl, "Conformalized TabICL", "orange")
plt$subplot(1L, 2L, 2L); plot_pi(pi_ridge,  "Conformalized RidgeCV", "steelblue")
plt$suptitle("Conformalized TabICL vs RidgeCV — diabetes dataset")
plt$tight_layout()
plt$show()

    WARNING: The R package "reticulate" only fixed recently
    an issue that caused a segfault when used with rpy2:
    https://github.com/rstudio/reticulate/pull/1188
    Make sure that you use a version of that package that includes
    the fix.
    TabICL      RMSE=54.4  coverage=0.955  avg_width=226.1
RidgeCV     RMSE=53.9  coverage=0.955  avg_width=211.5
</pre>

<p></p>

<p>Probably a dataset that’s too <em>easy</em> for a Transformer. Conformalizing simple models helps them, in general, to obtain coverage rates close to the nominal level, as we see for RidgeCV here.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/21/r/python/Conformalized-TabICL-nnetsauce"> T. Moudiki's Webpage - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/conformalized-tabicl-prediction-intervals-for-a-state-of-the-art-tabular-foundation-model-in-python-and-r/">Conformalized TabICL: Prediction Intervals for a State-Of-The-Art Tabular Foundation Model in Python and R</a>
</article>
<article>
<h1>The Atlas-Learn Approach to the Manifold Hypothesis</h1>
<p>R Works — Wed, 20 May 2026 00:00:00 +0000</p>
<![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://rworks.dev/posts/atlas-learn-sphere/"> R Works</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 





<p>The <strong>manifold hypothesis</strong>, the idea that real-world high-dimensional data concentrates near a low-dimensional curved subspace, is foundational to modern machine learning. Many popular manifold learning methods such as UMAP, t-SNE, Isomap, and diffusion maps do achieve dimensionality reduction by embedding data into a flat Euclidean space , but they do not attempt to directly learn the underlying manifold. In contrast, the 2025 paper by <a href="https://arxiv.org/pdf/2510.17772" rel="nofollow" target="_blank">Robinett et al.</a>, <em>Atlas-based manifold representations for interpretable Riemannian machine learning</em>, offers a proof of concept for directly tackling the manifold hypothesis based on fundamental ideas from differential geometry. It provides an algorithm for learning a low dimensional manifold from point data by constructing an atlas of charts. The paper is also notable for the design of an efficient data structure for working with the learned atlas and for the extensive supplementary materials that include a <a href="https://anonymous.4open.science/r/atlas_graph_learning-6DE0" rel="nofollow" target="_blank">GitHub Repository</a> containing several practical Python algorithms for doing calculations on manifolds, and an extraordinary amount of implementation detail.</p>
<p>Reading through Robinett et al., however, requires a fairly deep background in the theory of differential geometry. This post is an attempt to provide an on-ramp to Robinett et al. by discussing the relatively simple example of the two dimensional sphere,  embedded in . It implements the <strong>Atlas-Learn</strong> data structures and algorithms in <code>R</code>, uses them to learn  and then goes on to validate the Atlas-Learn algorithm for the sphere via three independent methods: 1) use numerical integration along the manifold to trace a great circle on the sphere, 3) recover the radius of curvature of the sphere from the atlas, and 4) verify the Gauss-Bonnet Theorem for the sphere.</p>
<p>The <code>R</code> code was mostly worked out by Claude Sonnet 4.3 in the context of participating in the Posit beta test for its AI Assistant. I found the integration of the AI engine into the RStudio IDE an effective means of communicating with Claude and managing the project workflow.</p>
<section id="atlas-learn-theory-and-algorithm" class="level1">
<h1>Atlas-Learn: Theory and Algorithm</h1>
<p>This section provides some minimal theoretical background for understanding the Atlas-Learn algorithm. A <em>smooth manifold</em>  of intrinsic dimension  embedded in  can be described by an <strong>atlas</strong> — a finite collection of <em>charts</em>  such that the open sets  cover  and each chart map  is a smooth bijection onto its image.</p>
<p>Normally, the definition of a smooth manifold also requires that any two charts be smoothly compatible, where two charts  and  are said to be iff  and  are both open in  and the transition map  is a diffeomorphism (e.g. see [2]). Robinett et al. relax the smoothly compatible requirement and define transition maps  separately from coordinate chart images. They then approximate a differentiable atlas by ensuring that the discrepancy between coordinate charts and transition maps  goes to 0 as the number of charts and the number of points sampled goes to infinity.</p>
<p>In the Atlas-Learn algorithm the manifold is a surface () embedded in , and both the covering sets and the chart maps are <em>learned</em> from a finite point cloud . The algorithm proceeds to construct an atlas in four basic steps.</p>
<p>The Atlas-Learn algorithm proceeds in four steps for each chart:</p>
<ol type="1">
<li>The point cloud comprising the data, the sphere in our case, is partitioned into k-medoids.</li>
<li>Local PCA is used to find the tangent plane and the normal plane for each point.</li>
<li>Quadratic regression is performed to find the curvature coefficients, K</li>
<li>The minimum ellipsoidal region enclosing the chart is estimated.</li>
</ol>
</section><section id="step-1-partitioning-via-k-medoids" class="level2">
<h2 class="anchored" data-anchor-id="step-1-partitioning-via-k-medoids">Step 1: Partitioning via k-medoids</h2>
<p>The point cloud is partitioned into  clusters using the -medoids algorithm (PAM). Unlike -means, PAM selects actual data points as cluster centers (medoids), which makes the partition robust to outliers and avoids projection artefacts. Each point  receives a chart label , and the points belonging to chart  together with their centroid are</p>
<p></p>
</section>
<section id="step-2-local-pca-and-tangent-plane-estimation" class="level2">
<h2 class="anchored" data-anchor-id="step-2-local-pca-and-tangent-plane-estimation">Step 2: Local PCA and tangent-plane estimation</h2>
<p>For each cluster , the centered data matrix  is decomposed via the thin SVD:</p>
<p></p>
<p>The first two right singular vectors span the <strong>local tangent plane</strong>:</p>
<p></p>
<p>while the third singular vector  estimates the <strong>local surface normal</strong> (the direction of least variance). Each centered point is then decomposed into tangent and normal components:</p>
<p></p>
</section>
<section id="step-3-quadratic-chart-map" class="level2">
<h2 class="anchored" data-anchor-id="step-3-quadratic-chart-map">Step 3: Quadratic chart map</h2>
<p>On a smooth surface the normal offset  is a smooth function of the tangent coordinates . Atlas-Learn approximates this by a <strong>degree-2 polynomial</strong> (capturing local curvature):</p>
<p></p>
<p>where  is estimated by ordinary least squares with a small ridge penalty :</p>
<p></p>
<p>The resulting <strong>inverse chart map</strong>  reconstructs an ambient point from local coordinates :</p>
<p></p>
<p>Its <strong>Jacobian</strong> , required for geodesic integration, is:</p>
<p></p>
</section>
<section id="step-4-ellipsoidal-chart-domains" class="level2">
<h2 class="anchored" data-anchor-id="step-4-ellipsoidal-chart-domains">Step 4: Ellipsoidal chart domains</h2>
<p>Each chart is assigned an <strong>ellipsoidal domain</strong>  defined by</p>
<p></p>
<p>where  is a rescaled inverse covariance of the projected points:</p>
<p></p>
<p>Setting  (default ) inflates each domain slightly beyond the convex hull of its own cluster, so that neighboring charts <strong>overlap</strong> and transitions are always possible. On  specifically, because the sphere is isotropic and the -medoids partition tends to produce roughly equal-area, near-circular patches, the learned ellipsoids  are close to circles ( for some scalar ). Each chart is assigned a domain  where  is a rescaled inverse covariance of the projected tangent-plane coordinates. Setting the scale factor  (default ) inflates domains slightly so that neighboring charts overlap and transitions are always possible.</p>
<hr />
</section>

<section id="the-atlas-learn-implementation" class="level1">
<h1>The Atlas-Learn Implementation</h1>
<p>Here are the required <code>R</code> packages.</p>
<div class="cell">
<details class="code-fold">
<summary>Required Packages
</article>
<article>
<h1>More crochet/programming thoughts</h1>
<p>Maëlle's R blog on Maëlle Salmon's personal website — Tue, 19 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://masalmon.eu/2026/05/19/crochet-again/"> Maëlle's R blog on Maëlle Salmon's personal website</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>Here I am again, writing about crochet and programming!
I’ve continued creating <a href="https://masalmon.eu/2026/01/26/amigurumi/#the-importance-of-practice" rel="nofollow" target="_blank"><del>my tons of shitty stitches</del></a> cute creatures.</p>
<h2 id="new-git-analogy-the-crochet-lifeline">New Git analogy! The crochet lifeline</h2>
<p><em><a href="https://masalmon.eu/2026/02/15/stitch-markers-git-commits/" rel="nofollow" target="_blank">First Git/crochet analogy</a></em>.</p>
<p>I read a great crochet book about creating your own amigurumi patterns: <a href="https://www.editions-eyrolles.com/livre/creer-ses-propres-modeles-d-amigurumis-au-crochet" rel="nofollow" target="_blank"><em>Créer ses propres modèles d’amigurumis au crochet</em></a> by Clotilde Massot and Lise Grandjonc.
One of its authors (Clotilde Massot) is a software developer who, among other things, published an octocat pattern that I bought and <a href="https://bsky.app/profile/masalmon.eu/post/3mjw2zxta2k2z" rel="nofollow" target="_blank">used</a>…
Anyway in the book they explain that when you create a pattern, you will probably have to undo your work several times.
Undoing a round is easy if you have a stitch marker in the first stitch of the current/last round: you undo until you hit that stitch marker.
But what about undoing several rounds?
In that case, you’ll be better off if instead of using a stitch marker at the beginning of the current round, you use contrasting yarn stuck <em>under the first stitch of each round</em>.
That yarn creates what the authors of the book call your <strong>lifeline</strong>!</p>
<p>Now, if that’s not a great analogy for commits and the ability to reset…</p>
<h2 id="communities-of-practice">Communities of practice</h2>
<p>If you’ve had the opportunity to attend or watch the wonderful <a href="https://yabellini.netlify.app/talk/2025_user2025/" rel="nofollow" target="_blank">useR! 2025 keynote talk of my rOpenSci colleague Yanina Bellini Saibene</a>, you’ve heard of communities of practice.
In her talk, Yani quoted Etienne Wenger who defined communities of practice as <em>“groups of people who share a passion for something that they know how to do, and who interact regularly in order to learn how to do it better“</em>.
Yani mentioned her swimming team and English conversation club.
Well, I found a community of practice for crochet: a stitch club at a local café!
Participants meet up to crochet side by side, talking a lot about crochet: comparing yarns, exchanging tips, etc.
The first time I went, we even started with a round of introductions where we said what each of us would work on during that meeting, which reminded me of <a href="https://ropensci.org/coworking/" rel="nofollow" target="_blank">rOpenSci coworking sessions</a> where participants do exactly that.</p>
<h2 id="usefulness-of-seeing-others-work">Usefulness of seeing others’ work</h2>
<p>I’ve worked on some patterns by Yan Schenkel a.k.a. <a href="https://picapauyan.com/" rel="nofollow" target="_blank">Pica Pau</a>.
One cool aspect of the patterns is that they include a link to a gallery where anyone can upload pics of their take on each creature.
So you can look at them, maybe seeing a crucial (to you) angle that’s absent from the pattern, noticing whether some “flaw” of your own project is present in others’ projects, comparing variations and picking what you prefer before you start, etc.
For instance I stared at many pics of <a href="https://www.amigurumi.com/forum/Animal-Friends-of-Pica-Pau-3/Alberto-Seagull/" rel="nofollow" target="_blank">Alberto Seagull</a> before making the legs for mine.</p>
<p>The usefulness of seeing others’ crocheted animals reminds me of how reading and reviewing open-source, or our colleagues’, code helps us (and LLMs, I suppose) learn how to do, or not to do, some things, how it help us refine our taste!
Ironically, I do publish most of my code, but I haven’t had interest in doing that for my crocheting yet. </p>
<h2 id="conclusion">Conclusion</h2>
<p>In summary, I keep finding excuses to talk about crochet. My pink octocat approves!</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://masalmon.eu/2026/05/19/crochet-again/"> Maëlle's R blog on Maëlle Salmon's personal website</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/more-crochet-programming-thoughts/">More crochet/programming thoughts</a>
</article>
<article>
<h1>Querying Neo4j Aura from R with neo2R</h1>
<p>Patrice Godard — Mon, 18 May 2026 22:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://patzaw.github.io/posts/neo2R-Aura.html"> Patrice Godard</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 




<p></p>
<script src="https://patzaw.github.io/site_libs/htmlwidgets-1.6.4/htmlwidgets.js"></script>
<link rel="stylesheet" href="https://patzaw.github.io/site_libs/vis-9.1.0/vis-network.min.css">
<script src="https://patzaw.github.io/site_libs/vis-9.1.0/vis-network.min.js"></script>
<script src="https://patzaw.github.io/site_libs/visNetwork-binding-2.1.4/visNetwork.js"></script>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>Graph databases excel at storing and traversing highly connected data used for recommendation engines, fraud detection, knowledge graphs, and social networks. <strong>Neo4j</strong> is one of the most widely used graph databases, and with <strong>Neo4j Aura</strong>, its managed cloud service now makes it easy to spin up a production-grade instance without any infrastructure overhead.</p>
<p>On the R side, the <a href="https://cran.r-project.org/package=neo2R" rel="nofollow" target="_blank">neo2R</a> package has long been available for querying self-hosted Neo4j instances from R. Version <strong>3.0.0</strong> brings two important changes:</p>
<ol type="1">
<li><strong>Unified connection model</strong> — a single <code>startGraph()</code> call handles both a self-hosted Neo4j instance (<code>http://localhost:7474</code>) and a <em>cloud</em> Neo4j Aura instance (<code>https://<id>.databases.neo4j.io</code>).</li>
<li><strong>httr2 backend</strong> — the internal HTTP layer migrated from the deprecated <code>httr</code> package to <a href="https://httr2.r-lib.org/" rel="nofollow" target="_blank"><code>httr2</code></a>.</li>
</ol>
<p>In this post, we’ll connect to the <strong>free Neo4j Aura demo database</strong> preloaded with the classic Movie Recommendations dataset, explore the graph with Cypher queries, and finish with an interactive network visualization built with <a href="https://datastorm-open.github.io/visNetwork/" rel="nofollow" target="_blank">visNetwork</a>.</p>
<hr>
</section>
<section id="prerequisites" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="prerequisites"><span class="header-section-number">2</span> Prerequisites</h2>
<div class="cell">
<pre>install.packages(c("neo2R", "dplyr", "visNetwork"))</pre>
</div>
<div class="cell">
<pre>library(neo2R)
library(dplyr)</pre>
<div class="cell-output cell-output-stderr">
<pre>
Attaching package: 'dplyr'</pre>
</div>
<div class="cell-output cell-output-stderr">
<pre>The following objects are masked from 'package:stats':

    filter, lag</pre>
</div>
<div class="cell-output cell-output-stderr">
<pre>The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union</pre>
</div>
<pre>library(visNetwork)</pre>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>neo2R 3.0.0 requires R ≥ 4.1 and httr2 ≥ 1.0.0. Check your versions with <code>packageVersion("neo2R")</code> and <code>packageVersion("httr2")</code>.</p>
</div>
</div>
<hr>
</section>
<section id="connecting-to-neo4j-aura" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="connecting-to-neo4j-aura"><span class="header-section-number">3</span> Connecting to Neo4j Aura</h2>
<section id="create-and-connect-to-an-aura-instance" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="create-and-connect-to-an-aura-instance"><span class="header-section-number">3.1</span> Create and Connect to an Aura Instance</h3>
<p>Neo4j provides a <strong>free Aura Free</strong> tier (up to 200 k nodes / 400 k relationships).</p>
<p>Create a free instance at https://console.neo4j.io and get your connection details.</p>
<p>Connect to your instance with <code>startGraph()</code>.</p>
<div class="cell">
<pre>my_aura <- startGraph(
  url = "https://<INSTANCEID>.databases.neo4j.io",
  database = "INSTANCEID",
  username = "INSTANCEID",
  password = "INSTANCEPASSWORD"
  ## api = "v2" is set automatically for *.databases.neo4j.io URLs
)</pre>
</div>
</section>
<section id="the-movie-recommendations-dataset" class="level3" data-number="3.2">
<h3 data-number="3.2" class="anchored" data-anchor-id="the-movie-recommendations-dataset"><span class="header-section-number">3.2</span> The Movie Recommendations Dataset</h3>
<p>Neo4j provides <a href="https://neo4j.com/docs/getting-started/appendix/example-data/" rel="nofollow" target="_blank">example datasets</a>, and most of them are available as a one-click templates in the <a href="https://console.neo4j.io/" rel="nofollow" target="_blank">Neo4j Aura console</a>.</p>
<p>The <a href="https://github.com/neo4j-graph-examples/recommendations" rel="nofollow" target="_blank">Movie Recommendations dataset</a> is a graph example using a dataset of movie reviews for generating personalized, real-time recommendations. This dataset is also available on a demo server that can be accessed as follows.</p>
<div class="cell">
<pre>graph <- startGraph(
  url = "https://demo.neo4jlabs.com:7473",
  database = "recommendations",
  username = "recommendations",
  password = "recommendations"
)</pre>
</div>
</section>
</section>
<section id="exploring-the-schema" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="exploring-the-schema"><span class="header-section-number">4</span> Exploring the schema</h2>
<p>The Movie database contains the following node labels and relationship types:</p>
<div class="cell">
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<thead>
<tr class="header">
<th style="text-align: left;">Node label</th>
<th style="text-align: left;">Key properties</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Movie</td>
<td style="text-align: left;">title, released, imdbId</td>
</tr>
<tr class="even">
<td style="text-align: left;">Genre</td>
<td style="text-align: left;">name</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Actor</td>
<td style="text-align: left;">name, born, imdbId</td>
</tr>
<tr class="even">
<td style="text-align: left;">Director</td>
<td style="text-align: left;">name, born, imdbId</td>
</tr>
<tr class="odd">
<td style="text-align: left;">User</td>
<td style="text-align: left;">name</td>
</tr>
</tbody>
</table>
</div>
</div>
<p><br></p>
<div class="cell">
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<thead>
<tr class="header">
<th style="text-align: left;">Relationship type</th>
<th style="text-align: left;">Key properties</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">IN_GENRE</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">ACTED_IN</td>
<td style="text-align: left;">role</td>
</tr>
<tr class="odd">
<td style="text-align: left;">DIRECTED</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">RATED</td>
<td style="text-align: left;">rating || timestamp</td>
</tr>
</tbody>
</table>
</div>
</div>
<p><br></p>
<p>Let’s count the number of these different concepts:</p>
<div class="cell">
<pre>## Node types and counts
cypher(
  graph,
  "
  MATCH (n)
  RETURN labels(n) AS label, count(n) AS n
  ORDER BY n DESC
"
) |>
  as_tibble() |>
  ## filtering out technical nodes
  filter(label %notin% c("_Bloom_Perspective_", "_Bloom_Scene_", ""))</pre>
<div class="cell-output cell-output-stdout">
<pre># A tibble: 6 × 2
  label                           n
  <chr>                       <int>
1 Actor || Person             14956
2 Movie                        9125
3 Director || Person           3604
4 User                          671
5 Actor || Director || Person   487
6 Genre                          20</pre>
</div>
<pre>## Relationship types and counts
cypher(
  graph,
  "
  MATCH ()-[r]->()
  RETURN type(r) AS type, count(r) AS n
  ORDER BY n DESC
"
) |>
  as_tibble() |>
  ## filtering out technical relationships
  filter(type %notin% c("_Bloom_HAS_SCENE_"))</pre>
<div class="cell-output cell-output-stdout">
<pre># A tibble: 4 × 2
  type          n
  <chr>     <int>
1 RATED    100004
2 ACTED_IN  35910
3 IN_GENRE  20340
4 DIRECTED  10007</pre>
</div>
</div>
</section>
<section id="querying-with-cypher" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="querying-with-cypher"><span class="header-section-number">5</span> Querying with Cypher</h2>
<section id="top-prolific-actors" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="top-prolific-actors"><span class="header-section-number">5.1</span> Top prolific actors</h3>
<div class="cell">
<pre>cypher(
  graph,
  "
  MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
  RETURN p.name AS actor, count(m) AS movies
  ORDER BY movies DESC
  LIMIT 10
"
) |>
  as_tibble()</pre>
<div class="cell-output cell-output-stdout">
<pre># A tibble: 10 × 2
   actor             movies
   <chr>              <int>
 1 Robert De Niro        56
 2 Bruce Willis          49
 3 Samuel L. Jackson     45
 4 Nicolas Cage          45
 5 Michael Caine         40
 6 Clint Eastwood        40
 7 Tom Hanks             38
 8 John Cusack           38
 9 Morgan Freeman        38
10 Gene Hackman          38</pre>
</div>
</div>
</section>
<section id="movies-and-their-directors" class="level3" data-number="5.2">
<h3 data-number="5.2" class="anchored" data-anchor-id="movies-and-their-directors"><span class="header-section-number">5.2</span> Movies and their directors</h3>
<div class="cell">
<pre>cypher(
  graph,
  "
  MATCH (d:Person)-[:DIRECTED]->(m:Movie)
  RETURN m.title AS movie, m.released as released, d.name AS director
  ORDER BY m.released IS NOT NULL DESC, m.released DESC
  LIMIT 10
"
) |>
  as_tibble()</pre>
<div class="cell-output cell-output-stdout">
<pre># A tibble: 10 × 3
   movie         released   director            
   <chr>         <chr>      <chr>               
 1 Solace        2016-09-02 "Afonso Poyart"     
 2 Ben-hur       2016-08-12 "Timur Bekmambetov" 
 3 Rustom        2016-08-12 "Tinu Suresh Desai" 
 4 Mohenjo Daro  2016-08-12 "Ashutosh Gowariker"
 5 Suicide Squad 2016-08-05 "David Ayer"        
 6 Shin Godzilla 2016-07-29 "Hideaki Anno"      
 7 Shin Godzilla 2016-07-29 " Shinji Higuchi"   
 8 Jason Bourne  2016-07-29 "Paul Greengrass"   
 9 Star Trek 3   2016-07-22 "Justin Lin"        
10 Ghostbusters  2016-07-15 "Paul Feig"         </pre>
</div>
</div>
</section>
<section id="parameterised-queries" class="level3" data-number="5.3">
<h3 data-number="5.3" class="anchored" data-anchor-id="parameterised-queries"><span class="header-section-number">5.3</span> Parameterised queries</h3>
<p>neo2R supports <strong>named parameters</strong>, keeping queries safe from injection and easy to reuse:</p>
<div class="cell">
<pre>## Find all co-stars of a given actor
cypher(
  graph,
  "
  MATCH (a:Person {name: $actor})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(co:Person)
  RETURN DISTINCT co.name AS co_star, m.title AS shared_movie
  ORDER BY co_star
  ",
  parameters = list(actor = "Tom Hanks")
) |>
  as_tibble()</pre>
<div class="cell-output cell-output-stdout">
<pre># A tibble: 114 × 2
   co_star            shared_movie        
   <chr>              <chr>               
 1 Adrian Zmed        Bachelor Party      
 2 Alexander Godunov  Money Pit, The      
 3 Amy Adams          Charlie Wilson's War
 4 Annie Rose Buckley Saving Mr. Banks    
 5 Audrey Tautou      Da Vinci Code, The  
 6 Ayelet Zurer       Angels & Demons     
 7 Barkhad Abdi       Captain Phillips    
 8 Barkhad Abdirahman Captain Phillips    
 9 Barry Pepper       Saving Private Ryan 
10 Bill Paxton        Apollo 13           
# ℹ 104 more rows</pre>
</div>
</div>
<hr>
</section>
</section>
<section id="network-visualisation-with-visnetwork" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="network-visualisation-with-visnetwork"><span class="header-section-number">6</span> Network visualisation with visNetwork</h2>
<p>The real power of a graph database is visible when you <em>draw</em> the graph. Let’s pull Tom Hanks’s ego network, everyone he has acted alongside, and render it with <strong>visNetwork</strong>.</p>
<section id="step-1-fetch-nodes-and-edges" class="level3" data-number="6.1">
<h3 data-number="6.1" class="anchored" data-anchor-id="step-1-fetch-nodes-and-edges"><span class="header-section-number">6.1</span> Step 1 — Fetch nodes and edges</h3>
<div class="cell">
<pre>## Tom Hanks, his movies, and his co-stars
hub <- "Tom Hanks"
nodes_raw <- cypher(
  graph,
  "
  MATCH (hub:Person {name: $hub})-[hr:ACTED_IN]->(m:Movie)
  <-[cr:ACTED_IN]-(co:Person)
  RETURN hub.name AS hub, hr.role AS hub_role,
  m.title AS movie, m.year AS year,
  co.name AS co, cr.role AS co_role
  ",
  parameters = list(hub = hub)
) |>
  as_tibble()</pre>
</div>
</section>
<section id="step-2-shape-data-for-visnetwork" class="level3" data-number="6.2">
<h3 data-number="6.2" class="anchored" data-anchor-id="step-2-shape-data-for-visnetwork"><span class="header-section-number">6.2</span> Step 2 — Shape data for visNetwork</h3>
<p>visNetwork expects two data frames: <code>nodes</code> (with columns <code>id</code>, <code>label</code>, <code>group</code>, …) and <code>edges</code> (with columns <code>from</code>, <code>to</code>, …).</p>
<div class="cell">
<pre>nodes <- bind_rows(
  nodes_raw |>
    distinct(
      id = hub,
      group = "Hub"
    ),
  nodes_raw |>
    distinct(
      id = co,
      group = "Co-star"
    ),
  nodes_raw |>
    distinct(
      id = movie,
      group = "Movie",
      year
    )
) |>
  distinct() |>
  mutate(
    title = sprintf(
      '<b>%s</b>: %s%s',
      group,
      id,
      ifelse(!is.na(year), sprintf("(%s)", year), "")
    ),
    shape = ifelse(group == "Movie", "dot", "star"),
    size = ifelse(group == "Hub", 30, 18)
  ) |> 
    arrange(id)

edges <- bind_rows(
  nodes_raw |>
    distinct(
      from = hub,
      to = movie,
      role = hub_role
    ),
  nodes_raw |>
    distinct(
      from = co,
      to = movie,
      role = co_role
    )
) |>
  mutate(
    title = sprintf('<b>Role</b>: %s', role),
    arrows = "to"
  )</pre>
</div>
</section>
<section id="step-3-draw-the-network" class="level3" data-number="6.3">
<h3 data-number="6.3" class="anchored" data-anchor-id="step-3-draw-the-network"><span class="header-section-number">6.3</span> Step 3 — Draw the network</h3>
<div class="cell">
<pre>visNetwork(nodes, edges) |>
  visGroups(
    groupname = "hub",
    color = list(
      background = "#3B82F6",
      border = "#1D4ED8",
      highlight = "#93C5FD"
    )
  ) |>
  visGroups(
    groupname = "movie",
    color = list(
      background = "#F97316",
      border = "#C2410C",
      highlight = "#FED7AA"
    ),
    shape = "square"
  ) |>
  visGroups(
    groupname = "costar",
    color = list(
      background = "#6B7280",
      border = "#374151",
      highlight = "#D1D5DB"
    )
  ) |>
  visEdges(
    color = list(color = "#CBD5E1", highlight = "#3B82F6"),
    width = 1.5
  ) |>
  visOptions(
    highlightNearest = list(enabled = TRUE, degree = 1, hover = TRUE),
    nodesIdSelection = TRUE
  ) |>
  visLayout(randomSeed = 42) |>
  visPhysics(
    solver = "forceAtlas2Based",
    forceAtlas2Based = list(
      gravitationalConstant = -60,
      springLength = 120,
      springConstant = 0.04
    )
  ) |>
  visLegend(position = "right", main = "Node type")</pre>
<div class="cell-output-display">
<div class="visNetwork html-widget html-fill-item" id="htmlwidget-6a800980bfb41b9a5d8f" style="width:100%;height:464px;"></div>
<script type="application/json" data-for="htmlwidget-6a800980bfb41b9a5d8f">{"x":{"nodes":{"id":["'burbs, The","Adrian Zmed","Alexander Godunov","Amy Adams","Angels & Demons","Annie Rose Buckley","Apollo 13","Audrey Tautou","Ayelet Zurer","Bachelor Party","Barkhad Abdi","Barkhad Abdirahman","Barry Pepper","Big","Bill Paxton","Bonfire of the Vanities","Bonnie Hunt","Bruce Dern","Bruce Willis","Buzz Kilman","Captain Phillips","Carl Weathers","Carrie Fisher","Catch Me If You Can","Catherine Keener","Catherine Zeta-Jones","Charles Durning","Charlie Wilson's War","Chi McBride","Christopher Plummer","Christopher Walken","Cloud Atlas","Colin Farrell","Craig T. Nelson","Da Vinci Code, The","Dabney Coleman","Dan Aykroyd","Daryl Hannah","David Andrews","David Morse","Denzel Washington","Don Rickles","Dragnet","Eddie Deezen","Edward Burns","Elizabeth Perkins","Emma Thompson","Eugene Levy","Eva Marie Saint","Ewan McGregor","Extremely Loud and Incredibly Close","Forrest Gump","From the Earth to the Moon","Gary Sinise","Geena Davis","George Grizzard","Green Mile, The","Greg Kinnear","Halle Berry","Harry Morgan","Hector Elizondo","Hugo Weaving","Ian McKellen","Irma P. Hall","J.K. Simmons","Jackie Gleason","Jean Reno","Jim Broadbent","Jim Varney","Joan Cusack","Joe Versus the Volcano","John Candy","John Goodman","John Heard","Julia Roberts","Kelsey Grammer","Kevin Bacon","Kim Cattrall","Ladykillers, The","Lane Smith","Larry Crowne","League of Their Own, A","Leonardo DiCaprio","Leslie Zemeckis","Lloyd Bridges","Lori Petty","Lori Singer","Madonna","Man with One Red Shoe, The","Mare Winningham","Mark Rydell","Marlon Wayans","Martin Sheen","Maureen Stapleton","Meg Ryan","Melanie Griffith","Michael Clarke Duncan","Michael Conner Humphreys","Money Pit, The","Ned Beatty","Nick Searcy","Nona Gaye","Nothing in Common","Parker Posey","Philadelphia","Philip Seymour Hoffman","Polar Express, The","Punchline","Randall Park","Reginald VelJohnson","Rick Ducommun","Rita Wilson","Robert Loggia","Robert Stack","Roberta Maxwell","Robin Wright","Ross Malinger","Roxana Ortega","Sally Field","Sandra Bullock","Sarah Mahoney","Saving Mr. Banks","Saving Private Ryan","Shelley Long","Sleepless in Seattle","Splash","Stanley Tucci","Stellan Skarsgård","Tawny Kitaen","Terminal, The","Thomas Horn","Tim Allen","Tim Thomerson","Tom Hanks","Tom Sizemore","Toy Story","Toy Story 2","Toy Story 3","Toy Story of Terror","Turner & Hooch","Victor Garber","Volunteers","You've Got Mail","Zoe Caldwell"],"group":["Movie","Co-star","Co-star","Co-star","Movie","Co-star","Movie","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Movie","Co-star","Movie","Co-star","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Movie","Movie","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Co-star","Movie","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Movie","Co-star","Movie","Co-star","Movie","Movie","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Co-star","Movie","Movie","Co-star","Movie","Movie","Co-star","Co-star","Co-star","Movie","Co-star","Co-star","Co-star","Hub","Co-star","Movie","Movie","Movie","Movie","Movie","Co-star","Movie","Movie","Co-star"],"year":[1989,null,null,null,2009,null,1995,null,null,1984,null,null,null,1988,null,1990,null,null,null,null,2013,null,null,2002,null,null,null,2007,null,null,null,2012,null,null,2006,null,null,null,null,null,null,null,1987,null,null,null,null,null,null,null,2011,1994,1998,null,null,null,1999,null,null,null,null,null,null,null,null,null,null,null,null,null,1990,null,null,null,null,null,null,null,2004,null,2011,1992,null,null,null,null,null,null,1985,null,null,null,null,null,null,null,null,null,1986,null,null,null,1986,null,1993,null,2004,1988,null,null,null,null,null,null,null,null,null,null,null,null,null,2013,1998,null,1993,1984,null,null,null,2004,null,null,null,null,null,1995,1999,2010,2013,1989,null,1985,1998,null],"title":["<b>Movie<\/b>: 'burbs, The(1989)","<b>Co-star<\/b>: Adrian Zmed","<b>Co-star<\/b>: Alexander Godunov","<b>Co-star<\/b>: Amy Adams","<b>Movie<\/b>: Angels & Demons(2009)","<b>Co-star<\/b>: Annie Rose Buckley","<b>Movie<\/b>: Apollo 13(1995)","<b>Co-star<\/b>: Audrey Tautou","<b>Co-star<\/b>: Ayelet Zurer","<b>Movie<\/b>: Bachelor Party(1984)","<b>Co-star<\/b>: Barkhad Abdi","<b>Co-star<\/b>: Barkhad Abdirahman","<b>Co-star<\/b>: Barry Pepper","<b>Movie<\/b>: Big(1988)","<b>Co-star<\/b>: Bill Paxton","<b>Movie<\/b>: Bonfire of the Vanities(1990)","<b>Co-star<\/b>: Bonnie Hunt","<b>Co-star<\/b>: Bruce Dern","<b>Co-star<\/b>: Bruce Willis","<b>Co-star<\/b>: Buzz Kilman","<b>Movie<\/b>: Captain Phillips(2013)","<b>Co-star<\/b>: Carl Weathers","<b>Co-star<\/b>: Carrie Fisher","<b>Movie<\/b>: Catch Me If You Can(2002)","<b>Co-star<\/b>: Catherine Keener","<b>Co-star<\/b>: Catherine Zeta-Jones","<b>Co-star<\/b>: Charles Durning","<b>Movie<\/b>: Charlie Wilson's War(2007)","<b>Co-star<\/b>: Chi McBride","<b>Co-star<\/b>: Christopher Plummer","<b>Co-star<\/b>: Christopher Walken","<b>Movie<\/b>: Cloud Atlas(2012)","<b>Co-star<\/b>: Colin Farrell","<b>Co-star<\/b>: Craig T. Nelson","<b>Movie<\/b>: Da Vinci Code, The(2006)","<b>Co-star<\/b>: Dabney Coleman","<b>Co-star<\/b>: Dan Aykroyd","<b>Co-star<\/b>: Daryl Hannah","<b>Co-star<\/b>: David Andrews","<b>Co-star<\/b>: David Morse","<b>Co-star<\/b>: Denzel Washington","<b>Co-star<\/b>: Don Rickles","<b>Movie<\/b>: Dragnet(1987)","<b>Co-star<\/b>: Eddie Deezen","<b>Co-star<\/b>: Edward Burns","<b>Co-star<\/b>: Elizabeth Perkins","<b>Co-star<\/b>: Emma Thompson","<b>Co-star<\/b>: Eugene Levy","<b>Co-star<\/b>: Eva Marie Saint","<b>Co-star<\/b>: Ewan McGregor","<b>Movie<\/b>: Extremely Loud and Incredibly Close(2011)","<b>Movie<\/b>: Forrest Gump(1994)","<b>Movie<\/b>: From the Earth to the Moon(1998)","<b>Co-star<\/b>: Gary Sinise","<b>Co-star<\/b>: Geena Davis","<b>Co-star<\/b>: George Grizzard","<b>Movie<\/b>: Green Mile, The(1999)","<b>Co-star<\/b>: Greg Kinnear","<b>Co-star<\/b>: Halle Berry","<b>Co-star<\/b>: Harry Morgan","<b>Co-star<\/b>: Hector Elizondo","<b>Co-star<\/b>: Hugo Weaving","<b>Co-star<\/b>: Ian McKellen","<b>Co-star<\/b>: Irma P. Hall","<b>Co-star<\/b>: J.K. Simmons","<b>Co-star<\/b>: Jackie Gleason","<b>Co-star<\/b>: Jean Reno","<b>Co-star<\/b>: Jim Broadbent","<b>Co-star<\/b>: Jim Varney","<b>Co-star<\/b>: Joan Cusack","<b>Movie<\/b>: Joe Versus the Volcano(1990)","<b>Co-star<\/b>: John Candy","<b>Co-star<\/b>: John Goodman","<b>Co-star<\/b>: John Heard","<b>Co-star<\/b>: Julia Roberts","<b>Co-star<\/b>: Kelsey Grammer","<b>Co-star<\/b>: Kevin Bacon","<b>Co-star<\/b>: Kim Cattrall","<b>Movie<\/b>: Ladykillers, The(2004)","<b>Co-star<\/b>: Lane Smith","<b>Movie<\/b>: Larry Crowne(2011)","<b>Movie<\/b>: League of Their Own, A(1992)","<b>Co-star<\/b>: Leonardo DiCaprio","<b>Co-star<\/b>: Leslie Zemeckis","<b>Co-star<\/b>: Lloyd Bridges","<b>Co-star<\/b>: Lori Petty","<b>Co-star<\/b>: Lori Singer","<b>Co-star<\/b>: Madonna","<b>Movie<\/b>: Man with One Red Shoe, The(1985)","<b>Co-star<\/b>: Mare Winningham","<b>Co-star<\/b>: Mark Rydell","<b>Co-star<\/b>: Marlon Wayans","<b>Co-star<\/b>: Martin Sheen","<b>Co-star<\/b>: Maureen Stapleton","<b>Co-star<\/b>: Meg Ryan","<b>Co-star<\/b>: Melanie Griffith","<b>Co-star<\/b>: Michael Clarke Duncan","<b>Co-star<\/b>: Michael Conner Humphreys","<b>Movie<\/b>: Money Pit, The(1986)","<b>Co-star<\/b>: Ned Beatty","<b>Co-star<\/b>: Nick Searcy","<b>Co-star<\/b>: Nona Gaye","<b>Movie<\/b>: Nothing in Common(1986)","<b>Co-star<\/b>: Parker Posey","<b>Movie<\/b>: Philadelphia(1993)","<b>Co-star<\/b>: Philip Seymour Hoffman","<b>Movie<\/b>: Polar Express, The(2004)","<b>Movie<\/b>: Punchline(1988)","<b>Co-star<\/b>: Randall Park","<b>Co-star<\/b>: Reginald VelJohnson","<b>Co-star<\/b>: Rick Ducommun","<b>Co-star<\/b>: Rita Wilson","<b>Co-star<\/b>: Robert Loggia","<b>Co-star<\/b>: Robert Stack","<b>Co-star<\/b>: Roberta Maxwell","<b>Co-star<\/b>: Robin Wright","<b>Co-star<\/b>: Ross Malinger","<b>Co-star<\/b>: Roxana Ortega","<b>Co-star<\/b>: Sally Field","<b>Co-star<\/b>: Sandra Bullock","<b>Co-star<\/b>: Sarah Mahoney","<b>Movie<\/b>: Saving Mr. Banks(2013)","<b>Movie<\/b>: Saving Private Ryan(1998)","<b>Co-star<\/b>: Shelley Long","<b>Movie<\/b>: Sleepless in Seattle(1993)","<b>Movie<\/b>: Splash(1984)","<b>Co-star<\/b>: Stanley Tucci","<b>Co-star<\/b>: Stellan Skarsgård","<b>Co-star<\/b>: Tawny Kitaen","<b>Movie<\/b>: Terminal, The(2004)","<b>Co-star<\/b>: Thomas Horn","<b>Co-star<\/b>: Tim Allen","<b>Co-star<\/b>: Tim Thomerson","<b>Hub<\/b>: Tom Hanks","<b>Co-star<\/b>: Tom Sizemore","<b>Movie<\/b>: Toy Story(1995)","<b>Movie<\/b>: Toy Story 2(1999)","<b>Movie<\/b>: Toy Story 3(2010)","<b>Movie<\/b>: Toy Story of Terror(2013)","<b>Movie<\/b>: Turner & Hooch(1989)","<b>Co-star<\/b>: Victor Garber","<b>Movie<\/b>: Volunteers(1985)","<b>Movie<\/b>: You've Got Mail(1998)","<b>Co-star<\/b>: Zoe Caldwell"],"shape":["dot","star","star","star","dot","star","dot","star","star","dot","star","star","star","dot","star","dot","star","star","star","star","dot","star","star","dot","star","star","star","dot","star","star","star","dot","star","star","dot","star","star","star","star","star","star","star","dot","star","star","star","star","star","star","star","dot","dot","dot","star","star","star","dot","star","star","star","star","star","star","star","star","star","star","star","star","star","dot","star","star","star","star","star","star","star","dot","star","dot","dot","star","star","star","star","star","star","dot","star","star","star","star","star","star","star","star","star","dot","star","star","star","dot","star","dot","star","dot","dot","star","star","star","star","star","star","star","star","star","star","star","star","star","dot","dot","star","dot","dot","star","star","star","dot","star","star","star","star","star","dot","dot","dot","dot","dot","star","dot","dot","star"],"size":[18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,30,18,18,18,18,18,18,18,18,18,18],"label":["'burbs, The","Adrian Zmed","Alexander Godunov","Amy Adams","Angels & Demons","Annie Rose Buckley","Apollo 13","Audrey Tautou","Ayelet Zurer","Bachelor Party","Barkhad Abdi","Barkhad Abdirahman","Barry Pepper","Big","Bill Paxton","Bonfire of the Vanities","Bonnie Hunt","Bruce Dern","Bruce Willis","Buzz Kilman","Captain Phillips","Carl Weathers","Carrie Fisher","Catch Me If You Can","Catherine Keener","Catherine Zeta-Jones","Charles Durning","Charlie Wilson's War","Chi McBride","Christopher Plummer","Christopher Walken","Cloud Atlas","Colin Farrell","Craig T. Nelson","Da Vinci Code, The","Dabney Coleman","Dan Aykroyd","Daryl Hannah","David Andrews","David Morse","Denzel Washington","Don Rickles","Dragnet","Eddie Deezen","Edward Burns","Elizabeth Perkins","Emma Thompson","Eugene Levy","Eva Marie Saint","Ewan McGregor","Extremely Loud and Incredibly Close","Forrest Gump","From the Earth to the Moon","Gary Sinise","Geena Davis","George Grizzard","Green Mile, The","Greg Kinnear","Halle Berry","Harry Morgan","Hector Elizondo","Hugo Weaving","Ian McKellen","Irma P. Hall","J.K. Simmons","Jackie Gleason","Jean Reno","Jim Broadbent","Jim Varney","Joan Cusack","Joe Versus the Volcano","John Candy","John Goodman","John Heard","Julia Roberts","Kelsey Grammer","Kevin Bacon","Kim Cattrall","Ladykillers, The","Lane Smith","Larry Crowne","League of Their Own, A","Leonardo DiCaprio","Leslie Zemeckis","Lloyd Bridges","Lori Petty","Lori Singer","Madonna","Man with One Red Shoe, The","Mare Winningham","Mark Rydell","Marlon Wayans","Martin Sheen","Maureen Stapleton","Meg Ryan","Melanie Griffith","Michael Clarke Duncan","Michael Conner Humphreys","Money Pit, The","Ned Beatty","Nick Searcy","Nona Gaye","Nothing in Common","Parker Posey","Philadelphia","Philip Seymour Hoffman","Polar Express, The","Punchline","Randall Park","Reginald VelJohnson","Rick Ducommun","Rita Wilson","Robert Loggia","Robert Stack","Roberta Maxwell","Robin Wright","Ross Malinger","Roxana Ortega","Sally Field","Sandra Bullock","Sarah Mahoney","Saving Mr. Banks","Saving Private Ryan","Shelley Long","Sleepless in Seattle","Splash","Stanley Tucci","Stellan Skarsgård","Tawny Kitaen","Terminal, The","Thomas Horn","Tim Allen","Tim Thomerson","Tom Hanks","Tom Sizemore","Toy Story","Toy Story 2","Toy Story 3","Toy Story of Terror","Turner & Hooch","Victor Garber","Volunteers","You've Got Mail","Zoe Caldwell"]},"edges":{"from":["Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Tom Hanks","Sally Field","Mark Rydell","John Goodman","Martin Sheen","Leonardo DiCaprio","Christopher Walken","Dan Aykroyd","Harry Morgan","Christopher Plummer","Colin Farrell","Emma Thompson","Annie Rose Buckley","Tawny Kitaen","Adrian Zmed","George Grizzard","Tim Thomerson","Rita Wilson","John Candy","Dabney Coleman","Lori Singer","Charles Durning","Eugene Levy","John Candy","Daryl Hannah","John Heard","Elizabeth Perkins","Robert Loggia","Hector Elizondo","Eva Marie Saint","Jackie Gleason","Alexander Godunov","Shelley Long","Maureen Stapleton","Tim Allen","Joan Cusack","Carl Weathers","Barkhad Abdirahman","Catherine Keener","Barkhad Abdi","Sarah Mahoney","Roxana Ortega","Randall Park","Hugo Weaving","Jim Broadbent","Halle Berry","Stellan Skarsgård","Ayelet Zurer","Ewan McGregor","Thomas Horn","Zoe Caldwell","Sandra Bullock","Amy Adams","Philip Seymour Hoffman","Julia Roberts","Tim Allen","Ned Beatty","Joan Cusack","David Andrews","Lane Smith","Nick Searcy","Bonnie Hunt","Michael Clarke Duncan","David Morse","Tom Sizemore","Edward Burns","Barry Pepper","Kelsey Grammer","Tim Allen","Joan Cusack","Bill Paxton","Kevin Bacon","Gary Sinise","Jim Varney","Tim Allen","Don Rickles","Geena Davis","Lori Petty","Madonna","Robin Wright","Sally Field","Michael Conner Humphreys","Roberta Maxwell","Denzel Washington","Buzz Kilman","Victor Garber","Rita Wilson","Ross Malinger","Craig T. Nelson","Reginald VelJohnson","Mare Winningham","Rick Ducommun","Carrie Fisher","Bruce Dern","Meg Ryan","Robert Stack","Lloyd Bridges","Bruce Willis","Melanie Griffith","Kim Cattrall","Catherine Zeta-Jones","Chi McBride","Stanley Tucci","Audrey Tautou","Jean Reno","Ian McKellen","Eddie Deezen","Leslie Zemeckis","Nona Gaye","Irma P. Hall","Marlon Wayans","J.K. Simmons","Meg Ryan","Parker Posey","Greg Kinnear"],"to":["Punchline","Catch Me If You Can","Dragnet","Saving Mr. Banks","Bachelor Party","Volunteers","Man with One Red Shoe, The","Splash","Big","Nothing in Common","Money Pit, The","Toy Story of Terror","Captain Phillips","Larry Crowne","Cloud Atlas","Angels & Demons","Extremely Loud and Incredibly Close","Charlie Wilson's War","Toy Story 3","From the Earth to the Moon","Green Mile, The","Saving Private Ryan","Toy Story 2","Apollo 13","Toy Story","League of Their Own, A","Forrest Gump","Philadelphia","Sleepless in Seattle","Turner & Hooch","'burbs, The","Joe Versus the Volcano","Bonfire of the Vanities","Terminal, The","Da Vinci Code, The","Polar Express, The","Ladykillers, The","You've Got Mail","Punchline","Punchline","Punchline","Catch Me If You Can","Catch Me If You Can","Catch Me If You Can","Dragnet","Dragnet","Dragnet","Saving Mr. Banks","Saving Mr. Banks","Saving Mr. Banks","Bachelor Party","Bachelor Party","Bachelor Party","Volunteers","Volunteers","Volunteers","Man with One Red Shoe, The","Man with One Red Shoe, The","Man with One Red Shoe, The","Splash","Splash","Splash","Big","Big","Big","Nothing in Common","Nothing in Common","Nothing in Common","Money Pit, The","Money Pit, The","Money Pit, The","Toy Story of Terror","Toy Story of Terror","Toy Story of Terror","Captain Phillips","Captain Phillips","Captain Phillips","Larry Crowne","Larry Crowne","Larry Crowne","Cloud Atlas","Cloud Atlas","Cloud Atlas","Angels & Demons","Angels & Demons","Angels & Demons","Extremely Loud and Incredibly Close","Extremely Loud and Incredibly Close","Extremely Loud and Incredibly Close","Charlie Wilson's War","Charlie Wilson's War","Charlie Wilson's War","Toy Story 3","Toy Story 3","Toy Story 3","From the Earth to the Moon","From the Earth to the Moon","From the Earth to the Moon","Green Mile, The","Green Mile, The","Green Mile, The","Saving Private Ryan","Saving Private Ryan","Saving Private Ryan","Toy Story 2","Toy Story 2","Toy Story 2","Apollo 13","Apollo 13","Apollo 13","Toy Story","Toy Story","Toy Story","League of Their Own, A","League of Their Own, A","League of Their Own, A","Forrest Gump","Forrest Gump","Forrest Gump","Philadelphia","Philadelphia","Philadelphia","Sleepless in Seattle","Sleepless in Seattle","Sleepless in Seattle","Turner & Hooch","Turner & Hooch","Turner & Hooch","'burbs, The","'burbs, The","'burbs, The","Joe Versus the Volcano","Joe Versus the Volcano","Joe Versus the Volcano","Bonfire of the Vanities","Bonfire of the Vanities","Bonfire of the Vanities","Terminal, The","Terminal, The","Terminal, The","Da Vinci Code, The","Da Vinci Code, The","Da Vinci Code, The","Polar Express, The","Polar Express, The","Polar Express, The","Ladykillers, The","Ladykillers, The","Ladykillers, The","You've Got Mail","You've Got Mail","You've Got Mail"],"role":["Steven Gold","Carl Hanratty","Pep Streebeck","Walt Disney","Rick Gassko","Lawrence Whatley Bourne III","Richard Harlan Drew","Allen Bauer","Joshua \"Josh\" Baskin","David Basner","Walter Fielding, Jr.","Woody (Voice)","Captain Richard Phillips","Larry Crowne","Dr. Henry Goose / Hotel Manager / Isaac Sachs / Dermot Hoggins / Cavendish Look-a-Like Actor / Zachry","Robert Langdon","Thomas Schell","Charlie Wilson","Woody (voice)",null,"Paul Edgecomb","Captain John H. Miller","Woody (voice)","Jim Lovell","Woody (voice)","Jimmy Dugan - Manager","Forrest Gump","Andrew Beckett","Sam Baldwin","Scott Turner","Ray Peterson","Joe Banks","Sherman McCoy","Viktor Navorski","Robert Langdon","Hero Boy / Father / Conductor / Hobo / Scrooge / Santa Claus (voice)","Professor G.H. Dorr","Joe Fox","Lilah Krytsick","Romeo","John Krytsick","Roger Strong","Frank Abagnale Jr.","Frank Abagnale","Sgt. Joe Friday","Captain Gannon","Reverend Jonathan Whirley","Travers Robert Goff","P.L. Travers","Ginty","Debbie Thompson","Jay O'Neill","Ed Thompson","John Reynolds","Beth Wexler","Tom Tuttle","Cooper","Maddy","Ross","Walter Kornbluth","Freddie Bauer","Madison","Paul","Susan","MacMillan",null,"Lorraine Basner","Max Basner","Max Beissart","Anna Crowley","Estelle","Buzz Lightyear (Voice)","Jessie (Voice)","Combat Carl / Combat Carl Jr. (Voice)","Bilal","Andrea Phillips","Muse","Samantha","Alvarez","Trainee Wong","Haskell Moore / Tadeusz Kesselring / Bill Smoke / Nurse Noakes / Boardman Mephi / Old Georgie","Captain Molyneux / Vyvyan Ayrs / Timothy Cavendish / Korean Musician / Prescient 2","Native Woman / Jocasta Ayrs / Luisa Rey / Indian Party Guest / Ovid / Meronym","Commander Richter","Vittoria Vetra","Camerlengo Patrick McKenna","Oskar Schell","Oskar's Grandmother","Linda Schell","Bonnie Bach","Gust Avrakotos","Joanne Herring","Buzz Lightyear (voice)","Lotso (voice)","Jessie the Yodeling Cowgirl (voice)",null,null,null,"Jan Edgecomb","John Coffey","Brutus \"Brutal\" Howell","Technical Sergeant Michael Horvath","Private Richard Reiben","Private Daniel Jackson","Stinky Pete the Prospector (voice)","Buzz Lightyear (voice)","Jessie the Yodeling Cowgirl (voice)","Fred Haise","Jack Swigert","Ken Mattingly","Slinky Dog (voice)","Buzz Lightyear (voice)","Mr. Potato Head (voice)","Dottie Hinson - Catcher","Kit Keller - Pitcher","Mae Mordabito - Center Field","Jenny Curran","Mrs. Gump","Young Forrest Gump","Judge Tate","Joe Miller","Crutches","Greg","Suzy","Jonah Baldwin","Howard Hyde","Det. David Sutton","Emily Carson","Art Weingartner","Carol Peterson","Lt. Mark Rumsfield","DeDe/Angelica Graynamore/Patricia Graynamore","Dr. Ellison","Samuel Harvey Graynamore","Peter Fallow","Maria Ruskin","Judy McCoy","Amelia Warren","Mulroy","Frank Dixon","Sophie Neveu","Captain Bezu Fache","Sir Leigh Teabing","Know-It-All (voice)","Sister Sarah / Mother (voice)","Hero Girl (voice)","Marva Munson","Gawain MacSam","Garth Pancake","Kathleen Kelly","Patricia Eden","Frank Navasky"],"title":["<b>Role<\/b>: Steven Gold","<b>Role<\/b>: Carl Hanratty","<b>Role<\/b>: Pep Streebeck","<b>Role<\/b>: Walt Disney","<b>Role<\/b>: Rick Gassko","<b>Role<\/b>: Lawrence Whatley Bourne III","<b>Role<\/b>: Richard Harlan Drew","<b>Role<\/b>: Allen Bauer","<b>Role<\/b>: Joshua \"Josh\" Baskin","<b>Role<\/b>: David Basner","<b>Role<\/b>: Walter Fielding, Jr.","<b>Role<\/b>: Woody (Voice)","<b>Role<\/b>: Captain Richard Phillips","<b>Role<\/b>: Larry Crowne","<b>Role<\/b>: Dr. Henry Goose / Hotel Manager / Isaac Sachs / Dermot Hoggins / Cavendish Look-a-Like Actor / Zachry","<b>Role<\/b>: Robert Langdon","<b>Role<\/b>: Thomas Schell","<b>Role<\/b>: Charlie Wilson","<b>Role<\/b>: Woody (voice)","<b>Role<\/b>: NA","<b>Role<\/b>: Paul Edgecomb","<b>Role<\/b>: Captain John H. Miller","<b>Role<\/b>: Woody (voice)","<b>Role<\/b>: Jim Lovell","<b>Role<\/b>: Woody (voice)","<b>Role<\/b>: Jimmy Dugan - Manager","<b>Role<\/b>: Forrest Gump","<b>Role<\/b>: Andrew Beckett","<b>Role<\/b>: Sam Baldwin","<b>Role<\/b>: Scott Turner","<b>Role<\/b>: Ray Peterson","<b>Role<\/b>: Joe Banks","<b>Role<\/b>: Sherman McCoy","<b>Role<\/b>: Viktor Navorski","<b>Role<\/b>: Robert Langdon","<b>Role<\/b>: Hero Boy / Father / Conductor / Hobo / Scrooge / Santa Claus (voice)","<b>Role<\/b>: Professor G.H. Dorr","<b>Role<\/b>: Joe Fox","<b>Role<\/b>: Lilah Krytsick","<b>Role<\/b>: Romeo","<b>Role<\/b>: John Krytsick","<b>Role<\/b>: Roger Strong","<b>Role<\/b>: Frank Abagnale Jr.","<b>Role<\/b>: Frank Abagnale","<b>Role<\/b>: Sgt. Joe Friday","<b>Role<\/b>: Captain Gannon","<b>Role<\/b>: Reverend Jonathan Whirley","<b>Role<\/b>: Travers Robert Goff","<b>Role<\/b>: P.L. Travers","<b>Role<\/b>: Ginty","<b>Role<\/b>: Debbie Thompson","<b>Role<\/b>: Jay O'Neill","<b>Role<\/b>: Ed Thompson","<b>Role<\/b>: John Reynolds","<b>Role<\/b>: Beth Wexler","<b>Role<\/b>: Tom Tuttle","<b>Role<\/b>: Cooper","<b>Role<\/b>: Maddy","<b>Role<\/b>: Ross","<b>Role<\/b>: Walter Kornbluth","<b>Role<\/b>: Freddie Bauer","<b>Role<\/b>: Madison","<b>Role<\/b>: Paul","<b>Role<\/b>: Susan","<b>Role<\/b>: MacMillan","<b>Role<\/b>: NA","<b>Role<\/b>: Lorraine Basner","<b>Role<\/b>: Max Basner","<b>Role<\/b>: Max Beissart","<b>Role<\/b>: Anna Crowley","<b>Role<\/b>: Estelle","<b>Role<\/b>: Buzz Lightyear (Voice)","<b>Role<\/b>: Jessie (Voice)","<b>Role<\/b>: Combat Carl / Combat Carl Jr. (Voice)","<b>Role<\/b>: Bilal","<b>Role<\/b>: Andrea Phillips","<b>Role<\/b>: Muse","<b>Role<\/b>: Samantha","<b>Role<\/b>: Alvarez","<b>Role<\/b>: Trainee Wong","<b>Role<\/b>: Haskell Moore / Tadeusz Kesselring / Bill Smoke / Nurse Noakes / Boardman Mephi / Old Georgie","<b>Role<\/b>: Captain Molyneux / Vyvyan Ayrs / Timothy Cavendish / Korean Musician / Prescient 2","<b>Role<\/b>: Native Woman / Jocasta Ayrs / Luisa Rey / Indian Party Guest / Ovid / Meronym","<b>Role<\/b>: Commander Richter","<b>Role<\/b>: Vittoria Vetra","<b>Role<\/b>: Camerlengo Patrick McKenna","<b>Role<\/b>: Oskar Schell","<b>Role<\/b>: Oskar's Grandmother","<b>Role<\/b>: Linda Schell","<b>Role<\/b>: Bonnie Bach","<b>Role<\/b>: Gust Avrakotos","<b>Role<\/b>: Joanne Herring","<b>Role<\/b>: Buzz Lightyear (voice)","<b>Role<\/b>: Lotso (voice)","<b>Role<\/b>: Jessie the Yodeling Cowgirl (voice)","<b>Role<\/b>: NA","<b>Role<\/b>: NA","<b>Role<\/b>: NA","<b>Role<\/b>: Jan Edgecomb","<b>Role<\/b>: John Coffey","<b>Role<\/b>: Brutus \"Brutal\" Howell","<b>Role<\/b>: Technical Sergeant Michael Horvath","<b>Role<\/b>: Private Richard Reiben","<b>Role<\/b>: Private Daniel Jackson","<b>Role<\/b>: Stinky Pete the Prospector (voice)","<b>Role<\/b>: Buzz Lightyear (voice)","<b>Role<\/b>: Jessie the Yodeling Cowgirl (voice)","<b>Role<\/b>: Fred Haise","<b>Role<\/b>: Jack Swigert","<b>Role<\/b>: Ken Mattingly","<b>Role<\/b>: Slinky Dog (voice)","<b>Role<\/b>: Buzz Lightyear (voice)","<b>Role<\/b>: Mr. Potato Head (voice)","<b>Role<\/b>: Dottie Hinson - Catcher","<b>Role<\/b>: Kit Keller - Pitcher","<b>Role<\/b>: Mae Mordabito - Center Field","<b>Role<\/b>: Jenny Curran","<b>Role<\/b>: Mrs. Gump","<b>Role<\/b>: Young Forrest Gump","<b>Role<\/b>: Judge Tate","<b>Role<\/b>: Joe Miller","<b>Role<\/b>: Crutches","<b>Role<\/b>: Greg","<b>Role<\/b>: Suzy","<b>Role<\/b>: Jonah Baldwin","<b>Role<\/b>: Howard Hyde","<b>Role<\/b>: Det. David Sutton","<b>Role<\/b>: Emily Carson","<b>Role<\/b>: Art Weingartner","<b>Role<\/b>: Carol Peterson","<b>Role<\/b>: Lt. Mark Rumsfield","<b>Role<\/b>: DeDe/Angelica Graynamore/Patricia Graynamore","<b>Role<\/b>: Dr. Ellison","<b>Role<\/b>: Samuel Harvey Graynamore","<b>Role<\/b>: Peter Fallow","<b>Role<\/b>: Maria Ruskin","<b>Role<\/b>: Judy McCoy","<b>Role<\/b>: Amelia Warren","<b>Role<\/b>: Mulroy","<b>Role<\/b>: Frank Dixon","<b>Role<\/b>: Sophie Neveu","<b>Role<\/b>: Captain Bezu Fache","<b>Role<\/b>: Sir Leigh Teabing","<b>Role<\/b>: Know-It-All (voice)","<b>Role<\/b>: Sister Sarah / Mother (voice)","<b>Role<\/b>: Hero Girl (voice)","<b>Role<\/b>: Marva Munson","<b>Role<\/b>: Gawain MacSam","<b>Role<\/b>: Garth Pancake","<b>Role<\/b>: Kathleen Kelly","<b>Role<\/b>: Patricia Eden","<b>Role<\/b>: Frank Navasky"],"arrows":["to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to","to"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot"},"manipulation":{"enabled":false},"groups":{"hub":{"color":{"background":"#3B82F6","border":"#1D4ED8","highlight":"#93C5FD"}},"movie":{"color":{"background":"#F97316","border":"#C2410C","highlight":"#FED7AA"},"shape":"square"},"useDefaultGroups":true,"costar":{"color":{"background":"#6B7280","border":"#374151","highlight":"#D1D5DB"}}},"edges":{"width":1.5,"color":{"color":"#CBD5E1","highlight":"#3B82F6"}},"interaction":{"hover":true,"zoomSpeed":1},"layout":{"randomSeed":42},"physics":{"solver":"forceAtlas2Based","forceAtlas2Based":{"gravitationalConstant":-60,"springLength":120,"springConstant":0.04}}},"groups":["Movie","Co-star","Hub"],"width":null,"height":null,"idselection":{"enabled":true,"style":"width: 150px; height: 26px","useLabels":true,"main":"Select by id"},"byselection":{"enabled":false,"style":"width: 150px; height: 26px","multiple":false,"hideColor":"rgba(200,200,200,0.5)","highlight":false},"main":null,"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","tooltipStay":300,"tooltipStyle":"position: fixed;visibility:hidden;padding: 5px;white-space: nowrap;font-family: verdana;font-size:14px;font-color:#000000;background-color: #f5f4ed;-moz-border-radius: 3px;-webkit-border-radius: 3px;border-radius: 3px;border: 1px solid #808074;box-shadow: 3px 3px 10px rgba(0, 0, 0, 0.2);","highlight":{"enabled":true,"hoverNearest":true,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":false,"fit":false,"resetHighlight":true,"clusterOptions":null,"keepCoord":true,"labelSuffix":"(cluster)"},"legend":{"width":0.2,"useGroups":true,"position":"right","ncol":1,"stepX":100,"stepY":100,"zoom":true,"main":{"text":"Node type","style":"font-family:Georgia, Times New Roman, Times, serif;font-weight:bold;font-size:14px;text-align:center;"}}},"evals":[],"jsHooks":[]}</script>
</div>
</div>
<p>Hover over any node to see its label. Use the <strong>Select by id</strong> dropdown or a node to highlight movies shared with Tom Hanks.</p>
</section>
</section>
<section id="conclusion" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="conclusion"><span class="header-section-number">7</span> Conclusion</h2>
<p>neo2R 3.0.0 removes the last friction point for R users who want to work with <strong>Neo4j Aura</strong>: a single <code>startGraph()</code> call now handles cloud and local instances uniformly, the httr2 backend gives reliable retries and clean error handling, and the Cypher query interface remains exactly as it was.</p>
<section id="further-reading" class="level3" data-number="7.1">
<h3 data-number="7.1" class="anchored" data-anchor-id="further-reading"><span class="header-section-number">7.1</span> Further reading</h3>
<ul>
<li><a href="https://cran.r-project.org/package=neo2R" rel="nofollow" target="_blank">neo2R on CRAN</a> — package documentation</li>
<li><a href="https://github.com/patzaw/neo2R" rel="nofollow" target="_blank">neo2R GitHub</a> — source, changelog, and issues</li>
<li><a href="https://console.neo4j.io/" rel="nofollow" target="_blank">Neo4j Aura console</a> — create your free instance</li>
<li><a href="https://neo4j.com/docs/cypher-manual/current/" rel="nofollow" target="_blank">Neo4j Cypher reference</a> — query language docs</li>
<li><a href="https://datastorm-open.github.io/visNetwork/" rel="nofollow" target="_blank">visNetwork documentation</a> — all chart options</li>
</ul>


</section>
</section>

 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://patzaw.github.io/posts/neo2R-Aura.html"> Patrice Godard</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/querying-neo4j-aura-from-r-with-neo2r/">Querying Neo4j Aura from R with neo2R</a>
</article>
<article>
<h1>Leaflet attribution</h1>
<p>Michael — Mon, 18 May 2026 04:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://r.iresmi.net/posts/2026/leaflet_attribution/"> r.iresmi.net</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://www.flickr.com/photos/bassbro/54876966918/" rel="nofollow" target="_blank"></a></p>
<figcaption>Leaflets – CC-BY-NC-ND by Steve Walser</figcaption>
</figure>
</div>
<p>Note for myself and others: how to remove the “Leaflet | ” prefix in map attribution using the R package {leaflet}.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>It’s <a href="https://groups.google.com/g/leaflet-js/c/fA6M7fbchOs/m/JTNVhqdc7JcJ?pli=1" rel="nofollow" target="_blank">allowed</a>. Masking the attribution is sometimes useful in certain circumstances, but generally please cite the software and data used…</p>
</div>
</div>
<p>According to <a href="https://stackoverflow.com/questions/57092107/how-can-i-remove-attribution-in-leaflet-map-in-r/77265384" rel="nofollow" target="_blank">some sources</a> we could write <code>leaflet(options = leafletOptions(attributionPrefix = ""))</code> but it doesn’t work in my case.</p>
<p>So instead we can execute some javascript:</p>
<div class="cell">
<pre>library(leaflet)

leaflet() |>
  addTiles(urlTemplate = "", attribution = "Only my data") |> 
  htmlwidgets::onRender("function(el, x) {
    // Remove Leaflet attribution prefix
    this.attributionControl.setPrefix('');
    }")</pre>
<div id="fig-leaflet" class="cell-output-display quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-leaflet-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="leaflet html-widget html-fill-item" id="htmlwidget-bf42d06680c921c885f8" style="width:100%;height:464px;"></div>
<script type="application/json" data-for="htmlwidget-bf42d06680c921c885f8">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"Only my data"}]}]},"evals":[],"jsHooks":{"render":[{"code":"function(el, x, data) {\n  return (function(el, x) {\n    // Remove Leaflet attribution prefix\n    this.attributionControl.setPrefix('');\n    }).call(this.getMap(), el, x, data);\n}","data":null}]}}</script>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-leaflet-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 1: A blank leaflet map with a custom attribution
</figcaption>
</figure>
</div>
</div>


<!-- -->


 
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://r.iresmi.net/posts/2026/leaflet_attribution/"> r.iresmi.net</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/leaflet-attribution/">Leaflet attribution</a>
</article>
<article>
<h1>Five tips for managing your R-universe 🚀</h1>
<p>R | Dr Tom Palmer — Mon, 18 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://remlapmot.github.io/post/2026/runiverse-tips/"> R | Dr Tom Palmer</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<h2 id="introduction">Introduction</h2>
<p>
<a href="https://ropensci.org/" rel="nofollow" target="_blank">rOpenSci</a>’s 
<a href="https://r-universe.dev/" rel="nofollow" target="_blank">R-universe</a> system is an open source platform allowing users to create their own CRAN-like universe of R packages.</p>
<p>It is absolutely fantastic. It is particularly useful in one area I research, Mendelian randomization (at the interface of Epidemiology and Genetic Epidemiology), because a lot of the packages are GitHub/GitLab-only.</p>
<p>Therefore, I setup and maintain 
<a href="https://mrcieu.r-universe.dev/" rel="nofollow" target="_blank">https://mrcieu.r-universe.dev/</a> to include both packages from our MRCIEU GitHub organisation (from the MRC Integrative Epidemiology Unit at the University of Bristol, UK), and as many of the GitHub-only packages for Mendelian randomization I could find.</p>
<p>It is difficult to overstate how useful this is. For the first time, not only do researchers have a list of the Mendelian randomization packages in one place but they can install binaries – without having to go through the hassle – especially on (Ubuntu) Linux – of <code>remotes::install_github()</code>. Researchers can also see how often packages are updated and R-universe checks for changes in packages approximately every hour, keeping it always up to date.</p>
<p>This post gives five tips I have developed to help manage my R-universe.</p>
<h2 id="tip-1-referring-to-a-package-from-a-pull-request-instead-of-from-a-branch-on-a-fork">Tip 1: Referring to a package from a pull request instead of from a branch on a fork</h2>
<p>In the Mendelian randomization field many of these GitHub-only packages are not well written or abandoned once the PhD student/researcher leaves. Often when I add a package to our R-universe I find that their build fails, or they have <code>R CMD check</code> errors and warnings, or after several months their build fails because they are not maintained. I sometimes look into the failed builds and <code>check</code> problems. If it’s clear just a few fixes are required to rectify the situation I often open a pull request. Often that pull request is not responded to.</p>
<p>Previously, for such cases I would switch the source of the package entry in <em>packages.json</em> to be from the relevant branch on my fork. However, I have always felt a bit uneasy about this. I wondered if GitHub had a way to refer to the pull request branch without having to switch the repository. It turns out that it does. The format of pull request branch names is <code>refs/pull/{number}/head</code> where <code>{number}</code> is the number assigned once the PR is opened. Therefore, when I open a PR on a package I now add the <code>"branch"</code> field to the package entry in <em>packages.json</em> as follows.</p>
<pre>  {
    "package": "GWASBrewer",
    "url": "https://github.com/jean997/GWASBrewer",
    "branch": "refs/pull/18/head"
  },
</pre>
<p>I switch back to the default branch if the PR is merged.</p>
<h2 id="tip-2-justfile-recipe-for-adding-a-package-to-packagesjson">Tip 2: Justfile recipe for adding a package to packages.json</h2>
<p>I regularly find that I need to add or remove a package. Manually editing the <em>packages.json</em> file is not hard, but I have found the following 
<a href="https://just.systems/" rel="nofollow" target="_blank">Justfile</a> (Just is like Make, but specifically designed for running commands and has a much friendlier syntax) recipes helpful for doing this quickly.</p>
<p>These recipes require 
<a href="https://docs.astral.sh/uv/" rel="nofollow" target="_blank">uv</a> and just to be installed and on your <code>PATH</code> (uv automatically installs the required version of Python and creates/destroys/manages any required virtual environments). To use them, copy them into a text file named <em>justfile</em> at the top level of your R-universe registry repository and follow the instructions.</p>
<p>This recipe adds a package to your <em>packages.json</em> in alphabetical order. It has one required argument and 3 optional arguments.</p>
<pre># add a package entry to packages.json in alphabetical order
[arg("branch", short="b")]
[arg("pkgname", short="p")]
[arg("subdir", short="s")]
add url pkgname="" branch="" subdir="":
    #!/usr/bin/env -S uv run --python 3.14 python3
    import json, re, sys
    url = "{{ url }}"
    if re.fullmatch(r'[^/]+/[^/]+', url):
        url = f"https://github.com/{url}"
    pkgname = "{{ pkgname }}" or url.rstrip("/").split("/")[-1]
    branch = "{{ branch }}"
    subdir = "{{ subdir }}"
    with open("packages.json") as f:
        packages = json.load(f)
    if any(p["package"] == pkgname for p in packages):
        print(f"Error: '{pkgname}' already exists in packages.json", file=sys.stderr)
        sys.exit(1)
    entry = {"package": pkgname, "url": url}
    if branch:
        entry["branch"] = branch
    if subdir:
        entry["subdir"] = subdir
    packages.append(entry)
    packages.sort(key=lambda p: p["package"].lower())
    with open("packages.json", "w") as f:
        json.dump(packages, f, indent=2)
        f.write("\n")
    print(f"Added {pkgname}")
</pre>
<p>Where <code>url</code> is say <code>https://github.com/MRCIEU/TwoSampleMR</code>, except that for GitHub packages you can specify this as <code>MRCIEU/TwoSampleMR</code>.</p>
<p>To add a GitHub package whose name matches its repository name, simply run</p>
<pre>just add username/reponame
</pre>
<p>You can inspect the recipe’s arguments and options with</p>
<pre>just --usage add

Usage: just add [OPTIONS] url

Arguments:
  url

Options:
  -p pkgname [default: ""]
  -b branch [default: ""]
  -s subdir [default: ""]
</pre>
<p>The 3 optional arguments allow you to specify the package name (<code>-p pkgname</code>), branch (<code>-b branchname</code>), or subdirectory (<code>-s subdirectory</code>) the package is in. For example, to add a GitHub package whose package name does not match its repository name run</p>
<pre>just add username/reponame -p pkgname
</pre>
<h2 id="tip-3-justfile-recipe-for-removing-a-package-from-packagesjson">Tip 3: Justfile recipe for removing a package from packages.json</h2>
<p>This recipe removes a package from your <em>packages.json</em>.</p>
<pre># remove a package entry from packages.json
remove pkgname:
    #!/usr/bin/env -S uv run --python 3.14 python3
    import json, sys
    pkgname = "{{ pkgname }}"
    with open("packages.json") as f:
        packages = json.load(f)
    filtered = [p for p in packages if p["package"] != pkgname]
    if len(filtered) == len(packages):
        print(f"Error: '{pkgname}' not found in packages.json", file=sys.stderr)
        sys.exit(1)
    with open("packages.json", "w") as f:
        json.dump(filtered, f, indent=2)
        f.write("\n")
    print(f"Removed {pkgname}")
</pre>
<p>Run it with</p>
<pre>just remove pkgname
</pre>
<h2 id="tip-4-justfile-recipe-for-checking-packagesjson-is-valid">Tip 4: Justfile recipe for checking packages.json is valid</h2>
<p>When manually editing <em>packages.json</em> it is very easy to forget a comma or to miss a closing bracket or quotation mark. This recipe checks your JSON is valid.</p>
<pre># check packages.json
check:
    uv run --python 3.14 -m json.tool packages.json > /dev/null && echo "JSON check passed"
</pre>
<p>Run it with</p>
<pre>just check
</pre>
<h2 id="tip-5-conveniently-view-a-packages-dependencies">Tip 5: Conveniently view a package’s dependencies</h2>
<p>Knowing a package’s full strong dependency list is useful — for example, when a breaking change somewhere in the chain causes unexpected build failures. While there are several ways to determine this in R, R-universe shows you the full list immediately.</p>
<p>Navigate to the R-universe page for the package you are interested in, say 
<a href="https://mrcieu.r-universe.dev/TwoSampleMR" rel="nofollow" target="_blank">https://mrcieu.r-universe.dev/TwoSampleMR</a> and click the dependencies pill.</p>

<p>It expands showing the full dependency list.</p>

<h2 id="summary">Summary</h2>
<p>In summary, I have shown five tips I find useful to manage a large R-universe.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://remlapmot.github.io/post/2026/runiverse-tips/"> R | Dr Tom Palmer</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/five-tips-for-managing-your-r-universe-%f0%9f%9a%80/">Five tips for managing your R-universe 🚀</a>
</article>
<article>
<h1>Conformalized TabPFN: Prediction Intervals for a Pretrained Transformer for Tabular Data in Python and R</h1>
<p>T. Moudiki — Sun, 17 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/17/r/python/conformalized-tabpfn"> T. Moudiki's Webpage - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>Knowing a model’s prediction is useful. Knowing how confident that prediction is, even more so. Conformal prediction provides exactly that: statistically valid prediction intervals with guaranteed coverage (under certain conditions), regardless of the underlying model or data distribution.</p>

<p>In this post, we pair two powerful tools: <code>TabPFN</code>, a <strong>pretrained transformer for tabular data</strong>, and <code>nnetsauce</code>’s <code>PredictionInterval</code> (which implements Split Conformal Prediction), which wraps any scikit-learn-compatible regressor into a conformal predictor. We demonstrate the full pipeline on the diabetes dataset, first in Python, then in R via reticulate. Both versions produce identical results: a coverage rate of 96.7% at a nominal 95% level.</p>

<h1 id="1---python-version">1 – Python version</h1>

<pre>!pip install tabpfn tabpfn_client

!pip install nnetsauce

import tabpfn_client

API_TOKEN = "" # <- Paste your TabPFN token here (from https://priorlabs.ai/tabpfn)


tabpfn_client.set_access_token(API_TOKEN)

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from tabpfn_client import TabPFNRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

reg = TabPFNRegressor()

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

reg.fit(X_train, y_train)
preds = reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(-rmse)

00:00 Fitting... |

WARNING:tabpfn_client.client:The provided train set hashes match previously uploaded train sets.


00:00 Fitting... Done!
00:00 Predicting... -

WARNING:tabpfn_client.client:The provided test set hash matches a previously uploaded test set.


00:01 Predicting... Done!
-51.559912022529886

import nnetsauce as ns

reg_conformal = ns.PredictionInterval(reg, level=95)
reg_conformal.fit(X_train, y_train)
preds = reg_conformal.predict(X_test, return_pi=True)

00:00 Fitting... |

WARNING:tabpfn_client.client:The provided train set hashes match previously uploaded train sets.


00:00 Fitting... Done!
00:00 Predicting... -

WARNING:tabpfn_client.client:The provided test set hash matches a previously uploaded test set.


00:01 Predicting... Done!
00:00 Predicting... -

WARNING:tabpfn_client.client:The provided test set hash matches a previously uploaded test set.


00:01 Predicting... Done!
00:00 Predicting... -

WARNING:tabpfn_client.client:The provided test set hash matches a previously uploaded test set.


00:01 Predicting... Done!

print(f"coverage_rate: {np.mean((preds.lower<=y_test)*(preds.upper>=y_test))}")

coverage_rate: 0.9662921348314607

import warnings
import matplotlib.pyplot as plt


warnings.filterwarnings('ignore')

split_color = 'green'
split_color2 = 'orange'
local_color = 'gray'

def plot_func(x,
              y,
              y_u=None,
              y_l=None,
              pred=None,
              shade_color="",
              method_name="",
              title=""):

    fig = plt.figure()

    plt.plot(x, y, 'k.', alpha=.3, markersize=10,
             fillstyle='full', label=u'Test set observations')

    if (y_u is not None) and (y_l is not None):
        plt.fill(np.concatenate([x, x[::-1]]),
                 np.concatenate([y_u, y_l[::-1]]),
                 alpha=.3, fc=shade_color, ec='None',
                 label = method_name + ' Prediction interval')

    if pred is not None:
        plt.plot(x, pred, 'k--', lw=2, alpha=0.9,
                 label=u'Predicted value')

    #plt.ylim([-2.5, 7])
    plt.xlabel('$X$')
    plt.ylabel('$Y$')
    plt.legend(loc='upper right')
    plt.title(title)

    plt.show()


max_idx = 50
plot_func(x = range(max_idx),
          y = y_test[0:max_idx],
          y_u = preds.upper[0:max_idx],
          y_l = preds.lower[0:max_idx],
          pred = preds.mean[0:max_idx],
          shade_color=split_color2,
          title = f"conformalized TabPFN ({max_idx} first points in test set)")

</pre>

<p></p>

<h1 id="2---r-version">2 – R version</h1>

<p>For this R version, I used R in the same notebook as Python, in Google Colab.</p>

<pre>%load_ext rpy2.ipython

%R install.packages("reticulate")

%%R

# Conformalized TabPFN in R via reticulate

library(reticulate)

# ── 0. Python environment ──────────────────────────────────────────────────────
# Use your preferred Python env. Uncomment one (automatic on Google Colab):
# use_python("/usr/bin/python3")
# use_virtualenv("r-tabpfn")
# use_condaenv("r-tabpfn")

# Install required packages into the active Python env (run once)
# py_install(c("tabpfn", "tabpfn_client", "nnetsauce", "scikit-learn",
#              "matplotlib", "numpy"), pip = TRUE)

# ── 1. Imports ─────────────────────────────────────────────────────────────────
sklearn_datasets  <- import("sklearn.datasets")
sklearn_model_sel <- import("sklearn.model_selection")
sklearn_metrics   <- import("sklearn.metrics")
tabpfn_client     <- import("tabpfn_client")
ns                <- import("nnetsauce")
np                <- import("numpy")
plt               <- import("matplotlib.pyplot")
warnings          <- import("warnings")

# ── 2. TabPFN API token ────────────────────────────────────────────────────────
API_TOKEN <- ""   # <-- paste your TabPFN token here (from https://priorlabs.ai/tabpfn)
tabpfn_client$set_access_token(API_TOKEN)

TabPFNRegressor <- tabpfn_client$TabPFNRegressor

# ── 3. Data ────────────────────────────────────────────────────────────────────
diabetes   <- sklearn_datasets$load_diabetes(return_X_y = TRUE)
X          <- diabetes[[1]]
y          <- diabetes[[2]]

split      <- sklearn_model_sel$train_test_split(X, y, test_size = 0.2, random_state = 42L)
X_train    <- split[[1]]
X_test     <- split[[2]]
y_train    <- split[[3]]
y_test     <- split[[4]]

# ── 4. Fit TabPFN regressor ────────────────────────────────────────────────────
reg   <- TabPFNRegressor()
reg$fit(X_train, y_train)
preds_plain <- reg$predict(X_test)

rmse <- sqrt(sklearn_metrics$mean_squared_error(y_test, preds_plain))
cat(sprintf("TabPFN RMSE: %.4f\n", rmse))

# ── 5. Conformal prediction with nnetsauce ─────────────────────────────────────
reg_conformal <- ns$PredictionInterval(reg, level = 95L)
reg_conformal$fit(X_train, y_train)
preds <- reg_conformal$predict(X_test, return_pi = TRUE)

coverage <- np$mean((preds$lower <= y_test) * (preds$upper >= y_test))
cat(sprintf("Coverage rate: %.4f\n", coverage))

# ── 6. Plot (first 50 test points) ────────────────────────────────────────────
warnings$filterwarnings("ignore")

max_idx    <- 50L
x_range    <- np$array(0:(max_idx - 1))   # numeric index
y_obs      <- y_test[1:max_idx]
y_upper    <- preds$upper[1:max_idx]
y_lower    <- preds$lower[1:max_idx]
y_pred     <- preds$mean[1:max_idx]

# Build the filled polygon (matplotlib-style concatenation)
x_fill <- np$concatenate(list(x_range, x_range[max_idx:1]))
y_fill <- np$concatenate(list(y_upper, y_lower[max_idx:1]))

fig <- plt$figure()
plt$plot(x_range, y_obs,  "k.", alpha = 0.3, markersize = 10L,
         label = "Test set observations")
plt$fill(x_fill, y_fill, alpha = 0.3, fc = "orange", ec = "None",
         label = "Conformal Prediction interval")
plt$plot(x_range, y_pred, "k--", lw = 2L, alpha = 0.9,
         label = "Predicted value")
plt$xlabel("Index")
plt$ylabel("Y")
plt$legend(loc = "upper right")
plt$title(sprintf("Conformalized TabPFN (first %d points in test set)", max_idx))
plt$tight_layout()
plt$show()
# To save instead: plt$savefig("conformalized_tabpfn.png", dpi = 150L)

00:02 Fitting... Done!
00:02 Predicting... Done!
TabPFN RMSE: 51.5599
00:01 Fitting... Done!
00:02 Predicting... Done!
00:00 Predicting... -

WARNING:tabpfn_client.client:The provided test set hash matches a previously uploaded test set.


00:01 Predicting... Done!
00:02 Predicting... Done!
Coverage rate: 0.9663
</pre>

<p></p>


<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/17/r/python/conformalized-tabpfn"> T. Moudiki's Webpage - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/conformalized-tabpfn-prediction-intervals-for-a-pretrained-transformer-for-tabular-data-in-python-and-r/">Conformalized TabPFN: Prediction Intervals for a Pretrained Transformer for Tabular Data in Python and R</a>
</article>
<article>
<h1>Exploring the CovR/S Two-Component System in Streptococcus pyogenes</h1>
<p>r on Everyday Is A School Day — Sat, 16 May 2026 00:00:00 +0000</p>
<![CDATA[<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://www.kenkoonwong.com/blog/haddock/"> r on Everyday Is A School Day</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<blockquote>
<p>Exploring the CovR/S two-component system in Group A Strep  — from genome annotation with Bakta & BaktFold, to AlphaFold confidence metrics, and a first attempt at protein docking with Haddock3. Learning as we go! </p>
</blockquote>
<p align="center">
  
</p>




<h2 id="motivations">Motivations
  <a href="https://www.kenkoonwong.com/blog/haddock/#motivations" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Since the last 
<a href="https://www.kenkoonwong.com/blog/ampc/" rel="nofollow" target="_blank">ampC adventure</a>, I’m really curious about the mechanism of some of these bacterial virulence. Remember how chromosomal ampC organisms use ampG, ampD, and then ampR to repress class C beta lactamase gene? It’s such an orchestrated endeavor. What about streptococcus pyogenes and its virulence? How can it be a colonizer on one end and then virulence on the other that caused a number of devastating infection? Let’s learn a bit of the mechanism, and of course why not use this opportunity too to learn some other bioinformatic tools along the way? And see if we can use existing knowledge to make it more fun and educational! I’m looking forward to this! Join me in exploring the mechanism of the CovR/S two-component system in streptococcus pyogenes, aka Group A strep!</p>




<h2 id="objectives">Objectives:
  <a href="https://www.kenkoonwong.com/blog/haddock/#objectives" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#covrs" rel="nofollow" target="_blank">What is CovR/S Two-component System?</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#ncbi" rel="nofollow" target="_blank">Let’s Look A Where Does It Show in NCBI</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#annotate" rel="nofollow" target="_blank">What If We Have WGS? How to Annotate?</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#hypothetical" rel="nofollow" target="_blank">What Are Hypothetical?</a>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#alphafoid" rel="nofollow" target="_blank">What Is An Acceptable AlphaFold Confidence?</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#baktfold" rel="nofollow" target="_blank">A New Tool Called BaktFold</a></li>
</ul>
</li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#strep" rel="nofollow" target="_blank">Do All Streptococcus Pyogenes Have CovR/S Two-component System?</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#otherstrep" rel="nofollow" target="_blank">Do Other Streptococcus species Have CovR/S Two-component System</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#covr" rel="nofollow" target="_blank">What Does CovR Look Like?</a>
<ul>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#phos" rel="nofollow" target="_blank">What would a Phosphorylated CovR Look like?</a></li>
</ul>
</li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#opportunities" rel="nofollow" target="_blank">Opportunities For Improvement</a></li>
<li>
<a href="https://www.kenkoonwong.com/blog/haddock/#lessons" rel="nofollow" target="_blank">Lessons Learnt</a></li>
</ul>




<h2 id="covrs">What is CovR/S Two-component System?
  <a href="https://www.kenkoonwong.com/blog/haddock/#covrs" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>The CovR/S (Control of Virulence) system functions as a sophisticated environmental sensor that integrates multiple host-derived signals to orchestrate the transition from colonization to invasive disease. The evidence reveals three primary environmental triggers that modulate this master regulatory system. Simplistically, <code>CovS (Sensor)</code> senses first -> gets activated -> trigger <code>CovR (Regulator)</code> -> downstream repression. Breakage of such system will de-repress the virulence factors.</p>
<ol>
<li>
<p><strong>Magnesium Levels</strong>: The Baseline Sensor
High extracellular magnesium concentrations (typical of healthy tissue) activate CovS kinase activity, leading to increased CovR phosphorylation and repression of virulence genes. This creates a colonization-friendly state where GAS maintains low virulence factor expression suitable for asymptomatic carriage.</p>
</li>
<li>
<p><strong>LL-37 Antimicrobial Peptide</strong>: The Invasion Signal
LL-37 cathelicidin peptide — released by neutrophils and epithelial cells during inflammation — directly binds to the extracellular domain of CovS and inhibits its kinase activity. This creates a paradoxical host-pathogen interaction where the host’s antimicrobial defense actually triggers bacterial virulence. LL-37 binding to CovS reduces CovR phosphorylation, leading to derepression of multiple virulence factors including pyrogenic exotoxin A, DNase Sda1, streptolysin O, and hyaluronic acid capsule. Critically, LL-37 signaling converts GAS from a colonizing to an invasive phenotype, with marked increases in resistance to opsonophagocytic killing by human leukocytes.</p>
</li>
<li>
<p><strong>Acidic Stress</strong>: The Tissue Environment Sensor
Acidic conditions (pH < 7.0, typical of infected or inflamed tissue) enhance CovR/S-dependent gene repression through activation of the covR/S promoter itself. This creates a negative feedback loop where tissue acidosis increases CovR/S expression, which then more strongly represses virulence factors.</p>
</li>
</ol>
<p>Below is an image referenced directly from source that depicts the mechanism of CovR/S system in streptococcus pyogenes.</p>
<p></p>




<h2 id="ncbi">Let’s Look A Where Does It Show in NCBI
  <a href="https://www.kenkoonwong.com/blog/haddock/#ncbi" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Let’s go to 
<a href="https://www.ncbi.nlm.nih.gov/datasets/gene/GCF_900475035.1/?search=cov" rel="nofollow" target="_blank">here</a>. I picked out streptococcus pyogenes reference genome annotation and search for <code>cov</code> and this popped up.</p>
<p align="center">
  
</p>
<p>There you go! CovS and CovR. Sometimes these system <code>can also be known as CsrR/CsrS (Capsule Synthesis Regulator)</code>. There may have been 2 different research groups discovered these identical gene and called it differently?</p>
<p>It’s also so interesting that these 2 genes are so close to each other. </p>




<h2 id="annotate">What If We Have WGS? How to Annotate?
  <a href="https://www.kenkoonwong.com/blog/haddock/#annotate" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Alright, let’s pick a random streptococcus pyogenes and see if we can use bakta to help us annotate. Let’s look at this one.</p>
<p align="center">
  
</p>
<p>
<a href="https://github.com/oschwengers/bakta" rel="nofollow" target="_blank">Install Bakta, see here</a></p>
<pre>#make sure you use the environment name you created
conda activate bakta_env 

bakta \
  --db /path/to/bakta_db \
  --output rabdom_bakta_output \
  --prefix random \
  --threads 8 \
  --skip-crispr \
  --force \
  random.fna
</pre><p>After it is done, when you look in the folder, you will see something like this</p>
<p align="center">
  
</p>
<p>When we look at the annotated gff3 file, we can see that there are 2 features annotated as <code>two-component system response regulator</code> and <code>two-component system sensor histidine kinase</code>. These are likely to be CovR and CovS.</p>
<pre>library(ape)
library(tidyverse)

readLines("random.gff3") |> str_detect("FASTA") |> which() #found fasta, apparently bakta has ###FASTA inserted and ape cannot handle

## [1] 2025

tmp <- tempfile()
readLines("random.gff3")[1:2024] |> writeLines(tmp)
gff <- read.gff(tmp, GFF3 = T)
gff |>
  filter(str_detect(attributes, "[Cc][Oo][Vv]"))

##        seqid    source type  start    end score strand phase
## 331 contig_1 Pyrodigal  CDS 303144 303830    NA      +     0
## 332 contig_1 Pyrodigal  CDS 303836 305338    NA      +     0
##                                                                                                                                                                                                                             attributes
## 331           ID=EEBJGP_00327;Name=two-component system response regulator CovR;locus_tag=EEBJGP_00327;product=two-component system response regulator CovR;Dbxref=BlastRules:WP_002991052,SO:0001217,UniRef:UniRef50_Q49XM7;gene=covR
## 332 ID=EEBJGP_00328;Name=two-component system sensor histidine kinase CovS;locus_tag=EEBJGP_00328;product=two-component system sensor histidine kinase CovS;Dbxref=BlastRules:WP_002991036,SO:0001217,UniRef:UniRef50_D3KVE8;gene=covS
</pre><p>There you go! We found them after annotation. Wait a minute… what are those <code>hypotheticals</code> on our folder? Why are they there? Are they important? Let’s find out.</p>




<h2 id="hypothetical">What Are Hypothetical?
  <a href="https://www.kenkoonwong.com/blog/haddock/#hypothetical" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Hypotheticals are those proteins that we don’t know what they do. They are annotated as “hypothetical protein” because they are predicted to be proteins based on the DNA sequence, but we have no experimental evidence of their function. They are often annotated as “hypothetical” because they have no known homologs in other organisms, or because they have no known domains or motifs that can be used to predict their function.</p>
<p>When we take a peek at the <code>random.hypotheticals.tsv</code>, it looks like this:</p>
<p align="center">
  
</p>
<p>Let’s check how many hypotheticals we have here for this genome</p>
<pre>tmp <- tempfile()
readLines("random.hypotheticals.tsv")[3:194] |> writeLines(tmp)
hypo <- read_tsv(tmp)
nrow(hypo)

## [1] 191
</pre><p>OK we have 191 of hypotheticals. Let’s see if a new tool on the block will be able to add some annotation to these hypotheticals and see if we can find anything interesting. We could also use filter and see if we can see those hypotheticals</p>
<pre>gff |>
  filter(str_detect(attributes,"hypothetical protein")) |>
  nrow()

## [1] 192
</pre><p>Hmm.. they don’t tally. But let’s move on.</p>




<h2 id="baktfold">A New Tool Called BaktFold
  <a href="https://www.kenkoonwong.com/blog/haddock/#baktfold" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>
<a href="https://github.com/gbouras13/baktfold" rel="nofollow" target="_blank">BaktFold</a> is a new tool that uses AlphaFold to predict the structure of proteins and then uses that structure to predict the function of the protein. It is a very powerful tool that can be used to annotate hypothetical proteins. 
<a href="https://www.biorxiv.org/content/10.64898/2026.03.31.715528v1" rel="nofollow" target="_blank">Check out their paper</a></p>
<pre>baktfold run \
  -i random.json \
  -o random_baktfold_output \
  -d /path/to/baktfold_db \
  -t 8 \
  -f 

readLines("baktfold.gff3") |> str_detect("FASTA") |> which() #found fasta, apparently bakta has ###FASTA inserted and ape cannot handle

## [1] 2025

tmp <- tempfile()
readLines("baktfold.gff3")[1:2024] |> writeLines(tmp)
gff_baktfold <- read.gff(tmp, GFF3 = T)
gff_baktfold |>
  filter(str_detect(attributes,"hypothetical protein")) |>
  nrow()

## [1] 106
</pre><p>Wow, this is really cool! We can see that we have much less hypotheticals! About 86 less! Let’s take a look what were previous hypotheticals and what they are annotated now and how?</p>
<pre>gff_hypo <- gff |>
  filter(str_detect(attributes,"hypothetical protein")) |>
  pull(start)

new_df <- tibble(start = gff_hypo, temp = NA)

gff_baktfold_hypo <- gff_baktfold |>
  right_join(new_df) |>
  mutate(temp = case_when(
    !str_detect(attributes,"hypothetical protein") ~ attributes,
    TRUE ~ temp
  ))

## Joining with `by = join_by(start)`
</pre><p align="center">
  
</p>
<p>Take a look at the image above, if you see temp column with NA, that would be hypotheticals from bakta. If you see it filled, it means baktfold was able to identify it across one or more databases. It’s interesting how baktfold work, it conducts sequential protein structure-based searches against four complementary structure databases (SwissProt, Alphafold Cluster Database, PDB, CATH). Protein sequences are transformed into Foldseek 3Di tokens via the ProstT5 protein language model and subsequently searched against structure databases via Foldseek. Pretty cool! Also, interestingly, when comparing a few of the baktfold predicted functional proteins with NCBI’s annotation, we sometimes do see some baktfold-annotated functions whereas NCBI labeled them as uncharacterized gene. This is not an exhaustive or thorough comparison by any means, but interesting to note.</p>
<p>Speaking of AlphaFold, we’ve always wanted to know a bit more about AlphaFold confidence. When we look at the predicted structure of a protein, how do we know if we can trust it? What is an acceptable AlphaFold confidence? Let’s learn a bit more.</p>




<h2 id="alphafoid">What Is An Acceptable AlphaFold Confidence?
  <a href="https://www.kenkoonwong.com/blog/haddock/#alphafoid" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>**These are edited responses from back-and-forth Claude Sonnet 4.6 query, **</p>
<p>AlphaFold reports per-residue confidence as per-residue confidence as <strong>pLDDT</strong> (predicted local distance difference test), scored 0–100:</p>
<table>
<thead>
<tr>
<th>pLDDT</th>
<th>Interpretation</th>
</tr>
</thead>
<tbody>
<tr>
<td>> 90</td>
<td>High confidence — trust side-chain positions</td>
</tr>
<tr>
<td>70–90</td>
<td>Good — backbone reliable, some side-chain uncertainty</td>
</tr>
<tr>
<td>50–70</td>
<td>Low — treat as a rough scaffold only</td>
</tr>
<tr>
<td>< 50</td>
<td>Likely disordered or misfolded prediction</td>
</tr>
</tbody>
</table>
<p>It tells you how well-placed AlphaFold thinks each amino acid is relative to nearby residues. Crucially, it is not a measure of experimental validation — it is the model’s self-assessed confidence.</p>
<p>For docking, pLDDT is essentially a proxy for how much you can trust the binding pocket geometry. Active site residues need pLDDT ≥ 90 ideally, with ≥ 70 as a minimum. Second-shell residues within ~8 Å should also clear 70, and any low-confidence loops capping the binding site entrance are a red flag even if the catalytic residues themselves look fine. Meaning, might be a good idea to visualize with B factor in ChimeraX to ensure the binding sites are acceptable.</p>
<p>For MD, it seems to be a bit more forgiving. Regions scoring 70–90 will generally equilibrate fine — the force field redistributes strain and lets uncertain side chains settle. Regions in the 50–70 band need longer equilibration (100+ ns) and staged restraint release to avoid unphysical collapse (interesting area to explore). Below 50, MD often can’t rescue the geometry — these regions should either be truncated if non-essential, cross-checked against other databases, or explored with enhanced sampling methods.  OK what about PAE on their website?</p>
<p>PAE (predicted aligned error) is the second major confidence metric AlphaFold produces, and it tells you something fundamentally different from pLDDT. Where pLDDT is a per-residue score asking “how confident am I in this residue’s local geometry,” PAE is a pairwise score asking “how confident am I in the relative position and orientation of residue A with respect to residue B.” It’s an N×N matrix where every cell (i,j) contains the expected position error in Å for residue j when residue i is used as the alignment reference.</p>
<p>Why it matters? pLDDT can look great across an entire protein — every residue scores above 80 — but if the PAE between two domains is high, that confident-looking structure is misleading. The two domains are individually well-folded, but AlphaFold is telling you it has no idea how they pack against each other. For docking, check the PAE within the domain containing your binding site — you want a dark block there, confirming the domain’s internal geometry is reliable as a unit. For MD, high inter-domain PAE is a heads-up that you may need enhanced sampling to explore the conformational space between domains rather than assuming the AlphaFold pose is the dominant one.</p>
<blockquote>
<p>Two metrics before trusting AlphaFold. pLDDT is local — want ≥ 70 at active site, ≥ 90 for catalytic residues. PAE is pairwise — dark green means confident relative positioning between any two residues. Single-chain: check diagonal at binding site. Multi-chain: off-diagonal blocks tell you if the predicted interface is real. Check both before docking or MD.</p>
</blockquote>




<h2 id="strep">Do All Streptococcus Pyogenes Have CovR/S Two-component System?
  <a href="https://www.kenkoonwong.com/blog/haddock/#strep" rel="nofollow" target="_blank"><svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg">
      <path d="M0 0h24v24H0z" fill="currentColor"></path>
      <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path>
    </svg></a>
</h2>
<p>Interestingly, after downloaded 2840 Streptococcus pyogenes refseq annotation feature (gff3), I found <code>99.96% (2839/2840) of these contain CovR/S genes</code>. Why not 100%? Turns out to be this <code>GCF_005472355.1</code> that doesn’t have CovR/S listed in the annotation. I used Bakta to annotate it, and still no luck. Then used Baktfold to further annotate, couldn’t find it either. Used <code>tblastn</code> to look for CovR/S protein, the best return was 42% identity. Wow, does this isolate really have absent gene for those 2? </p>
<blockquote>
<p>Note: Expand below to see the utility of using –dehydrate and then rehydrate for downloading and annotating large number of genomes.</p>
</blockquote>
<details>
<summary>code in terminal
</article>
<article>
<h1>Probabilistic Time Series Cross-Validation with R package crossvalidation</h1>
<p>T. Moudiki — Sat, 16 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/16/r/crossvalidation"> T. Moudiki's Webpage - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>A previous post introduced the <code>crossvalidation</code> package for R. This time, the focus is on probabilistic forecasting — evaluating not just how accurate point forecasts are, but how well-calibrated prediction intervals are, using empirical coverage rates and Winkler scores – and <code>crossvalidation</code>.</p>

<pre>install.packages("remotes")

install.packages("forecast")

remotes::install_github("Techtonique/crossvalidation")

library(crossvalidation)
</pre>

<h1 id="example-1">Example 1</h1>

<pre>require(forecast)
data("AirPassengers")



eval_metric <- function(predicted, observed)
{
  error <- observed - predicted$mean

  me <- mean(error)
  rmse <- sqrt(mean(error^2))
  mae <- mean(abs(error))

  # ----- 80% interval -----

  lower80 <- predicted$lower[, 1]
  upper80 <- predicted$upper[, 1]

  coverage80 <- mean(
    observed >= lower80 & observed <= upper80
  )

  alpha80 <- 0.20

  winkler80 <- ifelse(
    observed < lower80,
    (upper80 - lower80) + (2 / alpha80) * (lower80 - observed),
    ifelse(
      observed > upper80,
      (upper80 - lower80) + (2 / alpha80) * (observed - upper80),
      (upper80 - lower80)
    )
  )

  # ----- 95% interval -----

  lower95 <- predicted$lower[, 2]
  upper95 <- predicted$upper[, 2]

  coverage95 <- mean(
    observed >= lower95 & observed <= upper95
  )

  alpha95 <- 0.05

  winkler95 <- ifelse(
    observed < lower95,
    (upper95 - lower95) + (2 / alpha95) * (lower95 - observed),
    ifelse(
      observed > upper95,
      (upper95 - lower95) + (2 / alpha95) * (observed - upper95),
      (upper95 - lower95)
    )
  )

  c(
    ME = me,
    RMSE = rmse,
    MAE = mae,
    Coverage80 = coverage80,
    Winkler80 = mean(winkler80),
    Coverage95 = coverage95,
    Winkler95 = mean(winkler95)
  )
}

(res <- crossval_ts(y=AirPassengers, initial_window = 10,
horizon = 3, fcast_func = forecast::thetaf, eval_metric = eval_metric))
print(colMeans(res))


Loading required package: forecast



  |======================================================================| 100%
</pre>

<table class="dataframe">
<caption>A matrix: 132 × 7 of type dbl</caption>
<thead>
	<tr><th></th><th scope="col">ME</th><th scope="col">RMSE</th><th scope="col">MAE</th><th scope="col">Coverage80</th><th scope="col">Winkler80</th><th scope="col">Coverage95</th><th scope="col">Winkler95</th></tr>
</thead>
<tbody>
	<tr><th scope="row">result.1</th><td>-28.794660</td><td>29.300287</td><td>28.794660</td><td>0.0000000</td><td>153.10992</td><td>0.3333333</td><td>207.58384</td></tr>
	<tr><th scope="row">result.2</th><td> 16.198526</td><td>16.894302</td><td>16.198526</td><td>1.0000000</td><td> 45.01795</td><td>1.0000000</td><td> 68.84902</td></tr>
	<tr><th scope="row">result.3</th><td> 11.201494</td><td>15.993359</td><td>12.578276</td><td>1.0000000</td><td> 45.05996</td><td>1.0000000</td><td> 68.91326</td></tr>
	<tr><th scope="row">result.4</th><td> 21.430125</td><td>22.483895</td><td>21.430125</td><td>0.6666667</td><td> 63.01207</td><td>1.0000000</td><td> 68.84778</td></tr>
	<tr><th scope="row">result.5</th><td> 10.055765</td><td>11.527746</td><td>10.055765</td><td>1.0000000</td><td> 45.99967</td><td>1.0000000</td><td> 70.35043</td></tr>
	<tr><th scope="row">result.6</th><td> -2.640822</td><td>10.676714</td><td> 9.999466</td><td>1.0000000</td><td> 46.56907</td><td>1.0000000</td><td> 71.22125</td></tr>
	<tr><th scope="row">result.7</th><td> 14.296434</td><td>23.709132</td><td>20.531135</td><td>0.6666667</td><td> 75.04186</td><td>1.0000000</td><td> 67.58381</td></tr>
	<tr><th scope="row">result.8</th><td> 38.247497</td><td>39.529998</td><td>38.247497</td><td>0.0000000</td><td>198.74990</td><td>0.3333333</td><td>212.44029</td></tr>
	<tr><th scope="row">result.9</th><td> 23.043159</td><td>23.947630</td><td>23.043159</td><td>0.3333333</td><td> 93.83463</td><td>1.0000000</td><td> 64.19366</td></tr>
	<tr><th scope="row">result.10</th><td>-21.689067</td><td>27.907560</td><td>21.689067</td><td>0.6666667</td><td> 90.23377</td><td>1.0000000</td><td> 84.12361</td></tr>
	<tr><th scope="row">result.11</th><td>-41.782157</td><td>46.664199</td><td>41.782157</td><td>0.3333333</td><td>222.06310</td><td>0.3333333</td><td>345.16553</td></tr>
	<tr><th scope="row">result.12</th><td>-34.934831</td><td>36.512081</td><td>34.934831</td><td>0.3333333</td><td>162.38092</td><td>0.6666667</td><td>212.58117</td></tr>
	<tr><th scope="row">result.13</th><td> -4.002700</td><td>12.728771</td><td> 9.999100</td><td>1.0000000</td><td> 59.64475</td><td>1.0000000</td><td> 91.21878</td></tr>
	<tr><th scope="row">result.14</th><td> 30.349582</td><td>30.588761</td><td>30.349582</td><td>0.6666667</td><td> 72.14355</td><td>1.0000000</td><td> 99.76932</td></tr>
	<tr><th scope="row">result.15</th><td> 21.192349</td><td>25.806712</td><td>21.192349</td><td>0.6666667</td><td> 71.39094</td><td>1.0000000</td><td>101.02401</td></tr>
	<tr><th scope="row">result.16</th><td> 23.193143</td><td>25.914875</td><td>23.193143</td><td>0.6666667</td><td> 91.70660</td><td>1.0000000</td><td> 76.57925</td></tr>
	<tr><th scope="row">result.17</th><td> 30.081542</td><td>30.679960</td><td>30.081542</td><td>0.3333333</td><td>111.58689</td><td>1.0000000</td><td> 75.78459</td></tr>
	<tr><th scope="row">result.18</th><td> -6.530509</td><td> 9.111376</td><td> 6.999059</td><td>1.0000000</td><td> 69.51704</td><td>1.0000000</td><td>106.31714</td></tr>
	<tr><th scope="row">result.19</th><td> 19.907586</td><td>23.010762</td><td>19.907586</td><td>1.0000000</td><td> 67.03506</td><td>1.0000000</td><td>102.52128</td></tr>
	<tr><th scope="row">result.20</th><td> 17.631089</td><td>19.829355</td><td>17.631089</td><td>1.0000000</td><td> 67.97573</td><td>1.0000000</td><td>103.95991</td></tr>
	<tr><th scope="row">result.21</th><td> 11.738022</td><td>14.718185</td><td>12.229846</td><td>1.0000000</td><td> 61.61617</td><td>1.0000000</td><td> 94.23380</td></tr>
	<tr><th scope="row">result.22</th><td>-21.787490</td><td>28.489509</td><td>21.787490</td><td>0.6666667</td><td> 93.30920</td><td>1.0000000</td><td> 88.70090</td></tr>
	<tr><th scope="row">result.23</th><td>-43.557571</td><td>47.527244</td><td>43.557571</td><td>0.3333333</td><td>206.77368</td><td>0.6666667</td><td>216.50078</td></tr>
	<tr><th scope="row">result.24</th><td>-34.473558</td><td>35.514155</td><td>34.473558</td><td>0.3333333</td><td>146.63288</td><td>0.6666667</td><td>173.17046</td></tr>
	<tr><th scope="row">result.25</th><td> -4.699360</td><td>10.498595</td><td> 7.201224</td><td>1.0000000</td><td> 60.07550</td><td>1.0000000</td><td> 91.87755</td></tr>
	<tr><th scope="row">result.26</th><td> 25.974138</td><td>26.581272</td><td>25.974138</td><td>1.0000000</td><td> 63.01942</td><td>1.0000000</td><td> 96.37989</td></tr>
	<tr><th scope="row">result.27</th><td> 16.905109</td><td>19.474600</td><td>16.905109</td><td>1.0000000</td><td> 58.04472</td><td>1.0000000</td><td> 88.77173</td></tr>
	<tr><th scope="row">result.28</th><td> 15.218760</td><td>16.352917</td><td>15.218760</td><td>1.0000000</td><td> 55.27721</td><td>1.0000000</td><td> 84.53920</td></tr>
	<tr><th scope="row">result.29</th><td>  7.625241</td><td> 8.933828</td><td> 7.625241</td><td>1.0000000</td><td> 55.27718</td><td>1.0000000</td><td> 84.53916</td></tr>
	<tr><th scope="row">result.30</th><td>  2.261970</td><td>17.595326</td><td>15.666212</td><td>1.0000000</td><td> 57.13292</td><td>1.0000000</td><td> 87.37725</td></tr>
	<tr><th scope="row">⋮</th><td>⋮</td><td>⋮</td><td>⋮</td><td>⋮</td><td>⋮</td><td>⋮</td><td>⋮</td></tr>
	<tr><th scope="row">result.103</th><td>  95.047754</td><td>111.26440</td><td> 95.04775</td><td>0.3333333</td><td> 485.7096</td><td>0.6666667</td><td> 594.3549</td></tr>
	<tr><th scope="row">result.104</th><td> 121.335201</td><td>125.76554</td><td>121.33520</td><td>0.0000000</td><td> 646.5750</td><td>0.3333333</td><td> 772.4818</td></tr>
	<tr><th scope="row">result.105</th><td>  27.661546</td><td> 53.66952</td><td> 52.33567</td><td>0.6666667</td><td> 149.4669</td><td>1.0000000</td><td> 226.7499</td></tr>
	<tr><th scope="row">result.106</th><td> -82.928463</td><td>106.53675</td><td> 87.39838</td><td>0.3333333</td><td> 439.0476</td><td>0.6666667</td><td> 391.0034</td></tr>
	<tr><th scope="row">result.107</th><td>-168.429957</td><td>174.86402</td><td>168.42996</td><td>0.0000000</td><td>1125.8534</td><td>0.0000000</td><td>2680.3671</td></tr>
	<tr><th scope="row">result.108</th><td> -86.047368</td><td> 89.34969</td><td> 86.04737</td><td>0.6666667</td><td> 241.5086</td><td>1.0000000</td><td> 281.3325</td></tr>
	<tr><th scope="row">result.109</th><td> -35.392983</td><td> 38.64620</td><td> 35.39298</td><td>1.0000000</td><td> 192.3314</td><td>1.0000000</td><td> 294.1455</td></tr>
	<tr><th scope="row">result.110</th><td>  32.273683</td><td> 33.69167</td><td> 32.27368</td><td>1.0000000</td><td> 199.9978</td><td>1.0000000</td><td> 305.8702</td></tr>
	<tr><th scope="row">result.111</th><td>  35.911969</td><td> 45.52857</td><td> 35.91197</td><td>1.0000000</td><td> 195.2069</td><td>1.0000000</td><td> 298.5432</td></tr>
	<tr><th scope="row">result.112</th><td>  28.584481</td><td> 41.79144</td><td> 38.16654</td><td>1.0000000</td><td> 196.5409</td><td>1.0000000</td><td> 300.5833</td></tr>
	<tr><th scope="row">result.113</th><td>  78.144295</td><td> 79.31310</td><td> 78.14430</td><td>1.0000000</td><td> 196.9343</td><td>1.0000000</td><td> 301.1850</td></tr>
	<tr><th scope="row">result.114</th><td>  37.152546</td><td> 52.61404</td><td> 39.21044</td><td>1.0000000</td><td> 192.5487</td><td>1.0000000</td><td> 294.4778</td></tr>
	<tr><th scope="row">result.115</th><td>  95.078342</td><td>110.88602</td><td> 95.07834</td><td>0.6666667</td><td> 366.3676</td><td>1.0000000</td><td> 274.9151</td></tr>
	<tr><th scope="row">result.116</th><td> 109.166178</td><td>116.17612</td><td>109.16618</td><td>0.3333333</td><td> 406.7397</td><td>1.0000000</td><td> 277.4405</td></tr>
	<tr><th scope="row">result.117</th><td>  41.289554</td><td> 62.02085</td><td> 57.33490</td><td>0.3333333</td><td> 215.1577</td><td>1.0000000</td><td> 222.4127</td></tr>
	<tr><th scope="row">result.118</th><td> -92.399494</td><td>116.61777</td><td> 92.82407</td><td>0.3333333</td><td> 466.7285</td><td>0.6666667</td><td> 445.3571</td></tr>
	<tr><th scope="row">result.119</th><td>-175.618445</td><td>183.27955</td><td>175.61845</td><td>0.0000000</td><td>1143.5479</td><td>0.0000000</td><td>2574.2409</td></tr>
	<tr><th scope="row">result.120</th><td> -94.580461</td><td> 97.36039</td><td> 94.58046</td><td>0.6666667</td><td> 277.7847</td><td>1.0000000</td><td> 293.2590</td></tr>
	<tr><th scope="row">result.121</th><td> -27.751828</td><td> 32.93559</td><td> 27.75183</td><td>1.0000000</td><td> 202.1374</td><td>1.0000000</td><td> 309.1425</td></tr>
	<tr><th scope="row">result.122</th><td>  36.177008</td><td> 38.16646</td><td> 36.17701</td><td>1.0000000</td><td> 208.6352</td><td>1.0000000</td><td> 319.0800</td></tr>
	<tr><th scope="row">result.123</th><td>   5.992278</td><td> 14.16185</td><td> 13.99743</td><td>1.0000000</td><td> 200.0098</td><td>1.0000000</td><td> 305.8885</td></tr>
	<tr><th scope="row">result.124</th><td>  12.637863</td><td> 33.65269</td><td> 27.98030</td><td>1.0000000</td><td> 200.1828</td><td>1.0000000</td><td> 306.1532</td></tr>
	<tr><th scope="row">result.125</th><td>  71.834372</td><td> 76.95073</td><td> 71.83437</td><td>1.0000000</td><td> 200.5753</td><td>1.0000000</td><td> 306.7534</td></tr>
	<tr><th scope="row">result.126</th><td>  85.518711</td><td> 93.75094</td><td> 85.51871</td><td>0.6666667</td><td> 252.5638</td><td>1.0000000</td><td> 295.0496</td></tr>
	<tr><th scope="row">result.127</th><td>  94.429064</td><td>115.52397</td><td> 94.42906</td><td>0.6666667</td><td> 407.3636</td><td>0.6666667</td><td> 417.2566</td></tr>
	<tr><th scope="row">result.128</th><td> 173.325805</td><td>177.66652</td><td>173.32580</td><td>0.0000000</td><td>1129.6141</td><td>0.0000000</td><td>2547.8618</td></tr>
	<tr><th scope="row">result.129</th><td>  33.890665</td><td> 63.84191</td><td> 61.66861</td><td>0.6666667</td><td> 242.6901</td><td>1.0000000</td><td> 230.3885</td></tr>
	<tr><th scope="row">result.130</th><td>-119.059067</td><td>137.73685</td><td>119.05907</td><td>0.3333333</td><td> 619.4166</td><td>0.3333333</td><td> 668.9786</td></tr>
	<tr><th scope="row">result.131</th><td>-180.821172</td><td>190.45241</td><td>180.82117</td><td>0.0000000</td><td>1152.4949</td><td>0.0000000</td><td>2469.3936</td></tr>
	<tr><th scope="row">result.132</th><td>-103.156396</td><td>108.61881</td><td>103.15640</td><td>0.6666667</td><td> 330.0400</td><td>1.0000000</td><td> 302.1675</td></tr>
</tbody>
</table>

<pre>         ME        RMSE         MAE  Coverage80   Winkler80  Coverage95 
  2.6570822  51.4271704  46.5118747   0.6590909 218.4527816   0.8459596 
  Winkler95 
312.1383104 
</pre>

<h1 id="example-2">Example 2</h1>

<pre>eval_metric <- function(predicted, observed)
{
  error <- observed - predicted$mean

  me <- mean(error)
  rmse <- sqrt(mean(error^2))
  mae <- mean(abs(error))

  # Only one interval returned
  lower <- predicted$lower
  upper <- predicted$upper

  coverage <- mean(
    observed >= lower & observed <= upper
  )

  alpha <- 0.05

  winkler <- ifelse(
    observed < lower,
    (upper - lower) + (2 / alpha) * (lower - observed),
    ifelse(
      observed > upper,
      (upper - lower) + (2 / alpha) * (observed - upper),
      (upper - lower)
    )
  )

  c(
    ME = me,
    RMSE = rmse,
    MAE = mae,
    Coverage95 = coverage,
    Winkler95 = mean(winkler)
  )
}

fcast_func <- function(y, h, ...)
{
  forecast::thetaf(
    y,
    h = h,
    level = 95
  )
}

res <- crossval_ts(
  y = AirPassengers,
  initial_window = 10,
  horizon = 3,
  fcast_func = fcast_func,
  eval_metric = eval_metric
)

print(colMeans(res))

  |======================================================================| 100%
         ME        RMSE         MAE  Coverage95   Winkler95 
  2.6570822  51.4271704  46.5118747   0.8459596 312.1383104 

boxplot(res[, "Coverage95"])
</pre>

<p></p>


<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://thierrymoudiki.github.io//blog/2026/05/16/r/crossvalidation"> T. Moudiki's Webpage - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/probabilistic-time-series-cross-validation-with-r-package-crossvalidation/">Probabilistic Time Series Cross-Validation with R package crossvalidation</a>
</article>
<article>
<h1>muttest 0.2.0: More Mutators, Better Reporting, and Parallel Execution</h1>
<p>jakub::sobolewski — Fri, 15 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://jakubsobolewski.com/blog/muttest-0_2_0"> jakub::sobolewski</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>Your tests pass. Coverage is high. Everything looks fine — until someone finds a bug in production that your didn’t catch – all because of a poor assertion.</p>
<p>Code coverage tells you which lines ran. It says nothing about whether those lines are actually tested. You can delete every assertion in your test suite, run <code>covr</code>, and still see 100%. Coverage is a measure of execution, not correctness. That gap is exactly what <a href="https://github.com/jakubsob/muttest" rel="nofollow" target="_blank"><code>{muttest}</code></a> was built to close — and 0.2.0 makes it much more capable than the previous version.</p>
<blockquote>
<p> See the full changelog <a href="https://github.com/jakubsob/muttest/blob/main/NEWS.md" rel="nofollow" target="_blank">here</a>.</p>
</blockquote>
<h2 id="what-is-mutation-testing">What Is Mutation Testing?</h2>
<p>Mutation testing asks a harder question than coverage: <em>if this code were subtly wrong, would your tests notice?</em></p>
<p>It works by making small, deliberate changes to your source code — swapping <code>></code> for <code>>=</code>, flipping <code>TRUE</code> to <code>FALSE</code>, replacing <code>&&</code> with <code>||</code> — and then running your test suite against each modified version. Each modified version is called a <strong>mutant</strong>. If your tests fail, the mutant is <strong>killed</strong>: your tests noticed the change. If your tests pass, the mutant <strong>survived</strong>: your tests are blind to that kind of bug.</p>
<p>The result is a <strong>mutation score</strong>:</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mtext>Mutation Score</mtext><mo>=</mo><mfrac><mtext>Killed Mutants</mtext><mtext>Total Mutants</mtext></mfrac><mo>×</mo><mn>100</mn><mi mathvariant="normal">%</mi></mrow><annotation encoding="application/x-tex">\text{Mutation Score} = \frac{\text{Killed Mutants}}{\text{Total Mutants}} \times 100\%</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord text"><span class="mord">Mutation Score</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2251em;vertical-align:-0.345em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em;"><span style="top:-2.655em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">Total Mutants</span></span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.394em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">Killed Mutants</span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.345em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8056em;vertical-align:-0.0556em;"></span><span class="mord">100%</span></span></span></span></p>
<ul>
<li><strong>0%</strong> — Your tests pass no matter what the code does. Assertions are missing or trivial.</li>
<li><strong>100%</strong> — Every mutation triggers a test failure. Your tests are tight.</li>
</ul>
<p>Unlike coverage, this score reflects <strong>assertion quality</strong>, not just execution. A test suite full of <code>expect_true(is.numeric(x))</code> checks will hit 100% coverage while missing every meaningful failure. Mutation testing exposes that.</p>
<h2 id="why-you-should-care">Why You Should Care</h2>
<p>Here is the canonical example. The function <code>is_adult</code> has a boundary condition:</p>
<pre># R/is_adult.R
is_adult <- function(age) {
  age >= 18
}</pre>
<p>And these tests give 100% coverage:</p>
<pre># tests/testthat/test-is_adult.R
test_that("is_adult returns TRUE for adults", {
  expect_true(is_adult(25))
})

test_that("is_adult returns FALSE for minors", {
  expect_false(is_adult(10))
})</pre>
<p>Both tests pass. Both would still pass if <code>>=</code> were accidentally replaced with <code>></code>. The boundary value <code>18</code> is never tested, so neither mutant is killed:</p>
<pre>#' R/is_adult.R — mutant 1: ">=" → ">"
is_adult <- function(age) {
  age > 18
}</pre>
<p>Imagine this bug makes it to production. A 18 year old user tries to sign up, and the system rejects them. The bug is real, but your tests never saw it coming.</p>
<p>Running <code>muttest</code> exposes this immediately:</p>
<pre>library(muttest)

plan <- muttest_plan(
  mutators = comparison_operators()
)
muttest(plan)</pre>
<p>The progress table shows one survivor. The fix is a single test:</p>
<pre>test_that("is_adult returns TRUE at the boundary age", {
  expect_true(is_adult(18))  # kills the >= → > mutant
})</pre>
<p><strong>This surviving mutant is not a problem to fix — it’s a specification you forgot to write.</strong></p>
<h3 id="the-llm-test-problem">The LLM Test Problem</h3>
<p>Many developers now use LLMs to generate tests. Who likes to write tests themselves anyway?</p>
<p>LLMs are fast and produce syntactically correct code, but they may produce obvious cases, miss boundaries or just test properties of the code. The <code>is_adult</code> test suite above is what a language model might produce: structurally fine, semantically incomplete.</p>
<p>Mutation testing gives you an objective signal for how strong tests actually are, whether you wrote them yourself or they were generated by an LLM. A low mutation score doesn’t mean the LLM did a bad job — it means you now know exactly where to strengthen the assertions. <strong>LLM-generated tests need external validation just as much as human-written tests do.</strong></p>
<p><code>muttest</code> provides tools to help with this validation.</p>
<hr>
<h2 id="whats-new-in-020">What’s New in 0.2.0</h2>
<h3 id="expanded-mutator-library">Expanded Mutator Library</h3>
<p>The biggest addition in this release is a full roster of new mutators, organized into individual mutators and ready-made preset collections.</p>
<p><strong>New individual mutators:</strong></p>
<ul>
<li><code>boolean_literal("TRUE", "FALSE")</code> — flips boolean constants: <code>TRUE → FALSE</code></li>
<li><code>na_literal("NA", "NULL")</code> — swaps NA variants and NULL: <code>NA → NULL</code></li>
<li><code>negate_condition()</code> — wraps <code>if</code> conditions in <code>!(...)</code>: <code>if (x > 0)</code> → <code>if (!(x > 0))</code></li>
<li><code>remove_condition_negation()</code> — strips leading <code>!</code> from conditions: <code>if (!done)</code> → <code>if (done)</code></li>
<li><code>numeric_increment()</code> / <code>numeric_decrement()</code> — shifts numeric constants by one: <code>5 → 6</code>, <code>5 → 4</code></li>
<li><code>index_increment()</code> / <code>index_decrement()</code> — shifts subscript indices: <code>x[i]</code> → <code>x[i + 1L]</code></li>
<li><code>string_empty()</code> — replaces non-empty strings with <code>""</code>: <code>"hello" → ""</code></li>
<li><code>string_fill()</code> — replaces empty strings with <code>"mutant"</code>: <code>"" → "mutant"</code></li>
<li><code>call_name("any", "all")</code> — swaps function names: <code>any(x) → all(x)</code></li>
<li><code>remove_negation()</code> — removes <code>!</code> anywhere: <code>!is.na(x) → is.na(x)</code></li>
<li><code>replace_return_value()</code> — replaces explicit return values with <code>NULL</code>: <code>return(x) → return(NULL)</code></li>
<li><code>delete_statement()</code> — removes assignments and standalone calls one at a time, catching untested side effects and dead assignments</li>
</ul>
<p><strong>New preset collections</strong> — pass a single call and get the full set of relevant mutators:</p>
<ul>
<li><code>boolean_literals()</code> — <code>TRUE ↔ FALSE</code>, <code>T ↔ F</code></li>
<li><code>na_literals()</code> — <code>NA ↔ NULL</code>, <code>NA ↔ NA_real_</code>, <code>NA ↔ NA_integer_</code>, <code>NA ↔ NA_character_</code></li>
<li><code>numeric_literals()</code> — combines <code>numeric_increment()</code> and <code>numeric_decrement()</code></li>
<li><code>index_mutations()</code> — combines <code>index_increment()</code> and <code>index_decrement()</code></li>
<li><code>string_literals()</code> — combines <code>string_empty()</code> and <code>string_fill()</code></li>
<li><code>condition_mutations()</code> — combines <code>negate_condition()</code> and <code>remove_condition_negation()</code></li>
</ul>
<p>The three operator presets from 0.1.0 are still there — <code>arithmetic_operators()</code>, <code>comparison_operators()</code>, <code>logical_operators()</code> — and now they have company.</p>
<p>A practical starting configuration covers most of what you’d want to catch in business logic:</p>
<pre>plan <- muttest_plan(
  source_files = "R/my_file.R",
  mutators = c(
    arithmetic_operators(),
    comparison_operators(),
    logical_operators(),
    condition_mutations(),
    numeric_literals(),
    list(remove_negation())
  )
)</pre>
<p>Layer in <code>boolean_literals()</code>, <code>na_literals()</code>, <code>string_literals()</code>, or <code>index_mutations()</code> based on what your code actually does.</p>
<h3 id="mutators-are-now-parametrized">Mutators Are Now Parametrized</h3>
<p>Individual mutators accept configuration arguments. <code>operator("+", "-")</code> and <code>boolean_literal("TRUE", "FALSE")</code> let you define exactly which token to replace and with what — so you can express the mutations that matter for your domain without writing a custom mutator from scratch. The <code>Mutator</code> base class is also now exported for cases where you want to go further and build an entirely custom mutator.</p>
<h3 id="survived-mutants-are-now-reported">Survived Mutants Are Now Reported</h3>
<p>The <code>ProgressMutationReporter</code> previously showed you only killed and total mutant counts. In 0.2.0, it now reports <strong>survived mutants</strong> — the ones your tests missed.</p>
<p>This is the signal that matters. Survivors are not noise; each one represents a real gap in your test suite. Seeing them surfaced directly in the progress output makes the feedback loop tighter: run <code>muttest</code>, read the survivors, add a test, repeat.</p>
<pre>i Mutation Testing
  |   K |   S |   E |   T |   % | Mutator  | File
v |   1 |   0 |   0 |   1 | 100 | > → <    | shipping.R
x |   1 |   1 |   0 |   2 |  50 | > → >=   | shipping.R
-- Survived Mutants -----------------------------------------------
shipping.R  > → >=
2-   if (weight_kg > 5) 15.00 else 5.00
2+   if (weight_kg >= 5) 15.00 else 5.00
-- Results --------------------------------------------------------
[ KILLED 1 | SURVIVED 1 | ERRORS 0 | TOTAL 2 | SCORE 50.0% ]</pre>
<h3 id="timeouts-and-improved-error-handling">Timeouts and Improved Error Handling</h3>
<p>Mutation testing works by running your test suite once per mutant. Some mutations produce code that hangs — an infinite loop, a blocking call, a computation that never completes. In 0.1.0 that would stall your entire run.</p>
<p>In 0.2.0, <code>muttest()</code> supports <strong>per-mutant timeouts</strong>. Set a timeout and any mutant whose test run exceeds it is marked as errored. The rest of the run continues unaffected.</p>
<p>Error handling in general has been improved. When test execution fails unexpectedly, errors are now captured and reported cleanly rather than surfacing as unhandled conditions that stop the whole run. This makes mutation testing more robust in real projects where test environments are not always perfectly controlled.</p>
<h3 id="parallel-execution">Parallel Execution</h3>
<p>The 0.1.0 release ran mutants sequentially. In large files with many mutants, that adds up. <code>muttest()</code> now supports <strong>parallel execution</strong> with {mirai} under the hood: mutants can be run concurrently across multiple workers, cutting run time on larger repositories.</p>
<hr>
<h2 id="getting-started">Getting Started</h2>
<p>Install from CRAN:</p>
<pre>install.packages("muttest")</pre>
<p>Pick one file with meaningful logic — branching, comparisons, arithmetic. Define a plan:</p>
<pre>library(muttest)

plan <- muttest_plan(
  source_files = "R/your_file.R",
  mutators = comparison_operators()
)

muttest(plan)</pre>
<p>Read the output. Find the survivors. Add the tests they imply. Repeat.</p>
<p>Start with one file and one mutator preset. Aim for a meaningful score improvement each iteration rather than chasing 100% immediately. <strong>A score of 80%+ on critical business logic is a strong starting target.</strong></p>
<p>Try it on a file where you suspect the tests are weak. The survivors will tell you exactly what to add.</p>
<hr>
<h2 id="id-love-to-hear-from-you">I’d Love to Hear From You</h2>
<p><code>{muttest}</code> is still fresh and its features and interface might change. The new mutator library covers a wide range of patterns, but there are certainly mutations specific to your domain that aren’t covered yet. If you run into a case where the right mutation is missing, an existing mutator behaves unexpectedly, or something in the output is hard to interpret, please open an issue on GitHub.</p>
<p>Feature requests are equally welcome. If there’s a kind of code change you’d want to test for and there’s no good way to express it yet, please drop an issue in the repository.</p>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://jakubsobolewski.com/blog/muttest-0_2_0"> jakub::sobolewski</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/muttest-0-2-0-more-mutators-better-reporting-and-parallel-execution/">muttest 0.2.0: More Mutators, Better Reporting, and Parallel Execution</a>
</article>
<article>
<h1>Is logistic regression regression?</h1>
<p>datascienceconfidential - r — Thu, 14 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://datascienceconfidential.github.io/r/predictive-models/2026/05/14/is-logistic-regression-regression.html"> datascienceconfidential - r</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>I came across a post recently by a machine learning engineer who made the bold claim that logistic regression is the worst name for an algorithm ever, or something along those lines<sup><a href="https://datascienceconfidential.github.io/r/predictive-models/2026/05/14/is-logistic-regression-regression.html#myfootnote1" rel="nofollow" target="_blank">1</a></sup>. Many statisticians of the more old-school type seemed to disagree. This led me to think a bit more deeply about the subject. I’ve already written several posts on bad terminology in statistics (see <a href="https://datascienceconfidential.github.io/statistics/probability/2020/07/23/confidence-intervals.html" rel="nofollow" target="_blank">confidence level</a>, <a href="https://datascienceconfidential.github.io/statistics/linear-regression/2020/08/16/line-of-best-fit.html" rel="nofollow" target="_blank">line of best fit</a>, <a href="https://datascienceconfidential.github.io/statistics/linear-regression/python/2021/06/01/r-squared.html" rel="nofollow" target="_blank">r squared</a>) so I might have been expected to agree with the machine learning view, but in this case I agree with the statisticians, and I would like to explain why.</p>

<h1 id="what-data-scientists-think-regression-is">What data scientists think regression is</h1>

<p>In data science classes, students are taught that there are two kinds of predictive modelling. In both cases, the aim is to predict a response $Y$ given a vector of features $X$. If $Y$ is real-valued (<code>numeric</code> in R terminology) then it’s a <em>regression</em> problem. If $Y$ is categorical then it’s a <em>classification</em> problem. I’m not sure where this terminology originated, but it’s certainly been propogated very widely by Hastie and Tibshirani’s classic <a href="https://hastie.su.domains/ElemStatLearn/" rel="nofollow" target="_blank"><em>The Elements of Statistical Learning</em></a>.</p>

<p>In logistic regression, your data consists of some feature values $X$ and a response $Y \in \lbrace 0, 1 \rbrace$. In this case, the response is definitely categorical, so someone trained in data science would indeed call this a classification problem. But if you look more closely at the output produced by logistic regression, its predicted values are numbers, namely the probability of each data point being in the class labelled $1$. You need to do something to these numbers (for example, use a cutoff) in order to get a predicted class.</p>

<p>For example, in R:</p>

<pre>set.seed(100)
N <- 100
a <- -1
b <- 1
x <- 2 * rnorm(N)

# simulated binary data
y <- rbinom(N, 1, 1/(1 + exp(-a -b * x)))

# plot observed values in grey
plot(x, y, pch=19, xlab="x", ylab="y",
     col=rgb(0, 0, 0, 0.3), las=1)

# fit logistic regression
model <- glm(y ~ x, family="binomial")

# plot predicted values in red
points(x, 
       predict(model, data.frame(x=x), 
                  type="response"),
       col=rgb(1, 0, 0, 0.3),
       pch=19)
</pre>

<div style="width:70%; margin:0 auto;">
 
</div>

<p>In fact, it’s quite hard to think of a machine learning algorithm which directly predicts class membership rather than some sort of measure of how strongly a data point is a member of a class. Even Naive Bayes is making some sort of attempt to predict the probability of class membership. The simplest algorithm which directly predicts the class instead of the probability of class membership is the <a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" rel="nofollow" target="_blank">1-nearest neighbour algorithm</a>. (But if you used a larger number of neighbours, say 20, you would get some sort of estimate of how confident you were in your prediction.)</p>

<h1 id="what-statisticians-think-regression-is">What statisticians think regression is</h1>

<p>The term <em>regression</em> comes from Galton’s idea of <em>regression to the mean</em> (which I have written about <a href="https://datascienceconfidential.github.io/statistics/probability/2024/12/02/mount-everest.html" rel="nofollow" target="_blank">here</a>). Originally this was the observation that tall parents tend to have children who are shorter than them, and vice versa. The heights of children seem to regress towards the mean of the whole population.</p>

<p>More generally, the values of the response $Y$ corresponding to some fixed value of the features $x_0$ will follow some probability distribution. The mean of this distribution is $E[Y \vert x_0]$. The observed values of $Y$ will cluster around this mean. If you repeatedly draw values of $Y$, a large value will tend to be followed by a smaller value, and vice-versa. Thus, $E[Y \vert X]$ will tend to be smaller than $Y$ if $Y$ is unusually large, and larger than $Y$ if $Y$ is unusually small<sup><a href="https://datascienceconfidential.github.io/r/predictive-models/2026/05/14/is-logistic-regression-regression.html#myfootnote2" rel="nofollow" target="_blank">2</a></sup>. You can see this if you use linear regression to predict $Y$ given $X$, as in the following example.</p>

<pre>set.seed(100)
N <- 500

x <- rnorm(N)
y <- 0.4 * x + 0.8 * rnorm(N)
plot(x, y)
abline(coef(lm(y~x)), col="red")
</pre>

<div style="width:70%; margin:0 auto;">
 
</div>

<p>(Note how the slope of the regression line is shallower than the “slope” which the eye perceives in the cloud of data points, which is the <a href="https://en.wikipedia.org/wiki/Principal_axis_theorem" rel="nofollow" target="_blank">principal axis</a>.)</p>

<p>But some algorithms don’t give you any regression effect. For example, an overfitted decision tree (a.k.a 1-NN regressor) will not show any regression to the mean, as in the following example. Note that the blue line does not under- or over-predict for the extreme values of $x$.</p>

<pre>x <- c(1:9)
y <- c(-10, seq(-1,1, length=7), 10)
pred_nn <- function(xx) y[which.min(abs(xx - x))[1]]

plot(x, y)
abline(coef(lm(y~x)), col="red")
xx <- seq(1, 9, length=1000)
lines(xx, sapply(xx, pred_nn), type="s", lty=2, col="blue")
</pre>

<div style="width:70%; margin:0 auto;">
 
</div>

<p>In this case, you have an algorithm which is predicting a numerical value, so data scientists would call it a regression, but it’s not actually exhibiting any regression. How annoying!</p>

<h1 id="what-regression-actually-is">What regression actually is</h1>

<p>Although it’s too late to rewrite the textbooks, maybe it could be argued that regression and classification should have been defined in the following way. If a predictive model directly predicts a response $Y$ given features $X$, then it should be called a classification model (even if $Y$ is numeric, as in the previous example). But if the model predicts $E[Y \vert X]$, then it should be called a regression model.</p>

<p>What about logistic regression? In this case, the model is predicting $P(Y=1 \vert X)$ which is just $E[Y \vert X]$. So the statisticians were right in the first place! Logistic regression <em>is</em> a regression model. It only becomes a classification model if you apply a second model to it. Usually this takes the form of a decision tree which predicts $Y=1$ if $E[Y \vert X] > p_0$ for some choice of $p_0$ and $Y=0$ otherwise. This decision tree <em>is</em> a classification model. But logistic regression itself isn’t.</p>

<hr />

<p><small>
<a name="myfootnote1">1</a>: I’m a little wary of calling myself a data scientist these days, partly because I think the profession has been devalued by various attempts to cash in on its popularity (leading to a glut of people with high confidence and low experience) and partly because I think data science is becoming a bit of a toxic brand with all the <a href="https://datascienceconfidential.github.io/economics/ai/llm/r/2026/01/07/so-how-much-does-openai-owe-us.html" rel="nofollow" target="_blank">real-world harm</a> being done by AI, data centres, mass surveillance, etc.
</small></p>

<p><small>
<a name="myfootnote2">2</a>: Anecdote time: at one of my old jobs we had to entertain a vendor who was basically selling a Kaggle-style workflow as a software-as-a-service product. The sales rep built a model on some of our data and presented it. In their write-up they included the observation that “interestingly, we noticed that the model tends to underpredict for large values of $x$ and overpredict for small values of $x$”. Well, that’s not very surprising because that’s what <em>every</em> predictive model does!
</small></p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://datascienceconfidential.github.io/r/predictive-models/2026/05/14/is-logistic-regression-regression.html"> datascienceconfidential - r</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/is-logistic-regression-regression/">Is logistic regression regression?</a>
</article>
<article>
<h1>15 Years of rOpenSci, and We’re Just Getting Started 🎉</h1>
<p>rOpenSci — Wed, 13 May 2026 00:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://ropensci.org/blog/2026/05/13/anniversary2026/"> rOpenSci - open tools for open science</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>

<p>Digging through our memory box, we came across a conversation from which we tried to piece together when it all began with rOpenSci.</p>
<p>On July 13, 2011, an email was sent with the idea of a shared blog, a clever domain name, and a way to connect R package developers who cared about open science. The name “rOpenSci” appear in that email. A few months before that, the first commits had already been pushed to what would become taxize and treeBASE, two packages that quietly planted the seed of something much bigger.</p>
<p>That was 15 years ago. This year, we celebrate. </p>
<figure class="center"><figcaption>
<p>Template design by Lauren Creatives in Canva. Adapted by Yani.</p>
</figcaption>
</figure>
<h2>
<em>Quinceañera</em> time 
</h2><p>Fifteen years<sup id="fnref:1"><a href="https://ropensci.org/blog/2026/05/13/anniversary2026/#fn:1" class="footnote-ref" role="doc-noteref" rel="nofollow" target="_blank">1</a></sup> is a milestone worth marking properly, that is why we want to celebrate with our community. We have a full year of activities planned, and we want you along for all of it.</p>
<p>Expect several diverse events, retrospectives and a few surprises we’re still stitching together. We’ll be reflecting on what we’ve built, highlighting the work of contributors old and new, and dreaming out loud about the next 15 years.</p>
<p>Stay tuned to this blog and our newsletter for announcements as the year unfolds.</p>
<h2>
First up: co-working session and casual virtual community celebration
</h2><p>We’re kicking things off with one <a href="https://ropensci.org/events/coworking-2026-06" rel="nofollow" target="_blank">co-working session on Tuesday, June 2</a> and two virtual celebrations on <a href="https://ropensci.org/events/celebrations-2026-06-10" rel="nofollow" target="_blank">Wednesday, June 10</a> and <a href="https://ropensci.org/events/celebrations-2026-06-17" rel="nofollow" target="_blank">Wednesday, June 17</a> in different timezones, so as many people as possible can join.</p>
<p>Each 90-minute celebration is built around rotating small-group conversations. You’ll meet community members you may not know yet, and together you’ll dig into questions to reflect on rOpenSci’s past and future.</p>
<p>Old friends and new faces alike, we would love to share this celebrations with you. No registration needed.</p>
<h2>
Thank you
</h2><p>To everyone who has contributed a package, reviewed code, written a blog post, helped in the forum, showed up to a community call, or simply used our tools in your research, thank you!</p>
<p>Here’s to 15 years and to whatever comes next.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>The fifteenth birthday of young women in many Latin American cultures is a special marker of adulthood called a Quinceañera. <a href="https://ropensci.org/blog/2026/05/13/anniversary2026/#fnref:1" class="footnote-backref" role="doc-backlink" rel="nofollow" target="_blank">︎</a></p>
</li>
</ol>
</div>
<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://ropensci.org/blog/2026/05/13/anniversary2026/"> rOpenSci - open tools for open science</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/15-years-of-ropensci-and-were-just-getting-started-%f0%9f%8e%89/">15 Years of rOpenSci, and We’re Just Getting Started 🎉</a>
</article>
<article>
<h1>Durations of wars by @ellis2013nz</h1>
<p>free range statistics - R — Tue, 12 May 2026 13:00:00 +0000</p>
<!-- 
<div style="min-height: 30px;">
[social4i size="small" align="align-left"]
</div>
-->

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;">
[This article was first published on  <strong><a href="https://freerangestats.info/blog/2026/05/13/war-durations"> free range statistics - R</a></strong>, and kindly contributed to <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers</a>].  (You can report issue about the content on this page <a href="https://www.r-bloggers.com/contact-us/">here</a>)
<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div>
<p>How long do wars last, on average? If a war such as that currently under way in Iran has lasted 74 days so far, how long do we expect it to last in total? For all sorts of reasons, inquiring minds are interested. Luckily there are some very well curated datasets out there, including the <a href="https://correlatesofwar.org/data-sets/cow-war/" rel="nofollow" target="_blank">Correlates of War</a>, that make it easy to answer these questions.</p>

<p>A caveat to all this applies that I am not a military historian, just an interested amateur. I’m very open to having mistakes of interpretation or method pointed out to me.</p>

<h2 id="distribution-of-wars-durations">Distribution of wars’ durations</h2>

<p>The Correlates of War data lets us see, for example, that this is the distribution (on a logarithmic scale) of durations of wars post-Napoleon:</p>



<p>You can see I’ve compared this to a log-normal distribution and found that it doesn’t have quite as fat tails as that. But that’s ok, I’m not too worried about the precise shape, because later on I’ll be using pretty straightforward empirical methods.</p>

<p>This data is only for inter-state wars, which are in contrast to intra-state (eg civil wars) and extra-state (eg with external non-state actors). As I’m interested in a reference population to compare the current USA-Israel-Iran war to, it’s the inter-state population I want.</p>

<p>The median length of a war is 139 days and the mean is 408 days.</p>

<p>The four day war in the dataset is the so-called “<a href="https://en.wikipedia.org/wiki/Football_War" rel="nofollow" target="_blank">Football War</a>” of 1969 between Honduras and El Salvador. The 3,734 day war was the much better-known “Vietnam War Phase II”, involving USA, Australia, Vietnam, Cambodia and others.</p>

<p>Here’s the code to import the data from the Correlates of War project and draw that first density plot:</p>

<figure class="highlight"><pre>library(tidyverse)
library(lubridate)
library(janitor)
library(glue)
library(ggrepel)
library(scales)

# https://correlatesofwar.org/data-sets/cow-war/


#----- import interstate war data----------------------

interstate <- read_csv("https://correlatesofwar.org/wp-content/uploads/Inter-StateWarData_v4.0.csv") |> 
  clean_names() |> 
  mutate(start_date = as.Date(sprintf("%04d-%02d-%02d", start_year1, start_month1, start_day1)),
         end_date = as.Date(sprintf("%04d-%02d-%02d", end_year1, end_month1, end_day1)))

interstate_wars <- interstate |> 
  group_by(war_num, war_name) |> 
  summarise(earliest_start= min(start_date),
            latest_end = max(end_date),
            bat_death = sum(bat_death)) |> 
  mutate(duration = as.numeric(latest_end - earliest_start),
         start_year = year(earliest_start)) |> 
  ungroup()

# what years covered? 1823 to 2003 at time of writing
range(interstate_wars$start_year)

#==========================plots=================
 
simple_caption <- "Source: Correlates of War, Inter-State War Data; analysis by freerangestats.info"

#-----------------distribution of duration------------
summary(interstate_wars$duration)

sim_norm <- data.frame(duration = 10 ^ (rnorm(1e6, 
                                        mean = log10(interstate_wars$duration), 
                                        sd = sd(log10(interstate_wars$duration)))))

interstate_wars |> 
  ggplot(aes(x = duration)) +
  geom_density() +
  geom_rug() +
  geom_density(data = sim_norm, colour = "orange") +
  annotate("text", x= 1, y = 0.18, label = "Simulated log-normal distribution", 
           colour = "orange", hjust = 0) +
  annotate("text", x= 300, y = 0.51, label = "Empirical distribution of war durations", 
           colour = "black", hjust = 0) +
  # carefully chosen labels for x axis:
  scale_x_log10(label = comma, breaks = c(range(interstate_wars$duration), 10, 100, 1000)) +
  labs(x = "Duration of wars (in days, logarithmic scale)",
       y = "Density",
       title = "Distribution of war durations, 1823 to 2003",
       subtitle = "More concentrated, less-fat tails than a log-normal distribution",
       caption = simple_caption) +
  # use coord to limit x axis so statistical calculations are all done on full data:
  coord_cartesian(xlim = c(1, 8000))</pre></figure>

<p>OK, so my main analytical task here is to work out the conditional expected duration of a war that has reached 74 days – the length so far of the USA-Israel-Iran war. Yes, I know there’s an incompletely observed ceasefire, but there’s also a blockade (or two), and that’s unambiguously an act of war under international law. So I’m counting the war as ongoing.</p>

<p>My chart to answer this question is this one:</p>



<p>What’s happening here is:</p>

<ul>
  <li>the empirical cumulative distribution function of durations is the dark line – basically the cumulative frequency on the vertical axis, but expressed as a proportion.</li>
  <li>the grey line is a simple LOESS smoother of that cumulative frequency, useful for modelling values that aren’t exactly matched in the data.</li>
  <li>the red lines show the duration of the current war, and where it would fit in the distribution of 1823 to 2003 wars. It’s about 0.33 (defined in the code below as the variable <code>current_cf</code>), meaning that the current war is already longer than about 33% of wars.</li>
  <li>the horizontal blue line is half way in the vertical space between the horizontal red line and 1. Where it meets the smoothed line and drops a vertical blue line shows the expected median duration of a war that has gotten to this 0.33 point on the cumulative frequency.</li>
</ul>

<p>So we see that of wars that get as long as 74 days, we expect the median total length to be 261 days. That’s a bit grim for those of us who think that even extending into June is going to be very bad indeed for the world economy, but it’s good to know. Of course, there’s plenty of wars that get to 74 days and then stop soon after, so there’s hope there too.</p>

<p>Here’s the code to do that bit of statistical inference and draw the chart:</p>

<figure class="highlight"><pre>#-------------------cumulative distribution--------------
interstate_cumulative <- interstate_wars |> 
  arrange(duration) |> 
  mutate(cumulative_freq = 1:n() / n()) 

# smoothed model of the cumulative distribution, including estimates of where
# the Iran war is on it:
model <- loess(cumulative_freq ~ log(duration), data = interstate_cumulative)
current_dur <- 74 # as at 13 May 2025 - war started 28 February 2026
current_cf <- predict(model, newdata = data.frame(duration = current_dur))

# inverse model to estimate duration given a cumulative frequency, useful for
# annotations on the chart:
inv_model <- loess(duration ~ x, 
                   data = data.frame(duration = interstate_cumulative$duration, 
                                     x = fitted(model)))

# of wars that last this long, what is the median cumulative frequency (i.e. half-way to 1):
conditional_median_freq <- (1 + current_cf) / 2
# of wars with that median cumulative frequency, convert it back into a duration,
conditional_median_dur <- predict(inv_model, data.frame(x = conditional_median_freq))

# Draw chart of cumulative distribution:
interstate_cumulative |> 
  ggplot(aes(x = duration, y = cumulative_freq)) +
  geom_smooth(method = "loess", colour = "grey80") +
  geom_line() +
  # note that (seems a bit odd) need to manually do the scale transform to geom_segment here:
  geom_segment(x = log10(current_dur), xend = log10(current_dur), y = -Inf, yend = current_cf, colour = "red") +
  geom_segment(x = 0, xend = log10(current_dur), y = current_cf, yend = current_cf, colour = "red") +
  geom_segment(x = log10(conditional_median_dur), xend = log10(conditional_median_dur), y = -Inf, yend = conditional_median_freq, colour = "blue") +
  geom_segment(x = 0, xend = log10(conditional_median_dur), y = conditional_median_freq, yend = conditional_median_freq, colour = "blue") +
  
  annotate("text", x = current_dur * 0.95, y = 0.39, label = "Current Iran war", colour = "red", hjust = 1) +
  annotate("text", x = conditional_median_dur * 1.05, y = 0.62, colour = "blue", hjust = 0, vjust = 1, 
           label = glue("Median expectation conditional 
on at least {current_dur} days")) +
  scale_x_log10(label = comma, breaks = c(10, current_dur, 100, conditional_median_dur, 1000)) +
  labs(x = "Total duration of war (in days, logarithmic scale)",
       y = "Cumulative frequency of wars",
       title = "Expectations of duration of Iran war, based on modern inter-state wars' duration",
       subtitle = glue("Comparison to wars from 1823 to 2003. The median war that lasts {current_dur} days goes on to last {round(conditional_median_dur)} days."),
       caption = simple_caption)</pre></figure>

<p>We can use the same approach to calculate not just the median war duration (conditional on getting to 74 days) but other percentiles. For example, in the below we can construct an 80% prediction interval (between the 0.1 and 0.9 quantiles) of total duration of 94.9 and 1,752 days. To put this another way, from this 74 day point, only 10% of wars will have a total duration of 94.9 or less days (ie another 21 days).</p>

<p>All up, that’s a big range of course; the main thing it tells us is that wars last longer than many people would like, and there’s a big variation in wars’ duration.</p>

<figure class="highlight"><pre># some prediction intervals, conditional on getting to 74 days:
probs <- c(0.05, 0.1, 0.5, 0.8, 0.9, 0.95)
more_freqs <- probs * (1 - current_cf) + current_cf
conditional_dur <- predict(inv_model, data.frame(x = more_freqs))
tibble(probability = probs, duration = conditional_dur)
# so 80% of wars that reach 74 days will have a total duration between 95 and 1,752 days</pre></figure>

<pre>  probability duration
        <dbl>    <dbl>
1        0.05     82.3
2        0.1      94.9
3        0.5     261. 
4        0.8    1141. 
5        0.9    1752. 
6        0.95   2119. 
</pre>

<h2 id="duration-and-other-factors">Duration and other factors</h2>

<p>So I’d answered my main question but I was naturally curious about some other relationships too. Obviously one expects longer wars to have more deaths in battle; can we see this in the data? Yes we can:</p>



<p>I like this chart as presenting the scale of nearly two centuries of inter-state war in one easy visualisation.</p>

<p>We also see that if there’s a pattern in relationship between duration, deaths and when the war started (the starting year mapped to colour in the chart above) it’s not an obvious one. We’ll come back to that in the next chart, but first, here’s the code to create the scatter plot above.</p>

<figure class="highlight"><pre>#------------------Compare duration and number of deaths----------------
interstate_wars |> 
  ggplot(aes(x = duration, y = bat_death, label = war_name)) +
  geom_point(aes(colour = start_year), size = 3.5) +
  geom_text_repel(colour = "grey50", size = 2, seed = 123) +
  scale_y_log10(label = comma) +
  scale_x_log10(label = comma) +
  scale_colour_viridis_c() +
  labs(title = "Inter-state wars, 1823-2003",
       colour = "Starting year",
       x = "Duration in days",
       y = "Number of battle deaths",
       caption = simple_caption) +
  theme(legend.position = c(0.15, 0.8))</pre></figure>

<p>I was a bit worried about that “two centuries” thing. Are recent wars all much shorter, or perhaps much longer, than older wars? If so it would be a big limitation on my inference about likely war length. So I prepared one more plot to check out if there was an obvious relationship, more rigorously than just eye-balling colour on the previous plot. I was a bit surprised to see that actually there is no real growth or reduction in war duration over time:</p>



<p>I also quite like this chart as giving us an instant comparison of our current USA-Israel-Iran war with some of those in history. We can see that it is already longer than the Boxer Rebellion, but not quite as long as the Falkland Islands or the War for Kosovo (for all of these names I am using those provided by the Correlates of War project - I’m well aware that these are contested labels).</p>

<p>Here’s my final chunk of code drawing that last chart:</p>

<figure class="highlight"><pre>#------------Compare duration with when in history it happened---------------
interstate_wars |> 
  arrange(bat_death) |> 
  ggplot(aes(x = earliest_start, y = duration)) +
  geom_hline(yintercept = current_dur, colour = "red") +
  geom_point(aes(size = bat_death), shape = 1) +
  geom_text_repel(aes(label = war_name), colour = "steelblue", size = 3, seed = 123) +
  annotate("text", x= as.Date("1820-01-01"), y = current_dur + 8, hjust = 0,
           label = "Duration of 2026 US-Israel-Iran war so far", colour = "red") +
  scale_y_log10(label = comma) +
  scale_size_area(label = comma, max_size = 25) +
  labs(title = "Inter-state wars, 1823-2003",
       subtitle = glue("Compared to the USA-Israel-Iran war as at {Sys.Date()}"),
       x = "Start of war",
       y = "Duration of war (days)",
       size = "Number of batlle deaths:",
       caption = simple_caption)</pre></figure>

<p>That’s all folks. Stay safe out there.</p>

<div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://freerangestats.info/blog/2026/05/13/war-durations"> free range statistics - R</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div><strong>Continue reading</strong>: <a href="https://www.r-bloggers.com/2026/05/durations-of-wars-by-ellis2013nz/">Durations of wars by @ellis2013nz</a>
</article>
</main></body></html>

R-bloggers

[R] How to Position the Legend Inside a Plot in ggplot2

Example 1: Basic Usage of legend.position

Example 2: Fine-Tuning with Justification and Background

Example 3: Positioning in Faceted Plots

Example 4: The “Empty Space” Trick for Faceted Plots

Key Positioning Controls

EuroBioC2026 Tidyomics Hackathon

Join us in Turku

What we will work on

How to participate

Communication

People

See you there

Behavior-Driven Development in R Shiny: Asserting Outcomes with Then Steps

The Purpose of Then

What to Assert

User-visible state

System state

Side effects

Implementing Then Steps in the Driver

Grouping Then Steps

Making Failure Messages Helpful

What Doesn’t Belong in Then Steps

The Complete Picture

Wrapping up

Ghost in the Shell Script

Task One: fixing a common git mistake

Task Two: listing image files

Final Thoughts

rOpenSci News Digest, May 2026

rOpenSci HQ

15 Years of rOpenSci, and we’re just getting started

Champions Program update

May is Maintainer Month

Software-Review update

Social media cards for R-universe

Coworking

Software

New versions

Software Peer Review

On the blog

Calls for contributions

Calls for maintainers

Calls for contributions

Package development corner

devtools’ install functions now officially deprecated

New Git commands!

roxygen2 8.0.0

Comment DESCRIPTION files

{cross} R package for running code with different package versions

On attribution with a footnote

Last words

survivoR now includes US50 and AU12

Speeding up Stan model builds for R package developers

Introduction

Big win 1: Enable parallel compilations with the MAKEFLAGS environment variable

Big win 2: Enable C/C++ compiler cache using ccache

Testing which of your models takes the longest to compile

Finding your MAKEFLAGS sweet spot

Big win 3: Combining these in GitHub Actions workflows

Big win 4: Switch to clang

Small win 1: WSL users should use the native file system

Naive guesses that made no difference

(Money no object) Big win 5: Switch to an Apple Silicon Mac

Summary

Behavior-Driven Development in R Shiny: Modeling User Behavior with When Steps

The Purpose of When

Starting the App

Implementing When Steps

Naming When Steps

My Thoughts on Navigation Steps

Multi-Step Flows

Testing Error Paths

Parameterising When Steps

The Cucumber Way

Example 1: Basic Usage of `legend.position`

Task One: fixing a common `git` mistake

Comment `DESCRIPTION` files

Big win 1: Enable parallel compilations with the `MAKEFLAGS` environment variable

Big win 2: Enable C/C++ compiler cache using `ccache`

Finding your `MAKEFLAGS` sweet spot

Big win 4: Switch to `clang`