R-bloggers

From RUnit to testthat with Coding Agent Support

Mirai Solutions — Tue, 25 Nov 2025 00:00:00 +0000

[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A case study of AI-supported maintenance in an R package

Migrating a Test Suite

Going from one testing framework to another in a mature R package can be tedious and still requires some understanding of the codebase’s overall structure and goals, as well as local test context. This makes it impossible to fully automate, which is why XLConnect, our open source Excel connector, still used RUnit until recently.

Coding tools based on Large Language Models (LLMs) have now become available. After positive experiences with these tools on tasks requiring detailed instructions and context, we decided to use them to support our test suite migration process to testthat.

This work impacts a large number of files and is quite repetitive. This makes it an ideal task for a category of tools known as coding agents, which aim to accomplish tasks spanning multiple files based on a single set of instructions and context (prompt). Coding agents are available in two types: running locally on the developer’s machine as a program, or running in a dedicated Virtual Machine (VM) — an asynchronous agent.

For this case study we used the async agent Google Jules. As of November 2025 it is powered by Gemini 2.5 Pro (soon the new Gemini 3 Pro) and its free tier provides 15 sessions per day as well as lets you opt out of having your sessions used for model training. These features, plus a quick-to-provision isolated VM, make Jules a convenient way to experiment with coding agents. Async agents also reduce some risks of agent-assisted programming because they execute commands in a managed environment rather than directly on your local machine. A local agent requires either constant supervision or restricted permissions, limiting its usefulness, or a sandboxed environment you have to set up and maintain; otherwise, the agent could perform dangerous operations. A managed VM also reduces the risk of leaking sensitive information present on your machine to the internet via prompt injection, though any sensitive code or data given to the agent is still exposed.

Context Preparation

Coding agents have settled on using an AGENTS.md file, which contains instructions you wish tools to follow in general when working in your projects. We created this file in the root directory of our repository. You might start by adding development information found in your README.md or otherwise in your project documentation. It is also possible to make AGENTS.md a symlink to your README.md file, but the AGENTS.md file will likely need to become more detailed. You may also want to omit some README.md information as irrelevant for development.

Environment Setup

For Google Jules and other async agents, it will usually be necessary to customize their development environment. In Jules, this is currently done by a setup script that will need to do things like install R, your package’s dependencies, and other tools that would typically be required for development. The script runs after your repository is cloned, allowing you to install your package (for example, with R CMD INSTALL .) as part of the setup.

If one of the setup steps fails, or the setup times out, you may try moving some setup steps into AGENTS.md in a ## Setup section. This is less reliable but could avoid timing out the initial setup as each command is run separately by Jules.

The setup script for XLConnect in Jules, in the final version used for the present task, is the following:

rc="${HOME}/.Renviron"
line="R_LIBS_USER=$HOME/r-libs"
grep -qxF "$line" "$rc" || printf '\n%s\n' "$line" >> "$rc"
source $rc
mkdir -p $R_LIBS_USER
cp $rc .Renviron
curl -LsSf https://github.com/posit-dev/air/releases/latest/download/air-installer.sh | sh
sudo apt-get update && sudo apt-get install --no-install-recommends -y r-base libpcre2-dev libtirpc-dev libicu-dev
sudo R CMD javareconf
Rscript -e "install.packages(c('rJava'))"
Rscript -e "install.packages(c('RUnit', 'testthat'))"
R CMD INSTALL .

We use .Renviron to set the user R library directory, which avoids modifying preexisting .profile or .bashrc files and producing potentially unwanted side-effects.

Note that we would no longer need ‘RUnit’ in future tasks, as we are migrating away from it.

If you haven’t already, configure a code formatter for your project, along with instructions to apply it consistently. Otherwise, the code produced by the agent will be more difficult to review. We use Posit’s Air.

First task and Iterating on instructions

When running your first task, you may notice that Jules does not behave as you expected! It took some time for us to arrive at a combination of setup script and instructions that works well. We hope the present example can help you get to a working setup more quickly.

Based on an initial set of tests ported mechanically, we first simply asked Jules to fix the test code that was not running, given how to run the new test suite.

After submitting a request, Jules creates a plan and asks you to review it. If you do nothing within a short time (~2 minutes), Jules will start working based on its plan. It is a good idea to review the plan and provide feedback if necessary; this helps you formulate subsequent requests. In addition, Jules remembers general preferences based on your requests (this “Memories” feature can be disabled).

In some cases, Jules asks you to specifically clarify certain aspects. This is a chance to catch ambiguities and conceptual issues early.

The first change that Jules made was copying the test Excel files to a more convenient location, and simplifying the test code that reads them. Though this was a large change in terms of lines of code, it was not very complex; most of the time was spent by Jules in setting up its development environment.

Watching Jules work in this first task and subsequent ones helped us arrive at the environment setup described above. Reviewing the steps taken by Jules can reveal issues with the provided instructions or environment.

This can be the case even if the results are satisfactory: Jules will work around many issues and manage to achieve a goal. For example, it ran the following because package rJava had not been installed:

$ Rscript -e "install.packages(c('rJava'))"

This resulted in

  /usr/bin/ld: cannot find -lpcre2-8: No such file or directory
  /usr/bin/ld: cannot find -ltirpc: No such file or directory
  ERROR: compilation failed for package ‘rJava’

This was enough for Jules to install some missing packages:

$ sudo apt-get update && sudo apt-get install -y libpcre2-dev libtirpc-dev

It continued to try, and finally succeeded in installing rJava and further dependencies, solving other such issues along the way.

Another example: we were using export ... to set R_LIBS_USER in the setup script. This actually was not effective, but Jules simply started setting the variable for each command after this started failing:

$ Rscript -e "install.packages('testthat', repos = 'https://cloud.r-project.org')"

  ...
  Warning in install.packages(...) : 'lib = "/usr/local/lib/R/site-library"' is not writable
  ...

Jules simply started prefixing export R_LIBS_USER=... to each command.

Note that Jules now includes functionality to set environment variables explicitly. In many situations, Jules sometimes interprets instructions in unexpected ways or draws on its internal knowledge to find creative workarounds for problems.

Nevertheless, it is preferable to have a clean setup, as this will make tasks complete faster and more reliably — and resolve eventual issues more quickly.

A Tedious but Simple Task: `expect_equal` Argument Order

As part of refactoring the test suite, we wanted to make sure the expected vs. actual values were passed in the correct order. This makes the test output more informative when failures occur: it is clear when comparing complex diffs which value was expected and which was actually produced. This is a specific example of a task that is relatively trivial but cannot be solved mechanically: the notions of expected and actual, though simple, are semantic — they cannot be inferred purely from syntax. We proceeded with the following prompt:

expect_equal’s first argument is the actual value under test, whereas the second argument is the expected value. Make sure these arguments are passed in the correct order. The expected value is typically a literal value or constructed in the test itself, whereas the actual value under test is typically the output of the function under test, or derived from that output.

This kind of prompt can work quite well, but there is a limitation due to the number of files involved. If you do not specify a scope in which Jules should perform modifications, it will often only perform the change in a few files. We add the following to the prompt:

Modify only the following test files under tests/testthat:

test.workbook.existsName.R

test.workbook.existsSheet.R

…

We also split the task into 3 groups of 20 files. Whether this is required or not will likely change depending on implementation details or even the current load on the Jules service. You could also create a single task with all the files. Jules frequently seeks confirmation on its progress during complex, multi-step tasks to ensure it’s on the right track.

This could also be a reason to split such tasks in multiple sessions if you are not close to using up your quota. It is now also easier to do with the (Jules Tools) CLI and API.

Explicit Self-Review

After a number of iterations, we wanted to systematically investigate the fidelity of our ported test suite to the original in RUnit. To do so, we first asked Jules itself to perform a review with this goal in mind:

Check the testthat tests against the RUnit reference implementation according to the following plan:

List all files matching the patterns runit..R (RUnit) and test..R (testthat).

Pair files with matching

For each pair, compare their logic and note any discrepancies that could cause different test results.

1. Pairing RUnit and testthat Files

For each file in unitTests with the pattern runit..R, there is a corresponding file in testthat with the pattern test..R. For example:

runit.writeAndReadWorksheet.R test.writeAndReadWorksheet.R

runit.workbook.getReferenceCoordinates.R test.workbook.getReferenceCoordinates.R

…and so on for each .

2. Comparison Process

For each pair:

Compare the logic and assertions.

Note any logical discrepancies that could cause the test results to differ.

Differences in test data or parameters.

Differences in assertion logic.

Missing or extra tests in one file.

Test coverage (are all scenarios tested in both?).

Assertion logic (do they check the same things, in the same way?).

Test data (are the same inputs used?).

Exception handling (are errors expected and checked the same way?).

3. Trigger differences

If any pairs seem to diverge in a significant way, try to make a small modification in the code under R/ to rerun the tests and see if you manage to trigger different behavior in the two test suites. In order for code changes to be effective, you must reinstall the source package. (See AGENTS.md for this and running tests).

If you can’t trigger these, still make a note of these in your final message. If you have a code change that triggers a behavior difference in the suites, commit that change individually.

4. Summary Table (Example)

Test Name Logical Discrepancy? Notes

writeAndReadWorksheet No Fully equivalent

workbook.readNamedRegion Check Ensure all checkException/expect_error and attribute checks match

workbook.readWorksheet Check Many error/edge cases—verify all are ported

… … …

Test Name	Logical Discrepancy?	Notes
`writeAndReadWorksheet`	No	Fully equivalent
`workbook.readNamedRegion`	Check	Ensure all `checkException`/`expect_error` and attribute checks match
`workbook.readWorksheet`	Check	Many error/edge cases—verify all are ported
…	…	…

The above prompt was generated with an interactive LLM tool (in this case GitHub Copilot), then slimmed down and edited manually. It is often useful to generate more detailed instructions based on your requirements; even if you end up not using them, they will often specify aspects of the task that were implicit or ambiguous in your initial problem statement.

In this task, Jules produced a response with many detailed notes regarding the differences between the RUnit and testthat implementations. Most of them were regarding cosmetic differences, but some hinted at issues in the ported code where previous tasks had incorrectly simplified behavior. For example, in the case of workbook.readNamedRegion, Notes were:

Assertion styles differ. Differences in explicit attribute checking (worksheetScope), but core logic for numerous scenarios (data types, keep/drop, errors, cached values, bug fixes) is equivalent.

This was an understated hint to review the test.workbook.readNamedRegion.R file for inconsistencies, and to look into more detail at the differences between RUnit and testthat regarding attribute checks.

Note that point three in the prompt did not elicit usable output from Jules – it was not able to find differences between the test suites (this would be quite a difficult task in any case, but we wanted to see what would happen here).

We adapted some code manually for a few cases to understand how to fix the issue – see waldo – compare for details. We then had Jules apply the solution across all test files:

in testthat tests, replace ignore_attr = TRUE, check.names = TRUE in expect_equals calls with a value of ignore_attr=c("worksheetScope"), which will result in ignoring the relevant attribute only….

In this case, Jules initially produced a “lazy” plan:

…

Modify the test file

I will run the tests in tests/testthat/test.writeAndReadNamedRegion.R to ensure that my changes are correct and have not introduced any regressions. …

We had to follow-up to force it to widen its search:

Please thoroughly check the code for all calls. There are more files with this pattern. sometimes the arguments are not all on the same line.

This hints at the advantage of directly specifying the scope of a change in terms of target files or directories, when possible. This is now easier with the recently added file selector.

“Third-Party” Code Review

As an additional review step, we wanted to use a different model and agent to check the tests. To iterate over each file pair, we used a local agent: Aider. It is an open source program which can connect to most models via API. Here we used the Anthropic API with Claude 4 Sonnet to compare the RUnit and testthat scripts. We chose Aider based on familiarity and the ability to script it to automate the pairwise comparison, each pair of files being a distinct task, as well as to connect it to different model providers. The script gathers the file pairs then runs an Aider command for each of them. That command looks like the following:

    # ... inside a loop; we have defined paths to a  RUnit test file ($unitfile) and a testthat test file ($testfile), and a correspondingly named markdown ($md) file.
    aider \
      --no-auto-commits \
      --message "Compare RUnit vs testthat for $name, summarise any differences in $md" \
      "$unitfile" "$testfile" "$md" || {
        echo "Warning: aider failed for $name, continuing..."
        continue
    }

We then summarized the findings, again with Aider, using the prompt

these files contain many duplicate comments regarding RUnit tests being migrated to testthat. Summarize the changes, making a single entry for each point, and avoid repeating the same point when it is made across multiple files.

This left us with a handful of issues, few enough to investigate and fix manually if necessary.

Conclusion

After a final “classic” coverage check, we can now be confident that the migrated testthat suite is faithful to the semantics and scope of the original RUnit suite! Jules and other tools were important to achieve this task. The experience so far shows that experimentation and detailed review of outputs are very much needed to get useful results out of AI-enabled tools. Trying multiple tools, sometimes in combination, is a good way to find out about possible ways that they can support your workflow. It is also important to remember that things are evolving fast — Jules itself sees significant improvements and additions on a monthly basis, and has already evolved since the work described in the present post.

Toponymy

Michael — Mon, 24 Nov 2025 21:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Texte latin sur marbre noir – CC BY-NC-SA by Francisco Gonzalez

Day 24 of 30DayMapChallenge: « Places and their names » (previously).

BDTOPO is the most detailed topographic GIS database for France, made by IGN under a free licence. It includes names. It’s massive (40 GB) so take a break while downloading…

Config

library(sf)
library(ggplot2)
library(glue)
library(ggspatial)
library(dplyr)
library(stringr)
library(purrr)
library(httr)

Data

Basemap

# See https://r.iresmi.net/posts/2021/simplifying_polygons_layers/
dep <- read_sf("~/data/adminexpress/adminexpress-cog-simpl-000-2024.gpkg",
               layer = "departement") |> 
  filter(insee_dep < "971")

Download and extract

url_bdtopo <- glue("https://data.geopf.fr/telechargement/download/BDTOPO/\\
               BDTOPO_3-5_TOUSTHEMES_GPKG_WGS84G_FRA_2025-09-15/\\
               BDTOPO_3-5_TOUSTHEMES_GPKG_WGS84G_FRA_2025-09-15.7z.0")

glue("{url_bdtopo}{sprintf('%02i', 1:10)}") |> 
  walk(\(x) RETRY("GET", x, 
                  write_disk(basename(x), 
                             overwrite = TRUE)),
       .progress = TRUE)

system("7z e BDTOPO_3-5_TOUSTHEMES_GPKG_WGS84G_FRA_2025-09-15.7z.001 -r toponymie.gpkg")

Get suffix

More information (in french)

toponymy <- read_sf("toponymie.gpkg",
                    query = r"[
                    SELECT * 
                    FROM toponymie 
                    WHERE classe_de_l_objet IN ("Lieu-dit non habité", 
                    "Zone d'habitation")
                    ]") |> 
  st_filter(dep)

suffix <- toponymy |> 
  mutate(suffix = case_when(
    str_detect(graphie_du_toponyme, "ac$") ~ "-acum → -ac",
    str_detect(graphie_du_toponyme, "(ey|ay)$") ~ "-acum → -ey -ay",
    str_detect(graphie_du_toponyme, "illy$") ~ "-acum → -illy",
    str_detect(graphie_du_toponyme, "at$") ~ "-acum → -at")) |> 
  filter(!is.na(suffix))

Map

suffix |> 
  bind_cols(st_coordinates(suffix)) |> 
  ggplot() +
  geom_sf(data = dep) +
  geom_sf(aes(color = suffix), size = .1, alpha = .2) +
  facet_wrap(~ suffix, ncol = 1) +
  coord_sf() +
  labs(title = "French localities suffix",
       subtitle = "from gallo-roman -acum",
       caption = glue("data: IGN BDTOPO 2025
                      https://r.iresmi.net - {Sys.Date()}")) +
  theme_void() +
  theme(plot.caption = element_text(size = 7, color = "grey40"),
        plot.margin = unit(c(.2, .2, .2, .2), units = "cm"),
        legend.position = "none")

Figure 1: French suffix

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Toponymy

Bioconductor in Africa: Ethiopia’s First In-Person Course

Maria Doyle — Mon, 24 Nov 2025 00:00:00 +0000

[This article was first published on Bioconductor community blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Following the success of our first in-person Bioconductor course in Nairobi earlier this year, we continued building momentum across the continent with Ethiopia’s first Bioconductor workshop, held in Addis Ababa from 25–29 August 2025. Hosted by the Bio and Emerging Technology Institute (BETin) in partnership with the International Institute of Tropical Agriculture (IITA) and the University of Limerick, the event brought together researchers, students, and educators to support and grow bioinformatics capacity in Ethiopia.

What we taught

A five-day, hands-on programme covering:

R for data management, manipulation, visualisation, and reproducible analysis
Bioconductor core data structures (e.g. SummarizedExperiment)
Exploratory data analysis and quality control
Differential expression with DESeq2
Gene set enrichment analysis

Agenda and instructor list are on the workshop page.

We welcomed 26 participants from universities, national research institutes, and biotech groups across Ethiopia, reflecting a strong mix of MSc/PhD students, lecturers, and researchers working in agriculture, public health, genomics, and AI-driven biomedical research. The cohort was selected from more than 170 applicants – a clear sign of the growing demand for hands-on bioinformatics training.

Learning & Community

The workshop opened with a warm welcome from BETin leadership, including Dr Zewdu Edea and Dr Hailu Dadi with strong support from Prof Kassahun Tesfaye, and Dr Helen Nigussie from Addis Ababa University. Sessions were interactive and collaborative as participants worked through exercises together.

From left to right: Dr Yohannes Gedamu Gebre, Niguse Kelile Lema, Dr Helen Nigussie, Trushar Shah, and Dr Maria Doyle teaching at the Ethiopia course.

Participants at the Bioconductor Ethiopia course

Participant voices

Impact snapshot:
From the post-course survey (n=24):

100% would recommend the workshop
92% rated the course “Excellent” or “Very good”
Average self-reported improvement in R skills: 4.3/5
23 out of 24 respondents expressed interest in helping grow the Bioconductor Africa community

Participant feedback summary.

“It’s very hands-on and engaging… The instructors were very experienced, knowledgeable, and approachable. It was a great learning experience!”

“There were enough supporters for the trainees; those who were rounding and supporting when we got stuck were a wonderful approach. The training shoots the target from my side.”

“I would appreciate it if the time for the training were increased.”

These sentiments echoed the hosts’ reflections:

“Participants were thrilled to have been given the chance and vowed to utilise the new skills they gained in their research and future careers.” — Dr Zewdu Edea (BETin), LinkedIn

“The workshop highlights the importance of international collaboration and knowledge exchange in advancing research and training, particularly in the African context.” — Niguse Kelile Lema (BETin), LinkedIn

Encouragingly, many respondents expressed interest in helping grow the Bioconductor Africa community through teaching, webinars, or community sessions – a strong signal for sustainable, local leadership.

Lessons learned

A key part of building a sustainable training programme is listening to feedback. A few themes stood out:

More time for deeper dives:
Participants were keen for extra time to practise and explore. While extending the main workshop isn’t always possible, we’re exploring optional drop-in coding sessions or “hacky hours” after future events so participants can spend more time working through exercises with instructor support.
Clearer signposting of pre-workshop reading: Although we shared pre-reading and background material, some participants noted they would have benefited from more emphasis on the publication describing the experimental dataset. We’ll highlight these resources more clearly in future courses.
Mapping Future Topics: Feedback included a range of topics people would like to see next – including GWAS, single-cell analysis, and multi-omics integration. This helps us map out future workshops and tailor training to local research priorities.

Highlights

A short certificate ceremony on the final day was a lovely way to wrap up the week. BETin presented each participant with a certificate recognising their effort and commitment throughout the course. Participants were delighted to receive them, as reflected in the smiles during the ceremony.

Participants receiving their course certificates.

Another note from the week is that two of the local instructors, Yohannes and Niguse, were completing their Carpentries Instructor Certification, and this workshop served as their final teaching checkout.

Yohannes’ story of how he and Niguse became certified Carpentry instructors. LinkedIn

Collaborators & Acknowledgements

This workshop was hosted and funded by the Bio and Emerging Technology Institute (BETin). We are deeply grateful for their leadership and commitment to strengthening bioinformatics capacity in Ethiopia.

The event was co-organised with the International Institute of Tropical Agriculture (IITA) and the University of Limerick, with additional support from Bioconductor (CZI EOSS) and UL Global/Erasmus+.

Our instructor team: Dr Yohannes Gedamu Gebre (BETin), Niguse Kelile Lema (BETin), Dr Helen Nigussie (AAU), Trushar Shah (IITA), Dr Maria Doyle (UL).

We thank the Ministry of Innovation and Technology and Innobiz-K Ethiopia Incubation Center for generously providing their facilities for the training.

What’s next?

Our next workshop took place in West Africa, in Benin, from 17–21 November 2025. If your institute is interested in co‑hosting or contributing to future workshops, we’d love to hear from you.

Get involved

Workshop page
Materials
About Bioconductor training
Join the Bioconductor Africa mailing list or #bioc_africa channel in Bioconductor Chat

To leave a comment for the author, please follow the link and comment on their blog: Bioconductor community blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Bioconductor in Africa: Ethiopia’s First In-Person Course

Twenty Questions and Decision Trees

Jerry Tuttle — Sat, 22 Nov 2025 22:05:00 +0000

[This article was first published on Online College Math Teacher, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of us have probably played the game Twenty Questions. The answerer chooses something, and the other players try to guess it by asking yes or no questions. The TV show “What’s My Line” is an example of this where the players try to guess a contestant’s occupation.

Contrary to my kids’ attempts, the strategy should be to ask questions that split the remaining possibilities roughly in half each time.

This seems very similar to a machine learning Decision Tree, although with an interesting distinction.

A decision tree cheats. The decision tree algorithm knows the answer (df$target = 1). The algorithm attempts to find the best feature and split value to separate df$target = 1 from df$target = 0 at each node, but it needs to know the right answer to ask the best questions. This is why, if the game is played say with different US Presidents multiple times, the algorithm may choose different features and split values.

Nevertheless, I thought it would be fun to program a decision tree model with the US Presidents. I found some data on Presidents. I decided some variables had too many values (high cardinality – there were a lot of political party names in the 1800s), so I grouped some values to reduce the number of unique values.

I initially began with a random integer between 1 and 47 to select a President, which selected President Hoover, but I found a different President would create a tree closer to the questions I would have asked if I were a player. So I selected President Reagan to get a more interesting tree.

(I considered selecting President Garfield to be able to ask the question, “Is the president credited with a unique proof of the Pythagorean Theorem?”, but I decided that was a little quirky, even for me.)

Here is the resulting tree for President Reagan:

Here is the resulting variable importance plot. Note that the variables are not in the same order as the tree splits. I understand that variable importance is based on the some of the improvements in all nodes where the variable was used as a splitter.

Here is my R code:


library(dplyr)
library(rpart)
library(rpart.plot)
library(ggplot2)
df <- read.csv("prez.csv", header=TRUE)   
# data file available at github:  
prez.csv
set.seed(123)
# r <- sample(1:nrow(df),1)
r <- 40   # deliberate choice to get longer tree
answer <- df$LastName[r]
print(paste("The target president is:", answer))
df$target <- rep(0, nrow(df))
df$target[r] <- 1

# Feature engineering:
df <- df %>%
  # A. Categorical Reduction
  mutate(
    Party = case_when(
        Party %in% c("Democratic") ~ "Democratic", 
        Party %in% c("Republican") ~ "Republican", 
        TRUE ~ "Other"),
    Occupation = case_when(
        Occupation %in% c("Businessman", "Lawyer") ~ Occupation, 
        TRUE ~ "Other"),
    State = case_when(
        State %in% c("New York") ~ "NY", 
        State %in% c("Ohio") ~ "OH", 
        State %in% c("Virginia") ~ "VA", 
        State %in% c("Massachusetts") ~ "MA", 
        State %in% c("Texas") ~ "TX", TRUE ~ "Other"),
    Religion = case_when(
        Religion %in% c("Episcopalian", "Presbyterian", "Unitarian", "Baptist", "Methodist") ~ "Main_Prot", 
        TRUE ~ "Other"),
    
    # B. Year/Century Binning using cut()
    DOB = cut(DOB, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    DOD = cut(DOD, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    Began = cut(Began, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    Ended = cut(Ended, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE)
  ) %>%

  # C. Convert all new/existing binary/categorical variables to factor
  mutate_at(vars(Party, State, Occupation, Religion, Assassinated, Military, Terms_GT_1, Pres_During_War, Was_Veep, DOB, DOD, Began, Ended), as.factor)


# selected variables 
formula <- as.formula(target ~ Began + State + Party + Occupation + Pres_During_War)
# Using aggressive control settings to force a maximal, unpruned tree
prez_tree <- rpart(formula, data = df, method = "class",
                   control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))
rpart.plot(prez_tree, type = 4, extra = 101, main = "President Twenty Questions")

# check Reagan
df %>% filter(Began == "20th century" & 
              !State %in% c("MA", "NY", "OH", "TX") &
              Party == "Republican" &
              !Occupation %in% c( "Businessman", "Lawyer"))

variable_importance <- prez_tree$variable.importance
importance_df <- data.frame(
  Variable = names(variable_importance),
  Importance = variable_importance
)

importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

common_theme <- theme(
        legend.position="NULL",
        plot.title = element_text(size=15, face="bold"),
        plot.subtitle = element_text(size=12.5, face="bold"),
        axis.title = element_text(size=15, face="bold"),
        axis.text = element_text(size=15, face="bold"),
        legend.title = element_text(size=15, face="bold"),
        legend.text = element_text(size=15, face="bold"))


ggplot(importance_df, aes(x = factor(Variable, levels = rev(Variable)), y = Importance)) +
  geom_col(aes(fill = Variable)) + 
  coord_flip() +
  labs(title = "20 Questions Variable Importance",
       x = "Variable",
       y = "Mean Decrease Gini") +
  common_theme

# loop through all presidents to see different first split vars
library(purrr)

### 1. Define the Analysis Function ###
# The function is modified to return a data frame row for clarity
get_first_split_row <- function(df, r) {
  # Temporarily set the target for the current president
  df$target <- 0
  df$target[r] <- 1
  tree <- rpart(formula, data = df, method = "class",
                control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))
  frame <- tree$frame
  
  # Determine the result
  if (nrow(frame) > 1) {
    first_split_var <- as.character(frame$var[1])
  } else {
    first_split_var <- "No split"
  }
  
  # Return a single-row data frame
  return(data.frame(
    President = df$LastName[r],
    First_Split_Variable = first_split_var
  ))
}

### 2. Run the Analysis and Combine Results ###
# Create a list of row indices to iterate over
indices_to_run <- 1:nrow(df)

# Use map_dfr to apply the function to every index and combine the results 
# into a single data frame (dfr = data frame row bind)
first_split_results_df <- map_dfr(indices_to_run, ~ get_first_split_row(df, .x))

### 3. Display the Table and Original Analysis ###
# Display the resulting table:
print(first_split_results_df)

print(table(first_split_results_df$First_Split_Variable))

End

To leave a comment for the author, please follow the link and comment on their blog: Online College Math Teacher.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Twenty Questions and Decision Trees

Data Science Quiz For Humanities

Ponne, Bruno — Sat, 22 Nov 2025 00:00:00 +0000

[This article was first published on coding-the-past, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Test your skills with this interactive data science quiz covering statistics, Python, R, and data analysis.

To leave a comment for the author, please follow the link and comment on their blog: coding-the-past.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Data Science Quiz For Humanities

Flags

Michael — Fri, 21 Nov 2025 23:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Colors of Youth !! – CC BY-NC-ND by Rahul Sheel

Day 21-22 of 30DayMapChallenge: « Icons » & « Natural Earth » (previously).

The world as flags…

Data

# install.packages("grImport2")
# install.packages("ggflags", 
#                  repos = c("https://jimjam-slam.r-universe.dev"))
library(ggflags)
library(rnaturalearth)
library(sf)
library(ggplot2)
library(glue)
library(ggspatial)
library(dplyr)
library(stringr)

world <- ne_countries() |> 
  st_transform("EPSG:8857")

world_points <- world |> 
  group_by(flag = str_to_lower(iso_a2_eh)) |> 
  st_cast("POLYGON") |> 
  mutate(surf = st_area(geometry)) |> 
  filter(flag != -99) |> 
  slice_max(surf, with_ties = FALSE) |> 
  ungroup()|> 
  select(flag, sovereignt) |> 
  st_point_on_surface()

Map

world_points |> 
  bind_cols(st_coordinates(world_points)) |> 
  ggplot() +
  geom_sf(data = world, color = "snow2", fill = "snow2") +
  geom_flag(aes(X, Y, country = flag), size = 4) +
  labs(title = "World Countries",
       subtitle = "by flags",
       caption = glue("data: Natural Earth - Flags: EmojiOne
                      {st_crs(world_points)$Name}
                      https://r.iresmi.net - {Sys.Date()}")) +
  theme_void() +
  theme(plot.caption = element_text(size = 7, color = "grey40"),
        plot.margin = unit(c(.2, .2, .2, .2), units = "cm"))

Figure 1: Flags of the world

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Flags

(ICYMI) RPweave: Unified R + Python + LaTeX System using uv

T. Moudiki — Fri, 21 Nov 2025 00:00:00 +0000

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ICYMI: RPweave — Unified R + Python + LaTeX Workflow (Powered by uv)

If you juggle R, Python, and LaTeX for research, you know the pain: fragmented scripts, mixed environments, manual copying, and fragile reproducibility.

I needed a setup/workflow that handles both languages, any LaTeX template, environment isolation, and a command-line–first workflow — so I assembled RPweave. Not a new idea, but a polished, modern take that just works.

RPweave (GitHub template here: https://github.com/Techtonique/RPweave) ties everything together using:

knitr for R + Python chunks
reticulate for seamless Python integration
LaTeX for publication-ready typesetting
uv for fast, reproducible Python environments
Makefile automation for building and previewing
A ready-to-clone Git template

Getting Started

The workflow starts by cloning the RPweave template repo, and listing the Python packages you need in requirements.txt. Then, set up the isolated environment and install R dependencies via make setup. Finally, write your .Rnw document mixing R and Python chunks, and build with make view.

git clone https://github.com/Techtonique/RPweave my-paper
cd my-paper
uv venv venv && source venv/bin/activate
make setup
make view

Why It Matters

RPweave lets you:

Run R and Python in the same .Rnw document at the command line
Easily share objects across languages
Use any LaTeX template or academic style
Keep everything reproducible with isolated environments
Build your PDF with a single command (make view)

Minimal Example

The first chunk is mandatory.

<>=
library(knitr); library(reticulate)
use_python("venv/bin/python")
@

<>=
ggplot(mtcars, aes(wt, mpg)) + geom_point()
@

<>=
import pandas as pd
print(pd.DataFrame({'x': range(100)}).describe())
@

Ideal For

Papers mixing R stats + Python ML
Projects needing clean LaTeX output
Reproducible workflows with pinned dependencies
Researchers tired of context-switching between RStudio, Jupyter, and LaTeX editors

Pro Tips

Use make view as your main loop — instant rebuild + preview
Store long chunks in chunks/
Keep generated files out of Git
Pass data R → Python via py$object for smooth cross-language flows

Repo & Docs

RPweave: https://github.com/Techtonique/RPweave
uv: https://docs.astral.sh/uv/
knitr: https://yihui.org/knitr/

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: (ICYMI) RPweave: Unified R + Python + LaTeX System using uv

Should I Use Figma Design for Dashboard Prototyping?

The Jumping Rivers Blog — Thu, 20 Nov 2025 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Heard of Figma but not sure what it is? Seen Figma but not sure if it’s worth learning? Never seen or heard of Figma? If the answer to any of these questions is “Yes” then this blog post is for you.

What are Figma and Figma Design?

This is a simple question with a somewhat complex answer, not least because there are multiple products falling under the Figma umbrella, made by developers at the company Figma, Inc (often shortened to Figma). At the time of writing, these products are listed on Wikipedia as:

Figma
FigJam
Figma Slides
Figma Sites
Figma Make
Figma Buzz
Figma Draw

You’ll see the first of these products is listed simply “Figma”. This is the original Figma product that’s been around since the mid-2010’s. (By contrast, the last four products listed were all launched in 2025.) However, because of the existence of these other Figma products, Figma, Inc has now started to refer to the original product as “Figma Design” (or in some places just “Design”). I think this naming is slowly being adopted in general, but you will still find plenty of references to Figma that mean what Figma, Inc now calls Figma Design.

Screenshot of new-file options for a logged-in user at figma.com

So What is Figma Design?

According to Figma, Inc:

Figma Design is for people to create, share, and test designs for websites, mobile apps, and other digital products and experiences. It is a popular tool for designers, product managers, writers and developers and helps anyone involved in the design process contribute, give feedback, and make better decisions, faster.

I’d simplify that to:

Figma Design is cloud-based collaborative software that allows users to create wireframes, high-fidelity mock-ups and working prototypes of websites and mobile applications.

This doesn’t cover everything Figma Design can do or be used for, as I’ll come on to, but I think it covers the main reasons you’d choose to learn Figma Design over other design software.

What Can I Use Figma Design For?

As implied in the previous section, the core offering of Figma Design (in my view, at least) is the ability to quickly make wireframes, high-fidelity designs and interactive prototypes. These can be really helpful when building a complex dashboard.

Example of a wireframe of the top of the Jumping Rivers home page, built with Figma Design.

Screen recording of an interactive prototype built with Figma Design. (The first click at the start of the video is just to move focus to the prototype window. Subsequent clicks are interactions within the prototype.)

I’ve used Figma Design for a number of other things, including:

simple vector art
flow diagrams
annotating screenshots
promotional literature intended for print
very basic image editing

Figma Design is not the best tool available for any of these tasks. But if it is available to you, you know how to use it and it does the job to a satisfactory level, then it could be the most convenient tool you have at your disposal.

Example of a (joke) flow chart, built with Figma Design.

Is Figma Design Free?

Like a lot of (most?) cloud-based software tools, Figma Design (and the rest of the Figma products) is freemium software. What is and isn’t available on the free tier is liable to change so everything that follows in this section should be assumed to be caveated with “at the time of writing”.

While you can certainly use Figma Design for free – and, I think, learn how to use most of its tools – the answer to whether you can use it as desired without paying is, unsurprisingly, “it depends”. If you’re part of a team, the free tier strictly limits the number of collaborative files that can be created and your ability to create shared libraries. If you’re working independently these things may not be much of an issue, but you won’t have access to some other features available in paid tiers like Dev Mode and video imports.

What Tools Does Figma Design Give Me?

The things you’ll use most in Figma Design, alongside the ubiquitous Move tool, are almost certainly the Frame tool and the Text tool. These may not sound very exciting but you can get a long way using only these. Much of their power comes from the ability to finely customise the look of frames (essentially containers for stuff) and text and to build complex layouts by combining and nesting items you’ve created. Frames can also be filled with images, so while there is a separate Image/video tool, you don’t actually need to use it to create your high-fidelity mockups. This is illustrated below, where the top of the Jumping Rivers home page has been recreated using only the Frame and Text tools.

High-fidelity mockup of the top of the Jumping Rivers home page. The only tools from the Figma Design toolbar used to create this were the Frame and Text tools (plus the Move tool).

There are various other tools available associated with vector drawing – Line, Rectangle, Ellipse, Pen – as well as sectioning and commenting.

Screenshot of Figma Design’s toolbar with the vector-drawing submenu open.

Conceptual Tools

Alongside the literal (in a digital sense) tools described above, Figma Design gives you the tools (in a broad, conceptual sense) to perform a number of useful tasks.

The most significant of these conceptual tools is the ability to create interactive prototypes. In brief, you can select an item in your design, connect it to another item and then define one or more interactions. This is simple in principle and fairly simple in practice to start with. For complex designs with many interactions I find it quickly becomes quite messy and difficult to decipher: Figma, Inc calls the visual depictions of connections you create between elements “noodles” and I find this apt as it’s quite easy to end up with a sort of noodle soup that’s hard to decipher. Nevertheless the tools are there and, for simple designs it’s quick to set up and then run a working prototype.

Screenshot of a simple four-tab dashboard design. The curved arrows (“noodles”) show interactions the user can do: e.g. click on one of the “Flight Delay” buttons to go to the second (top-right) view. Even for this fairly simple prototype, the interlocking noodle pattern can be quite hard to decipher.

Paying users can use the tools made available in Dev Mode. Because it’s not part of the free tier I won’t go into details here, but in brief it’s a suite of tools that should make it easier to convert design files into code.

It’s also easy to export arbitrary parts of a design as JPEG, PNG, SVG or PDF. There’s no native app support for WebP or AVIF export yet, but there are community plugins that offer these.

So, Should I Design My Dashboard with Figma Design?

That is, of course, up to you. If your dashboard is fairly simple and you’re working on your own, it may be easier to just go straight out and build version 1 of your dashboard with your favourite dashboard-building tool. If you’re proficient with a library like Shiny or Dash this can be pretty quick. However, if you’re part of a team building a complex app, Figma Design may make the initial stages of development easier. And, if you want to user-test with simple interactive prototypes then it’s definitely an option worth considering.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Should I Use Figma Design for Dashboard Prototyping?

Announcing AI in Production 2026: A New Conference for AI and ML Practitioners

The Jumping Rivers Blog — Wed, 19 Nov 2025 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Registration is now open for our first AI in Production conference, taking place on 4 and 5 June 2026 in Newcastle Upon Tyne.

AI in Production is for people who want to see how AI works in day to day environments. The event brings together data scientists, engineers, analysts, researchers, and anyone who wants to learn from real projects rather than theory.

What to expect

The programme is split into two streams so you can follow what is most relevant to your work.

Engineering Stream
Covers deployment, monitoring, scaling, infrastructure, and what it takes to keep AI systems running.

Machine Learning Stream
Covers model development, evaluation, responsible use of data, and lessons from applied ML work across different industries.

Across both days you will hear open discussions about what teams tried, what worked, what failed, and what they learned along the way.

Workshops on Thursday 4 June

The conference opens with a day of hands on workshops delivered by the Jumping Rivers team. These sessions guide you through practical tasks and give you time to ask questions as you go.

All tickets include entry to a relaxed drinks reception from 17:00 to 19:30.

Conference day on Friday 5 June

Talks begin at 09:30 and continue until around 16:15. You can move between the two streams or stay with one focus for the day.

Call for speakers

If you would like to speak at AI in Production 2026, we would love to hear from you!

We welcome both new and experienced speakers. You’ll need to submit:

A talk title
A short abstract (maximum 250 characters)
Your preferred talk format
- Lightning talk (around 6 minutes)
- Standard talk (around 25 minutes)
Whether you are happy for your talk to be recorded
A link to a page that represents you
(personal site, LinkedIn, GitHub or GitLab, Twitter, Mastodon etc.)

The submission deadline is 23 January 2026. Submit your abstract.

Key dates

9 January: Super early bird deadline
23 January: Abstract submission deadline
6 March: Early bird deadline
28 May: General registration deadline
4 June: Conference begins

Speakers

We are also excited to share our first confirmed speakers.

Mac Misiura, Red Hat
George Stagg, Posit Software

More speakers will be announced soon. If you’d like to be one of them, you can submit your abstract today.

Tickets

You can choose a ticket for the conference only or a combined ticket that includes one workshop. Learn more and register for the conference.

Planning your visit

The Catalyst is a short walk from Newcastle Central Station, with regular trains from Edinburgh and London. Newcastle International Airport is around thirty minutes away by Metro.

Sponsorship

If your organisation would like to support the conference, email events@jumpingrivers.com{.email}.

We look forward to welcoming you to Newcastle for two days of focused sessions, open conversations, and practical insight into running AI systems in real settings!

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Announcing AI in Production 2026: A New Conference for AI and ML Practitioners

Acidification

Michael — Wed, 19 Nov 2025 23:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coral bleaching at Heron Island – CC BY by The Ocean Agency / XL Catlin Seaview Survey / Richard Vevers

Day 20 of 30DayMapChallenge: « Water » (previously).

Global ocean acidification mean sea water pH trend map from Multi-Observations Reprocessing from Copernicus.

Data

library(sf)
library(ggplot2)
library(rnaturalearth)
library(glue)
library(terra)
library(ggspatial)

eqearth <- "EPSG:8857"

world <- ne_countries() |> 
  st_transform(eqearth)

mask <- c(xmin = -179, ymin = -89, xmax = 179,  ymax = 89) |> 
  st_bbox() |> 
  st_as_sfc() |> 
  st_set_crs("EPSG:4326") |> 
  st_sf() |> 
  st_segmentize(100) |> 
  st_transform(eqearth) 

acid_trend <- "global_omi_health_carbon_ph_trend_1985_P20230930.nc" |> 
  rast() |> 
  rotate() |> 
  project(eqearth) |> 
  mask(mask)

Map

world |>
  ggplot() +
  layer_spatial(data = acid_trend,
                aes(fill = after_stat(band1))) +
  geom_sf(data = mask) +
  geom_sf(color = "grey", fill = "white") +
  scale_fill_viridis_c(name = bquote(Delta*pH~yr^-1), 
                       direction = -1, 
                       na.value = "white") +
  labs(title = "Global ocean acidification",
       subtitle = "mean sea water pH trend",
       caption = glue("data: Copernicus / LSCE doi:10.48670/moi-00277
                      Natural Earth - {st_crs(eqearth)$Name}
                      https://r.iresmi.net - {Sys.Date()}")) +
  theme_void() +
  theme(plot.caption = element_text(size = 7, color = "grey40"),
        plot.margin = unit(c(.2, .2, .2, .2), units = "cm"),
        legend.position = "bottom",
        legend.text = element_text(angle = 45, vjust = 1, hjust = 1))

Figure 1: Global ocean acidification – mean sea water pH trend

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Acidification

EPSG:3035

Michael — Tue, 18 Nov 2025 23:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Michalská brána (second take) – CC BY-SA by Alexandre Duret-Lutz

Day 19 of 30DayMapChallenge: « Projections » (previously).

EPSG:3035 is a Lambert azimuthal equal-area projection used for mapping Europe at medium scale, preserving area. It is quite used in European statistics for this property and is the base for a grid system.

We push this projection a little further by reprojecting the entire globe.

Data

library(sf)
library(ggplot2)
library(rnaturalearth)
library(glue)

world <- ne_countries() |> 
  st_transform("EPSG:3035")

Map

world |>
  ggplot() +
  geom_sf(color = "#abe338", fill = "#85b66f") +
  labs(title = "Earth",
       subtitle = st_crs(world)$Name,
       caption = glue("data : Natural Earth
                      https://r.iresmi.net - {Sys.Date()}")) +
  theme_minimal() +
   theme(text = element_text(family = "Ubuntu",  color = "#ffa07a"),
        plot.background = element_rect(fill = "#373737", color = NA),
        panel.background = element_blank(),
        panel.grid = element_line(color = "#0ac1c1"),
        axis.text = element_text(color = "#0ac1c1"),
        plot.caption = element_text(size = 7, color = "grey40"))

Figure 1: World in ETRS89-extended / LAEA Europe

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: EPSG:3035

Perseverance

Michael — Tue, 18 Nov 2025 20:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mars Perseverance Sol 1681: Right Mastcam-Z Camera – PD by NASA/JPL-Caltech/ASU

Day 18 of 30DayMapChallenge: « Out of this world » (previously).

Perseverance is still hiking…

library(dplyr)
library(sf)
library(ggplot2)
library(glue)

Data

Scraping Where is Perseverance?.

if (!file.exists("M20_waypoints.json")) {
  download.file("https://mars.nasa.gov/mmgis-maps/M20/Layers/json/M20_waypoints.json",
                "M20_waypoints.json")
}
perseverance <- read_sf("M20_waypoints.json")

Map

perseverance |> 
  ggplot() +
  geom_sf(aes(color = sol)) +
  scale_color_viridis_c() +
  labs(title = "Perseverance",
       caption = glue("data: https://science.nasa.gov/mission/mars-2020-\\
                       perseverance/location-map/
                       https://r.iresmi.net/ - {Sys.Date()}")) +
  theme(panel.background = element_rect(fill = NA),
        plot.background =  element_rect(fill = "sienna3",
                                        color = "sienna4"),
        text = element_text(color = "sienna4"),
        axis.text = element_text(color = "sienna4"),
        panel.grid = element_line(color = "sienna4"),
        legend.background = element_rect(fill = NA),
        plot.caption = element_text(size = 6))

Figure 1: Map of Perseverance itinerary on Mars

Don’t ask me about Mars datum or projections, I have no idea how it works…

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Perseverance

Why you should not use mean imputation for missing data

Jason Bryer — Tue, 18 Nov 2025 05:00:00 +0000

[This article was first published on Jason Bryer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I encountered the question today of what to do with missing values when conducting null hypothesis testing or regression? I have seen many suggest doing mean imputation. That is, simply replace any missing values with the mean of the variable calculated from the observed values. I argue that mean imputation is worse than doing nothing. Let’s explore.

To begin, let’s simulate a vector, x, from the random normal distribution.

set.seed(2112)
x <- rnorm(100, mean = 0, sd = 1)
(mean1 <- mean(x))

[1] 0.01129628

(sd1 <- sd(x))

[1] 1.032159

We can see that the mean and standard deviation aver fairly close to 0 and 1, respectively. In the next code chunk we are going to randomly select 20% of observations and set the value to NA. We can calculate the mean and standard deviation excluding the missing values (i.e. NAs) but setting na.rm = TRUE. The mean and standard deviation are relatively close.

x[sample(length(x), length(x) * 0.2, replace = FALSE)] <- NA
(mean2 <- mean(x, na.rm = TRUE))

[1] 0.02136184

(sd2 <- sd(x, na.rm = TRUE))

[1] 1.071757

Now we will replace the NAs we introduced above with the mean. We can see that the standard deviation is quite a bit smaller, hence reducing the variance of our estimate. Since many of our statistical tests rely on variance, reducing the variance may lead to spurious conclusions.

x[is.na(x)] <- mean(x, na.rm = TRUE)
(mean3 <- mean(x))

[1] 0.02136184

(sd3 <- sd(x))

[1] 0.9573977

To show this is not a random anomaly for our one random sample, let’s repeat the above 1,000 times.

n_samples <- 1000
percent_missing <- 0.10
sd_diffs <- data.frame(sample = 1:n_samples,
                       sd_drop_miss = numeric(n_samples),
                       sd_impute_miss = numeric(n_samples))
for(i in seq_len(n_samples)) {
    x2 <- x
    x2[sample(length(x), length(x) * percent_missing, replace = FALSE)] <- NA
    sd_diffs[i,]$sd_drop_miss <- sd(x2, na.rm = TRUE)
    x2[is.na(x2)] <- mean(x2, na.rm = TRUE)
    sd_diffs[i,]$sd_impute_miss <- sd(x2)
}

sd_diffs |> 
    reshape2::melt(id.vars = 'sample', variable.name = 'calculation_type', value.name = 'sd') |>
    ggplot(aes(x = sd, color = calculation_type)) +
        geom_vline(xintercept = sd(x)) +
        geom_density() +
        xlab('Standard Deviation') +
        theme_minimal()

As the figure above shows, there is a significant difference in the standard deviation estimates when calculated using only observed values and calculated with missing values imputed with the mean. The t-test below confirms this.

t.test(sd_diffs$sd_drop_miss, sd_diffs$sd_impute_miss)

    Welch Two Sample t-test

data:  sd_diffs$sd_drop_miss and sd_diffs$sd_impute_miss
t = 54.288, df = 1992.4, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.04782442 0.05140925
sample estimates:
mean of x mean of y 
0.9569447 0.9073278

Now let’s consider how mean imputation can impact the estimation of a correlation between two variables. We will simulate two variables with a population correlation of 0.18.

n <- 100
mean_x <- 0
mean_y <- 0
sd_x <- 1
sd_y <- 1
rho <- 0.18

set.seed(2112)
df <- mvtnorm::rmvnorm(
    n = 100,
    mean = c(mean_x, mean_y),
    sigma = matrix(c(sd_x^2, rho * (sd_x * sd_y),
                     rho * (sd_x * sd_y), sd_y^2), 2, 2)) |>
    as.data.frame() |>
    dplyr::rename(x = V1, y = V2)

cor.test(df$x, df$y)

    Pearson's product-moment correlation

data:  df$x and df$y
t = 1.8314, df = 98, p-value = 0.07008
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01504323  0.36527878
sample estimates:
      cor 
0.1819124

We will now randomly select 20% of x values to set to NA.

df_miss <- df
df_miss[sample(n, size = 0.2 * n, replace = FALSE),]$x <- NA
cor.test(df_miss$x, df_miss$y)

    Pearson's product-moment correlation

data:  df_miss$x and df_miss$y
t = 1.8392, df = 78, p-value = 0.06969
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01658176  0.40543327
sample estimates:
      cor 
0.2038779

Note that the p-value for both the correlation estimated using the complete dataset and estimated with observed values only is greater than 0.05 (i.e. we would fail to reject the null that the correlation is 0).

Now we will impute the missing values with the mean and calcualte the correlation.

df_miss[is.na(df_miss$x),] <- mean(df$x, na.rm = TRUE)
cor.test(df_miss$x, df_miss$y)

    Pearson's product-moment correlation

data:  df_miss$x and df_miss$y
t = 2.0582, df = 98, p-value = 0.04223
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.007431517 0.384594022
sample estimates:
      cor 
0.2035525

We would now reject the null and conclude that there is a statistically significant correlation between x and y even though our original dataset from which this was simulated was not.

To leave a comment for the author, please follow the link and comment on their blog: Jason Bryer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Why you should not use mean imputation for missing data

Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November!

The Jumping Rivers Blog — Mon, 17 Nov 2025 23:59:00 +0000

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you ready to stay ahead in the fast-evolving world of data? Join Jumping Rivers for our free monthly webinar series designed for data professionals at all levels. In just 55 minutes, you’ll gain practical insights, sharpen your skills, and tackle real-world challenges in R, Python, Shiny, and Posit – all from the comfort of your own desk.

Upcoming Webinar – Machine Learning with Python

Date & Time (BST): 20 November, 13:05

Why Attend?

Gain hands-on experience with the latest tools and best practices.
Make yourself more hireable by boosting your data science skills ahead of 2026.
Connect with a network of fellow data scientists, engineers, and experts.
Learn flexibly online with no cost or commitment.
Unlock exclusive discounts:
- Attend 2 sessions → 20% off AI in Production conference tickets.
- Attend more than 2 sessions → 20% off any of our high-quality public training courses.

Whether you want to improve coding or explore machine learning, this webinar is your chance to stay above the curve and grow your career.

Ready to Join?

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November!

Choose Your Fighter: data-driven selection of the best marathon

Stephen Royle — Mon, 17 Nov 2025 20:44:57 +0000

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Running a marathon is a big deal. It takes a lot of time to train to run a good time, and it takes a while to recover. So, if you’re chasing a marathon PB (personal best) time, you need to choose which Marathon to target wisely. How can we use data to help our decision? Let’s use R to find out!

For the impatient: just show me the marathon data! or I want to see how to code this up!

Let’s leave aside the fact that for the most popular marathons, it might not be your choice whether you can register. What factors do we need to consider to pick the best one?

Flat course
Favourable weather
Travel considerations

The flattest course is ideal. Any elevation gain will slow us down. We could look at finish times to know how “fast” the course is in practice, however the finish times really depend on who is running, and how many participants there are. It’s slightly cyclical with the bigger marathons attracting faster runners. To keep things simple, I didn’t use timings and went solely with elevation data.

Most people would agree that running in cool temperatures, ideally dry, is best. This is why they tend to be organised in the spring and autumn. So, we need to have an idea of the likely conditions on the day.

The ideal marathon would also be easy to get to. Since I am based in the UK, I made a list of popular marathons in the UK and then added the World Marathons for comparison, as well as a few others from Europe that people I know have run. For each one, I grabbed a GPX file of the route from Garmin (more on this below), and made a note of what date the last 3 editions occurred (for the weather data). Using these things, and making use of a few R libraries, I could generate graphics to compare the marathon routes.

The marathon data

Click on each image to enlarge it:

The course profile for each race is shown on the same scale to give feel for how challenging it is. Here is the key data organised into a table, listed by date.

Marathon	Date	Elevation gain (m)	Typical max temp (°C)
Tokyo	1/3/26	150	14.8
Great Welsh	8/3/26	118	11.4
Cambridge Boundary	15/3/26	158	10.4
Boston Lincs.	12/4/26	46	13.4
Brighton	12/4/26	160	13.1
Paris	12/4/26	194	14.6
Manchester	19/4/26	121	14.2
Newport	19/4/26	77	13.6
Boston	20/4/26	234	17.9
Blackpool	26/4/26	172	13.1
London	26/4/26	162	14.6
Stratford-upon-Avon	26/4/26	195	14.2
Milton Keynes	4/5/26	205	14.8
Leeds Rob Burrow	10/5/26	400	19.4
Worcester	17/5/26	296	19.6
Edinburgh	24/5/26	113	14.4
Sydney	30/8/26	369	21.8
Berlin	27/9/26	101	19.7
Chester	4/10/26	213	17
Chicago	11/10/26	105	16.2
Abingdon	18/10/26	97	15.6
Yorkshire	18/10/26	148	14.1
Amsterdam	18/10/26	174	13.8
Frankfurt	25/10/26	142	13
New York	1/11/26	179	15.2
Valencia	6/12/26	144	17.2

and here’s a graphical look at the same data:

Breakdown

Let’s face it, most marathons market themselves as flat and fast. Which ones can really make that claim

The three flattest on our list are Boston (Lincs.), Newport and Abingdon. The following marathons are all less than 150 m gain and therefore pretty flat: Great Welsh, Manchester, Edinburgh, Berlin, Chicago, Yorkshire, Frankfurt, Valencia. Between 150-200 m, which is still fairly flat, we have Tokyo, Cambridge Boundary, Brighton, Paris, Blackpool, London, Stratford-upon-Avon, Amsterdam and New York. Beyond this, we are into rolling territory. Marathons with more than 200 m of elevation gain are Boston, Milton Keynes, Leeds, Worcester, Sydney and Chester.

Of the flattest marathons on our list, the coolest temperatures are likely to be at Great Welsh, Frankfurt, Boston (Lincolnshire) and Newport. Whereas Berlin, Valencia and Chicago are probably the warmest. So, this gives us an idea of where the best performances can be unlocked.

Data accuracy

Getting the total elevation gain is difficult. I used a single data source (Garmin Connect) for the GPS data to reduce variation but even on this single source, the total gain calculated varied a lot.

The elevation data for a GPS location obviously needs to be correct. This is not necessarily true if the data is taken from a watch, where the barometer could be inaccurate or where tall buildings interfere with the location (which is a problem for city marathons).

If the data is correct then the calculation can still be inaccurate due to sampling frequency. If we add all the elevation gains for a track sampled every 10 metres, versus one sampled every 50 metres, we will get a different answer because the latter is smoother than the former. To deal with this, I resampled the elevation data on a uniform distance scale to get the most accurate elevation gain I could from the data I had. This caveat will be the case for whatever marathon data you will find online. So our comparison here allows us to say that one marathon has more or less elevation gain than another, but it doesn’t allow us to compare elevation gain with data on another site.

The weather “forecast” is taken by looking at the weather at the last three editions – with the exception of Valencia where the 2025 edition has not yet happened. I used the average of the max temperature on those editions. A more accurate picture would be to take a several days either side of the event because it could be that the weather on one or more of the editions was rather atypical.

Finally, I manually collated the data, so errors are possible. Apologies for any mistakes!

The code

If you came here for the R coding rather than the running, here is the bit where I show how the analysis works! Besides general R stuff – importing data, calculations, making plots – we need to do a few other things:

read the GPX data and calculate the elevation data – we’ll use {gpxtoolbox} to help with this
retrieve weather data – we’ll use {openmeteo} for this
convert WMO codes into icons, load the icons and display them

We have two functions saved to a script that gets sourced during the main script. It’s purpose is to convert the WMO codes into icons. I found a gist that had the WMO codes and the corresponding URLs of the day or night versions of the icons. The first function converts this data (in json format) into a data frame that we can use in the main script. The second function converts the wind direction into a text arrow for display.

library(jsonlite)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(tibble)

# Example: read the JSON into `lst`
# lst <- jsonlite::fromJSON("Data/descriptions.json", simplifyVector = FALSE)

descriptions_to_df <- function(lst) {
  # lst is expected to be a named list or a list of entries where each element corresponds to a WMO code.
  # Support either:
  #  - named list where names(lst) are WMO codes and each element is a list with day/night fields
  #  - or a list of objects where each object has a "wmo" or "WMO" field + nested day/night fields
  
  # Helper to normalize keys for day/night entries
  norm_field <- function(item, keys) {
    # keys: possible key names (vector), returns first non-NULL value or NA
    for (k in keys) {
      if (!is.null(item[[k]])) return(item[[k]])
    }
    return(NA_character_)
  }
  
  # If lst is a named list with codes as names
  if (!is.null(names(lst)) && all(names(lst) != "")) {
    codes <- names(lst)
    rows <- map2_df(lst, codes, function(item, code) {
      # item may have elements like $day$description, or $day_description, etc.
      # Try several common variants.
      day <- item[["day"]]    # might be a list
      night <- item[["night"]]
      day_description <- if (!is.null(day) && is.list(day)) norm_field(day, c("description", "desc", "text")) else norm_field(item, c("day_description", "dayDescription", "day-desc"))
      day_image <- if (!is.null(day) && is.list(day)) norm_field(day, c("image", "img", "image_url")) else norm_field(item, c("day_image", "dayImage", "day-img"))
      night_description <- if (!is.null(night) && is.list(night)) norm_field(night, c("description", "desc", "text")) else norm_field(item, c("night_description", "nightDescription", "night-desc"))
      night_image <- if (!is.null(night) && is.list(night)) norm_field(night, c("image", "img", "image_url")) else norm_field(item, c("night_image", "nightImage", "night-img"))
      
      tibble(
        wmo = code,
        day_description = as.character(day_description),
        day_image = as.character(day_image),
        night_description = as.character(night_description),
        night_image = as.character(night_image)
      )
    })
    
    return(rows)
  }
  
  # Otherwise treat as array of objects, each with a wmo field
  rows <- map_df(lst, function(item) {
    code <- norm_field(item, c("wmo", "WMO", "WMO_code", "wmo_code", "id"))
    day <- item[["day"]]
    night <- item[["night"]]
    day_description <- if (!is.null(day) && is.list(day)) norm_field(day, c("description", "desc", "text")) else norm_field(item, c("day_description", "dayDescription"))
    day_image <- if (!is.null(day) && is.list(day)) norm_field(day, c("image", "img", "image_url")) else norm_field(item, c("day_image", "dayImage"))
    night_description <- if (!is.null(night) && is.list(night)) norm_field(night, c("description", "desc", "text")) else norm_field(item, c("night_description", "nightDescription"))
    night_image <- if (!is.null(night) && is.list(night)) norm_field(night, c("image", "img", "image_url")) else norm_field(item, c("night_image", "nightImage"))
    
    tibble(
      wmo = as.character(code),
      day_description = as.character(day_description),
      day_image = as.character(day_image),
      night_description = as.character(night_description),
      night_image = as.character(night_image)
    )
  })
  
  # If wmo NA but names exist in original list, try to fill
  if (all(is.na(rows$wmo)) && !is.null(names(lst))) {
    rows$wmo <- names(lst)
  }
  
  # Ensure first column is wmo
  rows %>% select(wmo, everything())
}

windsymbol <- function(degree) {
  # Return wind direction symbol based on degree
  if (is.na(degree)) {
    return("-")
  }
  directions <- c("↓", "↙", "←", "↖", "↑", "↗", "→", "↘", "↓")
  index <- round(degree / 45) + 1
  return(directions[index])
}

OK, so now for the main script. I had the marathon event list in tab-separated format as a file in the Data directory and a gpx file for each marathon in the same directory. The name of the gpx file is the same as the event name. The event list also had an alias for display of the marathon name. These are the contents of the file.

event	date2023	date2024	date2025	date2026	alias
Leeds	14/5/23	12/5/24	11/5/25	10/5/26	Leeds Rob Burrow
GreatWelsh	2/4/23	17/3/24	16/3/25	8/3/26	Great Welsh
Cambridge	12/3/23	10/3/24	16/3/25	15/3/26	Cambridge Boundary
Boston	16/4/23	28/4/24	13/4/25	12/4/26	Boston Lincs.
Brighton	2/4/23	7/4/24	6/4/25	12/4/26	Brighton
Manchester	16/4/23	14/4/24	27/4/25	19/4/26	Manchester
Newport	16/4/23	28/4/24	19/4/25	19/4/26	Newport
Blackpool	23/4/23	21/4/24	27/4/25	26/4/26	Blackpool
London	23/4/23	21/4/24	27/4/25	26/4/26	London
Shakespeare	23/4/23	21/4/24	27/4/25	26/4/26	Stratford-upon-Avon
MK	1/5/23	6/5/24	5/5/25	4/5/26	Milton Keynes
Worcester	21/5/23	19/5/24	18/5/25	17/5/26	Worcester
Edinburgh	28/5/23	26/5/24	25/5/25	24/5/26	Edinburgh
Chester	8/10/23	6/10/24	5/10/25	4/10/26	Chester
Abingdon	22/10/23	20/10/24	19/10/25	18/10/26	Abingdon
Yorkshire	15/10/23	20/10/24	19/10/25	18/10/26	Yorkshire
Tokyo	5/3/23	3/3/24	2/3/25	1/3/26	Tokyo
BostonUSA	17/4/23	15/4/24	21/4/25	20/4/26	Boston
Sydney	17/9/23	15/9/24	31/8/25	30/8/26	Sydney
Berlin	24/9/23	29/9/24	21/9/25	27/9/26	Berlin
Chicago	8/10/23	13/10/24	12/10/25	11/10/26	Chicago
NewYork	5/11/23	3/11/24	2/11/25	1/11/26	New York
Frankfurt	29/10/23	27/10/24	26/10/25	25/10/26	Frankfurt
Valencia	3/12/23	1/12/24	7/12/25	6/12/26	Valencia
Amsterdam	15/10/23	20/10/24	19/10/25	18/10/26	Amsterdam
Paris	2/4/23	7/4/24	13/4/25	12/4/26	Paris

From here we can read it in and use it to drive the data collection and processing.

library(ggplot2)
library(dplyr)
library(lubridate)
library(gpxtoolbox)
library(openmeteo)
library(cowplot)
library(png)
library(ggrepel)

## Functions ----

load_weather_image <- function(thisyear) {
  wcode <- weather_df$daily_weather_code[weather_df$yr == thisyear]
  # if wcode is missing or length 0 return a blank image
  if (length(wcode) == 0 || is.na(wcode)) {
    return(NULL)
  }
  # get the image url from wmo_df
  img_url <- wmo_df$day_image[wmo_df$wmo == wcode]
  f <- tempfile()
  download.file(img_url, f)
  img <- readPNG(f)
  img <- as.raster(img)
}

# load tsv of dates
date_df <- read.delim("Data/marathons.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# the column "event" each row has the name of a marathon that can be loaded by appending ".gpx" to the name

source("Script/mwo.R")
# load the json file descriptions.json from Data/
descriptions <- jsonlite::fromJSON("Data/descriptions.json")
wmo_df <- descriptions_to_df(descriptions)
# we will collate the weather data for each marathon and the max temp etc.
summary_df <- data.frame()

# Loop over each marathon event
for (i in 1:nrow(date_df)) {
  # pick a city, we will automate this later
  city <- date_df$event[i]
  date2026 <- date_df$date2026[date_df$event == city]
  
  # Analyse the example GPX file and get summary statistics
  gpx_path <- paste0("Data/",city,".gpx")
  
  # Get summary statistics
  # stats <- analyse_gpx(gpx_path, return = "stats")
  
  # Get processed track points data
  track_data <- analyse_gpx(gpx_path, return = "data")
  # find the mid point in lat long
  mid_lat <- mean(range(track_data$lat))
  mid_lon <- mean(range(track_data$lon))
  # convert lat and lon coordinates to km for distance calculation
  track_data <- track_data %>%
    mutate(
      lat_km = (lat - mid_lat) * 111.32,
      lon_km = (lon - mid_lon) * 111.32 * cos(mid_lat * pi / 180)
    ) %>%
    arrange(time) %>%
    mutate(
      delta_dist = sqrt((lat_km - lag(lat_km, default = first(lat_km)))^2 + (lon_km - lag(lon_km, default = first(lon_km)))^2),
      cumulative_distance = cumsum(delta_dist)
    )
  # Create route plot
  route <- ggplot() +
    geom_path(data = track_data, aes(x = lon_km, y = lat_km), color = "darkgrey", linewidth = 1) +
    coord_equal() +
    theme_void()
  # x axis is too narrow so expand limits by 10% on each side
  x_range <- range(track_data$lon_km)
  x_expand <- (x_range[2] - x_range[1]) * 0.1
  y_range <- range(track_data$lat_km)
  y_expand <- (y_range[2] - y_range[1]) * 0.1
  # chack that x_expand and y_expand are at least 1.5 km
  x_expand <- max(x_expand, 1.5)
  y_expand <- max(y_expand, 1.5)
  route <- route +
    xlim(x_range[1] - x_expand, x_range[2] + x_expand) +
    ylim(y_range[1] - y_expand, y_range[2] + y_expand)
  # remove title and axis labels and tick labels
  route <- route + 
    # add a scale bar at bottom right
    ggspatial::annotation_scale(location = "br", width_hint = 0.2,
                                plot_unit = "km", bar_cols = c("grey", "white"),
                                line_col = "darkgrey",
                                text_col = "darkgrey",
                                text_cex = 0.8)
  
  # we have delta_dist which is the distance from one point to the next
  # calculate the cumulative distance along the path
  track_data$cum_dist <- cumsum(c(0, track_data$delta_dist[-nrow(track_data)]))
  # we have the ele which is the elevation at each point
  # resample ele so that we have elevation at regular intervals along the cumulative distance, use 0.05 km intervals
  resampled_dist <- seq(0, max(track_data$cum_dist), by = 0.1)
  resampled_ele <- approx(track_data$cum_dist, track_data$ele, xout = resampled_dist)$y
  # calculate the elevation gain and loss over each 0.05 km segment
  ele_diff <- diff(resampled_ele)
  ele_gain <- sum(ele_diff[ele_diff > 0], na.rm = TRUE)
  ele_loss <- sum(-ele_diff[ele_diff < 0], na.rm = TRUE)
  stats <- list(
    total_elevation_gain_m = ele_gain,
    total_elevation_loss_m = ele_loss,
    max_elevation_m = max(track_data$ele, na.rm = TRUE),
    min_elevation_m = min(track_data$ele, na.rm = TRUE)
  )
  new_track_data <- data.frame(resampled_dist, resampled_ele)
  
  # Create elevation profile plot
  # the biggest difference between min and max is 141 m so set y axis limits to min -5 to 150 above that
  ele_plot <- ggplot(new_track_data, aes(x = resampled_dist, y = resampled_ele)) +
    geom_ribbon(aes(ymin = stats$min_elevation_m - 5, ymax = resampled_ele), fill = "#55aa55") +
    geom_line() +
    labs(x = "Distance (km)", y = "Elevation (m)") +
    ylim(stats$min_elevation_m - 5, stats$min_elevation_m + 150) +
    theme_minimal()
  
  yearcols <- c("date2023", "date2024", "date2025")
  weather_df <- data.frame()
  
  for (yr in yearcols) {
    # select column using variable yr
    date_for_yr <- date_df[date_df$event == city, yr]
    # if date_for_yr is na then skip to next iteration
    if (is.na(date_for_yr)) {
      next
    }
    # the date is written in dd/mm/yy format, convert to yyyy-mm-dd
    date_for_yr <- dmy(date_for_yr)
    # if date is in the future, skip to next iteration
    if (date_for_yr > Sys.Date()) {
      next
    }
    
    weather_forecast <- weather_history(
      location = c(mid_lat, mid_lon),
      daily = c("temperature_2m_max",
                "temperature_2m_min",
                "precipitation_sum",
                "windspeed_10m_max",
                "wind_direction_10m_dominant",
                "weather_code"),
      start = date_for_yr,
      end = date_for_yr
    )
    
    weather_forecast$event <- city
    weather_forecast$yr <- year(date_for_yr)
    
    weather_df <- rbind(weather_df, weather_forecast)
  }
  
  # Get alias for city from date_df
  alias <- date_df$alias[date_df$event == city]
  
  # Make an object to display Marathon stats
  p <- ggdraw() +
    draw_label(
      alias,
      fontfamily = 'serif',
      fontface = 'bold',
      x = 0.05,
      y = 0.95,
      hjust = 0,
      vjust = 1,
      size = 24
    ) +
    draw_label(
      # print the date which is stored as 16/4/26 in date2026 column
      # Should say 16th April 2026
      format(dmy(date2026), "%d %B %Y"),
      fontfamily = 'serif',
      fontface = 'italic',
      x = 0.05,
      y = 0.9,
      hjust = 0,
      vjust = 1,
      size = 16
    ) +
    draw_label(
      paste0(
        "Gain: ", round(stats$total_elevation_gain_m, 0), " m\n",
        "Loss: ", round(stats$total_elevation_loss_m, 0), " m\n",
        "Max: ", round(stats$max_elevation_m, 0), " m\n",
        "Min: ", round(stats$min_elevation_m, 0), " m\n"
      ),
      fontface = 'plain',
      x = 0.5,
      y = 0.8,
      hjust = 0.5,
      vjust = 1,
      size = 14
    )
  # Add weather info for each year
  w2023 <- load_weather_image(2023)
  w2024 <- load_weather_image(2024)
  w2025 <- load_weather_image(2025)
  
  for(yr in c(2023, 2024, 2025)) {
    p <- p +
      draw_label(
        paste0(yr),
        fontface = 'bold',
        x = ifelse(yr == 2023, 0.2, ifelse(yr == 2024, 0.5, 0.8)),
        y = 0.5,
        vjust = 0,
        size = 16
      )
    # if there is no row corresponding to yr in weather_df, skip to next iteration
    if (nrow(weather_df[year(weather_df$date) == yr, ]) == 0) {
      next
    }
    p <- p +
      draw_label(
        paste0(
          "High: ", round(weather_df$daily_temperature_2m_max[year(weather_df$date) == yr], 1), " °C\n",
          "Low: ", round(weather_df$daily_temperature_2m_min[year(weather_df$date) == yr], 1), " °C\n",
          "Precip: ", round(weather_df$daily_precipitation_sum[year(weather_df$date) == yr], 1), " mm\n",
          windsymbol(round(weather_df$daily_wind_direction_10m_dominant[year(weather_df$date) == yr], 0))," ", round(weather_df$daily_windspeed_10m_max[year(weather_df$date) == yr], 1), " km/h\n"
        ),
        fontface = 'plain',
        x = ifelse(yr == 2023, 0.2, ifelse(yr == 2024, 0.5, 0.8)),
        y = 0.25,
        size = 12
      )
  }
  
  # check is w2023, w2024, w2025 are not null before adding to plot
  if (!is.null(w2023)) {
    p <- p +
      draw_image(w2023, x = 0.2, y = 0.4, width = 0.2, height = 0.2, hjust = 0.5, vjust = 0.5)
  }
  if (!is.null(w2024)) {
    p <- p +
      draw_image(w2024, x = 0.5, y = 0.4, width = 0.2, height = 0.2, hjust = 0.5, vjust = 0.5)
  }
  if (!is.null(w2025)) {
    p <- p +
      draw_image(w2025, x = 0.8, y = 0.4, width = 0.2, height = 0.2, hjust = 0.5, vjust = 0.5)
  }
  
  # Make a cowplot and assemble the plots
  top_row <- plot_grid(p, route, ncol = 2)
  combined_plot <- plot_grid(top_row, ele_plot, ncol = 1, align = "v", rel_heights = c(3, 1))
  # Save the combined plot to a file
  ggsave(filename = paste0("Output/Plots/",city,"_summary.png"),
         plot = combined_plot, width = 12, height = 8, dpi = 300, bg = "white")
  
  # add to summary_df
  summary_df <- rbind(summary_df, data.frame(
    alias = alias,
    date2026 = date2026,
    total_elevation_gain_m = round(stats$total_elevation_gain_m, 0),
    avg_daily_temp_max = round(mean(weather_df$daily_temperature_2m_max, na.rm = TRUE),1)
  ))
}

# reorder summary_df by date2026
summary_df <- summary_df %>%
  arrange(dmy(date2026))

# Save summary_df to a tsv file
write.table(summary_df, file = "Output/Data/marathon_summary.tsv", sep = "\t", row.names = FALSE, quote = FALSE)


p1 <- ggplot() +
  # add coloured rectangles to indicate elevation
  geom_rect(aes(xmin = 7, xmax = 24, ymin = 0, ymax = 150), fill = "#d0f0d0", alpha = 0.5) +
  geom_rect(aes(xmin = 7, xmax = 24, ymin = 150, ymax = 200), fill = "#fff0b0", alpha = 0.5) +
  geom_rect(aes(xmin = 7, xmax = 24, ymin = 200, ymax = 420), fill = "#f0d0d0", alpha = 0.5) +
  # add points and labels from summary_df
  geom_point(data = summary_df, aes(x = avg_daily_temp_max, y = total_elevation_gain_m)) +
  geom_text_repel(data = summary_df, aes(x = avg_daily_temp_max, y = total_elevation_gain_m, label = alias), size = 3.5, max.overlaps = 1000, segment.color = "#7f7f7f7f", segment.size = 0.2) +
  lims(x = c(7, NA), y = c(0, NA)) +
  labs(x = "Average Daily Max Temperature (°C)", y = "Total Elevation Gain (m)") +
  theme_cowplot(11)
ggsave(filename = "Output/Plots/marathonComparison.png",
       plot = p1, width = 12, height = 8, dpi = 300, bg = "white")

We load in the event and for each row (event), we load in the gpx file first. Using analyse_gpx() we read in the data from the file which includes elevation data and lat/long coordinates. We convert these to cartesian coordinates because we deal with coordinate sets from different parts of the globe. These data are used to generate the route map. The elevation data is resampled at 100 m intervals so that we get a uniform elevation measurement to calculate the total elevation gain and loss. This is used to make the elevation plot.

Stats from the course can be retrieved using analyse_gpx() but, as discussed above, I recalculated the elevation data and stored this together with the other stats I needed. This saved an extra call to analyse_gpx() which sped up the execution time.

Using the dates for the last three editions, I looked up the historical weather data on those dates for a location that is the midpoint of the lat/long coordinates. This was possible using {openmeteo} which is a client to use the Open-Meteo API. I found that openweathermap (which I have used for other projects) charges for access to historical weather data. Whereas Open-Meteo is truly free. Once we have this weather data we can then use the WMO code to retrieve the appropriate icon from openweathermap. These codes show the most extreme weather for the day, rather than a perfect summary. I just wanted something to signify the weather on the previous editions. The icons can be loaded from a URL using {png}. Finally, we have a wind direction which can be converted to arrows using the function above.

To assemble the graphic, I used {cowplot} to assemble the graphical and text elements. Then this “plot”, the route and the elevation profile were put together using {patchwork}. This is done for each event and saved as a file. The summary data gets stored as we go so that we can make the table and the plot, which were shown above in the post.

Conclusion

I’m quite happy with the result(s) but I can see a few ways to improve it. For example, I think having a more sophisticated measure of marathon toughness would be good. I also could use some custom fonts and improve the colours of the graphics to get a more professional look. Anyway, the purpose was to figure out which marathon to run in 2026 and I have been able to do that.

—

The post title comes from Choose Your Fighter by The Nova Twins. I watched their NPR Tiny Desk Concert this week.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Choose Your Fighter: data-driven selection of the best marathon

Maplibre

Michael — Mon, 17 Nov 2025 20:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Heatmap of the french population

Day 17 of 30DayMapChallenge: « A new tool » (previously).

Testing Maplibre with {mapgl}.

library(dplyr)
library(mapgl)
library(sf)

Data

Using french communes population.

pop <- read_sf("~/data/adminexpress/adminexpress-cog-simpl-000-2024.gpkg",
               layer = "commune") |> 
  st_centroid() |> 
  select(population)

Map

maplibre(center = c(5, 45),
         zoom = 6) |>
  add_heatmap_layer(
    id = "pop",
    source = pop,
    heatmap_weight = interpolate(
      column = "population",
      values = c(0, max(pop$population)),
      stops = c(0, 30)),
    heatmap_intensity = interpolate(
      property = "zoom",
      values = c(0, 1),
      stops = c(9, 3)),
    heatmap_color = interpolate(
      property = "heatmap-density",
      values = seq(0, 1, 0.2),
      stops = c(
        "rgba(33,102,172,0)", "rgb(103,169,207)",
        "rgb(209,229,240)", "rgb(253,219,199)",
        "rgb(239,138,98)", "rgb(178,24,43)")),
    heatmap_opacity = 0.7
  )

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Maplibre

Thuenen’s Natural Wage and the K-shaped Economy

datascienceconfidential - r — Mon, 17 Nov 2025 00:00:00 +0000

[This article was first published on datascienceconfidential - r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Johann Heinrich von Thünen (1783-1850) was a German agronomist, agricultural reformer, and textbook author. If you look his name up on the internet, you’ll find that he is best remembered for his model of land use, which seems to have made its way onto the Advanced Placement Geography syllabus in the United States. However, this wasn’t always the case. In a 1955 essay, George Stigler wrote:

We like to associate each eminent economist with the doctrine he originated. Jevons—marginal utility; Walras—general equilibrium; Marshall—quasi-rents, long and short run; Wicksteed—Euler’s Theorem; Pigou—private and social marginal products; Thünen—√ap; and so it goes on.

I’ve never encountered $\sqrt{ap}$ in an economics textbook, so I decided to spend some time researching what it is all about. This was more difficult than I expected, since the formula and the argument behind it are widely believed to be flawed. Nevertheless, it turned out to be quite interesting. In particular, when I attempted to generalise von Thünen’s formula to a slightly different context, I stumbled upon a quite extraordinary numerical punchline.

This post, then, is all about what $\sqrt{ap}$ is, why Thünen thought it was important enough to be carved on his tombstone and what, if anything, it can tell us about the post-1980 world.

von Thünen

There’s a short biography of von Thünen in a 1969 paper by Dickinson. He was a wealthy landowner who ran an estate under near-mediaeval conditions, but who also cared deeply about the welfare of his employees. His book Der isolierte Staat was the fruit of long years of patiently gathering data and reasoning about economics from first principles.

An article by Blaug provides a reader’s guide to Der isolierte Staat (described as a “formless monster”). The first volume was published in 1826 and the second posthumously in 1850. There is also a third volume published in 1863. It’s the second volume which contains the famous $\sqrt{ap}$ formula for the natural wage. I worked through a few chapters of the book and found it to be surprisingly readable. It’s clear that the author wants to be understood. He often writes down a formula, then immediately gives a numerical example, then continues with further reasoning. He doesn’t omit any steps. Nevertheless, it’s not always clear what he’s getting at. Fortunately an 1895 article by H. L. Moore explains the main ideas.

The Natural Wage

In von Thünen’s isolated state, there are a large number of farms. Each farm supports one farmer who produces $p$ units of produce per year. The farm is owned by a landowner who pays the farmer a wage $w$. Everything is measured in bushels of rye. If $a$ is the amount required for subsistence, the farmer’s wage is $w = a + y$. The landowner therefore receives $p-(a+y)$.

The isolated state is assumed to have plenty of space for new farms. Von Thünen’s genius idea is to ask: what would happen if a group of farmers got fed up with working for landowners and decided to crowdfund a new farm?

To answer this, you need to make some assumptions on what it would cost to set up a new farm. Thünen spends a great deal of time discussing capital (including a lengthy digression about coconuts¹) but the upshot is that the required capital is assumed to be proportional to labour, so you may as well assume that you can set up a new farm with labour alone.

Suppose it takes one farmer to set up a new farm and that they are willing to work for subsistence wages while doing so. The crowdfunding farmers can finance this with their own wages. Each crowdfunder needs to use $a$ bushels of rye just to stay alive and in working order, but gets a surplus $w-a=y$ to spend or invest as they like. The setter-upper of the new farm also requires $a$ bushels, so the minimum number of crowdfunders required is

\[\frac{a}{y}.\]

Once the new farm is set up, a labourer needs to be hired to work the farm. The labourer needs to be paid at least $a+y$, otherwise they might as well work for one of the existing landowners. The new farm therefore produces a surplus $p-(a+y)$ which will be shared among the $a/y$ crowdfunders plus the person who did the work of setting up the new farm, which is $1 + a/y$ people in total. Each of these people will earn

\[\frac{p-(a+y)}{1+a/y}=y\frac{p-a-y}{(a+y)}\]

per year. If you allow $y$ to vary and maximise this quantity, you find that

\[a + y = \sqrt{ap}\]

and therefore $\sqrt{ap}$ is the wage level which maximises the return for each person in the group who decided to build the new farm. This is what Thünen called the natural wage.

Objections

The first problem with this is that it’s not clear what the natural wage actually means. It can’t be the wage that would hold in some sort of long-run equilibrium, because there’s nothing to stop the landowners from paying a surplus $y=0$ in the first place, and preventing the whole process from happening.

The key passage seems to be in Section 15 of the book:

Die Bestimmung des Arbeitslohns ist hier in die Hände der Arbeiter selbst gelegt, und der aus der Bestimmung der Arbeiter hervorgehende Lohn ist, wie vorhin nachgewiesen, normierend für den ganzen isolierten Staat.

Die Willkür der Arbeiter findet bei dieser Feststellung ihres Lohns keine andere Schranke als die des eigenen Interesses.

In other words, the crowdfunders are free to choose $a+y$, the wage they pay themselves but must also pay the labourer who works on their new farm, and it’s in everybody’s best interest to choose the $y$ that maximises the annual return per crowdfunder (which is also the return the extra labourer would get if they invested their own surplus $y$ in a similar crowdfunding venture).

So the natural wage is the wage that people would choose if they were playing the roles of landowners and labourers at the same time and were trying to maximise their annual return on their investment. As Dickinson says, it’s a normative result (a statement about what should be, not a prediction.)

A further objection is discussed by Blaug. It seems that this analysis neglects the time element. It’s not clear how long it would take to set up the new farm and how much interest would be foregone in the meantime. It’s also not clear that people would want to maximise the return on one year of investment. Wouldn’t they want to maximise their lifetime return instead? What if they save the extra income from the new farm and invest in a new crowdfunding project? The mathematics seems to get more complicated. But I don’t think that this argument necessarily means that $\sqrt{ap}$ is bad as a first approximation to what’s going on.

In some ways it’s understandable that an enlightened landowner like von Thünen would find this formula beguiling. It’s mathematically very neat (the natural wage is the geometric mean of the minimum possible wage and the maximum possible wage) and the natural wage turns out to be surprisingly low, which I suppose means that he could pay his workers relatively little without being troubled by his conscience.

Natural Wage with Capital

In the course of trying to understand the derivation of the natural wage, I wondered what would happen if capital was included as a separate factor of production. The isolated state has a production function $Y = pL$ which is just a Cobb-Douglas function in which capital $K$ is in fixed proportion to labour $L$, so what happens if you copy the same argument for a more general production function?

Suppose that there is a technology which takes capital $K$ and labour $L$ as inputs and produces an output of $Y = f(K, L)$. Suppose wages are not determined by the marginal product of labour, but instead workers earn a wage $w = a+y$ where $a$ is a subsistence wage and $y$ is a surplus. Suppose capital can be rented at a rate $r$. Suppose a group of workers crowdfund a new production facility by saving their surpluses $y$ to rent $K_0$ units of capital at rate $r$ and $L_0$ units of labour at rate $a$ (the subsistence wage). The total number of workers required to crowdfund and set up the new facility is

\[L_0 + \frac{rK_0 + aL_0}{y}.\]

Once the new facility is set up, the owners can rent $K_1$ units of capital and $L_1$ units of labour (at rate $w = a+y$) to produce $f(K_1, L_1)$. The profit per owner is therefore

\[\frac{f(K_1, L_1) – rK_1-aL_1-yL_1}{L_0 + \frac{rK_0 + aL_0}{y}}\]

Following von Thünen we can maximise this and get

\[-L_1\left(L_0 + \frac{rK_0+aL_0}{y}\right) + \\ -(f(K_1, L_1) – rK_1-aL_1-yL_1)\left(-\frac{(rK_0 + aL_0)}{y^2}\right) = 0\]

which yields

\[-L_0L_1y^2 -2L_1(rK_0 +aL_0)y + \\ (rK_0+aL_0)(f(K_1,L_1)-rK_1-aL_1) = 0\]

and

\[y = \frac{-(rK_0+aL_0)}{L_0} \pm \\ \sqrt{\frac{(rK_0+aL_0)^2}{L_0^2} + \frac{(rK_0+aL_0)}{L_0L_1}(f(K_1,L_1)-rK_1-aL_1)}.\]

To make progress, note that the factor

\[f(K_1, L_1) – rK_1 – aL_1\]

has no global maximum in general.² However, since the workers who crowdfunded the new facility need to hire capital and labour before they can produce, it makes sense that they cannot hire more than $rK_0 + aL_0$ worth in total. Therefore, let’s assume that $K_1 = K_0$ and $L_1 = L_0$. The equation for $y$ simplifies to

\[w = y+a = \frac{1}{L_0}\left(\sqrt{(rK_0+aL_0)f(K_0, L_0)} – rK_0\right).\]

Note that if $K_0 = 0$ and $f(K, L) = pL$ then this reduces to $w = \sqrt{ap}$.

Now suppose $f(K,L) = AK^\alpha L^{1-\alpha}$ is Cobb-Douglas. Suppose that the factor price of capital is equal to its marginal product. Let $Y=f(K_0, L_0)$. Then $\alpha Y = rK_0$ and

\[w = \sqrt{\alpha\left(\frac{Y}{L_0}\right)^2 + a\left(\frac{Y}{L_0}\right)} -\alpha\frac{Y}{L_0}.\]

If $a$ is small and $Y/L_0$ is relatively large then the first term under the square root sign will dominate, and so

\[w \approx (\sqrt{\alpha} – \alpha)Y/L_0.\]

In contrast, if labour is paid its marginal product then the wage will be

\[w^* = (1-\alpha)Y/L_0\]

which means that if workers and capital-owners are effectively the same people, it will be in their interest to pay themselves a wage which is lower than the marginal product of their labour by a factor of

\[\frac{\sqrt{\alpha} – \alpha}{1-\alpha}.\]

If $\alpha = 1/3$, which is a standard choice of parameter in the Cobb-Douglas function because it provides a good fit to the economy of early twentieth-century America, then

\[\frac{\sqrt{\alpha} – \alpha}{1-\alpha} = \frac{\sqrt{3}-1}{2} \approx 0.366.\]

Wages and Productivity

After making this calculation, I wondered how closely it fitted the facts. There’s a famous graph showing a disconnect between wages and productivity beginning in the early 1970s. One version is shown in a FRED blog post from 2023.

The graph plots an index of US real wages and an index of US productivity measured as GDP/hours worked. I used R to look at what happened to the relationship between real wages and productivity before and after the election of Ronald Reagan at the end of 1980. The FRED data is saved as fredgraph.xlsx.

library(readxl)
df <- read_excel("fredgraph.xlsx", sheet=
                   "Quarterly")
df$after_r <- (df$observation_date >= "1980-10-01 UTC")

# fit regression lines wages ~ productivity
m1 <- lm(COMPNFB_CPIAUCSL ~ OPHNFB_NBD19700101, data=df[df$after_r <= 0,])
m2 <- lm(COMPNFB_CPIAUCSL ~ OPHNFB_NBD19700101, data=df[df$after_r > 0, ])

# plot colours
cols <- rep(rgb(0, 100/255, 0, 0.2), nrow(df))
cols[df$after_r] <- rgb(1, 0, 0, 0.2)

plot(df$OPHNFB_NBD19700101, df$COMPNFB_CPIAUCSL, col=cols,
     pch=19, xlab="Productivity", ylab="Wages", las=1, cex=1.5)
abline(coef(m1), col="darkgreen", lty=2, lwd=2)
abline(coef(m2), col="red", lty=2, lwd=2)
legend("topleft", lty=2, col=c("darkgreen", "red"), lwd=2,
       legend=c("Before Reagan", "After Reagan"))

The ratio of the slope of the After Reagan line to the slope of the Before Reagan line is

coef(m2)[2]/coef(m1)[2]
# OPHNFB_NBD19700101 
#          0.3799075

which,is remarkably close to $\frac{\sqrt{3}-1}{2} \approx 0.366$ which the modified Thünen model predicts.³

In other words, the relationship between wages and productivity after 1980 behaves almost exactly as though people have stopped getting all their income from work and started acting as if they are simultaneously workers and business owners. Why? I found one possible explanation in Debt: the first 5000 years by David Graeber. This was an extremely entertaining read full of bizarre and provocative ideas, one of which, proposed in the last chapter of the book, is the suggestion that neoliberalism is in some sense Marxist. In discussing the economics of Reagan and Thatcher, Graeber says:

All of this is not to say that the people of the world were not being offered something: just that, as I say, the terms had changed. In the new dispensation, wages would no longer rise, but workers were encouraged to buy a piece of capitalism. Rather than euthanize the rentiers, everyone could now become rentiers—effectively, could grab a chunk of the profits created by their own increasingly dramatic rates of exploitation.

In other words, deregulating markets and selling off state assets is one way of trying to put the means of production into the hands of workers.

Unfortunately, it seems that there is no way of doing this without creating massive inequality. Selling the assets would work perfectly if every potential buyer had the same level of wealth, but they don’t. Other approaches may be even worse. In the 1990s Russia tried voucher privatisation, which resulted in former state-owned industries falling into the hands of criminal gangs. And even earlier, the Russian revolutionaries used violence to seize private industries and give them to the workers themselves, but that ended in a similar result. In New Zealand our government always said that it wanted the shares of sold-off state-owned assets to go into the hands of Mums and Dads. Why Mums and Dads? Because the wealthier Mums and Dads were old. They were the ones who had accumulated enough lifetime wealth from wages to be able to buy the assets. Well, they did buy the assets. They bought all of them. Then they bought all the houses too.

In the US, people have started talking about a K-shaped economy in which the gap between rent-collecting asset owners and non-asset owning wage-earners grows at an ever-increasing pace. It’s interesting that a discredited nineteenth-century theory of feudalism fits such an economy so well.

1: Wenn wir den Ursprung des Kapitals und den Zustand der Gesellschaft, in welchem der mit keinem Kapital versehene Mensch bloß durch seine Arbeit subsistieren und selbst einiges Kapital schaffen kann, uns vergegenwärtigen wollen, so müssen wir uns in Gedanken nach den Tropenländern versetzen. It’s not clear exactly why.

2: This is because $a$ is not equal to the marginal product of labour, so the payments to factors of production do not exhaust the whole of the output.

3: Since the wage index has the form $\beta w + \gamma$ for some $\beta$ and $\gamma$ and the productivity index has the form $\beta’(Y/L) + \gamma’$ for some $\beta’$ and $\gamma’$, the predicted value for the ratio of the slopes of the two regression lines is $\frac{\sqrt{\alpha}-\alpha}{1-\alpha}$.

To leave a comment for the author, please follow the link and comment on their blog: datascienceconfidential - r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Thuenen’s Natural Wage and the K-shaped Economy

{talib}: Candlestick Pattern Recognition in R

Serkan Korkmaz — Sun, 16 Nov 2025 20:06:15 +0000

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

{talib} is a new R-package for Technical Analysis (TA) and Candlestick Pattern Recognition (Yeah, the patterns traders bet their lifesavings on….). In this post I will show basic example on how {talib} works, and how it compares performance-wise with {TTR}.

Basic example

In this example I will identify all ‘Harami’ patterns, and calculate the Bollinger Bands of the SPDR S&P 500 ETF (SPY).

Identify Harami patterns

x <- talib::harami(
  talib::SPY
)

talib::harami() is a S3 function and returns a matrix of the same length of the input. The number of identified patterns can counted as non-zero entires.

cat(
  "identified patterns:",
  sum(x[, 1] != 0, na.rm = TRUE)
)
#> identified patterns: 35

The Harami pattern can be bullish (1) or bearish (-1) and counted the same way

cat(
  "identified bullish patterns:",
  sum(x[, 1] == 1, na.rm = TRUE)
)
#> identified bullish patterns: 20

cat(
  "identified bearish patterns:",
  sum(x[, 1] == -1, na.rm = TRUE)
)
#> identified bearish patterns: 15

Charting

The Harami pattern can be plotted using talib::chart() with talib::bollinger_bands() to add Bollinger Bands to the chart.

{
  talib::chart(talib::SPY)
  talib::indicator(talib::harami)
  talib::indicator(talib::bollinger_bands)
}

Benchmarks

An often asked question about {talib} in relation to {TTR}, is what it “brings to the table”. Other than Candlestick Patterns and interactive charts, it brings speed and efficiency.

To demonstrate the difference in speed, I will create a univariate price series with 1 million entries.

set.seed(1903)
x <- runif(n = 1e6, min = 100, max = 150)

The univariate series x will be passed into the Bollinger Bands from each package:

bench::mark(
  talib::bollinger_bands(x),
  TTR::BBands(x),
  min_iterations = 10,
  check = FALSE
)[, c(1, 2, 3, 5)]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 4
#>   expression                     min   median mem_alloc
#>                     
#> 1 talib::bollinger_bands(x)   6.65ms   9.07ms    22.9MB
#> 2 TTR::BBands(x)             65.12ms  72.42ms   139.3MB

In this benchmark {talib} is faster, and more memory efficient, than {TTR}.

{talib} is still under development, and will most likely not be submitted to CRAN before next year. Until then it can be installed from Github: pak::pak("serkor1/ta-lib-R")

Feel free to stop by the repository here: https://github.com/serkor1/ta-lib-R.

^{Created on 2025-11-16 with reprex v2.1.1}

{talib}: Candlestick Pattern Recognition in R was first posted on November 16, 2025 at 8:06 pm.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: {talib}: Candlestick Pattern Recognition in R

Bias, Variance, and Doubly Robust Estimation: Testing The Promise of TMLE in Simulated Data

r on Everyday Is A School Day — Sun, 16 Nov 2025 00:00:00 +0000

[social4i size="small" align="align-left"] -->

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Finally understood TMLE’s “doubly robust” property through simulation. Works well when either outcome OR treatment model is correct. XGBoost + TMLE captured complex relationships without manual specification. It worked on simulated complex data, would it work in real world?

Motivations:

I’ve always heard about Targeted Maximum Likelihood Estimation (TMLE) and I’ve read Katherine Hoffman’s blog post several times. Printed her cheat sheet and go through it several times. Each time I thought I understood it, the next time I found myself questioning my understanding. So, what a better way to dive a tad deeper as to the machinery behind this, and why is it useful? Let’s go!

Just to set the context right, we’re going to estimate Average Treatment Effect (ATE) and use g-computation as a standard approach.

Objectives:

What is TMLE?

TMLE is a statistical method used for estimating causal effects in observational studies and clinical trials. It combines elements of machine learning and traditional statistical techniques to provide robust estimates of treatment effects while controlling for confounding variables. TMLE operates in two main steps: first, it estimates the outcome model and the treatment model, and then it uses these models to adjust the treatment effect estimate, targeting the parameter of interest directly. This approach is particularly useful in settings where standard methods may be biased or inefficient, as it allows for the incorporation of flexible machine learning algorithms to improve estimation accuracy. You will hear the term Doubly Robust about this method. What’s do robust x 2 about this?

What Does Doubly Robust mean?

Doubly Robust (DR) estimation refers to a statistical property of certain estimators that remain consistent if either the model for the treatment assignment (propensity score) or the model for the outcome is correctly specified, but not necessarily both. In other words, a doubly robust estimator provides two chances for obtaining a valid estimate of the causal effect: if one of the models is misspecified, as long as the other model is correctly specified, the estimator will still yield consistent results. This property is particularly advantageous in observational studies where there may be uncertainty about the correct specification of either model, enhancing the reliability of causal inferences drawn from the data. I didn’t quite understand this until we simulated the data to test this theory. It will, hopefully, be more clear when we go through the simulation. But, wait, what metrics should we use for this? Bias and variance!

What is Bias and Variance?

Bias and variance are two fundamental concepts in statistics and machine learning that describe different sources of error in predictive models. Bias refers to the systematic error that occurs when a model consistently overestimates or underestimates the true value of a parameter. High bias can lead to underfitting, where the model fails to capture the underlying patterns in the data. Variance, on the other hand, refers to the variability of model predictions for different training datasets. High variance can lead to overfitting, where the model captures noise in the training data rather than the true signal. The trade-off between bias and variance is a key consideration in model selection and evaluation, as it affects the overall accuracy and generalizability of predictive models.

The formula for bias is: $$ \text{Bias}(\hat{\theta}) = E[\hat{\theta}] – \theta $$ Where:

$\hat{\theta}$ is the estimator of the parameter $\theta$
$E[\hat{\theta}]$ is the expected value of the estimator

In pseudo-R code would look something like this:

predicted_theta <- vector(mode = "numeric", length=1000)

for (i in 1:1000) {
  training_data <- dplyr::slice_sample(original_training_data, n = nrow(original_training_data), replace = T)
  model <- glm(outcome~treatment+confounder,data=training_data)
  outcome_hat_1 <- predict(model,newdata = training_data |> mutate(treatment = 1))
  outcome_hat_0 <- predict(model,newdata = training_data |> mutate(treatment = 0))
  predicted_theta[i] <- mean(outcome_hat_1) - mean(outcome_hat_0)
}

bias <- mean(predicted_theta) - theta

In my own language, bias is, how close our estimation on average is to the true value.

The formula for variance is: $$ \text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2] $$

Where:

$\hat{\theta}$ is the estimator of the parameter $\theta$
$E[\hat{\theta}]$ is the expected value of the estimator

In pseudo-R code would look something like this:

predicted_theta <- vector(mode = "numeric", length=1000)

for (i in 1:1000) {
  training_data <- dplyr::slice_sample(original_training_data, n = nrow(original_training_data), replace = T)
  model <- glm(outcome~treatment+confounder,data=training_data)
  outcome_hat_1 <- predict(model,newdata = training_data |> mutate(treatment = 1))
  outcome_hat_0 <- predict(model,newdata = training_data |> mutate(treatment = 0))
  predicted_theta[i] <- mean(outcome_hat_1) - mean(outcome_hat_0)
}

variance <- mean((predicted_theta-mean(predicted_theta))^2)
# or 
variance <- var(predicted_theta)

We will be using bias and variance to test the doubly robust theory. But first, let’s simulate some data!

Simulate Data

library(tidyverse)

set.seed(1)

n <- 10000
W1 <- rnorm(n)
W2 <- rnorm(n)
W3 <- rbinom(n, 1, 0.5)
W4 <- rnorm(n)

# TRUE propensity score model
A <- rbinom(n, 1, plogis(-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4))

# TRUE outcome model
Y <- rbinom(n, 1, plogis(-1 + 0.2*A + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2))

# Calculate TRUE ATE
logit_Y1 <- -1 + 0.2 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2
logit_Y0 <- -1 + 0 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2

Y1_true <- plogis(logit_Y1)
Y0_true <- plogis(logit_Y0)
true_ATE <- mean(Y1_true - Y0_true)

df <- tibble(W1 = W1, W2 = W2, W3 = W3, W4 = W4, A = A, Y = Y,
           true_ATE = true_ATE, Y1_true = Y1_true, Y0_true = Y0_true)

Alright, our true ATE here is 0.0373518. We’ll see if doubly robust method can be able to estimate this either outcome or treatment model is misspecified.

Write a Function to Estimate

Let’s look at the WRONG Outcome model

model <- glm(Y ~ A + W1 + W2 + W3 + W4, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = Y ~ A + W1 + W2 + W3 + W4, family = "binomial")
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.045765   0.041489 -25.206   <2e-16 ***
## A           -0.050142   0.047732  -1.050    0.293    
## W1           0.767386   0.026058  29.449   <2e-16 ***
## W2          -0.024726   0.022807  -1.084    0.278    
## W3           0.561572   0.045658  12.300   <2e-16 ***
## W4          -0.003209   0.022382  -0.143    0.886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12737  on 9999  degrees of freedom
## Residual deviance: 11519  on 9994  degrees of freedom
## AIC: 11531
## 
## Number of Fisher Scoring iterations: 4

Ouch. Looking quickly at the coefficient of A is -0.0501418. Totally inverse of the true ATE. Alright let’s look at g-computation and see if it returns the same result.

g-computation function

g_comp <- function(model,data,ml=F) {
  if (ml==T) {
   y1 <- predict(model, new_data=data |> mutate(A=as.factor(1)), type = "prob")[,2] |> pull()
   y0 <- predict(model, new_data=data |> mutate(A=as.factor(0)), type = "prob")[,2] |> pull()
  } else {
  y1 <- predict(model, newdata=data |> mutate(A=1), type = "response")
  y0 <- predict(model, newdata=data |> mutate(A=0), type = "response")
  }
  return(mean(y1-y0))
}

g_comp(model,df)

## [1] -0.009823307

Yup, incorrect! Now what if we use the RIGHT Outcome model?

The correct outcome model

model <- glm(Y ~ A + W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial")

g_comp(model,df)

## [1] 0.03576854

Wow! Look at that! if we correctly specify the outcome model, it actually is VERY close to true ATE!

The wrong treatment model

Now what if we use IPW but with the wrong treatment model and see if we can estimate ATE

ps_model <- glm(A ~ W1 + W2 + W3 + W4, family = "binomial")  
ps <- ps_model$fitted.values
ps_final <- pmax(pmin(ps, 0.95), 0.05)
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))

model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model,df)

## [1] -0.0095777

Wow, very wrong indeed! Now let’s look at the right treatment model

The Correct treatment model

ps_model <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial")  #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4

ps <- ps_model$fitted.values
ps_final <- pmax(pmin(ps, 0.95), 0.05)
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))

model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model,df)

## [1] 0.03874379

Not too shabby! very close to our true ATE! To be honest, how on earth are we supposed to know before hand the complex equation to specify on either treatment or outcome model !?

Let’s Try ML xgboost

Let’s see if xgboost can tease out outcome model without us specifying all these weird interactions and quadratic relationships.

library(tidymodels)
library(doParallel)  
library(future)

workers <- parallel::detectCores(logical = FALSE) - 1
plan(multisession, workers = workers)
future::nbrOfWorkers() 

# Set up parallel processing
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores - 1)  # Leave 1 core free
registerDoParallel(cl)

df_ml <- df |> 
  select(Y,A,W1,W2,W3,W4) |>
  mutate(Y = as.factor(Y),
         A = as.factor(A))

# Define model specification
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  # sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")  

# Create workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(Y ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),df_ml),
  learn_rate(),
  size = 20
)

# Cross-validation and tuning
set.seed(1)
folds <- vfold_cv(df_ml, v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         parallel_over = "everything")
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = df_ml)

# g-comp
g_comp(final_fit, df_ml, T)

## [1] 0.03447109

Wow, nice! Quite close to our true ATE without specifying any interactions or quadratic relationship. Mind you, this dataset is quite large.

Now, let’s try out if we can use xgboost to create an accurate treatment model and use its weights to plug into our good trust glm.

# Rereate workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(A ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),df_ml |> select(-Y)),
  learn_rate(),
  size = 20
)

# Cross-validation and tuning
set.seed(1)
folds <- vfold_cv(df_ml |> select(-Y), v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         parallel_over = "everything")
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = df_ml |> select(-Y))

# calc ps
ps <- predict(final_fit, new_data = df_ml |> select(-Y), type = "prob")[,2] |> pull()
ps_final <- ps
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))

# glm model 
model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model, df_ml)

## [1] 0.04840169

Wow, compared to this, our ATE is much closer to our true ATE than the wrongly specified treatment model. Though it’s still quite biased, isn’t it? it’s far from the true ATE. But at least we know ML methods can probably handle these complex relationhip.

Now let’s write function to estimate bias and variance! Since that’s our major question. And then we’ll look into TMLE procedure.

code

How to access HomeAssistant’s InfluxDB from R

rstats-tips.net — Sun, 16 Nov 2025 00:00:00 +0000

[This article was first published on rstats-tips.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m running a HomeAssistant instance at home. I’ve configured it to log data into an InfluxDB database, so I can retrieve historical data for analysis later on. In default mode HomeAssistant would aggregate historical data for storage reasons.

So now I want to access the InfluxDB database from R to perform custom analyses. HomeAssistant is still using InfluxDB version 1. To connect to InfluxDB from R, I thought I can use the influxdbr package. But I got some errors because this package seems to be outdated.

Here’s what I get, when fetching data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


library(influxdbr)
library(tidyverse)

influx_con <- influx_connection(
  scheme = "http", host = "my_influx_host", port = 8086,
  user = "my_influx_user", pass = "my_influx_pass"
) 
            
query <- 'SELECT entity_id, value FROM "autogen"."°C" WHERE time > now() - 5d'                    

influx_query(influx_con, 
             db = "my_influx_db", 
             query = query,
)

The error message is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


Error in `purrr::map()`:
ℹ In index: 1.
Caused by error in `purrr::map()`:
ℹ In index: 1.
ℹ With name: results.
Caused by error in `purrr::map()`:
ℹ In index: 1.
Caused by error:
! The `validate` argument of `as_tibble()` was deprecated in tibble 2.0.0 and is now defunct.
ℹ Please use the `.name_repair` argument instead.
Run `rlang::last_trace()` to see where the error occurred.

It seems that the influxdbr package is not compatible with the current version of the tibble package. The latest commit is 5 years old.

So I decided to use the httr package to send HTTP requests directly to the InfluxDB API. Here’s how to do it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


library(tidyverse)
library(httr)
library(jsonlite)

query <- 'SELECT entity_id, value FROM "autogen"."°C" WHERE time > now() - 5d'


response <- GET(
  url = paste0("my_influx_host", "/query"),
  query = list(
    db = "my_influx_db",
    q = query,
    u = "my_influx_user",
    p = "my_influx_pass"
  ),
  config = config(ssl_verifypeer = TRUE)  
)

if (response$status_code == 200) {
  data_raw <- content(response, as = "text", encoding = "UTF-8") %>%  fromJSON(flatten = TRUE)
}

The data structure looks a little bit complicated:

1

data_raw %>% str()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


## List of 1
##  $ results:'data.frame':	1 obs. of  2 variables:
##   ..$ statement_id: int 0
##   ..$ series      :List of 1
##   .. ..$ :'data.frame':	1 obs. of  3 variables:
##   .. .. ..$ name   : chr "°C"
##   .. .. ..$ columns:List of 1
##   .. .. .. ..$ : chr [1:3] "time" "entity_id" "value"
##   .. .. ..$ values :List of 1
##   .. .. .. ..$ : chr [1:44176, 1:3] "2025-11-11T13:20:31.171165Z" "2025-11-11T13:20:43.741052Z" "2025-11-11T13:20:43.741159Z" "2025-11-11T13:20:44.552422Z" ...

But here starts the magic of tidyverse. The interesting data is stored in data_raw$results[[1]]$series[[1]]$values[[1]]. We can get it using pluck() and convert it to a data.frame:

1
2
3


data_raw_df <- data_raw %>%
  pluck("results", "series", 1, "values", 1) %>% 
  as.data.frame(stringsAsFactors = FALSE)

1
2


data_raw_df %>%
  head(10)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


##                             V1      V2      V3
## 1  2025-11-11T13:20:31.171165Z  Temp_5  25.875
## 2  2025-11-11T13:20:43.741052Z Temp_11  30.625
## 3  2025-11-11T13:20:43.741159Z Temp_10   48.75
## 4  2025-11-11T13:20:44.552422Z  Temp_8   28.25
## 5  2025-11-11T13:20:44.552522Z  Temp_9   28.25
## 6  2025-11-11T13:20:50.225514Z  Temp_2    19.8
## 7   2025-11-11T13:21:07.61594Z Temp_15 26.5625
## 8  2025-11-11T13:21:07.616039Z Temp_14 22.9375
## 9  2025-11-11T13:21:10.101543Z Temp_13   26.75
## 10 2025-11-11T13:21:10.911298Z Temp_12    23.5

Now we can change the column names (they are stored in data_raw %>% pluck("results", "series", 1, "columns", 1)) and convert the time column to POSIXct:

1
2
3
4
5
6
7


data_raw_df %>% 
   set_names(data_raw %>% pluck("results", "series", 1, "columns", 1)) %>%
   mutate(
      time = as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC"),
      value = as.numeric(value)
    ) %>% 
  str()

1
2
3
4


## 'data.frame':	44176 obs. of  3 variables:
##  $ time     : POSIXct, format: "2025-11-11 13:20:31" "2025-11-11 13:20:43" ...
##  $ entity_id: chr  "Temp_5" "Temp_11" "Temp_10" "Temp_8" ...
##  $ value    : num  25.9 30.6 48.8 28.2 28.2 ...

To leave a comment for the author, please follow the link and comment on their blog: rstats-tips.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: How to access HomeAssistant’s InfluxDB from R

unifiedml: A Unified Machine Learning Interface for R, is now on CRAN + Discussion about AI replacing humans

T. Moudiki — Sun, 16 Nov 2025 00:00:00 +0000

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

unifiedml is now available on CRAN. Just because. I wanted to see what was new

We had an interesting discussion about the process of package publishing, and AI replacing human beings (the type of discussion that I’ll have with you when I’ll be out of this mysterious jail).

Here is what I had to say about it:

>      > I see it has changed :) Just a question for my personal culture:
>     Is this going to be done by AI someday? Because it could/should.
>
>     You want to be replaced by AI?

On 11.11.2025 09:23, Thierry Moudiki wrote:
> The question of whether one would want to be replaced by AI or not is
> too narrow (in general and in particular in this context).
>
> To me, personally, it's like asking someone, back in the day, whether
> he/she wants to use the wheel or not.
>
> It's already much better than us in many mundane tasks. And it could be
> way, way more objective in this particular context.

On 11.12.2025 00:09, Thierry Moudiki wrote:
I said "AI" to follow up on your argument but this has nothing to do with text models. Not everything has to be AI.

In order to validate or push such a thing in prod, you typically need GitHub Actions and a set of objective rules (e.g what's a critical warning, or a critical note, if there's such a thing...). It's not even AI (see e.g https://github.com/Techtonique/techtonique-r-pkgs). 

Typically, since submitting this package to your highest authority, I've shipped ~5 packages to PyPI (within minutes, and they also check if the code is malicious, or it will be removed) without being bothered by gatekeeping ("this comma is missing", "oh english grammar requires a full stop there"). And this is not an insult, it's food for thought (if you're humble enough to accept it), when Python is eating R at breakfast everyday. 

R seems to be stuck somewhere in a very distant past with discussions like this.

That’s the comment of a foolish guy. If that is foolishness, I want to be foolish forever.

The package, unifiedml, provides a consistent interface for machine learning in R, making it easier to work with different ML algorithms using a unified API.

Why `unifiedml`?

R has an incredibly rich ecosystem of machine learning packages, but each comes with its own syntax and conventions. unifiedml is an effort to bridge this gap by providing:

Extremely lightweight (see for yourself: https://github.com/Techtonique/unifiedml/blob/main/R/model.R) and consistent API across different ML algorithms
Automatic task detection (regression vs classification)
Built-in cross-validation with appropriate metrics
Model interpretation tools including feature importance and partial dependence plots
Seamless integration with existing R packages (once installed) like glmnet, randomForest, and more

Installation

install.packages("unifiedml", repos = "https://cran.r-project.org/")

library(unifiedml)

Quick Start: Regression Example

Here’s how easy it is to build a regression model:

library(glmnet)
data(mtcars)

# Prepare data
X <- as.matrix(mtcars[, -1])
y <- mtcars$mpg  # numeric → automatic regression

# Fit model
mod <- Model$new(glmnet::glmnet)
mod$fit(X, y, alpha = 0, lambda = 0.1)

# Make predictions
predictions <- mod$predict(X)

# Get model summary with feature importance
mod$summary()

# Visualize partial dependence
mod$plot(feature = 1)

# Cross-validation (automatically uses RMSE for regression)
cv_scores <- cross_val_score(mod, X, y, cv = 5)
cat("Mean RMSE:", mean(cv_scores), "\n")

Quick Start: Classification Example

The same intuitive API works for classification:

library(randomForest)

data(iris)

# Prepare data
X <- as.matrix(iris[, 1:4])
y <- iris$Species  # factor → automatic classification

# Fit model
mod <- Model$new(randomForest::randomForest)
mod$fit(X, y, ntree = 100)

# Make predictions
predictions <- mod$predict(X)

# Cross-validation (automatically uses accuracy for classification)
cv_scores <- cross_val_score(mod, X, y, cv = 5)
cat("Mean Accuracy:", mean(cv_scores), "\n")

Key Features

Consistent Interface Whether you’re using glmnet, randomForest, xgboost, or any other compatible algorithm, the interface remains the same. This makes it easy to:

Switch between algorithms
Compare model performance
Build reproducible workflows

Automatic Task Detection No need to specify whether you’re doing regression or classification—unifiedml automatically detects this based on your target variable type.
Model Interpretation (in progress) Built-in tools for understanding your models: Feature importance rankings Partial dependence plots Comprehensive model summaries

Note

unifiedml is still very young, so there might be some rough edges. The package will mature over time as I refine the lightweight API (see for yourself: https://github.com/Techtonique/unifiedml/blob/main/R/model.R), expand functionality, and incorporate user feedback.

Feedback Welcome.

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: unifiedml: A Unified Machine Learning Interface for R, is now on CRAN + Discussion about AI replacing humans

MODIS fire

Michael — Sat, 15 Nov 2025 20:00:00 +0000

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Eagle Creek Fire – CC-BY-NC-SA by Curtis Gregory Perry

Day 15 of 30DayMapChallenge: « Fire » (previously).

An animation of global fires in 2024 using MODIS data.

library(dggridR)
library(dplyr)
library(readr)
library(ggplot2)
library(purrr)
library(sf)
library(rnaturalearth)
library(glue)
library(classInt)

Data

See the docs. You can use a client like Filezilla to download the data.

SFTP: fuoco.geog.umd.edu
Login / Password (as of time of writing): fire / burnt

Get all year 2024 files available in /data/MODIS/C61/MCD14ML.

We’ll use binning on a 250 km discrete global grid.

# countries background
world <- ne_countries(scale = 10) |> 
  st_make_valid() |> 
  st_wrap_dateline()

# build the grid
dggs <- dgconstruct(spacing = 250) 

hex <- dggs |> 
  dgshptogrid(world, cellsize = 0.5) |> 
  st_make_valid() |>
  st_wrap_dateline() |> 
  st_filter(world) |> 
  select(seqnum)

# read all MODIS files and find their grid cell ID
modis <- dir("~/data/modis/", full.names = TRUE) |> 
  read_fwf(
    skip = 1,
    col_types = cols("YYYYMMDD" = col_date(format = "%Y%m%d")),
    fwf_positions(
      c(1, 10, 15, 17, 26, 36, 42, 48, 53, 61, 65, 68),
      c(9, 14, 16, 25, 35, 41, 47, 52, 60, 64, 67, 69),
      c("YYYYMMDD", "HHMM", "sat", "lat", "lon", "T21", "T31", "sample", "FRP", 
        "conf", "type", "dn")),
    num_threads = 10) |> 
  mutate(seqnum = dgGEO_to_SEQNUM(dggs, lon, lat)$seqnum)

Map

We generate one PNG file per day and create the video with a call to a system-installed ffmpeg.

# compute the number of fires for each cell and each day
modis_cells <- modis |> 
  count(seqnum, YYYYMMDD) |> 
  left_join(hex,
            join_by(seqnum)) 

# prepare the breaks
breaks <- classIntervals(modis_cells$n, n = 4, style = "kmeans")

# create a PNG map for one day
create_map <- function(d) {
  p <- modis_cells |> 
    filter(YYYYMMDD == d) |> 
    left_join(hex,
              join_by(seqnum)) |> 
    st_sf() |> 
    ggplot() +
    geom_sf(data = world, fill = "#1a3853", color = "#002240") +
    geom_sf(aes(fill = n, color = n)) +
    scale_fill_viridis_c(aesthetics =  c("colour", "fill"),
                         breaks = round(breaks$brks),
                         transform = "log",
                         name = "Fires\nper cell\n(log\nscale)",
                         option = "B",
                         limits = c(1, max(modis_cells$seqnum))) +
    coord_sf(crs = "EPSG:8857") +
    labs(title = "MODIS fire detection",
         subtitle = d,
         caption = glue("MODIS - Global Monthly Fire Location \\
                        Product (MCD14ML)
                        https://r.iresmi.net/ - {Sys.Date()}")) +
    theme_void() +
    theme(text = element_text(family = "Ubuntu",
                              color = "white"),
          plot.margin = margin(2, 2, 2, 5, unit = "mm"),
          plot.caption = element_text(size = 7,
                                      color = "#777"))
  ggsave(glue("img/fire_{d}.png"), p, bg = "#002240", width = 9, height = 5)
}

# Iterate
modis |> 
  distinct(YYYYMMDD) |> 
  pull(YYYYMMDD) |> 
  walk(create_map, .progress = TRUE)

# generate the video
system(glue('ffmpeg -framerate 24 -pattern_type glob -i "img/fire*.png" \\
            -c:v libx264 -pix_fmt yuv420p modis.mp4'))

Figure 1: Animation of global fires in 2024 using MODIS data

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: MODIS fire

Bayes vs. the Invaders (Redivivus)

moth — Sat, 15 Nov 2025 10:50:27 +0000

[This article was first published on Weird Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Straying still further from whatever dubious graces it once presumed, academia’s luciferous descent into murky realms of huddled speculation continues unabated. Relocated, reconstituted, in ever-fading cycles, the Oxford Internet Institute at the University of Oxford once again saw fit to deprive its students of the peace and comfort of banal rationality through this fourth annual Halloween Lecture.

In the absence of fresh insight, this year’s lecture revisits the chilling implications of humanity’s contact with inexplicable aerial and marine phenomena, drawing from the faintest whispers of the ante-historical record through to the shimmering echoes of statistical reasoning. Through what means do visitors from beyond the void encroach on our night-time skies? What subtle deceptions underpin their visible geometries? What values lie behind their peculiar interests in certain uncomfortably-favoured regions?

Despite all safeguards, and in the face of numerous barely-perceptible currents opposing such efforts, this event was captured, stored, and released into a world still cruelly unprepared to face its findings.

More details, and the underlying code, for these findings can be found–for those unwary enough to look–in the series of entries beginning here.

Oxford Internet Institute Halloween Lecture
Bayes vs. the Invaders (Redivivus): A Bayesian Analysis of 70 Years of UFO Sightings
Prof. Joss Wright
Oxford. October 2025

The searing heat of summer retreats, cools, fades, surrendering its vitality to the flickering uncertainties of autumn. Nights draw close, like dimly-remembered friends clustering in our dreams. The spring leaves abandon their verdant dance, as they age, wither, and drift into a russet swirl of skeletal, wind-stirred fragments.

The dying seasons return, dragging with them time-hallowed fears and uneasy rumours, pooling around these darkly dreaming spires in a mire of primal superstition. The agonizingly brittle certainties of the modern enlightenment, our desperate faith in the gossamer fabrics of scientific progress, falter in the face of primal terrors that lurk implacably in the gloom.

In these darkening days, as faith in our treasured understanding dims, it is yet again time to turn our faces fully to the darkness. Halloween, slouching inexorably towards our minds, impels us as scholars to gather our methods, our theories, our data, our knowledge; and glean what light we can from the primordial glimmers of the unknown.

Unidentifiable aerial and marine phenomena. Impossible lights in the sky. Patterns of visitation and terror. Insidious influences from the hadal voids between the stars. Who–what–swoop and glide through the ink-black nights of our world, probing and testing our structures, our societies, our minds? From barely remembered history, to early reports of impossible objects, to blurrily evidenced documentation, data concerning flying arcane observations has grown and twisted, along with our capacity to lay them bare, to subject them to analysis, and to interrogate their secrets.

In this year’s OII Halloween Lecture, we will tremblingly revisit a Bayesian analysis of seventy years of UFO sightings, drawn from a dataset collected by the National UFO Reporting Center (NUFORC). Scepticism, fear, doubt, and most accepted standards of statistical rigour, will be cast aside in our unyielding and disquieting pursuit of an uncompromised truth.

2025-bayes_vs_invaders-redivivus

To leave a comment for the author, please follow the link and comment on their blog: Weird Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Bayes vs. the Invaders (Redivivus)

container: v1.1.0 on CRAN

R some blog — Sat, 15 Nov 2025 00:00:00 +0000

[This article was first published on R some blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The {container} package provides an enhanced version of base R’s list. Version 1.1.0 adds new extract and replace operations for interactive use to both match base R’s list behavior more closely and to introduce new convenient features beyond that.

To leave a comment for the author, please follow the link and comment on their blog: R some blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: container: v1.1.0 on CRAN

Recapping posit::conf 2025

tshafer.com — Fri, 14 Nov 2025 12:10:00 +0000

[This article was first published on tshafer.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In September I had the opportunity to give a talk in person at posit::conf in Atlanta, and now the video recording, plus annotated slides and other goodies, are generally available.

This was my first time at a Posit/RStudio conference, and I can’t say enough good things about how it was run and organized. The general atmosphere was super welcoming, and the talks were excellent (with compliments to Articulation’s Acacia Duncan and Blythe Coons, who provided speaker coaching), and it was striking how nearly every talk was given by a practitioner for other practitioners. Almost no vendor-pitching-their-tool talks.

If I can swing it, I would love to attend again next year.

This post and others like it are kindly republished by R-bloggers.

To leave a comment for the author, please follow the link and comment on their blog: tshafer.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Recapping posit::conf 2025

R-bloggers

From RUnit to testthat with Coding Agent Support

Migrating a Test Suite

Context Preparation

Environment Setup

First task and Iterating on instructions

A Tedious but Simple Task: expect_equal Argument Order

Explicit Self-Review

1. Pairing RUnit and testthat Files

2. Comparison Process

3. Trigger differences

4. Summary Table (Example)

“Third-Party” Code Review

Conclusion

See also

Toponymy

Config

Data

Basemap

Download and extract

Get suffix

Map

Bioconductor in Africa: Ethiopia’s First In-Person Course

Introduction

What we taught

Learning & Community

Participant voices

Lessons learned

Highlights

Collaborators & Acknowledgements

What’s next?

Get involved

Twenty Questions and Decision Trees

Data Science Quiz For Humanities

Flags

Data

Map

(ICYMI) RPweave: Unified R + Python + LaTeX System using uv

Getting Started

Why It Matters

Minimal Example

Ideal For

Pro Tips

Repo & Docs

Should I Use Figma Design for Dashboard Prototyping?

What are Figma and Figma Design?

So What is Figma Design?

What Can I Use Figma Design For?

Is Figma Design Free?

What Tools Does Figma Design Give Me?

Tools in the Toolbar

Conceptual Tools

So, Should I Design My Dashboard with Figma Design?

Announcing AI in Production 2026: A New Conference for AI and ML Practitioners

What to expect

Workshops on Thursday 4 June

Conference day on Friday 5 June

Call for speakers

Key dates

Speakers

Tickets

Planning your visit

Sponsorship

Acidification

Data

Map

EPSG:3035

Data

Map

Perseverance

Data

Map

Why you should not use mean imputation for missing data

Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November!

Upcoming Webinar – Machine Learning with Python

Why Attend?

Ready to Join?

Choose Your Fighter: data-driven selection of the best marathon

The marathon data

Breakdown

A Tedious but Simple Task: `expect_equal` Argument Order

Why `unifiedml`?