R-bloggers

Creating Summary Tables in R with tidyquant and dplyr

Steven P. Sanderson II, MPH — Fri, 26 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Creating summary tables is a key part of data analysis, allowing you to see trends and patterns in your data. In this post, we’ll explore how to create these tables using tidyquant and dplyr in R. These packages make it easy to manipulate and summarize your data.

Examples

Using tidyquant for Summary Tables

tidyquant is a versatile package that extends the tidyverse for financial and time series analysis. It simplifies working with data by integrating tidy principles.

Example: Calculating Average Price by Month

Here’s an example of how to calculate the average price by month using tidyquant:

# Load necessary libraries
library(tidyquant)
library(dplyr)

# Sample data: Daily stock prices
data <- tibble(
  date = seq(as.Date('2023-01-01'), as.Date('2023-06-30'), by = 'day'),
  price = runif(181, 100, 200)
)

# Create a summary table with average closing price by month
summary_table <- data |>
  mutate(month = floor_date(date, "month")) |>
  pivot_table(
    .rows = month, 
    .values = ~ mean(price, na.rm = TRUE)
  ) |>
  setNames(c("date", "avg_price"))

print(summary_table)

# A tibble: 6 × 2
  date       avg_price
           
1 2023-01-01      149.
2 2023-02-01      162.
3 2023-03-01      151.
4 2023-04-01      151.
5 2023-05-01      145.
6 2023-06-01      149.

In this example:

tidyquant and tibble are loaded to handle data manipulation.
We create a sample dataset with daily stock prices.
The mutate function adds a new column month, which extracts the month from each date.
pivot_table calculates the average price for each month.
Finally, we rename the columns for clarity.

Using dplyr for Summary Tables

dplyr is a core tidyverse package known for its powerful data manipulation functions. It helps streamline the process of filtering, summarizing, and mutating data.

Example: Calculating Average Closing Price by Month

Here’s a similar example using dplyr:

# Load necessary libraries
library(dplyr)
library(lubridate)

# Sample data: Daily stock prices
data <- tibble(
  date = seq(as.Date('2023-01-01'), as.Date('2023-06-30'), by = 'day'),
  price = runif(181, 100, 200)
)

# Create a summary table with average closing price by month
summary_table <- data %>%
  mutate(month = floor_date(date, "month")) %>%
  group_by(month) %>%
  summarise(avg_close = mean(price))

print(summary_table)

# A tibble: 6 × 2
  month      avg_close
           
1 2023-01-01      149.
2 2023-02-01      140.
3 2023-03-01      147.
4 2023-04-01      146.
5 2023-05-01      147.
6 2023-06-01      151.

In this dplyr example:

We load dplyr and lubridate for data manipulation and date handling.
The dataset creation process is the same.
The mutate function is used to add a month column.
We group the data by month using group_by and then calculate the average closing price for each group using summarise.

Your Turn!

Using packages like tidyquant and dplyr simplifies data analysis tasks, making it easier to work with large datasets. These examples show just one way to create summary tables; there are many other functions and methods to explore. Give these examples a try with your own data and see how you can summarize and gain insights from your datasets.

Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Creating Summary Tables in R with tidyquant and dplyr

Global movement of Happiness ladder with Machine learning in R

R-Blog on Data modelling to develop ... — Fri, 26 Jul 2024 01:00:00 +0000

[This article was first published on R-Blog on Data modelling to develop ..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This blog is about world happiness ladder using the world happiness report data sets (Helliwell et. al., 2024). The basic objective is to demonstrate the use of panel data which is quite distinct from cross-sectional or time series data.

Global happiness ladder

Cross-sectional happiness ladder for 2023

Fixed time, it is cross-sectional

Times series vs. panel data visualisation

Each line is a timeseries but together, it is panel data

Mean global happiness lalder

Cross-sectional mean happiness ladder for 2023

Time is fixed, cross-sectional

Times series vs. panel data visualisation

Each line is a timeseires but together, they are panel data

Factor analysis of panel global happiness ladder

Parallel analysis suggests that the number of factors = 4 and the number of components = NA 

Loadings:
MR1 MR2 MR4 MR3
Happiness 0.810
GDP 0.899
Support 0.761
Life_Exp 0.882
Freedom 0.651
Positive 0.775
Corruption -0.831
Negative 0.550
Year 0.487
Generosity 0.440
Regional
MR1 MR2 MR4 MR3
SS loadings 3.286 1.687 0.909 0.731
Proportion Var 0.299 0.153 0.083 0.066
Cumulative Var 0.299 0.452 0.535 0.601

	Values
degree of freedom	17.00
Chi-sq	132.48
Chi-sq/df	7.79
Harmonic sample size	2298.31
Root Mean Square	0.02
Probability of the empirical chi-sq	0.00
Adjusted Root Mean Square	0.04
Empirical BIC	0.90
Sample size adjusted BIC	54.91
fit (SSresidual vs SSoriginal values)	0.90
fit applied to off diagonal elements	1.00
SD of the residuals	0.02
Number of factors extracted	4.00
Number of observations	2363.00
Value of the minimised function	0.16
chi-sq based on the objective function	369.96
p-value of observing the chi-sq	0.00
chi-sq based on the objective function/df	6.73
Null model	5.32
df for null model	55.00
chi-sq for null model	12542.24
chi-sq for null model/df	228.04
Tucker Lewis Index of factoring reliability	0.91
RMSE Approximation	0.09
RMSE Approximation-lower	0.09
RMSE Approximation-upper	0.10
RMSE Approximation-confidence interval	0.90
RMSE Approximation-BIC	237.91
RMSE Approximation-empirical BIC	291.92
Mean item complexity	1.58
Kaiser Meyer Olkin Measure of Sampling Adequacy	0.81
Bartlett Chi	12542.24
Barlett p-value	0.00
Barlett df	55.00
Barlett Chi/df	228.04

lavaan 0.6-18 ended normally after 117 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 24
Used Total
Number of observations 2098 2363
Model Test User Model:
Test statistic 2216.113
Degrees of freedom 31
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 10727.135
Degrees of freedom 45
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.795
Tucker-Lewis Index (TLI) 0.703
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -15445.704
Loglikelihood unrestricted model (H1) -14337.648
Akaike (AIC) 30939.408
Bayesian (BIC) 31074.978
Sample-size adjusted Bayesian (SABIC) 30998.727
Root Mean Square Error of Approximation:
RMSEA 0.183
90 Percent confidence interval - lower 0.177
90 Percent confidence interval - upper 0.190
P-value H_0: RMSEA <= 0.050 0.000
P-value H_0: RMSEA >= 0.080 1.000
Standardized Root Mean Square Residual:
SRMR 0.110
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
MR1 =~
GDP 1.000
Life_Exp 5.659 0.097 58.044 0.000
Happiness 0.931 0.016 59.736 0.000
Support 0.089 0.002 45.585 0.000
MR2 =~
Positive 1.000
Freedom 1.858 0.076 24.518 0.000
Generosity 0.857 0.058 14.877 0.000
Regional 7.995 1.138 7.026 0.000
MR4 =~
Corruption 1.000
MR3 =~
Year 1.000
Covariances:
Estimate Std.Err z-value P(>|z|)
MR1 ~~
MR2 0.036 0.002 15.240 0.000
MR4 -0.078 0.005 -16.353 0.000
MR3 0.553 0.121 4.589 0.000
MR2 ~~
MR4 -0.007 0.000 -16.232 0.000
MR3 0.084 0.009 9.636 0.000
MR4 ~~
MR3 -0.086 0.020 -4.320 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.GDP 0.197 0.011 17.757 0.000
.Life_Exp 12.638 0.506 24.994 0.000
.Happiness 0.307 0.013 24.065 0.000
.Support 0.006 0.000 28.956 0.000
.Positive 0.007 0.000 26.406 0.000
.Freedom 0.003 0.001 6.186 0.000
.Generosity 0.023 0.001 31.520 0.000
.Regional 10.537 0.327 32.245 0.000
.Corruption 0.000
.Year 0.000
MR1 1.136 0.042 27.217 0.000
MR2 0.005 0.000 14.435 0.000
MR4 0.034 0.001 32.388 0.000
MR3 24.675 0.762 32.388 0.000

	x
npar	2.400000e+01
fmin	5.281489e-01
chisq	2.216113e+03
df	3.100000e+01
pvalue	0.000000e+00
baseline.chisq	1.072713e+04
baseline.df	4.500000e+01
baseline.pvalue	0.000000e+00
cfi	7.954423e-01
tli	7.030614e-01
nnfi	7.030614e-01
rfi	7.001121e-01
nfi	7.934106e-01
pnfi	5.465717e-01
ifi	7.957100e-01
rni	7.954423e-01
logl	-1.544570e+04
unrestricted.logl	-1.433765e+04
aic	3.093941e+04
bic	3.107498e+04
ntotal	2.098000e+03
bic2	3.099873e+04
rmsea	1.832962e-01
rmsea.ci.lower	1.768656e-01
rmsea.ci.upper	1.898093e-01
rmsea.ci.level	9.000000e-01
rmsea.pvalue	0.000000e+00
rmsea.close.h0	5.000000e-02
rmsea.notclose.pvalue	1.000000e+00
rmsea.notclose.h0	8.000000e-02
rmr	1.183338e+00
rmr_nomean	1.183338e+00
srmr	1.100957e-01
srmr_bentler	1.100957e-01
srmr_bentler_nomean	1.100957e-01
crmr	1.217154e-01
crmr_nomean	1.217154e-01
srmr_mplus	1.100957e-01
srmr_mplus_nomean	1.100957e-01
cn_05	4.358775e+01
cn_01	5.040973e+01
gfi	8.167185e-01
agfi	6.748232e-01
pgfi	4.603323e-01
mfi	5.940683e-01
ecvi	1.079177e+00

	MR1	MR2
alpha	0.4741262	0.0705956
omega	0.8358631	0.0570646
omega2	0.8358631	0.0570646
omega3	0.8354815	0.0556176
avevar	0.7454058	0.0295963

The inter-connectivity between the latent variables and the various variables used to measure hapiiness.

References

Helliwell, J. F., Layard, R., Sachs, J. D., De Neve, J.-E., Aknin, L. B., & Wang, S. (Eds.). (2024). World Happiness Report 2024. University of Oxford: Wellbeing Research Centre.

To leave a comment for the author, please follow the link and comment on their blog: R-Blog on Data modelling to develop ....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Global movement of Happiness ladder with Machine learning in R

Mastering Wildcard Searches in R with grep()

Steven P. Sanderson II, MPH — Thu, 25 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In R, finding patterns in text is a common task, and one of the most powerful functions to do this is grep(). This function is used to search for patterns in strings, allowing you to locate elements that match a specific pattern. Today, we’ll explore how to use wildcard characters with grep() to enhance your string searching capabilities. Let’s dive in!

Understanding `grep()`

At its core, grep() is a function that searches for matches to a pattern (regular expression) within a vector of strings. It returns the indices of the elements that contain the pattern. Here’s a basic syntax:

grep(pattern, x, ignore.case = FALSE, value = FALSE)

pattern: A character string containing a regular expression.
x: A character vector where the search is performed.
ignore.case: If TRUE, the search will be case-insensitive.
value: If TRUE, grep() returns the matching elements instead of their indices.

Using Wildcards in `grep()`

Wildcard characters are incredibly useful in searching for patterns that may not be exactly known. In regular expressions, which grep() uses, wildcards are represented in specific ways:

^: Asserts the start of a string.
$: Asserts the end of a string.
.: Matches any single character.
.*: Matches any number of any characters (including none).

Let’s look at some practical examples to see these in action!

Examples

Strings that Start with a Pattern

To find strings that start with a specific pattern, use ^ at the beginning of your pattern. For instance, if you’re looking for words starting with “data”:

words <- c("data", "dataframe", "database", "analytics", "visualization")
grep("^data", words)

[1] 1 2 3

This code will return the indices of “data”, “dataframe”, and “database” because they all start with “data”. If you set value = TRUE, it will return the matching elements:

grep("^data", words, value = TRUE)

[1] "data"      "dataframe" "database"

Strings that End with a Pattern

To find strings ending with a certain pattern, use $ at the end of your pattern. For example, to find words ending with “base”:

grep("base$", words, value = TRUE)

[1] "database"

Strings that Contain a Pattern

To find strings containing a pattern anywhere within them, use the pattern directly. For example, to find words containing “viz”:

words <- c("data", "visualization", "database", "analyze", "predict")
grep("vis", words, value = TRUE)

[1] "visualization"

Combining Patterns with `.*`

The combination of .* can be used to match any number of characters, making it useful for finding patterns within strings. For instance, to find words containing “a” followed by “z”:

grep("a.*z", words, value = TRUE)

[1] "visualization" "analyze"

Your Turn!

Regular expressions can seem intimidating at first, but with a bit of practice, they become a powerful tool in your R toolkit. I encourage you to play around with different patterns and see what you can find in your datasets. Try searching for different starting and ending patterns, or look for specific sequences within your strings. The grep() function is incredibly versatile, and mastering it can save you a lot of time when working with text data.

Feel free to share your discoveries or any interesting patterns you find.

Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Mastering Wildcard Searches in R with grep()

The distribution has changed; and pretty tables in base R

Jerry Tuttle — Wed, 24 Jul 2024 18:00:00 +0000

[This article was first published on Online College Math Teacher, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The distribution has changed; and pretty tables in base R, by Jerry Tuttle

So you spent hours, or maybe days, cranking out thousands of numbers, you submit it to your boss just at the deadline, your boss quickly peruses your exhibit of numbers, points to a single number and says, “This number doesn’t look right.” Bosses have an uncanny ability to do this.

Your boss is pointing to something like this: Your company sells property insurance on both personal and commercial properties. The average personal property premium increased 10% in 2024. The average commercial property premium increased 10% in 2024. But you say the combined average property premium decreased 3% in 2024. You realize that negative 3% does not look right.

You might have made an input error or a calculation error, but you don’t want to admit to that. So you blurt out, “That’s because the distribution has changed.” And to your relief, the boss buys into that.

“The distribution has changed” is probably a pretty good answer in more instances than you realize. A more common example is if your investment portfolio starts at 90% stocks and 10% bonds, but you have a good year with stocks, at year-end the distribution of your portfolio has changed to 94% stocks and 6% bonds, and you may want to rebalance your portfolio. How about the distribution of population by state? It has definitely changed since the prior Census. It might be hard to think of a non-trivial example of something where the distribution has not changed.

The key calculations are the 2023 average premium is (2000 * 90 + 10000 * 10) / (90 + 10) = 2800, the 2024 average premium is (2200 * 110 + 11000 * 7) / (110 + 7) = 2726, and so the average premium change is 2726 / 2800 = .97 (rounded) = – 3%. And the distribution between personal and commercial DID change, as measured either by the distribution of number of policies (was 90% personal, now 94%) or by the distribution of premium (was 64% personal, now 76%). So because the distribution has changed towards more smaller personal policies, this pulls the combined average premium down, even though the average premium for each of the two separate subgroups has increased. (There are alternatives to the key calculations, such as weighting the percentage changes, instead of weighting the average premiums.)

Years ago I posed a dilemma like this during job interviews for actuarial trainees to see how well they would respond to a sort of non-routine problem, but I decided it was too difficult.

I did the above exhibit in Excel because it was quick. It was also pretty easy to add custom colors to alternating rows, which I did via FONT > FILL > MORE COLORS > CUSTOM > enter HEX CODE. Here I chose cornsilk #fff8dc and cyan #00ffff for fun.

Then I wondered how easy it would be to make a pretty table in R. If you google something like “pretty tables in R”, you will find a number of R packages that create HTML type code that can be saved as an HTML file, a PDF file, or another file format. Much has been written about these packages, but they seem a little complicated for basic work, and further, I like the idea of staying exclusively within the R environment. When I realized a table is just a collection of rectangles, it occurred to me that the base R commands of rectangle and text are pretty much all I need.

Here is a table of sample rectangles with text, written in R. The rectangle syntax is rect(xleft, ybottom, xright, ytop, col, border) and the text syntax is text(x, y, labels, col, cex, font). The numerical axis is helpful when first defining the rectangles, but can be deleted by adding axes = FALSE to the plot function for the final table.

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
plot(x = c(0, 500), y = c(0, 500), type= “n”, xlab = “”, ylab = “”, main = “Sample rectangles with text”, cex=2, font=2)

rect(0, 0, 250, 250, col = “#E41A1C”, border = “blue”)
text(125, 125, “red rectangle, white font”, col=”white”, cex=1.15, font=2)

rect(250, 0, 500, 250, col = “yellow”, border = “blue”)
text(375, 125, “yellow rectangle, blue font”, col=”navyblue”, cex=1.15, font=2)

rect(0, 250, 250, 500, col = “cornsilk”, border = “blue”)
text(125, 375, “cornsilk rectangle, black font”, col=”black”, cex=1., font=2)

rect(250, 250, 500, 500, col = “cyan”, border = “blue”)
text(375, 375, “cyan rectangle, purple font”, col=”purple”, cex=1.15, font=2)

The R equivalent of the Excel exhibit is the following. Note that all code is in base R.

title <- c("Subgroups increase, but the average decreases")
df <- data.frame(Personal = c(2000, 90, 2200, 110, 10),
Commercial = c(10000, 10, 11000, 7, 10),
Weighted = c(2800, 100, 2726, 117, -3))

rownames(df) = c(“2023 Avg Prem”, “2023 No. Policies”,”2024 Avg Prem”, “2024 No. Policies”, “Avg Prem % Change”)
df

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
op <- par(bg = "thistle")
col1 = “cornsilk”v col2 = “cyan”

plot(x = c(0, 500), y = c(0, 700), type= “n”, xlab = “”, ylab = “”, axes = FALSE)
height = 100

rect(0, 6*height, 500, 7*height, col = col1, border = “blue”)
text(250, 6.5*height, title, col=”black”, cex=1.25, font=2)

rect(0, 5*height, 200, 6*height, col = col2, border = “blue”)
rect(200, 5*height, 300, 6*height, col = col2, border = “blue”)
rect(300, 5*height, 400, 6*height, col = col2, border = “blue”)
rect(400, 5*height, 500, 6*height, col = col2, border = “blue”)

text(100, 5.5*height, “”, col=”blue”)
text(250, 5.5*height, colnames(df)[1], col=”blue”)
text(350, 5.5*height, colnames(df)[2], col=”blue”)
text(450, 5.5*height, colnames(df)[3], col=”blue”)

rect(0, 4*height, 200, 5*height, col = col1, border = “blue”)v rect(200, 4*height, 300, 5*height, col = col1, border = “blue”)v rect(300, 4*height, 400, 5*height, col = col1, border = “blue”)v rect(400, 4*height, 500, 5*height, col = col1, border = “blue”)v

text(100, 4.5*height, rownames(df)[1], col=”blue”)v text(250, 4.5*height, df[1,1], col=”blue”)
text(350, 4.5*height, df[1,2], col=”blue”)
text(450, 4.5*height, df[1,3], col=”blue”)

rect(0, 3*height, 200, 4*height, col = col2, border = “blue”)
rect(200, 3*height, 300, 4*height, col = col2, border = “blue”)
rect(300, 3*height, 400, 4*height, col = col2, border = “blue”)
rect(400, 3*height, 500, 4*height, col = col2, border = “blue”)

text(100, 3.5*height, rownames(df)[2], col=”blue”)
text(250, 3.5*height, df[2,1], col=”blue”)
text(350, 3.5*height, df[2,2], col=”blue”)
text(450, 3.5*height, df[2,3], col=”blue”)

rect(0, 2*height, 200, 3*height, col = col1, border = “blue”)
rect(200, 2*height, 300, 3*height, col = col1, border = “blue”)
rect(300, 2*height, 400, 3*height, col = col1, border = “blue”)
rect(400, 2*height, 500, 3*height, col = col1, border = “blue”)

text(100, 2.5*height, rownames(df)[3], col=”blue”)
text(250, 2.5*height, df[3,1], col=”blue”)
text(350, 2.5*height, df[3,2], col=”blue”)
text(450, 2.5*height, df[3,3], col=”blue”)

rect(0, height, 200, 2*height, col = col2, border = “blue”)
rect(200, height, 300, 2*height, col = col2, border = “blue”)
rect(300, height, 400, 2*height, col = col2, border = “blue”)
rect(400, height, 500, 2*height, col = col2, border = “blue”)

text(100, 1.5*height, rownames(df)[4], col=”blue”)
text(250, 1.5*height, df[4,1], col=”blue”)
text(350, 1.5*height, df[4,2], col=”blue”)
text(450, 1.5*height, df[4,3], col=”blue”)

rect(0, 0, 200, height, col = col1, border = “blue”)
rect(200, 0, 300, height, col = col1, border = “blue”)
rect(300, 0, 400, height, col = col1, border = “blue”)
rect(400, 0, 500, height, col = col1, border = “blue”)

text(100, .5*height, rownames(df)[5], col=”blue”)
text(250, .5*height, paste(df[5,1], “%”), col=”black”, cex=1.5)
text(350, .5*height, paste(df[5,2], “%”), col=”black”, cex=1.5)
text(450, .5*height, paste(df[5,3], “%”), col=”black”, cex=2, font=2)

par(op)

# END

#######################################

To leave a comment for the author, please follow the link and comment on their blog: Online College Math Teacher.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: The distribution has changed; and pretty tables in base R

Getting the Workbook Name in VBA and Calling It from R

Steven P. Sanderson II, MPH — Wed, 24 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

When working with Excel, it’s often useful to know the name of the workbook you’re working in, especially if you’re managing multiple files. Today, we’ll look at how to retrieve the workbook name using VBA (Visual Basic for Applications) and then call this VBA code from R. This post will walk you through the steps with clear examples and explanations. Let’s get to it!

Example

Getting the Workbook Name Using VBA

First, we’ll start with a simple VBA script to get the workbook name. VBA is a powerful tool integrated into Microsoft Office applications, allowing you to automate tasks and interact with various elements in your documents.

Here’s a basic example of VBA code that retrieves the name of the active workbook:

Sub GetWorkbookName()
    Dim wbName As String
    wbName = ThisWorkbook.Name
    MsgBox "The name of the active workbook is: " & wbName
End Sub

Explanation:

Sub GetWorkbookName(): This line defines a new subroutine named GetWorkbookName. A subroutine in VBA is a block of code that performs a specific task.
Dim wbName As String: This line declares a variable wbName that will hold the workbook’s name as a string.
wbName = ThisWorkbook.Name: Here, we’re assigning the name of the active workbook (the one where this VBA code is being run) to the wbName variable.
MsgBox “The name of the active workbook is:” & wbName: Finally, we use a message box to display the workbook name.

Calling VBA Code from R

Now that we have our VBA macro, the next step is to call it from R. This is particularly useful if you’re integrating Excel operations into your R workflows.

We’ll use the RDCOMClient package in R, which allows us to interact with COM (Component Object Model) objects, such as Excel. If you haven’t installed this package, you can do so with:

install.packages("RDCOMClient")

Here’s a simple R script to call our VBA subroutine:

library(RDCOMClient)

# Create an instance of the Excel application
excel_app <- COMCreate("Excel.Application")

# File Path
f_path <- "C:/path_to/workbook_name.xlsm"

# Open the workbook (replace 'f_path' with the actual path)
workbook <- excel_app$Workbooks()$Open(f_path)

# Run the VBA macro
excel_app$Run("GetWorkbookName")

# Close the workbook without saving changes
workbook$Close(FALSE)

# Quit Excel
excel_app$Quit()

# Release the object
rm(excel_app)

Explanation:

library(RDCOMClient): This line loads the RDCOMClient package.
COMCreate(“Excel.Application”): We create an instance of the Excel application.
workbook <- excel_appOpen(“f_path”): This line opens the specified workbook. Replace "f_path" with the path to your actual Excel file.
excel_app$Run(“GetWorkbookName”): Here, we call the VBA subroutine GetWorkbookName to display the workbook’s name.
workbook$Close(FALSE): We close the workbook without saving any changes.
excel_app$Quit(): This closes the Excel application.
rm(excel_app): Finally, we release the Excel application object to free up resources.

Here is a picture of the message:

VBA Workbook Name

Conclusion

With these simple steps, you’ve learned how to retrieve the name of an Excel workbook using VBA and how to call this VBA code from R. This combination can be quite powerful, especially when automating data processing tasks that involve both Excel and R.

Feel free to try this on your own. Experiment with different VBA scripts and see how you can integrate them into your R workflows. It’s a great way to streamline your work and make the most of both tools.

Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Getting the Workbook Name in VBA and Calling It from R

plotting individual values within multiple groups together with their means

Di Cook — Tue, 23 Jul 2024 13:37:00 +0000

[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I show how groupScatterPlot(), function of the rnatoolbox R package can be used for plotting the individual values in several groups together with their mean (or other statistics). I think this is a useful function for plotting grouped data when some groups (or all groups) have few data points ! You may be wondering why to include such function in the rnatoolbox package ?! Well ! I happen to use it quit a bit for plotting expression values of different groups of genes/transcripts in a sample or expression levels of a specific gene/transcript in several sample groups. These expression value are either FPKM, TPM, LCPM, or PSI values (Maybe I should go through these different normalizations later in a different post !). But of course its application is not restricted to gene expression or RNAseq data analysis.

For the test, I first generate a list with three random values. The values are generated randomly using normal distribution, featuring different means and standard deviations.

library(rnatoolbox)
datList<- list(
l1=rnorm(n=30, mean = 10, sd = 3),
l2=rnorm(n=20, mean = 0, sd = 1),
l3=rnorm(n=25, mean = 10, sd = 1)
)

Then I plot the grouped values. Byt default the mean function is used to add a summary for the values. However, other functions (e.g. median) can be defined as the FUN parameter.

png(
“/proj/pehackma/ali/test/test_rnatoolbox/test_groupedScatterPlot_3.png”,
width=500, height=500, pointsize=21)
groupScatterPlot(l=datList, col=rainbow(3),
lty=1, lwd=1.5,
ylab=”Test values”)
dev.off()

To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: plotting individual values within multiple groups together with their means

Augmenting RNA-Ligand Binding Prediction With Machine Learning: A Leap Towards Enhanced Drug Discovery

Piotr Suwara — Tue, 23 Jul 2024 07:00:35 +0000

[This article was first published on Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In pharmaceutical research, the exploration of RNA-ligand interactions is a significant challenge, marking a stark contrast to the more developed understanding of protein-ligand interactions.

Our AI model reduces the number of missed protein crystals by over 30% compared to state-of-the-art benchmarks. Learn more about Crystal Clear Vision.

This complexity arises from the relatively uncharted nature of RNA-ligand interactions, despite extensive studies, thereby complicating the development of RNA-targeted drugs. To tackle this challenge, we partnered with the scientists from the International Institute of Molecular and Cell Biology in Warsaw (IIMCB).

Researchers provided us with comprehensive datasets, which we utilized, applying our expertise in machine learning, to develop predictive models for RNA-ligand binding.

This collaboration led to very promising results: our models achieved Area Under the Receiver Operating Characteristic curve (AUC) values between 0.65-0.68 on test sets, surpassing the molecular docking techniques (which currently deliver state-of-the-art results and are widely utilized for virtual screenings) that reach between 0.50-0.60 on the same RNAs. This approach marks a significant step forward in RNA-targeted drug discovery.

Figure 1. AUC curves and scores reflecting performance of molecular docking as well as our model’s performance in test and validation conditions. Test on RNA X: model trained on two other RNAs and tested on RNA X. Validation on RNA X: average performance on a held-out subset of samples for RNA X when training on RNA X and another RNA. The class of the ligands (i.e. binders or non-binders) was determined experimentally.

The Challenge in Detail

Certain RNA molecules have been identified as crucial targets for therapeutic agents, highlighting the potential of RNA molecules in revolutionizing drug development. These include bacterial ribosomes and the human pre-mRNA of the survival of motor neuron 2 (SMN2) protein, targeted by specific drugs like bacterial ribosome-targeting antibiotics and risdiplam, respectively. Moreover, other RNAs such as mRNAs, regulatory RNAs in humans, riboswitches in bacteria, and conserved non-coding RNAs in viruses are acknowledged as promising candidates for novel therapeutics.

The flexible and dynamic nature of RNA structures poses significant challenges for in silico modeling and prediction, making it difficult to accurately target RNAs with drugs. Unlike proteins, whose domains are in most cases relatively rigid and well-defined, making them easier to model (with tools building on AlphaFold’s results and architecture providing a solid foundation), RNAs lack similar structural predictability, thus complicating the drug design process. Another challenge is the limited availability of experimental data on RNA and small molecule ligands which makes careful curation and preparation of the data set particularly important in method development.

“Effective prediction of binding of small molecule ligands to RNA, is the ultimate challenge of rational drug discovery. The Machine Learning-based methods developed with Appsilon are taking us closer to that goal.”

Filip Stefaniak, PhD
International Institute of Molecular and Cell Biology in Warsaw

Our Approach

Combining Expertise:

Our collaboration with IIMCB combined a blend of structural biology of RNA, bioinformatics, and machine learning expertise. We teamed up with RNA-ligand interactions specialists Filip Stefaniak and Natalia Szulc, who created initial models for predicting binding of small molecule ligands to RNA [1]. Our role was to incorporate our knowledge in building custom neural networks, in particular to handle input data of variable lengths reflecting variability of lengths of RNA sequences.

Data Curation and Preparation:

The IIMCB researchers provided us with carefully curated datasets which contained experimental results on the interactions between three RNAs and tens of thousands of molecules, known as ligands. The data showed whether or not a ligand binds to a specific RNA. For every pair of RNA and ligand 3D structures, and for each nucleotide in the RNA, their software, fingeRNAt, generated a sequence of numbers, representing the nature of the noncovalent interactions.

Leveraging Transformer Architecture:

For handling variable length sequences of nucleotides, we utilized the transformer architecture, known for its success in designing large language models. This architecture can handle sequences of any length, making it ideal for RNA-ligand binding predictions. As it turns out, it is useful for building smaller models as well!

We aimed to create models that could learn from data determined for some RNAs and then predict binding for other RNAs, where little to no binding data might exist. To achieve this, we trained and validated our models with data from two of the RNAs, and then tested the models’ performance using the third RNA. This strategy ensured that our models could potentially be applied to a broad range of RNAs in future screenings.

Achievements and Results

Our collaborative efforts bore fruit, with our models achieving AUC scores between 0.65-0.68 on test sets, and even higher scores of 0.70-0.72 on validation sets, suggesting the potential for even better performance with additional data or refined models. This is a significant improvement over existing approaches, which in our test setup achieve AUC scores in the range of 0.53-0.61 (c.f. Figure 1).

In addition to AUC, our evaluation included the enrichment score (EF₁₀), a crucial metric in assessing the efficiency of identifying true positives early in the screening process. EF₁₀ measures the increase in likelihood of identifying active compounds in the top 10% samples ranked by the model, indicating the model’s efficiency in prioritizing potential binders. Our models achieved EF₁₀ scores of around 3, compared to SoTA results, which only reach between 1.1 and 2.0 on this dataset (c.f. Figure 2).

Figure 2. The percentage of all binders in the datasets that are ranked in the top 10% of the scores by each method, equal to (EF10)*10% Test/validation conditions: see Fig. 1 caption.

Curious about how machine learning is applied to drug discovery? Check out these 5 promising applications.

Implications for Drug Discovery

Our approach may have profound implications for drug discovery. With an EF₁₀ score of 3, our models demonstrate that, for the RNAs and molecules tested, selecting just 10% of the ligands and testing them in a wet lab would likely yield approximately 30-45 active binders from a dataset containing about 100-150 binders overall. This efficiency has the potential to drastically reduce the amount of time and resources needed for screening by increasing the likelihood of discovering an effective drug among those top-ranked binders. By effectively narrowing the field of candidates, our models enable a more focused and economically feasible approach to initial drug testing, and we look forward to making further improvements that greatly enhance the capabilities of our models.

Moreover, the rapid development and deployment of our models underscore our capability to deliver significant advancements in predictive analytics for pharmaceutical research. These results were achieved within a limited preliminary effort, on a relatively small dataset, illustrating our team’s ability to quickly adapt and develop customized machine learning models tailored to specific datasets.

Partnering with us allows you to leverage our expertise and cutting-edge technology to accelerate your drug discovery processes, ultimately reducing costs and expediting the development of new treatments.

Develop your AI models with us!

The advent of AI in biotechnology heralds a new era of innovation and discovery, with machine learning at its core catalyzing advancements in drug discovery and biomedical research. Our pioneering work in RNA-ligand binding prediction is a testament to the transformative potential of integrating AI with biotechnology. If your organization is looking to sort through complex data more efficiently or speed up your research and development efforts, we’re ready to help.

As we continue to explore the frontiers of cheminformatics and biology, we envision ourselves contributing significantly to this revolution, aiding in the discovery of new drug targets, the development of novel therapeutics, and expanding our collective understanding of the molecular world. To understand better how our past projects have paved the way for innovative developments, we invite you to explore our previous work, in particular how we apply computer vision to speed up drug development.

Let’s discuss how we can support your goals with a customized proof-of-concept model. Contact us to see how a partnership can benefit your research needs and help forge the path to groundbreaking discoveries in pharmaceutical research.

[1] Szulc NA, Mackiewicz Z, Bujnicki JM, Stefaniak F (2022) fingeRNAt—A novel tool for high-throughput analysis of nucleic acid-ligand interactions. PLoS Comput Biol 18(6): e1009783. https://doi.org/10.1371/journal.pcbi.1009783

Shorten timelines, improve discovery rates, cut costs, and get to the next stage faster. Learn more about our work in AI for Drug Discovery.

This blog post was co-authored by Jędrzej Świeżewski, Natalia Szulc and Filip Stefaniak.

The post appeared first on appsilon.com/blog/.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Augmenting RNA-Ligand Binding Prediction With Machine Learning: A Leap Towards Enhanced Drug Discovery

FAQs on RStudio: Installation, Features, and Tips for Data Analysis

Zubair Goraya — Tue, 23 Jul 2024 06:43:00 +0000

[This article was first published on RStudioDataLab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the ultimate resource for Frequently Asked Questions related to RStudio! If you’ve ever felt overwhelmed by the vast world of data analysis or uncertain about how to get started with downloading RStudio, installing R, or integrating Python with RStudio, you’re in the right place. As a certified Data Analyst with a PhD and five years of experience, I’ve compiled answers to the most

To leave a comment for the author, please follow the link and comment on their blog: RStudioDataLab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: FAQs on RStudio: Installation, Features, and Tips for Data Analysis

Checking if a String Contains Multiple Substrings in R

Steven P. Sanderson II, MPH — Tue, 23 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello, fellow R programmers! Today, we’re looking at a practical topic that often comes up when dealing with text data: how to check if a string contains multiple substrings. We’ll cover how to do this in base R, as well as using the stringr and stringi packages. Each approach has its own advantages, so let’s explore them together.

Examples

Base R Approach

First, let’s start with base R. Suppose we have a string and we want to check if it contains both “apple” and “banana”. Here’s how you can do it:

# Our main string
main_string <- "I have an apple and a banana."

# Substrings to check
substrings <- c("apple", "banana")

# Check if all substrings are in the main string
contains_all <- all(sapply(substrings, function(x) grepl(x, main_string)))

# Output the result
contains_all

[1] TRUE

sapply(substrings, grepl, x = main_string)

 apple banana 
  TRUE   TRUE

Explanation

main_string: This is the string we are checking.
substrings: A vector containing the substrings we are looking for.
sapply(substrings, function(x) grepl(x, main_string)): We use sapply to apply grepl (which checks if a pattern is found in a string) to each substring. This returns a logical vector indicating if each substring is present.
all(): This function checks if all values in the logical vector are TRUE.

By combining these functions, we can efficiently check if all the substrings are present in our main string.

Using `stringr`

The stringr package provides a set of functions designed to make string manipulation easier and more intuitive. Here’s how we can use it to achieve the same goal:

# Load the stringr package
library(stringr)

# Our main string
main_string <- "I have an apple and a banana."

# Substrings to check
substrings <- c("apple", "banana")

# Check if all substrings are in the main string
contains_all <- all(str_detect(main_string, substrings))

# Output the result
contains_all

[1] TRUE

str_detect(main_string, substrings)

[1] TRUE TRUE

Explanation

library(stringr): Loads the stringr package.
str_detect(main_string, substrings): The str_detect function checks if each pattern in substrings is found in main_string. It returns a logical vector.
all(): As before, all checks if all values in the logical vector are TRUE.

The stringr package simplifies the syntax and makes the code more readable.

Using `stringi`

The stringi package is another powerful tool for string manipulation. It offers a highly efficient way to handle strings. Here’s how we can use stringi to check for multiple substrings:

# Load the stringi package
library(stringi)

# Our main string
main_string <- "I have an apple and a banana."

# Substrings to check
substrings <- c("apple", "banana")

# Check if all substrings are in the main string
contains_all <- all(stri_detect_fixed(main_string, substrings))

# Output the result
contains_all

[1] TRUE

stri_detect_fixed(main_string, substrings)

[1] TRUE TRUE

Explanation

library(stringi): Loads the stringi package.
stri_detect_fixed(main_string, substrings): The stri_detect_fixed function checks if each fixed pattern in substrings is found in main_string. This function is optimized for fixed patterns and is very fast.
all(): Again, we use all to check if all values in the logical vector are TRUE.

stringi provides highly optimized functions that can be very useful for handling large datasets or performance-critical applications.

Try It Yourself!

Now that we’ve walked through the different methods to check if a string contains multiple substrings, I encourage you to try these approaches on your own. Experiment with different strings and substrings to get a feel for how these functions work. Understanding these techniques can greatly enhance your text data manipulation skills in R.

Happy coding, and feel free to share your experiences and any questions you might have in the comments!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Checking if a String Contains Multiple Substrings in R

Empowering the R Community: Insights from Myles Mitchell of the Leeds Data Science Group

R Consortium — Mon, 22 Jul 2024 20:44:23 +0000

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R Consortium recently interviewed Myles Mitchell, co-organizer of the Leeds Data Science group, to discuss the local R community and the group’s recent activities. Myles highlighted the group’s efforts to create an inclusive and welcoming environment for all participants. The group is dedicated to creating networking opportunities for students interested in pursuing a career in data science and sharing job openings.

The Leeds Data Science group is hosting an in-person event titled “Improving the Fidelity and Stability of Large Language Models” on the 23rd of July.

Please share about your background and involvement with the RUGS group.

I am a data scientist at Jumping Rivers, a data science consultancy. We collaborate with various companies on data-related projects, such as data storage, modeling, developing data visualization dashboards, and offering data science training. Initially, I had a background in Python, but I learned R while working at Jumping Rivers, where many of our staff are proficient in R, and much of our infrastructure is written in R.

At Jumping Rivers, we receive funding from the R Consortium. We organize the Leeds Data Science Meetups every two months and the North East Data Science Meetups every three months. Additionally, we hold annual conferences, such as Shiny in Production (October) and SatRdays London (April), which run once a year. I organize the North East and Leeds Data Science Meetups for Jumping Rivers.

Can you share what the local R Community is like?

I am located in Newcastle, in the northeast of England, where a large community is keenly interested in data science. Our community includes Newcastle University and Northumbria University students, many of whom are studying data science or statistics. There are also professionals from various industries looking for data science jobs. Our meetups are attended by prospective data scientists and students eager to network and learn more about the field.

Both universities teach R, and many industries in the area employ data science techniques, including Northumbrian Water and Nissan. These companies use data science to solve everyday problems, such as detecting water leaks and optimizing manufacturing processes.

Data science is relevant across almost all industries, and R, along with Python and other languages, is a crucial tool in solving data science problems. In the Northeast, consultancies like Jumping Rivers specialize in data science. In summary, we have a large community of students and industry professionals in the Northeast, and it’s a similar story in Leeds.

You have a Meetup on “Improving the Fidelity and Stability of Large Language Models”, can you share more on the topic covered? Why this topic?

During our Meetup on “Improving the Fidelity and Stability of Large Language Models,” we will explore how to enhance software solutions with AI capabilities, focusing on improving the accuracy and reliability of these models. Drawing from real-world experiences, we will discuss successful strategies for development, tackle the challenge of model ‘hallucinations,’ and address other significant obstacles. This topic is essential as the AI sector continues to grow rapidly, and integrating AI effectively is crucial for developers to achieve robust performance and innovative functionality in their projects. The session is designed for developers of all skill levels interested in incorporating AI into their work, ensuring they can implement practical and effective methodologies for positive outcomes.

Ryan Mangan will be presenting this meetup. Ryan is a seasoned technologist with over 18 years of experience in cloud computing, AI, and virtualization. He founded Efficient Ether Ltd, a Microsoft startup specializing in AI, cloud optimization, and sustainability. Ryan is a recognized Microsoft MVP, VMware vExpert, and Chartered Fellow of the British Computer Society. He has authored several e-books and publications, including “Mastering Azure Virtual Desktop,” and is active in public speaking and blogging within the tech community.

Any techniques you recommend using for planning for or during the event? (Github, zoom, other) Can these techniques be used to make your group more inclusive to people that are unable to attend physical events in the future?

Regarding techniques, I’m currently reviewing how we organize our meetups. Our meetups are free to attend for all participants, and we aim to create a welcoming and accessible environment for everyone to network and meet like-minded individuals in the area. The meetups are held every two to three months on weekdays in the evenings, providing attendees with time to travel from their place of work to the venue. We offer refreshments at the start, including pizza and soft drinks, and we ensure that vegan, gluten-free, and halal options are included to cater to a wide range of dietary preferences.

We often run interactive workshops at the North East Data Science Meetups, including a recent meetup on the Apache Arrow interface for R, led by Nic Crane on July 18th. To make our workshops as inclusive as possible, we provide attendees with all necessary materials and dependencies via a cloud environment created using Posit Workbench. It allows participants without prior installation of RStudio IDE to contribute and interact with the workshop materials. Our goal is to make our workshops accessible to a broad audience, including those from non-R backgrounds. In general, we aim to create an event where attendees can participate without the burden of installing multiple packages or downloading data.

Most attendees attend our meetups to network and meet industry professionals, especially students exploring post-graduation career opportunities. With this in mind, we always reserve a part of the meetup for advertising similar meetups and conferences in the area and job opportunities in data science. Many attendees regard these meetups as a regular source of news, so we try to provide a central hub of information and a place to enjoy high-quality live talks and workshops.

These are in-person meetups. We could attract more people if we recorded and live-streamed it on Zoom. However, managing a Zoom call and recording would create more work for the organizers, and an in-person event provides better networking opportunities than online. In saying that, we will continue to look at ways to make these accessible and appeal to a wide range of backgrounds, and we will always take feedback from the community seriously.

What trends do you currently see in R language and your industry? Any trends you see developing in the near future?

In the northeast and Leeds data science meetups, there is a significant interest in machine learning, training and deploying machine learning models, and productionizing these models (ML Ops). Attendees often expect talks on these topics and are particularly interested in chatGPT, generative AI, and other related issues. However, data science encompasses a broader range of areas, including visualizing data and creating dashboards, and we try to cover all of these areas in our talks and workshops. Despite our efforts, there is a clear trend toward machine learning-focused discussions, with many talk submissions focusing on ML Ops and deploying models on the cloud.

How do I Join?

R Consortium’s R User Group and Small Conference Support Program (RUGS) provides grants to help R groups organize, share information, and support each other worldwide. We have given grants over the past four years, encompassing over 68,000 members in 33 countries. We would like to include you! Cash grants and meetup.com accounts are awarded based on the intended use of the funds and the amount of money available to distribute.

Learn more

The post Empowering the R Community: Insights from Myles Mitchell of the Leeds Data Science Group appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Empowering the R Community: Insights from Myles Mitchell of the Leeds Data Science Group

Bootstrap Confidence Intervals: Exports in Japan

Selcuk Disci — Mon, 22 Jul 2024 13:34:56 +0000

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Japanese Finance Minister Shunichi Suzuki recently said that the volatility in USD/JPY rates negatively impacts the price competition and even damages the profit of the exporters in Japan. To examine this claim, first, we will look at the movement of the exports of goods and services in Japan, Nikkei 225, and exchange rates together.

library(tidyverse)
library(tidymodels)
library(tidyquant)
library(timetk)
library(modeltime)

#Quarterly change (%) of USD/JPY exchange rates
#https://finance.yahoo.com/quote/JPY%3DX/
df_usdjpy <- 
  tq_get("JPY=x", to = "2024-07-01") %>% 
  tq_transmute(select = "close",
               mutate_fun = to.quarterly) %>% 
  tq_transmute(mutate_fun = periodReturn,
               period = "quarterly",
               col_rename = "usd_jpy")
  


#Quarterly change (%) of exports of goods and services in Japan
#https://fred.stlouisfed.org/series/JPNEXPORTQDSNAQ
df_exports <- 
  tq_get("JPNEXPORTQDSNAQ", 
         get = "economic.data") %>% 
  select(date, exports = price) %>% 
  mutate(date = as.yearqtr(date),
         exports = exports / lag(exports) - 1) %>% 
  drop_na()


#Quarterly change (%) of Nikkei 225
#https://finance.yahoo.com/quote/%5EN225/
df_nikkei <- 
  tq_get("^N225", to = "2024-07-01") %>% 
  tq_transmute(select = "close",
               mutate_fun = to.quarterly) %>% 
  tq_transmute(mutate_fun = periodReturn,
               col_rename = "nikkei225")


#Merging series
df_merged <- 
  df_exports %>% 
  left_join(df_usdjpy) %>% 
  left_join(df_nikkei) 


#Change of quarterly %
df_merged %>% 
  pivot_longer(-date, names_to = "vars") %>% 
  filter(date >= 2021) %>% 
  mutate(type = case_when(
    vars == "exports" ~ "Exports in Japan",
    vars == "usd_jpy" ~ "USD/JPY",
    vars == "nikkei225" ~ "Nikkei 225",
    TRUE ~ vars
  )) %>% 
  ggplot(aes(date, value, color = vars)) +
  geom_line(linewidth = 1.25) +
  geom_text(
    data = . %>%  slice_tail(n = 1, by = type),
    aes(label = type),
    hjust = 0,
    vjust = 0,
    family = "Bricolage Grotesque",
    nudge_x = 0.05,
    size = 5
  ) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_yearqtr(format = "%Y Q%q",
                  expand = expansion(mult = c(0, .3))) +
  labs(x = "", 
       y = "",
       subtitle = "Change of quarterly %") +
  theme_minimal(base_family = "Bricolage Grotesque",
                base_size = 16) +
  theme(legend.position = "none",
        panel.grid.minor = element_blank())

When we look at the chart above, we can say that exchange rates and exports diverged in the opposite direction in the last quarter, but this still shows no evidence of a relationship between them. To examine whether there is a significant effect of the exchange rate on exports, we will use bootstrap confidence intervals.

#Bootstrap confidence intervals
set.seed(12345)
jpn_intervals <- 
  reg_intervals(exports ~ usd_jpy,
                data = df_merged,
                model_fn = "glm",
                keep_reps = TRUE)


jpn_intervals %>%
  unnest(.replicates) %>%
  ggplot(aes(estimate, fill = term)) +
  geom_vline(xintercept = 0, size = 1.5, lty = 2, color = "gray50") +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  labs(x = "", 
       y = "",
       subtitle = "The distribution includes zero which means there is no significant effect",
       title = "The estimated effect of USD/JPY rates on exports in Japan") +
  theme_minimal(base_family = "Bricolage Grotesque") +
  theme(axis.text = element_text(size = 16),
        plot.title = element_text(size = 16))

According to our simulation results, there seems no significant effect of USD/JPY exchange rates on exports of goods and services in Japan.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Bootstrap Confidence Intervals: Exports in Japan

{Shiny.Telemetry} 0.3.0: Track User Behavior In Your Shiny Applications

Appsilon — Mon, 22 Jul 2024 09:48:53 +0000

[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding how users interact with your application in data analytics is crucial for continuous improvement and user satisfaction. {shiny.telemetry} is a groundbreaking tool that lets you do just that.

shiny.telemetry offers a nuanced approach to user engagement, adoption, and impact assessment. Learn more in this blog post.

Whether you are a developer aiming to refine the user experience, a business analyst looking for insights into user engagement, or a project manager focused on optimizing navigation and functionality, {shiny.telemetry} has got you covered.

Not just that, {shiny.telemetry} can also replace your traditional logging approaches in Shiny applications.‍

Here we are announcing the third release of {shiny.telemetry} loaded with new features and some bug fixes.

Introduction to shiny.telemetry 0.3.0

One of our main design goals with this package is to make it really simple, easy to use and easy to integrate with existing codebases. This blog post will walk you through the basics and show you how to integrate with an existing app in no time.

What’s New in Version 0.3.0?

The latest version brings in lots of new features alongside performance and security improvements.

‍

Improved User Tracking

Detect username: {shiny.telemetry} can now identify users by their usernames if the shiny app is running in a Posit Connect or ShinyProxy environment and user authentication is enabled.

Track Anonymous users: You can now track anonymous users when the user is not logged in or user authentication is not enabled. {shiny.telemetry} will set a cookie in the browser and identify returning visitors. This is now enabled by default in the latest release. (Does not work in apps running in ShinyProxy)

telemetry$start_session(track_anonymous_user = TRUE)

Set username programmatically: You can also choose to set the username programmatically if your app does not run in Posit Connect or ShinyProxy environments.

telemetry$start_session(username = your_custom_function())

Track Only What You Want

Prior versions of shiny.telemetry offered basic control over tracked data. The latest release introduces two powerful new arguments to the log_all_inputs() method:

excluded_inputs_regex: This argument allows you to define a regular expression pattern to exclude specific types of inputs from tracking. For example, you might exclude all inputs containing “date” in their ID to avoid capturing every date picker interaction.

include_input_ids: This argument lets you prioritize specific inputs for tracking. This is useful when you’re particularly interested in certain user interactions within your app. You can provide a vector of input IDs that will be tracked regardless of any exclusions set with other arguments.

By combining these arguments, you can achieve a high level of granularity in your telemetry data.

Here’s an example:

telemetry <- Telemetry$new()

telemetry$start_session()

# Do not track inputs that contain "tbl_" while still tracking "tbl_row_selected"
telemetry$log_all_inputs(
	excluded_inputs_regex = "tbl_",
	include_input_ids = "tbl_row_selected"
)

This code ensures that only changes to the tbl_row_selected is tracked, while all other inputs containing tbl_ in their ID are excluded.

This level of control empowers you to focus on the most relevant user interactions within your Shiny app, leading to more focused and actionable insights.

To learn more about this check our new How-to guide: ”Track a Subset of Inputs to Improve Performance”

Adds Tracking of Shiny Errors

This new version of {shiny.telemetry} adds a new type of event that tracks errors on Shiny applications. They could be exceptions inside render, observe or reactive calls.

This can be achieved natively on Shiny since 1.8.1 with the onUnhandledError callback. However, if our users do not want or cannot update to this version, we added a custom mechanism that supports prior versions.

The error event is tracked by default once you update {shiny.telemetry} to version 0.3.0.

Figure 1: Analytics dashboard showing error on user session.

Support for MongoDB

In the previous version of {shiny.telemetry} all the major relational databases were supported e.g. SQLite, MySQL, MariaDB, PostgreSQL, MSSQL. This release adds support for MongoDB, one of the most used NoSQL databases in the world. Here is an example of initiating a Telemetry object with a MongoDB backend:

telemetry <- Telemetry$new(
  app_name = "your app name",
  data_storage = DataStorageMongoDB$new(
	host = "localhost",
	port = 27017,
	username = "root",
	password = "example",
  )
)

The MongoDB backend shares the same API as others making it seamless to migrate between data storage devices. There are additional parameters to help configure MongoDB connection, please read the documentation for more details.

Bug Fixes and Security Update

Along with the new functionalities we also worked on fixing user reported issues as well as reducing a possible security vulnerability when accessing the analytics data.

We fixed a problem on reading timestamps from SQLite data storage backends. This does not require any action in existing DBs as the timestamps are stored correctly.

This release also has added extra protections against SQL injection when building queries to data storage SQL backends by using glue::glue_sql to build queries. The previous mechanism already protected against SQL injection by checking if parameters of query building were only dates, but this adds a new extra layer of security.

Documentation and Miscellaneous Improvements

Release 0.3.0 of {shiny.telemetry} adds 3 new guides to the documentation and some changes to the example applications in inst/examples folder.

“Setup shiny.telemetry in a Rhino application” shows step by step on how to integrate {shiny.telemetry} with Rhino;
”Track a Subset of Inputs to Improve Performance” explains and gives some examples of how to use this new feature in {shiny.telemetry};
“Use External Databases with shiny.telemetry” guide will help users understand how to use {shiny.telemetry} with different database servers and how to protect sensitive information.

Lastly, we also updated our logo in part of an image update of the Rhinoverse.

Try {shiny.telemetry}yourself on your applications and let us know on Linkedin or in our Shiny Community on Slack how you feel about it.

{shiny telemetry} is already being used on our projects and with our clients to help us better understand the users of a dashboard and improve on it to deliver the best possible user experience as per our Shiny Manifesto.

Stay up to the date on the latest in shiny.telemetry and our Rhinoverse packages. Sign up for our weekly newsletter today.

This blog post was co-authored by André Veríssimo.

The post appeared first on appsilon.com/blog/.

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: {Shiny.Telemetry} 0.3.0: Track User Behavior In Your Shiny Applications

How to Concatenate Strings in R

Steven P. Sanderson II, MPH — Mon, 22 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello, R users! Today, we’re going to talk about a fundamental yet essential aspect of data manipulation: concatenating strings. String concatenation is the process of joining two or more strings together. It doesn’t matter if you’re working with text data, creating labels, or generating dynamic outputs, knowing how to concatenate strings efficiently is a must. We’ll explore how to do this using base R, the stringr package, and the stringi package. Let’s get started!

Examples

Concatenating Strings in Base R

Base R provides a straightforward way to concatenate strings using the paste() and paste0() functions. Here’s how you can use them:

Using `paste()`

The paste() function combines strings and adds a separator (default is a space).

# Example
string1 <- "Hello"
string2 <- "World"
result <- paste(string1, string2)
print(result)  # Output: "Hello World"

[1] "Hello World"

In this example, paste(string1, string2) joins “Hello” and “World” with a space in between.

Using `paste0()`

The paste0() function is similar to paste(), but it doesn’t add a separator by default.

# Example
result_no_space <- paste0(string1, string2)
print(result_no_space)  # Output: "HelloWorld"

[1] "HelloWorld"

Here, paste0(string1, string2) joins “Hello” and “World” without any spaces.

Custom Separator

You can also specify a custom separator with paste().

# Example
result_custom_sep <- paste(string1, string2, sep = ", ")
print(result_custom_sep)  # Output: "Hello, World"

[1] "Hello, World"

By setting sep = ", ", we add a comma and a space between the strings.

Concatenating Strings with `stringr`

The stringr package offers a more consistent and user-friendly way to handle strings in R. For concatenation, we use the str_c() function.

Using `str_c()`

The str_c() function from stringr is similar to paste0() but provides more control over the process.

# Load stringr package
library(stringr)

# Example
result_str_c <- str_c(string1, string2)
print(result_str_c)  # Output: "HelloWorld"

[1] "HelloWorld"

This example is equivalent to paste0().

Custom Separator

To add a separator, use the sep argument in str_c().

# Example with separator
result_str_c_sep <- str_c(string1, string2, sep = " ")
print(result_str_c_sep)  # Output: "Hello World"

[1] "Hello World"

Here, sep = " " adds a space between the strings.

Concatenating Strings with `stringi`

The stringi package is another powerful tool for string manipulation in R. For concatenation, we use the stri_c() function.

Using `stri_c()`

The stri_c() function works similarly to paste0() and str_c().

# Load stringi package
library(stringi)

# Example
result_stri_c <- stri_c(string1, string2)
print(result_stri_c)  # Output: "HelloWorld"

[1] "HelloWorld"

This joins “Hello” and “World” without spaces.

Custom Separator

To include a separator, use the sep argument in stri_c().

# Example with separator
result_stri_c_sep <- stri_c(string1, string2, sep = " ")
print(result_stri_c_sep)  # Output: "Hello World"

[1] "Hello World"

The sep argument adds a space between the strings.

Conclusion

String concatenation is a simple yet vital task in data manipulation. Whether you prefer base R functions like paste() and paste0(), or the more specialized stringr and stringi packages, you have multiple options to choose from. Each method has its unique advantages, and understanding them will help you handle strings more effectively in your R projects.

Feel free to try these examples on your own and see how they work with your data. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: How to Concatenate Strings in R

Forecasting uncertainty: sequential split conformal prediction + Block bootstrap (web app)

T. Moudiki — Mon, 22 Jul 2024 00:00:00 +0000

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post was firstly submitted to the Applied Quantitative Investment Management group on LinkedIn. It illustrates a recipe implemented in Python package nnetsauce for time series forecasting uncertainty quantification (through simulation): sequential split conformal prediction + block bootstrap

Underlying algorithm:

Split data into training set, calibration set and test set
Obtain point forecast on calibration set
Obtain calibrated residuals = point forecast on calibration set – true observation on calibration set
Simulate calibrated residuals using block bootstrap
Obtain Point forecast on test set
Prediction = Calibrated residuals simulations + point forecast on test set

Interested in experimenting more? Here is a web app.

For more details, you can read (under review): https://www.researchgate.net/publication/379643443_Conformalized_predictive_simulations_for_univariate_time_series

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Forecasting uncertainty: sequential split conformal prediction + Block bootstrap (web app)

How to use read.table Function in R | Read table in r

Zubair Goraya — Sun, 21 Jul 2024 05:20:00 +0000

[social4i size="small" align="align-left"] -->

[This article was first published on RStudioDataLab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Key Point

The read.table function in R is essential for importing tabular data into data frames. It supports various file formats with customizable parameters, allowing you to handle complex datasets accurately.
Adjusting the sep parameter allows it to manage different delimiters, such as commas, tabs, and spaces, ensuring versatile data import capabilities.

df <- read.table("path/to/file.txt", header=TRUE, sep=",")

Use additional parameters, such as row.names and col.names, with the read. table to precisely control the data structure and enhance data analysis accuracy.
```
df <- read.table("path/to/file.txt", header=TRUE, sep="\t")
```

It also helps in converting raw data into structured formats for efficient manipulation and analysis in R, which is crucial for any data analyst.
It is essential for importing tabular data into data frames, supporting various file formats with customizable parameters. Use it to handle complex datasets accurately.
```
df <- read.table("path/to/file.csv", header=TRUE, sep=",", row.names=1, col.names=c("Col1", "Col2", "Col3"))
```

By adjusting the sep parameter, the table can manage different delimiters like commas, tabs, and spaces, ensuring versatile data import capabilities.
```
df <- read.table("path/to/file.txt", header=TRUE, sep=",")
```

Use additional parameters, such as row.names and col.names, with the read. table to precisely control the data structure and enhance data analysis accuracy.

Table of Contents

Llama, Llama, Oh Give Me A Sign. What’s In The Latest IDSA Guideline?

r on Everyday Is A School Day — Sun, 21 Jul 2024 00:00:00 +0000

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wow, what a journey, and more to come! We learned how to perform simple RAG with an LLM and even ventured into LangChain territory. It wasn’t as scary as some people said! The documentation is fantastic. Best of all, we did it ALL in R with Reticulate, without leaving RStudio! Not only we can read IDSA Guidelines, we can use LLM to assist us with retrieving information!

TL;DR

I think everyone should learn the intuition of LLM, prompt engineering, RAG, agents, etc. The concept itself and with some trial and error will provide users a renewed perspective of how these things work, how helpful and beneficial it can be for us, how it serves as a tool for learning and not a replacement. The simplest & most straightforward way of learning is using GPT4All GUI LangChain’s tutorial, use a Local LLM, and then give it a go! Yes most of these are in python but it shouldn’t prevent R user like me to use it in R via reticulate!

Disclaimer

This is mainly for data science educationa purpose only. This is NOT a medical advice, nor is it a medical education. Some medical statements here may be inaccurate. If you find any error in this article, please feel free to educate me.

Objectives:

Motivation

I have been wanting to learn LLM for a while now. My first attempt was an utter failure. Not being proficient in python, needing to install a bunch of python packages, different error lingo, were quite discouraging. That said, it’s always good to try something and fail, and then pick it up again some other time and invest more time chunks into learning the individual portions of it. This is exactly what happened! I’m glad that it did and now I have a better understanding of it.

What really catalyzed this learning process was the recent AI Summit Conference. They provided a Prompt Enigeering beginner session and that really helped me to want to learn more about LLM. However, most LLM sessions, books, all involve paid version of API such as OpenAI GPT3.5/4o, Claude, Azure, etc, I wanted something local and does not involve paying for each token, in my case erroneous token , sent to the API and get charged for my mistakes. Then, we stumbled upon GPT4All. This was really something that kicked start the process of learning without requiring ANY codes! Just download it, and it has a GUI, attach local files, and then chat away, without internet!

Below is a GIF of the website and a snapshot of the GUI

We won’t be going through the details of using GPT4All here, it is quite intuitive. They also have a discord channel if you have questions. Very nice and helpful people. I recently learnt that GPT4All does not have embedding implemented yet on the python SDK through the channel. The most straightforward way to learn how to run LLM locally is this in 3 simple ways:

Download the app
Open the app, select model to download
Attach folder on LocalDoc (this can be tricky if document is too large/long, but if you use nomic.ai’s embedding, which requires sign up and API key, it’s very fast).
Then start chat, attach the LocalDoc folder of interest, and start chatting!

You will also need to change setting of n_ctx (tokens that can be sent) and max_tokens (tokens that LLM returns response) if your question is long or want longer answer. I did not tweak other things in here much, I didn’t find the response was much accurate, though it’s quick. Since it doesn’t allow a whole lots of customization, I went to LangChain instead since mainstream uses this a lot and I can find more tutorial in that setting. However, this really got me started and am forever grateful I found this and its community. Another thing that the community pointed out was that if the prompt template has something like

### Human:
%1

### Assistant:

That’s not the right template, more so a placeholder and you’d have to enter the model specific prompt template for it to work, such as for Llama 3 system prompt template.

<|im_start|>system
Write your system prompt here<|im_end|>

With LangChain, you don’t need the above, most of them are done for you! More to come.

‍LangChain

The reasons I chose LangChain to learn were because of standardization of the lingo, functions, workflow etc. It reminds me a whole lot of TidyModels. It has a specific workflow, but incorporates all the cool engines (ML methods) into the workflow. It doesn’t matter whether you’re using Llama 3, WizardLM, Gemma2 etc, the workflow, functions are all the same. Also, there are lot of resources out there that uses this and that is extremely helpful for me to start. Not to mention, the documentation of LangChain is fantastic! Because this field is evolving so swiftly some of the tutorial codes are deprecated but the documentations offer the current function and points you to the right direction. Definitely enjoyed reading through it when I stumbled upoen problem.

That said, it does not come without LOTS and LOTS of trial and error. Below I’ll try to document the things I need for this to run on R. It may not be extensive, but if you are stuck in any of the steps below please let me know, I’ll try to see if I can reproduce it and help you troubleshoot if I can. Here, I am using LlamaCpp because it is an efficient, open-source C++ implementation of Meta’s LLaMA language model, designed for CPU-based inference. It allows users to run large language models on consumer-grade hardware with relatively low memory requirements, thanks to its support for various quantization levels

Prerequisite

I assume you have python and reticulate installed, and your reticulate is pointing towards the python you use to install the following packages in python

pip install --upgrade langchain langchain-community langchain_core langchain_huggingface llama-cpp-python faiss-cpu sentence-transformers

Sorry if I missed anything. If when you run the code you noticed some error where packages not found, you can use that to troubleshoot. Let me know if I missed anything, I’ll modify.

Code In Action – Explained

Load Packages

library(reticulate)
library(tidyverse)

# load modules
langchain_community <- import("langchain_community")
langchain <- import("langchain")
langchain_core <- import("langchain_core")
langchain_huggingface <- import("langchain_huggingface")

# load functions
### Documents
DirectoryLoader <- langchain_community$document_loaders$directory$DirectoryLoader
PyPDFLoader <- langchain_community$document_loaders$PyPDFLoader
RecursiveCharacterTextSplitter <- langchain$text_splitter$RecursiveCharacterTextSplitter
HuggingFaceEmbeddings <- langchain_huggingface$HuggingFaceEmbeddings

### Embedding / Vectorstorage / Retriever
FAISS <- langchain_community$vectorstores$FAISS

### Model
LlamaCpp <- langchain_community$llms$LlamaCpp
CallbackManager <- langchain_core$callbacks$CallbackManager
StreamingStdOutCallbackHandler <- langchain_core$callbacks$StreamingStdOutCallbackHandler


### Template
PromptTemplate <- langchain$prompts$PromptTemplate
ChatPromptTemplate <- langchain_core$prompts$ChatPromptTemplate

### Chain
create_retrieval_chain <- langchain$chains$create_retrieval_chain
create_stuff_documents_chain <- langchain$chains$combine_documents$create_stuff_documents_chain

This is quite self-explainatory. If you have questions, copy and paste on LLM and have it explain. Make sure to get your copy of pdf here

Load & Embed Document

loader = PyPDFLoader("amr-guidance-4.0.pdf")
documents = loader$load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter$split_documents(documents)
vectorstore = FAISS$from_documents(documents=docs, embedding=HuggingFaceEmbeddings())
retriever = vectorstore$as_retriever()

Explaination:

Loads the contents of the PDF file named “amr-guidance-4.0.pdf”.
Extracts the text content from the loaded PDF and stores it in the documents variable.
Creates a text splitter that will divide the text into chunks of approximately 1000 characters each, with an overlap of 200 characters between adjacent chunks (to maintain context).
Applies the splitter to the documents variable, breaking the text into smaller chunks stored in the docs variable.
Initializes an embedding model from the Hugging Face library. Embeddings are numerical representations of text that capture semantic meaning.
Creates a FAISS vector store (vectorstore). It takes the split text chunks (docs) and converts them into embeddings using the specified HuggingFaceEmbeddings model. These embeddings are then stored in the vector store. HuggingFaceEmbeddings model when not specified will return sentence-transformers/all-mpnet-base-v2 model.
Creates a retriever object from the vectorstore. This retriever allows you to efficiently search the vector store for text chunks that are semantically similar to a given query.

Embedding, huh?

Embedding is a technique used in natural language processing (NLP) to represent words, sentences, or documents as numerical vectors. These vectors capture the semantic meaning of the text and can be used for various NLP tasks, such as similarity search, text classification, and language generation. In this case, we are using the Hugging Face library to generate embeddings for the text chunks extracted from the PDF document.

This tutorial has a great description of what embedding model -> vector storage means.

Example of embedding vector

embedding = HuggingFaceEmbeddings()
embedding$embed_query(text="can i use gentamicin for pseudomonas infection")

Then this embedding vector will be compared with all of the chunk vectors using squared L2 distance (Euclidean distance):

$d(a,b) = \sum(a_i-b_i)^2$

The actual euclidean distance we need to square-root it but here FAISS has omitted it for computational efficiency. The lower the number the more similar the 2 vectors are, it works the same regardless of square-rooting or not. More details on the proof

Return the Best (lowest squared L2) Similarity

vectorstore$similarity_search_with_score(query = "can i use gentamicin for pseudomonas infection", k=as.integer(1))

Now lets calculate it by hand!

query_v <- embedding$embed_query(text="can i use gentamicin for pseudomonas infection") 
page53_v <- embedding$embed_query(text=docs[[167]]$page_content) 
sum((query_v - page53_v)^2)

## [1] 0.7473773

YES !!! Same number, awesome!!!

Model

llm = LlamaCpp(
  model_path="wizardlm-13b-v1.2.Q5_K_M.gguf",
  streaming=TRUE,  
  callback_manager=CallbackManager(handlers = list(StreamingStdOutCallbackHandler())),
  n_ctx = as.integer(2048),
  max_tokens = as.integer(1024),
  temperature = 0,
  verbose = F)

Explaination:

Initializes a LlamaCpp object from the langchain.llms module
select path of GGUF model (see below how to download)
Enables streaming output, allowing the model to generate text incrementally and send it back as it’s produced, rather than waiting for the entire generation to finish.
Creates a CallbackManager object, which allows you to register callbacks (functions) to be executed during the text generation process. StreamingStdOutCallbackHandler. This handler prints the generated tokens directly to the standard output (your console) as they are produced, providing a real-time view of the generation.
Sets the maximum context window size to 2048 tokens. The context window is the amount of text the model can “remember” and use to generate its output.
Limits the maximum number of tokens in the model’s output to 1024. This prevents the model from generating overly long responses.
Controls the “creativity” or randomness of the model’s output. A temperature of 0 makes the model deterministic, always choosing the most likely next token.
Disables verbose logging from the LlamaCpp library. I disabled this to make it more aesthetically nice for the blog, you should set this to TRUE to see detailed output during the generation process.

How to Download GGUF models?

Go to Hugging Face, here I have pre-selected text-generation model and GGUF library for you
Select a model that piqued your interest (try 7-15B param)
Select “Files and versions”
Select a model to download and download. If the gguf contains part 1 of 2, make sure to download both parts and select the first part when you’re assigning the model.

Please note that certain LLM such as Llama, Gemma etc, requires you to request permission to use their models. It is quite straightforward, read through their policy, request it by filling out information, wait for approval then you’re in!

Prompt

system_prompt = "
    You are an expert for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question.
    When answering, also make sure to state the reason or rationale for that answer.
    Then question that reason or rationale critically once with a sentence.
    Then provide an answer to the critical appraisal.
    If you don't know the answer, say that you don't know.

    {context}

"

prompt = ChatPromptTemplate$from_messages(
  list(
    tuple("system",system_prompt),
    tuple("user",paste0("Question: ","{input}")),
    tuple("assistant",""))
)

prompt

Make sure the system prompt goes to SystemMessage. I had to debug this for sometime and finally realized that the ChatPromptTemplate$from_messages function takes a list of tuples in order for it to work. Found this out by reading LangChain documentation. To be quite honest, I really find their documentation to be very helpful for me!

Chain or Runnables ‍

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Explaination:

create_stuff_documents_chain(): This function from LangChain creates a chain specifically designed for question-answering tasks. Your model llm will be used to generate answers. prompt: A PromptTemplate that guides the LLM on how to structure its response. It sets up a chain that takes a question and some documents as input. It passes the question and documents to your LLM. The LLM uses the prompt to generate an answer based on the given information.
create_retrieval_chain()·: This function from LangChain creates a RAG chain. retriever: Your retriever object (e.g., the one you created from your vector store). This is responsible for fetching relevant documents based on a query. question_answer_chain: The chain you just created in the previous line, which will be used to generate answers from the retrieved documents. What this line does:

Then, we are ready to ask our questions!

result = rag_chain$invoke(dict("input"= "What guideline are we looking at today?"))

result

The output was saved in result as a list. As you can see, the first output was the question, 2nd,3rd,4th were the context that our retriever found most similar to our question. Lastly we have our answer “Guidance on the Management of Antimicrobial Resistance” Though it seemed to have missed “Gram Negative”

Create A Vector of Questions

Let’s create a vector of questions and then run pass our LLM and see what are the responses.

questions = c("What is the preferred treatment of CRE?",
             "What is the preferred treatment of ESBL-E?",
             "Can we use fosfomycin in ESBL Klebsiella?",
             "Can we use fosfomycin in ESBL Ecoli?",
             "What is the preferred treatment of stenotrophomonas?",
             "What is the preferred treatment of DTR Pseudomonas?",
             "Which organisms require two active agent when susceptibility is known?",
             "Can we use gentamicin in pseudomonas infection?",
             "Can we use tobramycin to treat pseudomonas infection?",
             "Why is there carbapenemase non-producing organism?",
             "Can we use oral antibiotics for any of these MDRO?",
             "What is the preferred treatment of MRSA?",
             "What is the preferred treatment of CRAB?",
             "Can fosofmycin be used for pyelonephritis?",
             "Is IV antibiotics better than oral antibiotics?")

response = vector(mode = "character", length = length(questions))
source = vector(mode = "character", length = length(questions))


for (i in 1:length(questions)) {
  cat(rep("\n",100))
  print(paste0("Question: ", questions[i]))
  result = rag_chain$invoke(dict("input" = questions[i]))
  response[i] = result$answer
  source[i] = map_chr(.x=result$context,.f=~paste(.x)) |> paste(collapse = "\n\n##########\n\n") 
  Sys.sleep(10)
}

Explaination:

questions variable contains a vector of the questions we’re interested in asking
create response and source empty vectors
create a for loop:
clear console create by creatining lots of new lines
print the question
invoke the i-th question
save the ith response to ith response vector
save the ith source to ith source vector (this will tell us where the RAG got the source from our documents), separate each source with \n\n##########\n\n
sleep for 10s, otherwise our machine will heat up quite fast; this also will help us capture screen of the response

Ready for the response? Let’s go!

Questions to Our LLM

Anything in baby blue highlight is LLM answer output, light green highlight is the source RAG found top 4 similarity to the question. If the GIF is too fast to follow, you can look at the actual output. I didn’t return all sources, unless necessary to investigate. The first question will have everything outputted.

Below might be a bit clinical, please feel free to glance through the first few examples and move on if you’re not interested.

What is the preferred treatment of CRE?

Response:

The preferred treatment for infections outside of the urinary tract caused by Carbapenem-Resistant Enterobacteriaceae (CRE) that produce Klebsiella pneumoniae carbapenemase (KPC) depends on several factors, including the specific type of infection, the patient’s clinical status, and the susceptibility patterns of the CRE isolate. According to the guidelines provided, the preferred antibiotics for such infections include:

1. Ceftazidime-avibactam, which inhibits KPC enzymes and other β-lactamases.
2. Meropenem-vaborbactam, where vaborbactam stabilizes meropenem against certain β-lactamases.
3. Imipenem-cilastatin-relebactam, which combines imipenem with the beta-lactamase inhibitor relebactam to protect against KPC and other carbapenemases.
4. Cefiderocol, a siderophore cephalosporin that can access bacterial iron centers and maintain activity against many Gram-negative bacteria, including those producing KPC.
5. Aminoglycosides, which can be considered as alternative options for certain types of infections.

It is important to note that the choice of antibiotic should be guided by susceptibility testing results, and the treatment regimen should be tailored to the individual patient’s needs. The continued use of a second agent beyond the initial therapy should be evaluated for additional benefit and potential risks such as increased likelihood of antibiotic resistance development.

Critical Appraisal:
The answer provided is based on current guidelines and evidence-based recommendations for treating CRE infections, particularly when KPC production is present. The rationale includes the use of antibiotics that have been shown to be effective against KPC-producing CRE, such as ceftazidime-avibactam, meropenem-vaborbactam, imipenem-cilastatin-relebactam, cefiderocol, and aminoglycosides. These recommendations are supported by clinical studies and expert opinion.

However, it is crucial to consider that the treatment landscape for CRE infections may evolve with new data and emerging resistance patterns. Therefore, healthcare providers should always consult the most current guidelines and local antibiotic resistance profiles when making treatment decisions. Additionally, individual patient factors such as renal function, allergies, and concomitant medications must be taken into account to tailor the treatment plan appropriately.

Answer:
The preferred treatment for infections outside of the urinary tract caused by CRE that produce KPC, based on current guidelines and evidence, includes:

1. Ceftazidime-avibactam, which inhibits KPC enzymes and other β-lactamases.
2. Meropenem-vaborbactam, where vaborbactam stabilizes meropenem against certain β-lactamases.
3. Imipenem-cilastatin-relebactam, which combines imipenem with the beta-lactamase inhibitor relebactam to protect against KPC and other carbapenemases.
4. Cefiderocol, a siderophore cephalosporin that can access bacterial iron centers and maintain activity against many Gram-negative bacteria, including those producing KPC.
5. Aminoglycosides, which can be considered as alternative options for certain types of infections.

It is essential to perform susceptibility testing to guide the choice of antibiotic and to monitor for potential resistance development during treatment.

Source:

page_content=‘therapeutic agent for patients at risk for CRE infections is being administered, data do not indicate that continued combination therapy —once the β-lactam agent has demonstrated in vitro activity —offers any additional benefit379. Rather, the continued use of a second agent increases the likelihood of antibiotic -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 52}
##########
page_content=‘alternative agents for the treatment of CRE bloodstream infections352. Tigecycline or eravacycline can be considered as alternative options for intra -abdominal infections, skin and soft tissue infections, osteomyelitis, and respiratory infections when optimal dosing is used ( Table 1 ). Nausea and emesis are
reported in as many as 20-40% of patients receiving tetracycline -derivatives353-355. Of note, CLSI’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 50}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
43
CRE infections when susceptibility to meropenem or imipenem has not been demonstrated252,253. It is plausible that the addition of vaborbactam or relebactam may decrease MICs of meropenem or imipenem even in isolates without a carbapenemase because of other β-lactamases (e.g., ESBLs) that may be overproduced253. Tigecycline or eravacycline are alternative options for the treatment of CRE infections not involving the bloodstream or urinary tract (Question 3.8). Their activity is independent of the presence
or type of carbapenemase.
Question 3.4: What are the preferred antibiotics for the treatment of infections outside of the urinary tract caused by CRE if KPC production is present?’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 42}
##########
page_content=‘pyelonephritis or cUTI caused by CRE, if susceptibility is demonstrated. Ceftazidime -avibactam, meropenem -vaborbactam, imipenem -cilastatin -relebactam, and cefiderocol are also preferred treatment options for pyelonephritis or cUTIs. Aminoglycosides are alternative options for the treatment of pyelonephritis or cUTI caused by CRE.
Rationale’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 39}

Looking at output, it’s not too shabby. I did not specify exactly what condition and it picked out non-urinary source. When I saw aminoglycoside, it sounded wrong but it did say in certain condition, which it can be used in UTI/pyelo according to the guideline if susceptible, which also does not make sense in the LLM output as it already states not urinary source. As you can see from the last source, it did include cUTI/pyelo context.

Note to self, More fine-tuning of the search_type might be helpful. Maybe set k=3 ? Let’s continue to see the other responses.

What is the preferred treatment of ESBL-E?

The preferred treatment for Extended-Spectrum Beta-Lactamase (ESBL)-producing Enterobacterales (E) based on the provided context is carbapenem therapy. Specifically, meropenem or imipenem -cilastatin are recommended as first-line treatments. Ertapenem is also an option, particularly for patients who need to continue therapy in the outpatient setting when oral treatment options are not available. After clinical response is achieved and if susceptibility tests confirm it, a transition to oral therapies such as trimethoprim - sulfamethoxazole (TMP - SMX), ciprofloxacin, or levofloxacin may be considered.

Critical Appraisal:
The recommendation for carbapenem therapy is based on a large clinical trial that showed a notable direction of risk difference in 30-day mortality favoring carbapenems over piperacillin - tazobactam. However, the absolute risk difference was attenuated and no longer statistically significant after reanalyzing the data with more stringent inclusion criteria (clinical isolates with piperacillin - tazobactam MICs ≤16 µg/mL). This suggests that while carbapenems are currently preferred, further research may be needed to confirm their superiority, especially when considering the cost and potential for resistance development with widespread carbapenem use.

Answer to Critical Appraisal:
The critical appraisal raises a valid point about the reliance on a single clinical trial for the recommendation of carbapenems. While the initial trial showed promising results, the fact that the difference in mortality was not statistically significant after reanalysis suggests that there may be variability in outcomes. This highlights the importance of considering multiple studies and data sources when formulating treatment guidelines. Additionally, the potential for resistance development with carbapenems is a concern that must be balanced against the benefits of their use. Therefore, while carbapenems are currently recommended as first-line treatment, ongoing monitoring of clinical outcomes and resistance patterns is essential to ensure the most effective and safe treatment options for ESBL-E infections.

page_content=‘hours or meropenem 1 g IV every eight hours, both as standard infusions (i.e., over 30 minutes). The primary outcome of 30 -day mortality occurred in 12% and 4% of patients receiving piperacillin - tazobactam and meropenem, respectively60. Trial data were subsequently reanalyzed only including patients with clinical isolates against which piperacillin -tazobactam MICs were ≤16 µg/mL by broth microdilution, the reference standard for AST72. Reanalyzing the data from 320 (82%) patients with clinical isolates available for retesting, 30 -day mortality occurred in 9% versus 4% of those in the piperacillin -tazobactam and meropenem arms, respectively. Although the absolute risk difference was attenuated and no longer significant in the reanalysis (i.e., the 95% confidence interval ranged from −1% to 11%)72, the panel still suggests carbapenem therapy as the preferred treatment of ESBL -producing’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 15}
##########
page_content=‘Suggested approach: Cephamycins are not suggested for the treatment of ESBL -E infections until more clinical outcomes data using cefoxitin or cefotetan are available and optimal dosing has been defined.
Rationale
The cephamycins are cephalosporins that are generally able to withstand hydrolysis from ESBL enzymes114,115. The cephamycins available in the United States are cefoxitin and cefotetan which are both IV agents. At least ten observational studies have compared the clinical outcomes of patients with’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 20}
##########
page_content=‘to 11%)72, the panel still suggests carbapenem therapy as the preferred treatment of ESBL -producing bloodstream infections due to the notable direction of the risk difference. Limitations of piperacillin -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 15}
##########
page_content=‘experiencing hypoalbuminemia, meropenem or imipenem -cilastatin are the preferred carbapenems. After appropriate clinical response is achieved, transitioning to oral TMP -SMX, ciprofloxacin, or levofloxacin should be considered, if susceptibility is demonstrated.
Rationale
A carbapenem is recommended as first -line treatment of ESBL -E infections outside of the urinary tract, based primarily on data from a large clinical trial, as described below60. Meropenem, imipenem -cilastatin, or ertapenem are preferred agents; ertapenem offers a more convenient option for patients needing to continue carbapenem therapy in the outpatient setting when oral treatment options are not available.
For patients who are critically ill and/or experiencing hypoalbuminemia, meropenem or imipenem -cilastatin are the preferred carbapenems. Ertapenem, in contrast to meropenem and’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 14}

Again, quite interesting, I did not provide the condition of treatment and it picked the broadest / most severe condition, and correctly de-esccalate to oral antibiotics if susceptible. It also mentioned about the piptazo susceptible ESBL and questioned itself.

Can we use fosfomycin in ESBL Klebsiella?

Based on the context provided, fosfomycin can be considered an alternative treatment option for ESBL-producing E. coli (EPEC) infections, particularly when preferred options such as carbapenems, trimethoprim-sulfamethoxazole (TMP-SMX), or fluoroquinolones are not tolerated or do not show susceptibility. However, the data for fosfomycin’s use against ESBL-producing Klebsiella (ESBL-K) is less clear. The context specifically mentions that fosfomycin is not suggested for prostatitis caused by gram-negative organisms other than E. coli due to the likely presence of the fosA gene, which can inactivate this agent. While fosfomycin has activity against some ESBL-producing Gram-negative bacteria, its use in ESBL-K infections should be approached with caution, and clinical outcomes data specific to this organism are lacking.

Critical Appraisal:
The rationale provided is based on the available evidence for fosfomycin’s use against EPEC rather than ESBL-K. The context suggests that while fosfomycin can be used for EPEC, its efficacy against ESBL-K is not well-established. The presence of the fosA gene in organisms other than E. coli may reduce the effectiveness of fosfomycin against these strains. Additionally, the clinical outcomes data specifically for ESBL-K are lacking, which means that any use of fosfomycin in this context would be off-label and should be carefully considered within the context of individual patient circumstances and in consultation with an infectious disease specialist.

Answer to Critical Appraisal:
The critical appraisal raises a valid point that the evidence for using fosfomycin against ESBL-K is indeed less robust compared to its use against EPEC. The presence of the fosA gene in ESBL-K, which can render fosfomycin ineffective, is a significant concern. Therefore, while fosfomycin may be an alternative for EPEC when other options are not suitable, its use in ESBL-K should be approached with caution and only after considering the potential for reduced efficacy due to resistance mechanisms like the fosA gene. Clinicians should exercise judgment and consider consulting with an infectious disease specialist when treating ESBL-K infections with fosfomycin, especially in the absence of strong clinical data supporting its use.

page_content=‘of the additional days of oral fosfomycin or ciprofloxacin is unclear. Another clinical trial randomized 51 patients with cUTI to 3 g of fosfomycin daily or 750 mg of levofloxacin daily for 5-7 days, after up to two days of IV therapy33. Clinical cure at the end of therapy was similar in both treatment groups (69% versus 68%). In this study, 63% of infections were caused by E. coli but only one isolate in each arm was caused by an ESBL -producing isolate.
IV fosfomycin is not clinically available in the United States. Although some data suggest IV fosfomycin may have activity against organisms beyond E. coli , it is difficult to translate data from IV fosfomycin to oral fosfomycin given the limited oral bioavailability and lower daily dosages with oral fosfomycin52. Transitioning to daily oral fosfomycin needs further investigation before suggesting for or’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 13}
##########
page_content=‘fosfomycin52. Transitioning to daily oral fosfomycin needs further investigation before suggesting for or against this practice for the treatment of ESBL -E pyelonephritis or cUTI; however, it may be a reasonable option when other preferred or alternative oral options are not available.
Fosfomycin is an alternative option for the treatment of prostatitis caused by ESBL -producing E. coli when preferred options (i.e., carbapenems, TMP -SMX, or fluoroquinolones) cannot be tolerated or do not test susceptible53-59. In an observational study, fosfomycin, dosed at 3 g orally daily for one week, followed by 3 g orally every 48 hours for 6 to 12 weeks, was associated with clinical cure in 36 (82%) of 44 males with chronic bacterial prostatitis53. Fosfomycin is not suggested for prostatitis caused by gram - negative organisms other than E. coli due to the likely presence of the fosA gene and its ability to inactive this agent ( Question 1.1 ).’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 13}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
11
dissections, seizures, peripheral neuropathy, and Clostridioides difficile infections24-27. Treatment with a single intravenous (IV) dose of an aminoglycoside is an alternative treatment option for uncomplicated ESBL -E cystitis. Aminoglycosides are nearly exclusively eliminated by the renal route. A single IV dose is generally effective for uncomplicated cystitis, with minimal toxicity, but robust clinical trial data are lacking28. Oral fosfomycin is an alternative treatment option exclusively for uncomplicated ESBL -E cystitis caused by E. coli . Susceptibility of E. coli to fosfomycin is not routinely tested by most clinical microbiology laboratories but E. coli resistance to fosfomycin remains rare in the’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 10}
##########
page_content=‘Suggested approach: Cephamycins are not suggested for the treatment of ESBL -E infections until more clinical outcomes data using cefoxitin or cefotetan are available and optimal dosing has been defined.
Rationale
The cephamycins are cephalosporins that are generally able to withstand hydrolysis from ESBL enzymes114,115. The cephamycins available in the United States are cefoxitin and cefotetan which are both IV agents. At least ten observational studies have compared the clinical outcomes of patients with’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 20}

This is also quite interesting. It correctly stated that fosfomycin cannot be used for ESBL klebsiella but for some reason, it started off with, one can use it for ESBL Ecoli, even though the question did not ask that. Generally, the statements provided here seem accurate to me by glancing through.

Note to self, i think k=3 might be the magic number, as the last context does not answer the question either. The problem too is there are 2 klebsiella words in the articles and they’re not related to the treatment, it did use k. pneumoniae. So maybe if I switch the wording, it might be different.

Yes!!!

Can we use fosfomycin in ESBL Ecoli?

Based on the retrieved context, fosfomycin can be considered an alternative treatment option for uncomplicated ESBL-E cystitis caused by E. coli, and it may also be a reasonable option for ESBL-E pyelonephritis or complex urinary tract infections (cUTI) when other preferred or alternative oral options are not available. However, the use of fosfomycin for prostatitis caused by ESBL-producing E. coli is supported when preferred options such as carbapenems, trimethoprim-sulfamethoxazole (TMP-SMX), or fluoroquinolones cannot be tolerated or do not test susceptible.

For other gram-negative organisms causing prostatitis, fosfomycin is not suggested due to the likely presence of the fosA gene in these organisms, which can render fosfomycin inactive. The rationale behind this recommendation is that fosfomycin’s efficacy is primarily against E. coli and its activity against other gram-negative pathogens is limited due to resistance mechanisms like the fosA gene.

Critical Appraisal:
The recommendation for fosfomycin in ESBL-E infections is based on observational studies and limited clinical trial data. While these studies suggest that fosfomycin can be effective against E. coli causing cystitis, pyelonephritis, or prostatitis, the evidence is not as robust as for other antibiotics. Additionally, the recommendation against using fosfomycin for gram-negative organisms other than E. coli is based on the presence of resistance genes like fosA, which may not be universally present in all strains. Therefore, while fosfomycin can be a valuable option for E. coli infections, its use should be cautious and guided by susceptibility testing when considering other gram-negative pathogens.

Answer to Critical Appraisal:
The critical appraisal raises an important point about the limitations of the current recommendations for fosfomycin use. While fosfomycin has shown promise in treating ESBL-E infections caused by E. coli, its use should indeed be cautious and guided by susceptibility testing when dealing with other gram-negative organisms. The presence of resistance genes like fosA does not necessarily mean that all non-E. coli strains will be resistant to fosfomycin, but it does suggest that the efficacy may vary and should be confirmed on a case-by-case basis. Therefore, while fosfomycin is an alternative for E. coli infections, it is essential to perform susceptibility testing when treating other gram-negative pathogens to ensure appropriate therapy.

page_content=‘of the additional days of oral fosfomycin or ciprofloxacin is unclear. Another clinical trial randomized 51 patients with cUTI to 3 g of fosfomycin daily or 750 mg of levofloxacin daily for 5-7 days, after up to two days of IV therapy33. Clinical cure at the end of therapy was similar in both treatment groups (69% versus 68%). In this study, 63% of infections were caused by E. coli but only one isolate in each arm was caused by an ESBL -producing isolate.
IV fosfomycin is not clinically available in the United States. Although some data suggest IV fosfomycin may have activity against organisms beyond E. coli , it is difficult to translate data from IV fosfomycin to oral fosfomycin given the limited oral bioavailability and lower daily dosages with oral fosfomycin52. Transitioning to daily oral fosfomycin needs further investigation before suggesting for or’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 13}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
11
dissections, seizures, peripheral neuropathy, and Clostridioides difficile infections24-27. Treatment with a single intravenous (IV) dose of an aminoglycoside is an alternative treatment option for uncomplicated ESBL -E cystitis. Aminoglycosides are nearly exclusively eliminated by the renal route. A single IV dose is generally effective for uncomplicated cystitis, with minimal toxicity, but robust clinical trial data are lacking28. Oral fosfomycin is an alternative treatment option exclusively for uncomplicated ESBL -E cystitis caused by E. coli . Susceptibility of E. coli to fosfomycin is not routinely tested by most clinical microbiology laboratories but E. coli resistance to fosfomycin remains rare in the’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 10}
##########
page_content=‘fosfomycin52. Transitioning to daily oral fosfomycin needs further investigation before suggesting for or against this practice for the treatment of ESBL -E pyelonephritis or cUTI; however, it may be a reasonable option when other preferred or alternative oral options are not available.
Fosfomycin is an alternative option for the treatment of prostatitis caused by ESBL -producing E. coli when preferred options (i.e., carbapenems, TMP -SMX, or fluoroquinolones) cannot be tolerated or do not test susceptible53-59. In an observational study, fosfomycin, dosed at 3 g orally daily for one week, followed by 3 g orally every 48 hours for 6 to 12 weeks, was associated with clinical cure in 36 (82%) of 44 males with chronic bacterial prostatitis53. Fosfomycin is not suggested for prostatitis caused by gram - negative organisms other than E. coli due to the likely presence of the fosA gene and its ability to inactive this agent ( Question 1.1 ).’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 13}
##########
page_content=‘Suggested approach: Cephamycins are not suggested for the treatment of ESBL -E infections until more clinical outcomes data using cefoxitin or cefotetan are available and optimal dosing has been defined.
Rationale
The cephamycins are cephalosporins that are generally able to withstand hydrolysis from ESBL enzymes114,115. The cephamycins available in the United States are cefoxitin and cefotetan which are both IV agents. At least ten observational studies have compared the clinical outcomes of patients with’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 20}

lol it looks like it’s the same context as ESBL Klebsiella, I guess E coli and Klebsiella terminology did not change a whole of the squared L2 distance?

Note to self, I may need to monitor terminology / abbreviation used in the article in order for retriever to retrieve the right context.

What is the preferred treatment of stenotrophomonas?

Based on the provided context, the preferred treatment for Stenotrophomonas maltophilia infections can vary depending on the type of infection, patient condition, and local resistance patterns. However, several treatments have been reported to be effective:

1. Trimethoprim/sulfamethoxazole is a commonly used treatment for S. maltophilia infections as indicated by multiple studies (753, 754, 720).
2. Minocycline has also been clinically effective for treating S. maltophilia infections (753).
3. Tigecycline is mentioned as a therapeutic option for S. maltophilia infections and has been studied in clinical settings (754, 766).
4. A retrospective cohort study suggests that tigecycline can be effective for treating ventilator-associated pneumonia due to S. maltophilia (755).
5. Comparative studies have looked at trimethoprim/sulfamethoxazole versus minocycline for the treatment of nonurinary monomicrobial infections caused by S. maltophilia, with no significant difference in outcomes (756).
6. Levofloxacin has been considered as a useful alternative to trimethoprim/sulfamethoxazole for treating S. maltophilia bacteremia (766, 767).

It is important to note that the choice of treatment should be guided by the patient’s clinical status, the type and severity of infection, local antim

page_content=‘trimethoprim/sulfamethoxazole for treatment of Stenotrophomonas maltophilia infections. J Antimicrob Chemother . Apr 2016;71(4):1071 -5. doi:10.1093/jac/dkv456
753. Jacobson S, Junco Noa L, Wallace MR, Bowman MC. Clinical outcomes using minocycline for Stenotrophomonas maltophilia infections. J Antimicrob Chemother . Dec 2016;71(12):3620. doi:10.1093/jac/dkw327
754. Tekce YT, Erbay A, Cabadak H, Sen S. Tigecycline as a therapeutic option in Stenotrophomonas maltophilia infections. J Chemother . Jun 2012;24(3):150 -4. doi:10.1179/1120009X12Z.00000000022’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 163}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
165
755. Zha L, Zhang D, Pan L, et al. Tigecycline in the Treatment of Ventilator -Associated Pneumonia Due to Stenotrophomonas maltophilia: A Multicenter Retrospective Cohort Study. Infect Dis Ther . Dec 2021;10(4):2415 -2429. doi:10.1007/s40121 -021-00516 -5 756. Hevia EC, Wooten L, Carr AL. Trimethoprim/Sulfamethoxazole vs Minocycline for the Treatment of Nonurinary Monomicrobial Stenotrophomonas maltophilia Infections in Hospitalized Patients. Ann Pharmacother . Sep 30 2023:10600280231201850. doi:10.1177/10600280231201850
757. Chang YT, Lin CY, Chen YH, Hsueh PR. Update on infections caused by Stenotrophomonas maltophilia with particular attention to resistance mechanisms and therapeutic options. Front Microbiol . 2015;6:893. doi:10.3389/fmicb.2015.00893’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 164}
##########
page_content=‘Patients with Monomicrobial Stenotrophomonas maltophilia Infections. Antimicrob Agents Chemother . Nov 2019;63(11):e00788 -19. doi:10.1128/AAC.00788 -19 766. Cho SY, Kang CI, Kim J, et al. Can levofloxacin be a useful alternative to trimethoprim - sulfamethoxazole for treating Stenotrophomonas maltophilia bacteremia? Antimicrob Agents Chemother . 2014;58(1):581 -3. doi:10.1128/AAC.01682 -13 767. Watson L, Esterly J, Jensen AO, Postelnick M, Aguirre A, McLaughlin M. Sulfamethoxazole/trimethoprim versus fluoroquinolones for the treatment of Stenotrophomonas maltophilia bloodstream infections. J Glob Antimicrob Resist . Mar 2018;12:104 -106. doi:10.1016/j.jgar.2017.09.015’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 164}
##########
page_content=‘PLoS One . 2016;11(3):e0152132. doi:10.1371/journal.pone.0152132
720. Shah MD, Coe KE, El Boghdadly Z, et al. Efficacy of combination therapy versus monotherapy in the treatment of Stenotrophomonas maltophilia pneumonia. J Antimicrob Chemother . Jul 1 2019;74(7):2055 -2059. doi:10.1093/jac/dkz116
721. Araoka H, Baba M, Okada C, Abe M, Kimura M, Yoneyama A. Evaluation of trimethoprim -sulfamethoxazole based combination therapy against Stenotrophomonas maltophilia: in vitro effects and clinical efficacy in cancer patients. Int J Infect Dis . May 2017;58:18 -21. doi:10.1016/j.ijid.2017.02.020
722. Muder RR, Harris AP, Muller S, et al. Bacteremia due to Stenotrophomonas (Xanthomonas) maltophilia: a prospective, multicenter study of 91 episodes. Clin Infect Dis. Mar 1996;22(3):508 -12. doi:10.1093/clinids/22.3.508
723. Chen L, Hua J, Hong S, et al. Assessment of the relative benefits of monotherapy and’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 161}

This is also an interesting one, while the answer is not wrong, this is not similar to the executive summary of the guideline. See snapshot below.

Let’s look at the context, they are ALL references! I think the problem here is most of the important context use S. maltophilia than stenotrophomonas itself.

Note to self, remove references!!! Maybe also build a short to long form for organism? Let’s try S. maltophilia and see if we can a different result.

Wow !!! Looks like you just gotta use the right words!What if I include both?

Actually, when I tried it with references removed, it worked just fine with stenotrophomonas.

Note to self, include both abbreviations to increase accuracy! Or remove references, ?or both.

What is the preferred treatment of DTR Pseudomonas?

The preferred treatment for infections outside of the urinary tract caused by DTR (Difficult-to-Treat) Pseudomonas aeruginosa includes a range of antibiotics such as ceftolozane -tazobactam, ceftazidime -avibactam, and imipenem -cilastatin - relebactam. Cefiderocol is also an alternative treatment option. These choices are based on in vitro activity data, observational studies, and clinical trial data that support their efficacy against DTR P. aeruginosa. It is important to note that combination antibiotic therapy should not be used once the β-lactam agent has demonstrated in vitro activity against the infection, as there is no additional benefit over monotherapy and the continued use of a second agent increases the likelihood of antibiotic resistance.

Critical Appraisal:
The answer provided is based on a comprehensive review of available literature, including in vitro activity data, observational studies, and clinical trial data. However, it is important to consider that the treatment landscape for DTR P. aeruginosa may evolve with emerging resistance patterns and new antibiotics entering the market. Additionally, the definition of DTR P. aeruginosa used in this guidance (non-susceptibility to a broad range of antibiotics) may vary by region or institution, potentially affecting treatment recommendations. It is also crucial to consider individual patient factors such as renal function, allergies, and local resistance patterns when selecting an antibiotic regimen.

Answer to Critical Appraisal:
The critical appraisal raises valid points regarding the dynamic nature of antibiotic resistance and the importance of considering local resistance patterns and individual patient factors. While the provided answer is based on the most current data available at the time of the guidance, healthcare providers must stay informed about the latest developments in antimicrobial resistance and adjust their treatment strategies accordingly. They should also perform individual patient assessments to tailor therapy to each patient’s unique situation. The guidance emphasizes the importance of not using combination antibiotic therapy once a β-lactam agent with proven in vitro activity is identified, as this approach has not shown additional benefits and may contribute to resistance development. It is also essential to monitor patients closely for treatment response and any signs of worsening infection or resistance.

page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
62
Question 4.5: What are preferred antibiotics for the treatment of infections outside of the urinary tract caused by DTR P. aeruginosa ? Suggested approach: Ceftolozane -tazobactam, ceftazidime -avibactam, and imipenem -cilastatin - relebactam are preferred options for the treatment of infections outside of the urinary tract caused by DTR P. aeruginosa . Cefiderocol is an alternative treatment option for infections outside of the urinary tract caused by DTR P. aeruginosa . Rationale
Ceftolozane -tazobactam, ceftazidime -avibactam, and imipenem -cilastatin -relebactam are preferred options for the treatment of DTR P. aeruginosa infections outside of the urinary tract, based’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 61}
##########
page_content=‘AST data are available.
Question 4.8: What is the role of combination antibiotic therapy for the treatment of infections caused by DTR P. aeruginosa ? Suggested approach: Combination antibiotic therapy is not suggested for infections caused by DTR P. aeruginosa if susceptibility to ceftolozane -tazobactam, ceftazidime -avibactam, imipenem -cilastatin - relebactam, or cefiderocol has been confirmed.
Rationale
Although empiric combination antibiotic therapy (e.g., the addition of tobramycin to a β -lactam agent) to broaden the likelihood of at least one active agent for patients at risk for DTR P. aeruginosa infections is reasonable, data do not indicate that continued combination therapy —once the β -lactam agent has demonstrated in vitro activity —offers any additional benefit over monotherapy with the β -lactam antibiotic379. Rather, the continued use of a second agent increases the likelihood of antibiotic -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 66}
##########
page_content=‘preferred options for the treatment of DTR P. aeruginosa infections outside of the urinary tract, based on in vitro activity138,139,141,264,266,329,420 -458, observational studies459-464, and clinical trial data131,135,145,240,465 - 469. The vast majority of patients in clinical trials receiving newer β-lactam agents were not infected with DTR P. aeruginosa . Clinical trials comparing novel agents to each other (e.g., ceftolozane -tazobactam versus ceftazidime -avibactam) are lacking. Rather, available studies focus on comparing newer β-lactam agents to older agents (e.g., ceftolozane -tazobactam versus polymyxins), and generally focus on MDR P. aeruginosa and not DTR P. aeruginosa . The suggested approach for the treatment of infections outside of the urinary tract caused by DTR P. aeruginosa isolates confirmed to produce MBL enzymes (e.g., bla VIM) is reviewed in Question 4.6 .’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 61}
##########
page_content=‘cephalosporins, fluoroquinolones, aminoglycosides, and carbapenems381. In 2018, the concept of “difficult -to-treat” resistance was proposed382. In this guidance document, DTR is defined as P. aeruginosa exhibiting non-susceptibility to all of the following: piperacillin -tazobactam, ceftazidime, cefepime, aztreonam, meropenem, imipenem -cilastatin, ciprofloxacin, and levofloxacin.
MDR P. aeruginosa or DTR P. aeruginosa generally evolve as a result of an interplay of multiple resistance mechanisms, including decreased expression of outer membrane porins (e.g., OprD), increased production of or amino acid substitutions within Pseudomonas -derived cephalosporinase’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 53}

Not bad at all. Both statements accurate, correct antibiotics, correct statement regarding no need for double active agents.

Which organisms require two active agent when susceptibility is known?

According to the Infectious Diseases Society of America (IDSA) guidance, the benefit of using two agents for treatment may outweigh the risks in cases of Carbapenem-Resistant Acinetobacter (CRAB) infections, especially considering the high mortality associated with these infections. When susceptibility testing indicates that an organism is resistant to a single agent like ampicillin-sulbactam, it may be appropriate to use combination therapy with two active agents to improve outcomes. This is because CRAB infections are often resistant to multiple antibiotics, and using two agents increases the likelihood of covering the pathogen and improving clinical outcomes.

Critical Appraisal:
The guidance provided by IDSA suggests that combination therapy with two active agents may be necessary for CRAB infections when susceptibility testing confirms resistance to a single agent like ampicillin-sulbactam. However, this recommendation is based on the complexity of CRAB infections and their high mortality rate. It assumes that all CRAB infections are resistant to single-agent therapy and that combination therapy will always lead to better clinical outcomes. This may not always be the case, as there could be instances where a single agent with broader spectrum activity is sufficient and where combination therapy might increase the risk of adverse drug reactions without significant benefit. Additionally, the choice of second agents should be tailored based on individual patient factors, local resistance patterns, and the specific characteristics of the CRAB infection.

Answer to Critical Appraisal:
The critical appraisal raises a valid point that not all CRAB infections may require combination therapy, and there is a potential for increased risk of adverse drug reactions with the use of multiple antibiotics. The IDSA guidance does indeed emphasize that the decision to use combination therapy should be individualized based on patient-specific factors, infection characteristics, and local resistance patterns. The goal is to balance the potential benefits of combination therapy against the risks of adverse effects and the complexity of managing such treatments. It is also important to consider the latest clinical evidence and guidelines when making treatment decisions for CRAB infections, as resistance patterns and treatment options can change over time.

page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
80
considering the high mortality associated with CRAB infections, the benefit of using two agents may outweigh the risks. Potential options for consideration as components of combination therapy in addition to high -dose ampicillin -sulbactam include: tetracycline derivatives (with the most experience available for minocycline), polymyxin B, or cefiderocol ( Questions 5.3 to 5.6 ).The decision to
preferentially select one agent over another should be based on patient and infection specific factors (e.g., polymyxin B may be less appealing for patients with chronic kidney diseases [ Question 5.5 ], minocycline may be less appealing for bloodstream infections [Question 5.6] ).’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 79}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
86
sulbactam, carbapenem, colistin), these studies did not compare outcomes of 3 -drug combinations versus 2-drug combinations of ampicillin -sulbactam and colistin637,677. As described in Question 5.4, two randomized trials evaluated the role of colistin monotherapy versus colistin plus meropenem and neither trial demonstrated a benefit with the combination of colistin plus meropenem for the treatment of CRAB infections569,570. A secondary analysis of one of the trials found that improved clinical outcomes were not observed with the combination of colistin and meropenem even when in vitro synergy was present678. Imipenem -cilastatin may retain activity against some meropenem -resistant isolates679-681;’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 85}
##########
page_content=‘636. Nation RL, Velkov T, Li J. Colistin and polymyxin B: peas in a pod, or chalk and cheese? Clin Infect Dis . Jul 1 2014;59(1):88 -94. doi:10.1093/cid/ciu213
637. Qureshi ZA, Hittle LE, O’Hara JA, et al. Colistin -resistant Acinetobacter baumannii: beyond carbapenem resistance. Clin Infect Dis . May 1 2015;60(9):1295 -303. doi:10.1093/cid/civ048
638. Bakthavatchalam YD, Veeraraghavan B. Challenges, Issues and Warnings from CLSI and EUCAST Working Group on Polymyxin Susceptibility Testing. J Clin Diagn Res . Aug 2017;11(8):DL03 -DL04. doi:10.7860/JCDR/2017/27182.10375’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 154}
##########
page_content=‘inform antibiotic decision -making.
Rationale
β-lactam antibiotics fall within a spectrum of potential for inducing ampC genes.
Aminopenicillins (i.e., amoxicillin, ampicillin), narrow -spectrum (i.e., first generation) cephalosporins, and cephamycins are potent ampC inducers 170,171. However, both organisms at low risk (e.g., S. marcescens ) and at moderate risk (e.g., E. cloacae ) for clinically significant ampC induction hydrolyze these antibiotics even at basal ampC expression levels. Therefore, such AmpC -E isolates will generally test as resistant to these drugs or are not recommended to be tested due to intrinsic resistance, averting treatment dilemmas.
Imipenem is also a potent ampC inducer but it generally remains stable to AmpC -E hydrolysis because of the formation of stable acyl enzyme complexes 170. The induction potential of ertapenem and’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 27}

Accurate statement, looks like we do need 2 active agents for CRAB. But it missed stenotrophomonas. Let’s try to rephrase.

No luck!

Can we use gentamicin in pseudomonas infection?

The use of gentamicin for Pseudomonas aeruginosa infections is a subject that requires careful consideration based on the patient’s clinical status, the severity of the infection, and the susceptibility patterns of the isolate. According to the Infectious Diseases Society of America (IDSA) 2024 Guidance on the Treatment of Antimicrobial-Resistant Gram-Negative Infections, when susceptibility testing indicates that Pseudomonas aeruginosa is susceptible to a preferred β-lactam agent, combination therapy including aminoglycosides like gentamicin is not recommended for the treatment of CRE infections. This is due to the potential for increased toxicity without clear evidence of improved clinical outcomes.

However, for MDR Pseudomonas aeruginosa where there are limited treatment options and the isolate is susceptible to gentamicin, it may be considered as part of combination therapy. The choice of antibiotics should always align with the latest guidelines, local resistance patterns, and individual patient factors. It’s also important to monitor renal function closely when using aminoglycosides due to their potential for nephrotoxicity.

Critical Appraisal:
The recommendation against combination therapy including aminoglycosides for CRE infections is based on clinical outcomes and known toxicities. However, this guidance may not fully account for situations where monotherapy with a preferred β-lactam agent might be insufficient due to high MIC values or the presence of additional resistance mechanisms. In such cases, the use of gentamicin could be justified if it improves the likelihood of successful treatment and there is close monitoring for toxicity.

Answer to Critical Appraisal:
You raise a valid point that

page_content=‘for the treatment of KPC -producing infections258. Thirty -day mortality was essentially identical at approximately 25% in both study arms.
Based on available outcomes data, clinical experience, and known toxicities associated with aminoglycosides, fluoroquinolones, tetracyclines, and polymyxins, the panel does not suggest combination therapy for CRE infections when susceptibility to a preferred β-lactam agent has been demonstrated.
Section 4: Pseudomonas aeruginosa with Difficult -to-Treat Resistance
MDR P. aeruginosa is defined as P. aeruginosa not susceptible to at least one antibiotic in at least three antibiotic classes for which P. aeruginosa susceptibility is generally expected: penicillins, cephalosporins, fluoroquinolones, aminoglycosides, and carbapenems381. In 2018, the concept of “difficult -to-treat” resistance was proposed382. In this guidance document, DTR is defined as P.’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 53}
##########
page_content=‘in patients with Pseudomonas aeruginosa infections. Antimicrob Agents Chemother . Jul 2013;57(7):2907 -12. doi:10.1128/AAC.02365 -12 403. Babich T, Naucler P, Valik JK, et al. Ceftazidime, Carbapenems, or Piperacillin - tazobactam as Single Definitive Therapy for Pseudomonas aeruginosa Bloodstream Infection: A Multisite Retrospective Study. Clin Infect Dis . May 23 2020 ;70(11):2270 -2280. doi:10.1093/cid/ciz668
404. Sader HS, Mendes RE, Arends SJR, Carvalhaes CG, Shortridge D, Castanheira M. Comparative activity of newer beta -lactam/beta -lactamase inhibitor combinations against Pseudomonas aeruginosa isolates from US medical centres (2020 -2021). Int J Antimicrob Agents . Apr 2023;61(4):106744. doi:10.1016/j.ijantimicag.2023.106744
405. Sader HS, Duncan LR, Doyle TB, Castanheira M. Antimicrobial activity of ceftazidime/avibactam, ceftolozane/tazobactam and comparator agents against Pseudomonas’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 135}
##########
page_content=‘Extensively Drug -Resistant Pseudomonas aeruginosa: CANWARD, 2007 to 2019. Microbiol Spectr . Aug 31 2022;10(4):e0172422. doi:10.1128/spectrum.01724 -22 417. Wagenlehner FM, Umeh O, Steenbergen J, Yuan G, Darouiche RO. Ceftolozane - tazobactam compared with levofloxacin in the treatment of complicated urinary -tract infections, including pyelonephritis: a randomised, double -blind, phase 3 trial (ASPECT -cUTI). Lancet . May 16 2015;385(9981):1949 -56. doi:10.1016/S0140 -6736(14)62220 -0 418. Walkty A, Adam H, Baxter M, et al. In vitro activity of plazomicin against 5,015 gram - negative and gram -positive clinical isolates obtained from patients in canadian hospitals as part of the CANWARD study, 2011 -2012. Antimicrob Agents Chemother . May 2014;58(5):2554 -63. doi:10.1128/AAC.02744 -13 419. Lopez Montesinos I, Gomez -Zorrilla S, Palacios -Baena ZR, et al. Aminoglycoside or’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 136}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
1
Infectious Diseases Society of America 2024 Guidance on the Treatment of Antimicrobial -Resistant Gram -Negative Infections
Pranita D. Tamma,1 Emily L. Heil2, Julie Ann Justo3, Amy J. Mathers4, Michael J. Satlin5, & Robert A. Bonomo,6 Affiliations
1Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA 2Department of Practice, Sciences, and Health -Outcomes Research, University of Maryland School of Pharmacy, Baltimore, Maryland, USA
3Department of Pharmacy, Dartmouth Hitchcock Medical Center, Lebanon, New Hampshire, USA 4Departments of Medicine and Pathology, University of Virginia, Charlottesville, Virginia, USA 5Department of Medicine, Weill Cornell Medicine, New York, NY, USA’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 0}

As you can see here the response got cut off? Maybe it went over the max_tokens. But the output seemed to be quite confusing, one stated Pseudomonas, the other stated CRE in the same paragraph. IF you look at the source the most similar vector contained the correct answer but because we had split it using overlap, it included the last CRE text and beginning of Pseudomonas text.

Note to self, play with overlap, reduce or set to 0? Typical practice is 10-20% of the token chunk size.

Can we use tobramycin to treat pseudomonas infection?

Yes, tobramycin is an aminoglycoside antibiotic that can be used to treat infections caused by Pseudomonas aeruginosa, including those with difficult-to-treat resistance (DTR). However, its use should be based on susceptibility testing, and it is often recommended as part of combination therapy when a β-lactam agent alone may not be sufficient. The Infectious Diseases Society of America (IDSA) guidelines suggest that if no β-lactam agent demonstrates activity against DTR P. aeruginosa, tobramycin can be considered in combination with certain β-lactam agents like ceftolozane-tazobactam, ceftazidime-avibactam, imipenem-cilastatin-relebactam, or cefiderocol. The choice of which β-lactam agent to use should be based on the in vitro susceptibility testing and the MIC values for each antibiotic against the specific Pseudomonas aeruginosa isolate.

Critical Appraisal:
The rationale provided for not routinely administering combination therapy for DTR P. aeruginosa infections when a β-lactam agent has demonstrated activity is based on clinical outcomes data and concerns about increased toxicity. However, this approach may overlook potential benefits of combination therapy, such as the possibility of reducing the emergence of resistance or providing a broader spectrum of coverage against potential co-infections. Additionally, the rationale assumes that in vitro susceptibility testing is always accurate and that the MIC values are reliable predictors of clinical outcomes, which may not always be the case. There is also a possibility that some patients may have unique circumstances where combination therapy could be beneficial despite the lack of strong supporting data.

Answer to Critical Appraisal:
The critical appraisal raises important considerations. While the IDSA guidelines emphasize the importance of in vitro susceptibility testing and the potential risks of combination therapy, including increased toxicity and antibiotic resistance, it is indeed possible that there are scenarios where combination therapy could offer additional benefits beyond what current data suggest. The guidelines recommend a case-by-case approach, taking into account the individual patient’s clinical status, the severity of the infection, and the potential risks and benefits of treatment. In cases where a β-lactam agent alone may not be sufficient to ensure effective treatment due to high resistance levels or other factors, the use of combination therapy with tobramycin could be considered, especially if there is a high likelihood that tobramycin will be active against the isolate in question. The decision to use combination therapy should always involve a careful assessment of the individual patient’s needs and the potential risks and benefits of treatment.

page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
68
therapy are not available (e.g., ceftazidime -avibactam versus ceftazidime -avibactam and tobramycin). Observational studies have not identified a survival advantage with combination therapy472,501,502. Based on toxicities associated with aminoglycosides and polymyxins and clinical outcomes data not demonstrating a benefit with the use of combination therapy for P. aeruginosa infections379, the panel does not suggest that combination therapy be routinely administered for DTR P. aeruginosa infections when susceptibility to a β -lactam agent has been demonstrated.
If no β -lactam agent demonstrates activity against DTR P. aeruginosa , tobramycin (if susceptibility is demonstrated) can be considered in combination with either ceftolozane -tazobactam,’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 67}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
2
Abstract
Background: The Infectious Diseases Society of America (IDSA) is committed to providing up-to-date guidance on the treatment of antimicrobial -resistant (AMR) infections. This guidance document focuses on infections caused by extended -spectrum β-lactamase -producing Enterobacterales (ESBL -E), AmpC β- lactamase -producing Enterobacterales (AmpC -E), carbapenem -resistant Enterobacterales (CRE), Pseudomonas aeruginosa with difficult -to-treat resistance (DTR P. aeruginosa ), carbapenem -resistant Acinetobacter baumannii (CRAB), and Stenotrophomonas maltophilia . This updated document replaces previous versions of the guidance document.
Methods: A panel of six infectious diseases specialists with expertise in managing antimicrobial -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 1}
##########
page_content=‘AST data are available.
Question 4.8: What is the role of combination antibiotic therapy for the treatment of infections caused by DTR P. aeruginosa ? Suggested approach: Combination antibiotic therapy is not suggested for infections caused by DTR P. aeruginosa if susceptibility to ceftolozane -tazobactam, ceftazidime -avibactam, imipenem -cilastatin - relebactam, or cefiderocol has been confirmed.
Rationale
Although empiric combination antibiotic therapy (e.g., the addition of tobramycin to a β -lactam agent) to broaden the likelihood of at least one active agent for patients at risk for DTR P. aeruginosa infections is reasonable, data do not indicate that continued combination therapy —once the β -lactam agent has demonstrated in vitro activity —offers any additional benefit over monotherapy with the β -lactam antibiotic379. Rather, the continued use of a second agent increases the likelihood of antibiotic -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 66}
##########
page_content=‘If no β -lactam agent demonstrates activity against DTR P. aeruginosa , tobramycin (if susceptibility is demonstrated) can be considered in combination with either ceftolozane -tazobactam, ceftazidime -avibactam, imipenem -cilastatin -relebactam, or cefiderocol, preferentially selecting the β - lactam agent for which the MIC is closest to its susceptibility breakpoint. For example, if ceftolozane - tazobactam and ceftazidime -avibactam MICs against a DTR P. aeruginosa isolate are both >128/4 µg/mL (highly resistant) and the imipenem -cilastatin -relebactam MIC is 4/4 µg/mL (intermediate category), imipenem -cilastatin -relebactam in combination with tobramycin is favored. Data are lacking demonstrating a benefit to this approach and it should be considered as a last resort. This approach is suggested as it may increase the likelihood that at least one active agent is being included in the treatment regimen.’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 67}

Not too shabby. Though the yes response was a bit optimistic in my view. According to the guideline, it’s used as a last resort when other beta-lactams are not susceptible. It did say that later on though.

Why is there carbapenemase non-producing organism?

Carbapenemase non-producing organisms, particularly among Enterobacterales, can exhibit resistance to carbapenems through mechanisms other than the production of carbapenemases. These mechanisms include:

1. Amplification of non-carbapenemase β-lactamase genes: This can lead to increased production of enzymes that hydrolyze β-lactam antibiotics, such as extended-spectrum beta-lactamases (ESBLs).
2. Outer membrane porin disruption: The outer membrane of Gram-negative bacteria acts as a barrier to many antibiotics, including carbapenems. Disruption of porins, which are proteins that facilitate the passage of molecules through the outer membrane, can reduce the penetration of these antibiotics into the bacterial cell.
3. Upregulation of efflux pumps: These are active transport systems that expel antibiotics from the bacterial cell, thereby reducing their intracellular concentration and effectiveness.
4. Mutations in penicillin-binding proteins (PBPs): PBPs are essential for bacterial cell wall synthesis. Mutations can lead to altered PBPs that have a reduced affinity for carbapenems, thus reducing the antibiotic’s ability to inhibit cell wall synthesis.
5. Presence of expanded-spectrum β-lactamases (e.g., bla OXA-10): These enzymes can hydrolyze a broad range of β-lactams, including carbapenems.

The prevalence of non-carbapenemase producing CRE is variable and can be influenced by geographic location, healthcare settings, and the prevalent resistance mechanisms in the local microbial ecosystem.

Critical Appraisal:
While it is true that non-carbapenemase producing organisms can exhibit carbapenem resistance through various mechanisms, the initial explanation might imply that carbapenemase production is a less common or secondary pathway to resistance. However, carbapenemases are a significant and growing cause of carbapenem resistance, especially in certain regions and healthcare settings. The emphasis on non-carbapenemase producing mechanisms should not overshadow the importance of carbapenemase-producing organisms, which can be highly resistant and pose a significant public health threat.

Answer to Critical Appraisal:
You are correct that my previous explanation might have understated the significance of carbapenemase-producing organisms. Carbapenemases are indeed a major driver of carbapenem resistance and are responsible for a

page_content=‘resistance, broadly divided into those that are not carbapenemase -producing and those that are carbapenemase -producing. CRE that are not carbapenemase -producing may be the result of amplification of non -carbapenemase β -lactamase genes (e.g., ESBL genes) with concurrent outer membrane porin disruption215. Carbapenemase -producing isolates account for 35% -83% of CRE cases in the United States, with higher percentages observed when restricting the definition of CRE to require resistance to meropenem or imipenem216-218. The most common carbapenemases in the United States are K. pneumoniae carbapenemases (KPCs), which are not limited to K. pneumoniae isolates. Other carbapenemases include New Delhi metallo -β-lactamases (NDMs), Verona integron -encoded metallo -β-lactamases (VIMs), imipenem - hydrolyzing metallo -β-lactamases (IMPs), and oxacillinases (e.g., OXA -48-like)218-220. NDM, VIM, and IMP’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 37}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
55
(PDC) enzymes (commonly referred to as pseudomonal AmpC enzymes), upregulation of efflux pumps (e.g., MexAB -OprM), mutations in PBP targets, and the presence of expanded -spectrum β -lactamases (e.g., bla OXA -10)383,384. Carbapenemase production is a relatively uncommon cause of carbapenem resistance in P. aeruginosa isolates in the United States385,386, but is identified in significant portions of carbapenem -resistant P. aeruginosa in other regions of the world (e.g., 69% in Latin America, 57% Asia), commonly due to the presence of bla KPC or bla VIM enzymes385,387 -392. These estimates suggest the prevalence of carbapenemase -producing P. aeruginosa will increase in the United States in coming’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 54}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
39
to 2021 the percentages of bla KPC decreased from 74% to 57%, whereas the percentages of isolates with MBL genes (e.g., bla NDM, bla VIM, bla IMP) increased from 4% to 20% and those with bla OXA -48-like increased from 1% to 8%218. Knowledge of the carbapenemase produced when CRE is identified in clinical isolates is important in guiding treatment decisions as specific newer β -lactam antibiotics have activity against specific carbapenemases. Phenotypic tests such as the modified carbapenem inactivation method differentiate carbapenemase and non -carbapenemase -producing CRE but generally do not provide information on the specific carbapenemase present16,222. This information is increasingly important given’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 38}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
38
Section 3: Carbapenem -Resistant Enterobacterales
CRE are defined as members of the Enterobacterales order resistant to at least one carbapenem antibiotic (i.e., ertapenem, meropenem, imipenem, doripenem) or producing a carbapenemase enzyme
214. Resistance to at least one carbapenem other than imipenem is required for bacteria intrinsically less susceptible to imipenem (e.g., Proteus spp., Morganella spp., Providencia spp.)214. CRE comprise a heterogenous group of pathogens encompassing multiple mechanisms of resistance, broadly divided into those that are not carbapenemase -producing and those that are carbapenemase -producing. CRE that are not carbapenemase -producing may be the result of’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 37}

Not too shabby. I was intending to ask for CRE but in this setting, the LLM accurately answered the question by including CR-PsA. Wow, this is not a bad way of refining our questions to be more concise and precise.

Let’s see if I only indicate CRE whether it will give a different answer.

Can we use oral antibiotics for any of these MDRO?

The question pertains to the use of oral antibiotics for multidrug-resistant organisms (MDRO), particularly for infections caused by difficult-to-treat pathogens like DTR Pseudomonas aeruginosa, Carbapenem-Resistant Enterobacterales (CRE), and other MDROs.

The use of oral antibiotics for MDRO infections depends on several factors:

1. Susceptibility Testing: The organism must be susceptible to an oral agent. This is determined through laboratory testing, such as antimicrobial susceptibility testing (AST).

2. Clinical Stability: The patient must be clinically stable, meaning they are not in septic shock or other life-threatening conditions that require intravenous (IV) administration for immediate effect.

3. Availability of AST Data: There should be available data from AST to guide the choice of an oral agent.

4. Drug Efficacy and Pharmacokinetics: The oral medication must have adequate bioavailability, distribution, metabolism, and excretion to effectively treat the infection at the site of infection.

5. Patient Compliance and Convenience: Oral therapy is generally more convenient for patients, which can improve compliance with treatment regimens.

6. Infection Site and Severity: Some infections may not be amenable to oral therapy due to the severity or location of the infection.

For DTR Pseudomonas aeruginosa, if susceptibility to appropriate oral agents is confirmed (such as ceftazidime-avibactam, cefiderocol, or others), and all other criteria are met, transitioning to oral therapy can be considered. The same applies to CRE infections where the organism is susceptible to oral agents.

It’s important to note that the decision to switch from IV to oral antibiotics should always be made on a case-by-case basis, considering the individual patient’s circumstances and in consultation with an infectious diseases specialist when possible. The most current guidelines and recommendations should be followed, as they are regularly updated to reflect new evidence and changing patterns of resistance.

Rationale Critique: While the suggested approach for transitioning to oral therapy is reasonable, it’s important to critically evaluate the potential risks associated with switching from IV to oral antibiotics. These include the possibility of suboptimal drug levels at the site of infection, patient non-compliance, and the emergence of resistance due to incomplete eradication of the pathogen. Additionally, some patients may have contraindications or adverse reactions to oral medications that were not present with IV therapy.

Critical Appraisal Response: The critique raises valid concerns about the transition from IV to oral antibiotics. However, when the criteria for oral therapy are met, and the patient is closely monitored, the benefits of improved patient comfort, convenience, and potentially better adherence to treatment can outweigh these risks. It is crucial that the choice of oral agent is based on reliable AST data and that the patient’s response to therapy is carefully monitored for any signs of treatment failure or the development of resistance. The decision to switch to oral therapy should be made in collaboration with a healthcare provider experienced in managing MDRO infections.

Answer: Oral antibiotics can be used for some MDRO infections, provided that the organism is susceptible to the oral agent, the patient is clinically stable, and there is available AST data supporting the choice of oral medication. The decision should be made on a case-by-case basis, considering all relevant factors and in consultation with an infectious diseases specialist when possible.

page_content=‘improvement nonetheless occurred, it is generally not necessary to repeat a urine culture, change the antibiotic regimen, or extend the planned treatment course. However, for all other infections, if AST results indicate a potentially inactive agent was initiated empirically, a change to an active regimen for a full treatment course (dated from the start of active therapy) is suggested. Additionally, important host factors related to immune status, ability to attain source control, and general response to therapy should be considered when determining treatment durations for AMR infections, as with the treatment of any bacterial infection. Finally, whenever possible, transitioning to oral therapy should be considered (assuming IV therapy was initially prescribed), particularly if the following criteria are met: (1) susceptibility to an appropriate oral agent is demonstrated, (2) the patient is hemodynamically stable,’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 7}
##########
page_content=‘AST data are available.
Question 4.8: What is the role of combination antibiotic therapy for the treatment of infections caused by DTR P. aeruginosa ? Suggested approach: Combination antibiotic therapy is not suggested for infections caused by DTR P. aeruginosa if susceptibility to ceftolozane -tazobactam, ceftazidime -avibactam, imipenem -cilastatin - relebactam, or cefiderocol has been confirmed.
Rationale
Although empiric combination antibiotic therapy (e.g., the addition of tobramycin to a β -lactam agent) to broaden the likelihood of at least one active agent for patients at risk for DTR P. aeruginosa infections is reasonable, data do not indicate that continued combination therapy —once the β -lactam agent has demonstrated in vitro activity —offers any additional benefit over monotherapy with the β -lactam antibiotic379. Rather, the continued use of a second agent increases the likelihood of antibiotic -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 66}
##########
page_content=‘management considerations are discussed briefly. Suggested approaches apply for both adult and pediatric populations, although suggested antibiotic dosages are provided only for adults.
Conclusions: The field of AMR is highly dynamic. Consultation with an infectious diseases specialist is recommended for the treatment of AMR infections. This document is current as of December 31, 2023 and will be updated periodically. The most current version of this document, including date of publication, is available at www.idsociety.org/practice -guideline/amr -guidance/ .’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 1}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
48
respectively292. Clinical trial data comparing ceftazidime -avibactam/aztreonam versus cefiderocol are not available and both agents are considered preferred treatment options for MBL -producing Enterobacterales infections.
Tigecycline or eravacycline are alternative options for the treatment of NDM -producing infections not involving the bloodstream or urinary tract (Question 3.9). Their activity is independent of the presence or type of carbapenemases.
Question 3.6: What are the preferred antibiotics for the treatment of infections outside of the urinary tract caused by CRE if OXA -48-like production is present?
Suggested approach: Ceftazidime -avibactam is the preferred treatment option for OXA -48-like-’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 47}

This question is mainly to test the LLM to see if it understand what MDRO is and it did. And able to provide a generic statement that if oral antibiotic is susceptible, it can be switched.

What is the preferred treatment of MRSA?

The preferred treatment for methicillin-resistant Staphylococcus aureus (MRSA) infections can vary depending on several factors, including the type of infection, its severity, the patient’s clinical status, and local resistance patterns. According to the Infectious Diseases Society of America (IDSA) 2024 Guidance on the Treatment of Antimicrobial-Resistant Gram-Negative Infections, which also includes guidance on MRSA, the treatment regimen should be tailored based on these factors and guided by culture results and susceptibility testing.

For skin and soft tissue infections caused by MRSA, options include:

1. Trimethoprim-sulfamethoxazole (caution in G6PD deficiency)
2. Doxycycline (for adolescents and adults)
3. Minocycline (for adolescents and adults)
4. Cephalexin or cefuroxime for uncomplicated infections in patients without penicillin allergy
5. For patients with a penicillin allergy, alternatives include doxycycline, minocycline, or clindamycin.

For more severe infections, such as bacteremia or endocarditis, treatment typically involves intravenous administration of agents like nafcillin, oxacillin, or a first-generation cephalosporin (e.g., cefazolin) for patients without penicillin allergy. For those with a penicillin allergy, alternatives include vancomycin, linezolid, or dalbavancin.

It is important to note that the IDSA guidance document is updated annually, and the most current recommendations should be consulted at the provided link: https://www.idsociety.org/practice-guideline/amr-guidance/.

Critical Appraisal:
One potential limitation of this answer is that it is based on a specific year’s guidance (2024) which may not capture the most recent developments or changes in treatment recommendations. Additionally, the treatment for MRSA can vary significantly based on geographic location and the evolving resistance patterns observed in different regions. Therefore, while the provided answer reflects the best available evidence as per the IDSA guidelines, it is essential to consult the latest guidance and local treatment protocols when making clinical decisions.

Response to Critical Appraisal:
The answer provided is based on the most recent IDSA guidance document available at the time of the question (2024). While it is true that treatment recommendations can change and may vary by region, the IDSA guidelines are widely respected and serve as a foundational reference for clinicians. It is also important to note that while local resistance patterns should always be considered, the IDSA guidance provides a standardized approach to treating MRSA infections when local data is not available or when it aligns with their recommendations. Clinicians should indeed consult the most current guidelines and local treatment protocols, but the IDSA guidance offers a robust starting point for treatment decisions.

page_content=‘data may not be very robust and continue to rapidly evolve – such as with AMR. Guidance documents are prepared by a small team of experts, who answer questions about treatment based on a comprehensive (but not necessarily systematic) review of the literature, clinical experience, and expert opinion. Documents are made available online and updated annually.
In the present document, guidance is provided on the treatment of infections caused by extended -spectrum β -lactamase -producing Enterobacterales (ESBL -E), AmpC β -lactamase -producing Enterobacterales (AmpC -E), carbapenem -resistant Enterobacterales (CRE), Pseudomonas aeruginosa with difficult -to-treat resistance (DTR P. aeruginosa ), carbapenem -resistant Acinetobacter baumannii (CRAB), and Stenotrophomonas maltophilia . Many of these pathogens have been designated urgent or serious threats by the CDC2. Each pathogen causes a wide range of infections that are encountered in’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 2}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
1
Infectious Diseases Society of America 2024 Guidance on the Treatment of Antimicrobial -Resistant Gram -Negative Infections
Pranita D. Tamma,1 Emily L. Heil2, Julie Ann Justo3, Amy J. Mathers4, Michael J. Satlin5, & Robert A. Bonomo,6 Affiliations
1Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA 2Department of Practice, Sciences, and Health -Outcomes Research, University of Maryland School of Pharmacy, Baltimore, Maryland, USA
3Department of Pharmacy, Dartmouth Hitchcock Medical Center, Lebanon, New Hampshire, USA 4Departments of Medicine and Pathology, University of Virginia, Charlottesville, Virginia, USA 5Department of Medicine, Weill Cornell Medicine, New York, NY, USA’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 0}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
2
Abstract
Background: The Infectious Diseases Society of America (IDSA) is committed to providing up-to-date guidance on the treatment of antimicrobial -resistant (AMR) infections. This guidance document focuses on infections caused by extended -spectrum β-lactamase -producing Enterobacterales (ESBL -E), AmpC β- lactamase -producing Enterobacterales (AmpC -E), carbapenem -resistant Enterobacterales (CRE), Pseudomonas aeruginosa with difficult -to-treat resistance (DTR P. aeruginosa ), carbapenem -resistant Acinetobacter baumannii (CRAB), and Stenotrophomonas maltophilia . This updated document replaces previous versions of the guidance document.
Methods: A panel of six infectious diseases specialists with expertise in managing antimicrobial -’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 1}
##########
page_content=‘Last updated December 31, 2023, and posted online at https://www.idsociety.org/practice -guideline/amr - guidance/ . Please check website for most updated version of this guidance.
3
Introduction
Antimicrobial -resistant (AMR) infections are a global crisis. Internationally, approximately 1.3 million deaths were estimated to be directly attributable to AMR pathogens in 20191. In the United States, AMR pathogens caused more than 2.8 million infections and over 35,000 deaths annually from 2012 through 2017, according to the Centers for Disease Control and Prevention (CDC) Antibiotic Resistance Threats in the United States Report2. As an alternative to practice guidelines, the Infectious Diseases Society of America (IDSA) has endorsed developing more narrowly focused guidance documents for the treatment of infections where data may not be very robust and continue to rapidly evolve – such as with AMR. Guidance documents’ metadata={‘source’: ’llm-langchain/idsa_data/amr-guidance-4.0.pdf’, ‘page’: 2}

LOL ~! This is a test. There was no MRSA in the guideline, I just wanted to see what the response is. It’s not good. It did provide some possible antibiotics that work, but included ones that we know will not work. I suspect the answers came from trained data. Let’s explore the content to see how can I prevent this in the future.

Interestingly, I think after I remove references, this answer came up… I didn’t even have to adjust score_threshold

What is the preferred treatment of CRAB?

The preferred treatment for moderate to severe infections caused by Carbapenem-Resistant Acinetobacter (CRAB) bacteria involves combination therapy. According to the guidance document, the recommended approach includes a sulbactam-containing agent, with the preferred regimen being sulbactam-durlobactam in combination with either imipenem-cilastatin or meropenem. An alternative regimen is high-dose ampicillin-sulbactam (with a total daily dose of 9 grams of the sulbactam component) combined with another active agent, such as colistin, minocycline, tigecycline, or cefiderocol if sulbactam-durlobactam is not available. This approach is based on the limited data supporting the effectiveness of single agents and the potential for increased bacterial killing with combination regimens, despite inconclusive results from in vitro and animal studies, as well as observational studies evaluating combination therapy versus monotherapy.

Critical Appraisal:
The preferred treatment for CRAB infections as outlined by the guidance document is based on a limited evidence base. The recommendation for sulbactam-durlobactam in combination with either imipenem-cilastatin or meropenem is made despite the lack of robust clinical data supporting its efficacy. The alternative regimen of high-dose ampicillin-sulbactam also lacks strong clinical trial evidence. The reliance on combination therapy may be a prudent approach given the resistance profile of CRAB, but it is not definitively proven to be superior to monotherapy based on current evidence. Additionally, the use of combination therapy could lead to increased antibiotic exposure and potential toxicity without necessarily improving outcomes or reducing the risk of treatment failure or mortality.

Answer to Critical Appraisal:
The critical appraisal raises a valid point regarding the strength of the evidence supporting the prefe...

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Llama, Llama, Oh Give Me A Sign. What’s In The Latest IDSA Guideline?

Text Styling With ggplot2

Albert Rapp — Sat, 20 Jul 2024 22:00:00 +0000

[This article was first published on Albert Rapp, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In today’s blog post, we are figuring out how to fully control the text styling of the texts that we put into our ggplots. This means that we will learn

how to dynamically adjust the text color depending on the background color, and
how to use the extensive styling capabilities that the brand-new {marquee} package gives you.

Here, you will find all of the code chunks split into sections. For detailed explantions, check out the corresponding YT video:

Customize the text color based on the background color

library(tidyverse)
dat <- tibble(value = 1:5) |> 
  mutate(
    text_color = if_else(
      value <= 3,
      'black',
      'white'
    )
  ) 

dat |> 
  ggplot(aes(x = value, y = 1)) +
  geom_tile(
    aes(fill = value),
    width = 0.5, 
    height = 0.5,
    col = 'black'
  ) +
  geom_text(
    aes(label = value),
    color = dat$text_color,
    size = 8,
    fontface = 'bold',
    family = 'Source Sans Pro'
  ) +
  coord_fixed() +
  scale_fill_gradient(low = 'white', high = 'firebrick4') +
  theme_void() +
  theme(legend.position = 'none')

Dynamic text color with categorical labels

dat <- tibble(x = 1:5, letter = letters[1:5]) |> 
  mutate(
    text_color = if_else(
      letter %in% c('a', 'd'),
      'white',
      'black'
    )
  )

dat |> 
  ggplot(aes(x = x, y = 1)) +
  geom_tile(
    aes(fill = letter),
    width = 0.5, 
    height = 0.5,
    col = 'black'
  ) +
  geom_text(
    aes(label = letter),
    size = 8,
    color = dat$text_color,
    fontface = 'bold',
    family = 'Source Sans Pro'
  ) +
  coord_fixed() +
  theme_void() +
  scale_fill_brewer(palette = 'Set1') +
  theme(legend.position = 'none')

Use `geom_marquee()` instead of `geom_text()`

geom_marquee() is a drop-in replacement for geom_text() and geom_label(). Important caveat: In order for everything to render properly, you might have to update your ragg package

library(marquee)

md_text <- 'This is a **bold word** written in _Markdown_.'

tibble(x = 1, y = 1, label = md_text) |> 
  ggplot(aes(x, y)) +
  geom_marquee(
    aes(label = label),
    size = 13,
    family = 'Source Sans Pro'
  ) +
  theme_void()

Add more styles using the `style` aesthetic

tibble(x = 1, y = 1, label = md_text) |> 
  ggplot(aes(x, y)) +
  geom_marquee(
    aes(label = label),
    size = 13,
    family = 'Source Sans Pro',
    style = classic_style(
      weight = 'thin'
    )
  ) +
  theme_void()

Modify paragraph styles

my_own_style <- classic_style(
  weight = 'thin'
) |> 
  modify_style(
    'p', 
    background = 'dodgerblue4',
    padding = trbl(10),
    color = 'white',
    border_radius = 4
  )

tibble(x = 1, y = 1, label = md_text) |> 
  ggplot(aes(x, y)) +
  geom_marquee(
    aes(label = label),
    size = 13,
    family = 'Source Sans Pro',
    style = my_own_style
  ) +
  theme_void()

Modify Code styles

md_text <- 'Now let\'s try some `code` stuff and a [url]().'

tibble(x = 1, y = 1, label = md_text) |> 
  ggplot(aes(x, y)) +
  geom_marquee(
    aes(label = label),
    size = 13,
    family = 'Source Sans Pro',
    style = classic_style() |> 
      modify_style(
        'code',
        weight = 'bold',
        background = colorspace::lighten('dodgerblue4', 0.9),
        border_radius = 4,
        color = 'dodgerblue4',
        family = 'IBM Plex Mono',
        padding = trbl(0, 4, 0, 4)
      )
  ) +
  theme_void()

Use long texts as part of plot titles

md_text <- '# This is a headline

And the rest is just a regular text, i.e. paragraph, that will contain long and lengthy but also **SUPER** important information. Isn\'t that just great?'


headline_style <- classic_style() |> 
  remove_style('h1') |> 
  modify_style(
    'h1',
    weight = 'bold',
    size = 32,
    margin = trbl(b = 4),
    family = 'Merriweather'
  ) |> 
  modify_style(
    'p',
    lineheight = 1
  )

tibble(x = 1, y = 1) |> 
  ggplot(aes(x, y)) +
  geom_point(size = 10) +
  labs(title = md_text) +
  theme_minimal(
    base_size = 18, 
    base_family = 'Source Sans Pro'
  ) +
  theme(
    plot.title = element_marquee(
      width = 1,
      style = headline_style
    )
  )

Create text boxes using the `width` argument

text_box_style <- headline_style |> 
      modify_style(
        'p', 
        padding = trbl(l = 4),
        size = 16
      ) |> 
      modify_style(
        'body',
        border_radius = 4,
        border = 'dodgerblue4',
        border_size = trbl(2),
        padding = trbl(5)
      ) |> 
      modify_style(
        'h1',
        border_size = trbl(0)
      ) 

tibble(x = 1, y = 1) |> 
  ggplot(aes(x, y)) +
  geom_point(size = 10) +
  annotate(
    'marquee',
    x = 1.2,
    y = 1.5,
    label = md_text,
    width = 0.4,
    hjust = 0,
    fill = colorspace::lighten('dodgerblue1', 0.7),
    style = text_box_style
  ) +
  labs(title = md_text) +
  theme_minimal(
    base_size = 18, 
    base_family = 'Source Sans Pro'
  ) +
  theme(
    plot.title = element_marquee(
      width = 1,
      style = headline_style
    )
  ) +
  coord_cartesian(
    xlim = c(0, 2),
    ylim = c(0, 2)
  )

Colorize single words in your title

md_text <- '# This is a headline

And the rest is just a regular text, i.e. paragraph, that will contain long and lengthy but also {.red **SUPER** important information}. Isn\'t that just great?'

tibble(x = 1, y = 1) |> 
  ggplot(aes(x, y)) +
  geom_point(size = 10) +
  annotate(
    'marquee',
    x = 1.2,
    y = 1.5,
    label = md_text,
    width = 0.4,
    hjust = 0,
    fill = colorspace::lighten('dodgerblue1', 0.7),
    style = text_box_style
  ) +
  labs(title = md_text) +
  theme_minimal(
    base_size = 18, 
    base_family = 'Source Sans Pro'
  ) +
  theme(
    plot.title = element_marquee(
      width = 1,
      style = headline_style
    )
  ) +
  coord_cartesian(
    xlim = c(0, 2),
    ylim = c(0, 2)
  )

Define your own inline style

md_text <- '# This is a headline

And the rest is just a regular text, i.e. paragraph, that will contain long and lengthy but also {.my_style **SUPER** important information}. Isn\'t that just great?'

tibble(x = 1, y = 1) |> 
  ggplot(aes(x, y)) +
  geom_point(size = 10) +
  annotate(
    'marquee',
    x = 1.2,
    y = 1.5,
    label = md_text,
    width = 0.4,
    hjust = 0,
    fill = colorspace::lighten('dodgerblue1', 0.7),
    style = text_box_style |> 
      modify_style(
        'my_style',
        color = 'seagreen'
      )
  ) +
  labs(title = md_text) +
  theme_minimal(
    base_size = 18, 
    base_family = 'Source Sans Pro'
  ) +
  theme(
    plot.title = element_marquee(
      width = 1,
      style = headline_style |> 
        modify_style(
          'my_style',
          color = 'seagreen'
        )
    )
  ) +
  coord_cartesian(
    xlim = c(0, 2),
    ylim = c(0, 2)
  )

To leave a comment for the author, please follow the link and comment on their blog: Albert Rapp.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Text Styling With ggplot2

King of the Mountain: using R to bag a Strava KOM

Stephen Royle — Sat, 20 Jul 2024 20:25:26 +0000

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the best features of Strava is the battle to be King (or Queen) of the Mountain. Originally, in cycling, segments were typically climbs or difficult sections of road, and the simple idea, is who can complete the segment in the quickest time. Hence they would be KOM/QOM, King or Queen of the Mountain.

Segments quickly expanded to pretty much any section of a course and to include running segments, to separate them from cycling. In running if you are KOM/QOM you hold the Course Record (CR). There’s a brief rant below about how frustrating this could-be-amazing feature is, if you are interested.

I have a few CRs/KOMs, literally a few, and it would be nice to have more! Strava has been going for a long time, and these days, it is very hard to set a course record. Partly for reasons in the rant below, but mainly because a lot of people have logged their runs and rides on Strava, and inevitably some fast people have done so over these segments. Add in the people that target segments to try and beat the times, then it is difficult to find one that is beatable.

Data science to the rescue

What’s that you say? Something that’s hard or difficult? Let’s see if data science has the answer!

My idea was to find all the local segments for running, and analyse which ones are beatable. Then go out and set a course record. It is not feasible to do this manually due to the number of segments and the fact that Strava doesn’t display all the segment data nicely, so we’ll use R to do it.

There’s a wonderful library {rStrava} that allows us to interact with the Strava API within R. I won’t discuss how to set it up, there’s guides elsewhere. Briefly, you need to register an app with Strava (they’re OK with individuals doing nerdy stuff with limited data using the API, but they have tightened up in the last year or so on bigger projects), and then use the keys to get an access token.

library(rStrava)
library(dplyr)
library(ggplot2)
# for mapping
library(leaflet)
library(htmlwidgets)
library(googleway)
library(mapview)

# create authentication token
# requires user created app name, id, and secret from Strava website
# initial setup ####
# Strava key
app_name <- "foobar"
app_client_id <- "1234"
app_secret <- "1234"

stoken <- httr::config(token = strava_oauth(app_name,
                                            app_client_id,
                                            app_secret,
                                            app_scope = "activity:read_all",
                                            cache = TRUE
))
# refresh stale token using the command below
# stoken <- httr::config(token = readRDS('.httr-oauth')[[1]])

OK, so now we’re good to go. Let’s pull the data for all segments within a region of interest.

We need a nice big region, to get a load of segments, BUT the request will only return the first 10 segments in a region. So we’ll break down the region into a 10 x 10 grid of squares and search within each so that we maximise the number of segments to analyse. The command get_explore() allows us to pull the ids of segments within bounds. The bounds are written as a string bnds <- "52.2970, -1.7326, 52.3876, -1.4598" i.e. bottom left and top right lat/long.

# define bottom-left and top-right corners of a 10 x 10 array of bounding boxes
# using 0.00906 increments for lat and 0.02728 for long
# starting at 52.2970, -1.7326
# ending at 52.3876, -1.4598
lats <- seq(52.2970, 52.3876, 0.00906)
longs <- seq(-1.7326, -1.4598, 0.02728)

# create a list of bounding boxes
bnds <- lapply(lats, function(x) {
  lapply(longs, function(y) {
    paste0(x, ",", y, ",", x + 0.00906, ",", y + 0.02728)
  })
}) %>% unlist()

# for each element in bnds, get the explore data
all_segments <- data.frame()
for (i in seq_along(bnds)) {
  data <- get_explore(stoken, bnds[i], activity_type = "running")
  data$segments %>%
    lapply(function(x) x$id) %>%
    unlist() -> segments
  if(length(segments) == 0) next
  df <- data.frame(bnds = bnds[i], segments = segments)
  all_segments <- rbind(all_segments, df)
}

# unique values in all_segments
all_segments_list <- unique(all_segments$segments)

Note that the API currently has a read limit of 2,000 per day and 100 per 15 minutes. For this first part of the code, we’re only requesting 100 things so we’re all good. Later we’ll need to limit the requests.

We now have a list of all segment ids within the region of interest, a total of 483 segments (running only).

Next we need to get the data associated with each segment id. A tricky thing here is that the data fields differ between segments so we need to establish a core set of columns of interest and select those so that we can assemble a big data frame without getting an error from incompatible column numbers.

# columns of interest
keepcols <- c(
  "activity_type", "athlete_count",
  "athlete_segment_stats.effort_count", "average_grade",
  "created_at", "distance", "effort_count", "elevation_high",
  "elevation_low", "elevation_profile", "elevation_profiles.dark_url",
  "elevation_profiles.light_url", "end_latlng1", "end_latlng2", "hazardous",
  "id", "map.id", "map.polyline", "map.resource_state",
  "maximum_grade", "name", "private", "resource_state", "star_count",
  "starred", "start_latlng1", "start_latlng2", 
  "total_elevation_gain", "updated_at", "xoms.destination.href",
  "xoms.destination.name", "xoms.destination.type",
  "xoms.kom", "xoms.overall"
)
all_segment_data <- data.frame()

for (seg in all_segments_list) {
  # wait for 9.5 seconds (time out is 100 requests in 15 min)
  Sys.sleep(9.5)
  data <- get_segment(stoken, id = seg)
  data %>% compile_segment() -> data2
  if (!is.null(data2) & nrow(data2) > 0) {
    # keep cols
    data2 <- data2[, keepcols]
    all_segment_data <- rbind(all_segment_data, data2)
  }
}

Note that we use a sleep time here so that we don’t get locked out of the API for making too many requests. If you get an error during this loop, it is likely due to a segment missing one of the columns that we want to select. This happened for one segment, I just deleted the id from the list and restarted the loop from there to get it to completion.

Also note, that I am only getting KOM data (and overall leader, although this is usually KOM) and not QOM. This is because not all segments have QOM data, i.e. they have not been attempted by any women runners. Any women out there wanting to try this, there are quite a few QOMs still up for grabs!

Now that we have the data, let’s calculate pace of the best time to figure out which ones are beatable. We have to do this because the segments vary in length and comparing speed or pace is the best way to compare.

# in the column xoms.kom, the time is written as a string
# it is either in the format "00:00:00", "00:00", "00s"
# convert to seconds
all_segment_data$overall_time <- sapply(all_segment_data$xoms.kom, function(x) {
  if (grepl("s", x)) {
    as.numeric(gsub("s", "", x))
  } else {
    # determine if it is in the format "00:00:00" or "00:00" by counting the number of ":"
    if (nchar(x) - nchar(gsub(":", "", x)) == 1) {
      x <- paste0("00:", x)
    }
    as.numeric(strsplit(x, ":")[[1]][1]) * 60 * 60 + as.numeric(strsplit(x, ":")[[1]][2]) * 60 + as.numeric(strsplit(x, ":")[[1]][3])
  }
})
# convert distance from character to numeric
all_segment_data$distance <- as.numeric(all_segment_data$distance)
# calculate pace in minutes per km
all_segment_data$pace <- (all_segment_data$overall_time / 60) / (all_segment_data$distance / 1000)
# pace is in decimal minutes, convert it to a string that is in mm:ss format
all_segment_data$pace <- format(as.POSIXct(all_segment_data$pace * 60, origin = "1970-01-01"), format = "%M:%S")
# order by pace
all_segment_data %>% arrange(pace) -> final_data

# save final_data
saveRDS(final_data, "Output/Data/segments.rds")
saveRDS(all_segments, "Output/Data/all_segments.rds")

It’s worth saving these R objects because of the hassle getting the data via the API.

Our final data frame (final_data) has all the segment data organised by pace so that checking the beatable segments is quite simple. Let’s see what we have in there.

Analysing the data

Time vs Distance

" data-medium-file="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?fit=300%2C225&ssl=1" data-large-file="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?fit=640%2C480&ssl=1" tabindex="0" role="button" data-id="3305" src="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?resize=640%2C480&ssl=1" alt="" class="wp-image-3305" srcset_temp="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?resize=1024%2C768&ssl=1 1024w, https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?resize=300%2C225&ssl=1 300w, https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?resize=768%2C576&ssl=1 768w, https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time.png?w=1200&ssl=1 1200w" sizes="(max-width: 640px) 100vw, 640px" data-recalc-dims="1" />

Time vs Distance

Time vs Distance (less than 1.5 km)

" data-medium-file="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?fit=300%2C225&ssl=1" data-large-file="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?fit=640%2C480&ssl=1" tabindex="0" role="button" data-id="3306" src="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?resize=640%2C480&ssl=1" alt="" class="wp-image-3306" srcset_temp="https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?resize=1024%2C768&ssl=1 1024w, https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?resize=300%2C225&ssl=1 300w, https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?resize=768%2C576&ssl=1 768w, https://i0.wp.com/quantixed.org/wp-content/uploads/2024/07/distance_vs_time_1500.png?w=1200&ssl=1 1200w" sizes="(max-width: 640px) 100vw, 640px" data-recalc-dims="1" />

Time vs Distance (less than 1.5 km)

As a first look, we can plot the course record time as a function of segment length. There are some very long segments in this region, so we’ll zoom in on those 1.5 km or less.

Now let’s look at pace (the code for all of this is shown below).

I’m looking for segments where the course record equates to a pace of 4 min/km or more. These are the beatable segments. The plots show that there are a few of these in the region of interest, as well as plenty around the 3 min/km which would be much harder to beat.

There are a bunch of segments with silly-fast pace. Over 6% of the segments have a faster than Usain Bolt pace of 1:40 min/km or faster. These are clearly garbage, which is annoying!

There’s also one segment with a very slow pace. I checked it and it seems not to be an error. The segment seems to be a women-only running charity event held on private property, clearly there was a man who jogged the course too and uploaded it to Strava.

We can map out all the segments in R or just show the segments that are beatable.

The code for visualisation is here:

ggplot(final_data, aes(x = distance, y = overall_time)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(x = "Distance (m)",
       y = "Overall Time (s)")
ggsave("Output/Plots/distance_vs_time.png", width = 1200, height = 900, units = "px")

final_data %>% filter(distance < 1500) %>% ggplot(aes(x = distance, y = overall_time)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(x = "Distance (m)",
       y = "Overall Time (s)")
ggsave("Output/Plots/distance_vs_time_1500.png", width = 1200, height = 900, units = "px")

final_data %>% filter(distance < 1500) %>%
  ggplot(aes(x = distance, y = as.POSIXct(pace, format="%M:%S"))) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() +
  scale_y_datetime(date_labels = "%M:%S") +
  labs(x = "Distance (m)",
       y = "Pace (mm:ss/km)")
ggsave("Output/Plots/distance_vs_pace.png", width = 1200, height = 900, units = "px")

# faster than Usain Bolt
nrow(final_data[final_data$pace < "01:40",]) / nrow(final_data) * 100
#[1] 6.431535

# histogram of pace
final_data %>% ggplot(aes(x = as.POSIXct(pace, format="%M:%S"))) +
  geom_histogram(bins = 100) +
  theme_bw() +
  scale_x_datetime(date_labels = "%M:%S") +
  labs(x = "Pace (mm:ss/km)",
       y = "Frequency")
ggsave("Output/Plots/pace_histogram.png", width = 1200, height = 900, units = "px")

## Mapping ----
## map boundaries
lats.range <- c(52.2905, 52.3941)
lons.range <- c(-1.4553, -1.7391)
# make map
map <- leaflet(options = leafletOptions(zoomControl = FALSE)) %>%
  addProviderTiles('CartoDB.Positron',
                   options = providerTileOptions(noWrap = T, minZoom=12, maxZoom=12)) %>%
  fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range))
# overlay each segment
for (i in 1:nrow(final_data)) {
  activity <- final_data[i,]
  #  if (activity$pace < "04:00") next
  coords   <- googleway::decode_pl(activity$map.polyline)
  map      <- addPolylines(map, lng = coords$lon, lat = coords$lat,
                           color = 'blue', opacity = 1/2, weight = 2)
}
map

Let’s set a new course record!

From the map, I could see that there were three beatable segments that could easily be run during one loop. These included “Holly Hop” which had only been tackled 95 times and had a best time of 4:58 for just over 1 km. The course record was set in 2013 which made me wonder if the low number of attempts and the long-standing record meant that it was inaccessible. Anyway, I made a mental note where the segments were, put on my running shoes, and set off.

Well, I set the CR for Holly Hop but failed on the other two. I was one second short on A Mighty Hill and 3 s short on Get up the Hill. Maybe next time…

The analysis showed that there’s a bunch of others for me to attempt in the future too.

Critique of Strava segments

Segments are basically Strava’s USP. Alternative platforms tried to launch equivalent features and were unsuccessful. Conversely, there’s almost nothing else that Strava does that you can’t do on one of the alternatives. Bizarrely, Strava seems to have never understood that segments are its USP.

Frustratingly, bad data plagues the segment leaderboards. As noted in the post, 6.4% of the segments were faster than Usain Bolt pace. These are almost certainly bad data, and anything up to 2:30 min/km is questionable IMO. The bad data comes from:

GPS blips – technical error.
Accidental user error – runner gets in car forgets to turn device off. Segments along roads are often contaminated.
Accidental user error – runner pauses and restarts but is recorded as though the intervening distance has been run.
Data doping – manipulation of gpx/fit files

Strava’s attitude to all this is just to shrug. They want users to police the leader boards and report dodgy timings. They could easily sweep the leaderboards algorithmically, but they don’t. In fact it’s totally bizarre that they let their best feature just deteriorate. They could also make more out of segments by having yearly KOM/QOM competitions, so that bad data has less chance of wrecking these (they sort of do this now, but they could make much more of this feature) but they don’t. OK, that’s my rant over. There’s plenty of other, more extensive rants out there e.g. here, about the most frustrating company in the exercise business.

—

The post title comes from “King of the Mountain” by Kate Bush from her Aerial album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: King of the Mountain: using R to bag a Strava KOM

Shockingly-fast data manipulation in R with polars

Business Science — Fri, 19 Jul 2024 12:00:00 +0000

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey guys, welcome back to my R-tips newsletter. Polars is NOW available in R! Yes– The shockinlgy-fast data manipulation library built on top of Rust is now in R. Today, I’m excited to show off some of Polar’s capabilities for fast financial and time series analysis. Let’s go!

Here’s what you’re learning today:

What is polars? You’ll discover what polars is and how it accomplishes shockingly-fast data manipulation
Benefits of using Polars Which types of data analysis can benefit from polars the most.
How to use Polars inside of R I have prepared a full R code tutorial (get the code here).

Get the Code (In the R-Tip 082 Folder)

SPECIAL ANNOUNCEMENT: ChatGPT for Data Scientists Workshop on August 14th

Inside the workshop I’ll share how I built a Machine Learning Powered Production Shiny App with ChatGPT (extends this data analysis to an insane production app):

What: ChatGPT for Data Scientists

When: Wednesday August 14th, 2pm EST

How It Will Help You: Whether you are new to data science or are an expert, ChatGPT is changing the game. There’s a ton of hype. But how can ChatGPT actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free chatgpt for data scientists workshop.

Price: Does Free sound good?

How To Join: Register Here

R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here are the links to get set up.

This Tutorial is Available in Video (9-minutes)

I have a 9-minute video that walks you through setting up polars in R and running your first financial time series data analysis.

What is Polars?

According to the polars documentation:

The polars package for R gives users access to a lightning fast Data Frame library written in Rust. Polars’ embarrassingly parallel execution, cache efficient algorithms and expressive API makes it perfect for efficient data wrangling, data pipelines, snappy APIs, and much more besides. Polars also supports “streaming mode” for out-of-memory operations. This allows users to analyze datasets many times larger than RAM.

Lightning-Fast Data Frame Library Written in Rust

The key here is that, under the hood, both the R and Python implementations of polars use the hyper-scalable and blazingly fast Rust library. Key aspects of Rust include:

Memory Safety: Rust ensures memory safety without needing a garbage collector. This is achieved through its ownership system, which enforces strict rules on how memory is managed.
Concurrency: Rust is designed to make it easy to write concurrent programs. The language’s ownership system helps prevent data races, which are a common problem in concurrent programming.
Zero-cost Abstractions: Rust aims to provide high-level abstractions without the cost typically associated with them in terms of performance. This allows developers to write efficient code without sacrificing readability.
Performance: Rust’s performance is comparable to C and C++ due to its focus on low-level control over system resources.
Tooling: Rust comes with a powerful set of tools, including cargo (the package manager and build system), rustc (the Rust compiler), and rustfmt (a code formatting tool).

Rust in a Nutshell

Rust is fast. It’s design is focused on parallel processing. And because of that polars is fast, parallel, lazy (in a good way), and really good for most data operations.

Which Data Manipulations is Polars Good For?

I’ve been testing out polars for quite a while in both Python and R.

For background, as of a year ago I began work on pytimetk, which replicates many of the R timetk packages time series analysis features in Python. And for that project, our team has internally used a polars engine for many time series operations that are known to be resource intense.

Polars vs Pandas: Speed Comparison and Performance Test Results:

We’ve published our performance results here.

Rolling Operations: Polars can be 10X to 3500X faster than Pandas
Expanding Operations: 3X to 500X Faster
Aggregations (Summarizations): 13X Faster

The bottomline is that Polars is fast vs Pandas. It’s especially good for grouped time series operations including rolling, expanding, and aggregating operations.

I expect Polars in R to be faster than dplyr. However, I have not run similar tests (yet).

Tutorial: How to use Polars inside of R

It takes about 30 seconds to get polars set up so you can start using shockingly-fast data manipulation inside of R. All the tutorial code shown is available in the R-Tips Newsletter folder for R-Tip 082.

Get the Code (In the R-Tip 082 Folder)

Step 1 – Install polars:

The first step is to set up polars. Polars is not on CRAN as of the writing of this article. But it’s simple to install from the r-multiverse.org team.

Run this line of code:

install.packages("polars", repos = "https://community.r-multiverse.org")

Step 2 – Load the Libraries and Data

Once polars is installed, load the libraries and data witht his code.

Here’s the stock_data.csv once it’s read with pl$read_csv(). A few key points about the Polars Data Frame Structure:

Shape of the data is shown at the top.
Some columns and rows will not be shown when printed to the screen(identifed with …)
The “Date” column is a str data type
The stocks (25 total) are f64 data type (float 64)

Get the Code (In the R-Tip 082 Folder)

Step 3 – Pivot to Long Format for Grouped Data Analysis

The next step is to get the data into a format so we can begin to do grouped analysis. Use the unpivot() function to go from wide-to-long format:

Get the Code (In the R-Tip 082 Folder)

The transformation was done shockingly-fast. This is what the long format looks like:

To visualize the data, run this code:

Get the Code (In the R-Tip 082 Folder)

Step 4 – Moving Averages with Polars’ Rolling Mean

The last step we’ll cover is how to perform moving averages using polars rolling mean functionality. This is one of the biggest benefits to using Polars.

Run this code to perform a 10-day and 50-day moving average over each of the 25 stocks:

Get the Code (In the R-Tip 082 Folder)

Again, the performance is undeniable. In milliseconds, the rolling calculations are complete.

Run this code to visualize the result:

Get the Code (In the R-Tip 082 Folder)

We can quickly see which stocks have momentum from the 10-day and 50-day moving averages (those with Red lines above the Green Lines).

Reminder: The code is available free inside R-tips

All of the code you saw today is available in R-Tips Newsletter folder for R-Tip 082

Get the Code (In the R-Tip 082 Folder)

Conclusions:

Polars is one of those libraries that is quickly becoming a standard in the Python ecosystem. I’m glad to see that R is getting the same treatment. It’s simply the fastest data manipulation library I’ve come across. And I’ve tried them all.

If you would like to grow your Business Data Science skills, then please read on…

Need to advance your business data science skills?

I’ve helped 6,107+ students learn data science for business from an elite business consultant’s perspective.

I’ve worked with Fortune 500 companies like S&P Global, Apple, MRM McCann, and more.

And I built a training program that gets my students life-changing data science careers (don’t believe me? see my testimonials here):

6-Figure Data Science Job at CVS Health ($125K)
Senior VP Of Analytics At JP Morgan ($200K)
50%+ Raises & Promotions ($150K)
Lead Data Scientist at Northwestern Mutual ($175K)
2X-ed Salary (From $60K to $120K)
2 Competing ML Job Offers ($150K)
Promotion to Lead Data Scientist ($175K)
Data Scientist Job at Verizon ($125K+)
Data Scientist Job at CitiBank ($100K + Bonus)

Whenever you are ready, here’s the system they are taking:

Here’s the system that has gotten aspiring data scientists, career transitioners, and life long learners data science jobs and promotions…

Join My 5-Course R-Track Program Now!
(And Become The Data Scientist You Were Meant To Be…)

P.S. – Samantha landed her NEW Data Science R Developer job at CVS Health (Fortune 500). This could be you.

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Shockingly-fast data manipulation in R with polars

rOpenSci News Digest, July 2024

rOpenSci - open tools for open science — Fri, 19 Jul 2024 00:00:00 +0000

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dear rOpenSci friends, it’s time for our monthly news roundup!

You can read this post on our blog. Now let’s dive into the activity at and around rOpenSci!

rOpenSci HQ

Announcing New Software Peer Review Editors: Beatriz Milz and Margaret Siple

We are excited to welcome Beatriz Milz and Margaret Siple to our team of Associate Editors for rOpenSci Software Peer Review. They join Laura DeCicco, Julia Gustavsen, Anna Krystalli, Mauro Lepore, Noam Ross, Maëlle Salmon, Emily Riederer, Adam Sparks, and Jeff Hollister.

Meet Beatriz and Margaret in their introduction blog post. Welcome on board to you both, thank you for your service!

A fresh new look for R-universe!

You might have noticed that R-universe got a big refresh. Read all about this big overhaul of the interface.

Resources from the rOpenSci community at useR! 2024

While some video recordings have not yet been posted on the useR! YouTube channel, some slidedecks and materials are already available.

Maëlle Salmon’s keynote talk “How your code might get rusty” (Slides | Recording)
Jeroen Ooms’ talk “Navigating the R Ecosystem Using R-Universe” (Slides)
Jon Harmon’s virtual talk “Learning Together at the Data Science Learning Community”
Paola Corrales and Elio Campitelli’s tutorial “Efficient Data Analysis with data.table” (Website)
Elio Campitelli’s talk “Building Bilingual Bridges with Multilingual Manuals” (Slides)
Hugo Gruson’s
- Talk “Building Interoperability in Existing Software Ecosystems with S3 Classes” (Slides).
- Poster “A reproducible analysis of CRAN Task Views to understand the state of an R package ecosystem”
Lluís Revilla and Henrik Bengtsson’s poster about CRAN packages archived and the cranhaven.org R-universe created to reduce the impact of that on users
Will Landau and Charlie Gao’s talk “Moju-Kapu: How {Mirai} and {Crew} Are Powering the Next Generation of Parallel Computing in R” (Slides)
Binod Jung Bogati’s talks
- “Generate Raw Synthetic Dataset for Clinical Trial – Binod Jung Bogati, Numeric Mind” (Slides)
- “Translate R for Global Reach”

Coworking

Read all about coworking!

Join us for social coworking & office hours monthly on first Tuesdays! Hosted by Steffi LaZerte and various community hosts. Everyone welcome. No RSVP needed. Consult our Events page to find your local time and how to join.

Tuesday, August 6th, 9:00 America Pacific (16:00 UTC), Building your first R package with cohost Carolina Pradier and Steffi LaZerte.
- Explore how to make R packages
- Plan out that package you’ve always wanted to create
- Chat with our cohost about tips and tricks for making your first R package
Tuesday, September 3rd, 9:00 Australia Western (01:00 UTC), Theme TBA with cohost TBA and Steffi LaZerte.

And remember, you can always cowork independently on work related to R, work on packages that tend to be neglected, or work on what ever you need to get done!

Software

New packages

The following package recently became a part of our software suite:

osmapiR, developed by Joan Maspons: Interface to OpenStreetMap API for fetching and saving data from/to the OpenStreetMap database (https://wiki.openstreetmap.org/wiki/API_v0.6). It is available on CRAN. It has been reviewed by Jon Harmon and Carlos Cámara.

Discover more packages, read more about Software Peer Review.

New versions

The following nine packages have had an update since the last newsletter: rotemplate (pkgdown-2.0.9), gitignore (v0.1.7), nodbi (v0.10.5), nuts (v1.1.0), occCite (v0.5.7), osmapiR (v0.1.0), phonfieldwork (v0.0.16), taxlist (v0.3.0), and waywiser (v0.6.0).

Software Peer Review

There are eleven recently closed and active submissions and 6 submissions on hold. Issues are at different stages:

One at ‘6/approved’:
- osmapiR, OpenStreetMap API. Submitted by Joan Maspons.
Four at ‘4/review(s)-in-awaiting-changes’:
- chopin, CHOPIN: Computation for Climate and Health research On Parallelized INfrastructure. Submitted by Insang Song.
- cancerprof, API Client for State Cancer Profiles. Submitted by Brian Park.
- rix, Rix: Reproducible Environments with Nix. Submitted by Bruno Rodrigues.
- agromet, Índices y Estadísticos Climáticos e Hidrológicos. Submitted by Paola Corrales.
Four at ‘3/reviewer(s)-assigned’:
- eDNAjoint, Joint Modeling of Traditional and Environmental DNA Survey Data. Submitted by Abigail Keller.
- rsi, Efficiently Retrieve and Process Satellite Imagery. Submitted by Michael Mahoney.
- sits, Satellite Image Time Series Analysis for Earth Observation Data Cubes. Submitted by Gilberto Camara.
- fwildclusterboot, Fast Wild Cluster Bootstrap Inference for Linear Models. Submitted by Alexander Fischer. (Stats).
One at ‘2/seeking-reviewer(s)’:
- QuadratiK, A Collection of Methods Using Kernel-Based Quadratic Distances for. Submitted by Giovanni Saraceno. (Stats).
One at ‘1/editor-checks’:
- gigs, Assess Fetal, Newborn, and Child Growth with International Standards. Submitted by Simon Parker. (Stats).

Find out more about Software Peer Review and how to get involved.

On the blog

Software Review

Announcing New Software Peer Review Editors: Beatriz Milz and Margaret Siple by Mark Padgham, Beatriz Milz, and Margaret Siple. Introducing two new editors for rOpenSci software review.

Use cases

Three use cases of our packages and resources have been reported recently.

Using {weatherOz} to Plot Perth’s May High Temperatures. Reported by Adam Sparks.
OAI interface request, response, and data download with R. Reported by Jörg Lehmann.
Translating Carpentries workbench lessons with babeldown. Reported by Hugo Gruson.

Explore other use cases and report your own!

Calls for contributions

Calls for maintainers

If you’re interested in maintaining any of the R packages below, you might enjoy reading our blog post What Does It Mean to Maintain a Package?.

historydata, datasets for historians. Issue for volunteering.
USAboundaries (and USAboundariesdata), historical and contemporary boundaries of the United States of America . Issue for volunteering.

Calls for contributions

Also refer to our help wanted page – before opening a PR, we recommend asking in the issue whether help is still needed.

Package development corner

Some useful tips for R package developers.

Last call: your opinion on the CRAN submission process!

Shared by Lluís Revilla and Heather Turner in our Slack workspace, a crucial survey ending today!

“If you have R package development experience and would like to share your thoughts on the CRAN submission process, please fill this short survey from the CRAN Cookbook project!”

Please find the Google form and read more about the exciting cookbook project in this post by Jasmine Daly.

Robust type-checking with r-lib

Don’t miss this insightful short post by Josiah Parry, “Type safe(r) R code”. A related older blog post is “Checking the inputs of your R functions” by Hugo Gruson, Sam Abbott, Carl Pearson.

The one with all the useR! links

The useR! 2024 conference featured quite a few talks relevant to package development, beside the talks we mentioned in the HQ section. Not all recordings are available yet, but make sure to check out the useR! YouTube channel.

If we missed any relevant content, please get in touch so we might add missing pieces to our next newsletter!

Retrospectives

Kurt Hornik and Torsten Hothorn gave keynote talks “More than 25 years of CRAN” (Slides) and “Some things you can’t read from a NEWS file” (Slides | Recording) about maintaining a package for decades.

Edzer Pebesma and Roger Bivand reported on “The Retirement of R Packages with Many Reverse Dependencies” (Slides).

On validation of R packages

Coline Zeballos and Yann Féhat from the R Validation Hub discussed how to support (pharma) companies with validation of R Packages (Slides). They use a toolset based on r-hub/repos and the riskmetric package.

Szymon Maksymiuk and Lorenzo Braschi presented a Deep Dive Into Industry R Package Quality Assessment. Beside introducing the concepts, they mentioned three open-source R packages that they created: checked for running reverse dependencies checks; covtracer for contextualizing tests using covr test traces; rd2markdown for converting .Rd files into Markdown.

Also on reverse dependencies checks, Pawel Rucki and André Veríssimo presented {verdepcheck} – A Tool for Dependencies Check (Slides | Package Docs).

Franciszek Walkowiak discussed Systems Integration Tests for R Package Cohorts, including the introductions to two open-source utilities, scribe that creates complete build, check and install reports for a collection of R projects and locksmith that helps with renv.lock creation (Slides).

On good practice

Daniel Sabanés Bové introduced openstatsware’s work on minimal viable good practice standards for R packages.

Pedro Silva listed Seven Deadly Sins Holding You Back as a Software Developer (Slides).

Hugo Gruson had a poster on A reproducible analysis of CRAN Task Views to understand the state of an R package ecosystem. See the live analysis.

On learning with silly projects

Fonti Kar shared her experience in creating {ohwhaley} – a ‘toy’ R package which serves as a tool for learning package development and upskilling new learners (Slides).

On package design

Hugo Gruson highlighted the benefits of using S3 classes for interoperability in Existing Software Ecosystems (Slides). See also his recent blog posts on the topic.

Ligia Adamska used an onion analogy to explain Layered Design for R Package Development: Meeting the Needs of Pharmaceutical R&D Stakeholders (Slides).

On tools

Daphne Grasselly, Franciszek Walkowiak and Pawel Rucki lead a tutorial on Streamlining R package development with GitHub Actions Workflows (Slides).

Emil Hvitfeldt explained how to make better error messages with rlang and cli.

Ella Kaye shared her insights on C for R users (Slides).

Davis Vaughan introduced tree-sitter, an efficient incremental parsing library and the R package treesitter, which provides bindings to tree-sitter whose README states “tree-sitter is useful for a number of things, including syntax highlighting, go-to definition, code reshaping, and more.”

On multilingualism

Elio Campitelli spoke about Building Bilingual Bridges with Multilingual Manuals (Slides). See also their post on our blog!

On debugging

Shannon Pileggi delivered a tutorial on debugging in R (Materials).

Antoine Fabri gave an overview of the motivations behind, and features of, his constructive package, which, among other things, can be useful for troubleshooting (Package docs).

On wrapping APIs

Hadley Wickham introduced and demo-ed his httr2 package (Package docs).

Simon Haller explained the Automated Generation of R Client Packages for RESTful APIs (Slides). See also Jon Harmon’s work on the same topic.

On a last resort for archived CRAN packages

Henrik Bengtsson and Lluís Revilla had a poster about their CRANhaven project, a backup solution for end-users when a package falls of CRAN (and which is built using R-universe!).

Last words

Thanks for reading! If you want to get involved with rOpenSci, check out our Contributing Guide that can help direct you to the right place, whether you want to make code contributions, non-code contributions, or contribute in other ways like sharing use cases. You can also support our work through donations.

If you haven’t subscribed to our newsletter yet, you can do so via a form. Until it’s time for our next newsletter, you can keep in touch with us via our website and Mastodon account.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: rOpenSci News Digest, July 2024

Radar charts and five-tool baseball players

Jerry Tuttle — Thu, 18 Jul 2024 18:32:00 +0000

[This article was first published on Online College Math Teacher, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Radar charts and five-tool baseball players by Jerry Tuttle

I was looking for an opportunity to practice with radar charts and I came across an article on five-tool baseball players, so this seemed like a perfect application for this kind of chart.

A radar chart is an alternative to a column chart to display three or more quantitative variables. The chart graphs the values in a circular manner around a center point.

The five tools in baseball are: (1) hitting for average; (2) hitting for power; (3) defense; (4) throwing; and (5) speed. A five-tool player excels in all five of these.

Among current players, Mike Trout is considered a five-tool player. The measurement of Trout’s five tools can be displayed in the following radar chart:

Trout is rated at 80 for hitting for average, 70 for hitting for power, and his lowest scores are 60 for defense, throwing and speed. This is based on a 20-to-80 rating system, where 80 is elite, 70 is plus-plus, and 60 is plus. Sorry – I could not get the points to line up with the concentric pentagons.

For comparison, here is a display of Aaron Judge’s ratings.

Judge is rated at 80 for hitting for power, 70 for hitting for average, 60 for defense, 70 for throwing, and 50 for speed, where the 50 is average at the major league level.

The results of several players can be displayed in a single radar chart, but this becomes hard to read. Three players are probably the maximum for readability.

The alternative to visualizing several players is either to create several individual radar charts or else to create a bar (horizontal) chart or a column (vertical) chart.

Each of the five tools is generally rated on a 20-to-80 scale, where 50 is average (for a major leaguer), 80 is elite, and every 10 points is supposed to represent one standard deviation. I suspect the standard deviation concept is more judgmental than mathematical. There is not a single rating system; some use traditional baseball statistics, and others use modern motion tracking data.

The numerical data above was obtained from an article by Jake Mintz in 2022 for Fox Sports https://www.foxsports.com/stories/mlb/trout-betts-rodriguez-the-definition-of-mlbs-five-tool-players . In Mintz’s data, all numbers are shown rounded to the nearest 10. Mintz only has five current players as five-tool players: Mike Trout, Mookie Betts, Trea Turner, Byron Buxton, and Julio Rodriguez. I tried graphing all five players in a single radar chart, but this was too hard to read. Mintz thinks a true five-tool player should have a grade of at least 60 in each of the five categories. By this measure, Aaron Judge is not quite a five-tool player due to a 50 in speed, and a number of elite major leaguers have at least one 50. Note that each category is considered separately. If there were some sort of weighting system, many people would weigh hitting with power as most important, followed by hitting for average, although perhaps the weights should vary by position with higher weights for defense and throwing for catcher, middle infielders, and center fielder. Pitchers have a different grading system.

What about Shohei Ohtani? At the time of his article, Mintz did not have sufficient data on Ohtani.

Mintz observes that Mike Trout worked one winter to improve his throwing, and Julio Rodriguez worked to increase his speed. This suggests that the ratings probably change over the life of a player and are dependent on when they are measured.

Other authors suggest that there is a sixth tool of exceptional players such as mental makeup and character. Another tool might be situational game awareness.

Modern motion tracking data by Statcast and others did not exist until fairly recently. Willie Mays is generally considered the greatest five-tool player. Using statistical measures, author Herm Krabbenhoft suggests Tris Speaker, Ty Cobb, and Honus Wagner should be considered as five tool players, although Krabbenhoft measures hitting for power with SLG (slugging percentage) and ISO (isolated power), not home runs https://sabr.org/journal/article/honus-wagner-baseballs-prototypical-five-tooler/ . A very different measure of hitting with power would be something like home run distance greater than 425 feet or launch angle and velocity.

What about Babe Ruth? We know Babe Ruth’s career numbers are .342 batting average and 714 home runs. I have not read anything about his defense, throwing, or speed. He did steal 123 bases, including home 10 times; maybe he was faster than we realize. He is remembered for getting thrown out stealing second to end the 1926 World Series, but perhaps the hit-and-run play was on, and Bob Meusel, the batter, swung and missed the pitch? See https://baseballegg.com/2019/10/30/babe-ruths-failed-stolen-base-attempt-ended-the-1926-world-series-or-is-that-what-really-happened/ . Ruth had 204 assists as an outfielder, which sounds like a lot. I wonder how he would have ranked in defense, throwing, and speed?

Here is my R code. I do like radar charts for comparing one to three observations over five variables, as a change of pace from column charts. I used the fmsb library for the radar charts. There is also a ggradar library, but I did not like its visualization. One of the quirks of fmsb is that the axis for each variable can have its own scale. Originally I used each variable’s max and min values, but the axes were out of sync, so I replaced this with the grand max and min. Also, I could not get the values, which are all multiples of ten, to line up on the concentric pentagons.

library(fmsb)
library(scales)

group = c(“Hit_avg”, “Hit_power”, “Defense”, “Throwing”, “Speed”)
player_names = c(“Trout”,”Betts”,”Judge”)
players <- data.frame(
   row.names = player_names,
   Hit_avg = c(80, 70, 70),
   Hit_power = c(70,60,80),
   Defense = c(60,70,60),
   Throwing = c(60,80,70),
   Speed = c(60,70,50))
players

# The row 1 should contain the maximum values for each variable
# The row 2 should contain the minimum values for each variable
# Data for cases or individuals should be given starting from row 3
# Define the variable ranges: maximum and minimum; however, want axes to have equal scales

max_min <- data.frame(
   Hit_avg = c(max(players), min(players)),
   Hit_power = c(max(players), min(players)),
   Defense = c(max(players), min(players)),
   Throwing = c(max(players), min(players)),
   Speed = c(max(players), min(players)))

rownames(max_min) <- c("Max", "Min") # row 1 has max's, row 2 has min's.
df <- rbind(max_min, players)
df

player1_data <- df[c("Max", "Min", player_names[1]), ]
player2_data <- df[c("Max", "Min", player_names[2]), ]
player3_data <- df[c("Max", "Min", player_names[3]), ]

chart <- function(data, color, title){
     radarchart(data, axistype = 0,
       pcol = color, pfcol = scales::alpha(color, 0.5), plwd = 2, plty = 1,
        vlabels = colnames(data), vlcex = 1.5,
       cglcol = “black”, cglty = 1, cglwd = 0.8,
        caxislabels = NULL,
       title = title)
}

# Plot the data for players 1, 2, and 3 separately
chart(data=player1_data, color=”#00AFBB”, title=”MIKE TROUT 5 Tools”)
chart(data=player2_data, color=”#E7B800″, title=”MOOKIE BETTS 5 Tools”)
chart(data=player3_data, color=”#FC4E07″, title=”AARON JUDGE 5 Tools”)

# Plot the data for three players
chart(data=df, color=c(“#00AFBB”, “#E7B800”, “#FC4E07”), # blue-green, red-green, red-green
     title=”TROUT, BETTS, JUDGE 5 Tools”)
legend(
     x = “bottom”, legend = rownames(df[-c(1,2),]), horiz = FALSE,
     bty = “n”, pch = 20 , col = c(“#00AFBB”, “#E7B800”, “#FC4E07”),
     text.col = “black”, cex = 1.25, pt.cex = 1.5)

###########################################

# column graphs

library(tibble)
library(tidyr)
library(ggplot2)
# Reshape data to long format
players_long <- players %>%
rownames_to_column(“player”) %>%
pivot_longer(cols = -player, names_to = “group”, values_to = “value”)

# Common theme for graphs
common_theme <- theme(
   legend.position=”right”,
   plot.title = element_text(size=15, face=”bold”),
   axis.title = element_text(size=15, face=”bold”),
   axis.text = element_text(size=15, face=”bold”),
   legend.title = element_text(size=15, face=”bold”),
   legend.text = element_text(size=15, face=”bold”))

# Create column graph: Tool Ratings by Player
ggplot(players_long, aes(x = player, y = value, fill = group, title = “Tool Ratings by Player”)) +
   geom_col(position = “dodge”) +
   labs(x = “Player”, y = “Rating”, fill = “Group”) +
   common_theme

# Create the column graph: Player Ratings for each Tool
ggplot(players_long, aes(x = group, y = value, fill = player)) +
   geom_col(position = “dodge”) +
   labs(x = “Group”, y = “Rating”, fill = “Player”, title = “Player Ratings for each Tool”) +
   common_theme

### END

##################################################################################

To leave a comment for the author, please follow the link and comment on their blog: Online College Math Teacher.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Radar charts and five-tool baseball players

Simplify Regression Modeling with tidyAML’s fast_regression()

Steven P. Sanderson II, MPH — Thu, 18 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

If you’ve ever faced the daunting task of setting up multiple regression models in R, you’ll appreciate the convenience and efficiency that tidyAML brings to the table. Today, we’re diving into one of its standout functions: fast_regression(). This function is designed to streamline the regression modeling process, allowing you to quickly create and evaluate a variety of model specifications with minimal code.

Introduction to `fast_regression()`

The fast_regression() function is part of the tidyAML package, a toolkit that simplifies machine learning workflows in R. This function takes your data and recipe object and generates multiple regression models, using a variety of engines and functions from the parsnip package.

Syntax

Here’s a look at the function’s syntax:

fast_regression(
  .data,
  .rec_obj,
  .parsnip_fns = "all",
  .parsnip_eng = "all",
  .split_type = "initial_split",
  .split_args = NULL,
  .drop_na = TRUE
)

Arguments

.data: The data frame to be used in the regression problem.
.rec_obj: A recipe object from the recipes package that defines the pre-processing steps.
.parsnip_fns: Specifies which parsnip functions to use. The default "all" will create all possible regression model specifications.
.parsnip_eng: Specifies which parsnip engines to use. The default "all" will create all possible regression model specifications.
.split_type: Defines the type of data split, with "initial_split" as the default. Other split types supported by rsample can also be used.
.split_args: Additional arguments for the split type. When set to NULL, default parameters for the chosen split type are used.
.drop_na: Determines whether to drop NA values from the data. Default is TRUE.

Example: Using `fast_regression()`

Let’s see fast_regression() in action with a simple example. We’ll use the well-known mtcars dataset, and set up a basic recipe to predict miles per gallon (mpg).

First, ensure you have the necessary packages loaded:

library(tidyAML)
library(dplyr)
library(recipes)
library(purrr)

Next, we create a recipe object:

rec_obj <- recipe(mpg ~ ., data = mtcars)

rec_obj

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 10

Now, we can run fast_regression() to create regression models using the lm and glm engines with a linear_reg parsnip function:

fast_reg_tbl <- fast_regression(
  .data = mtcars,
  .rec_obj = rec_obj,
  .parsnip_eng = c("lm", "glm"),
  .parsnip_fns = "linear_reg"
)

Output

The function returns a tibble with details about the generated models:

fast_reg_tbl

# A tibble: 2 × 8
  .model_id .parsnip_engine .parsnip_mode .parsnip_fns model_spec wflw      
                                            
1         1 lm              regression    linear_reg     
2         2 glm             regression    linear_reg     
# ℹ 2 more variables: fitted_wflw , pred_wflw

Explanation

.model_id: A unique identifier for each model.
.parsnip_engine: The engine used by parsnip (e.g., lm, glm).
.parsnip_mode: The mode of the model, typically regression.
.parsnip_fns: The parsnip function used (e.g., linear_reg).
model_spec: The model specification.
wflw: The workflow object.
fitted_wflw: The fitted workflow object.
pred_wflw: The predictions from the fitted workflow.

Benefits of `fast_regression()`

Efficiency: Quickly set up and evaluate multiple regression models with different specifications.
Flexibility: Supports a wide range of engines and functions, allowing for comprehensive model exploration.
Integration: Seamlessly integrates with the recipes, parsnip, and rsample packages, making it a versatile tool in your modeling arsenal.

Let’s Inspect

Now that we have our models, let’s take a closer look at the results. We can use the extract_regression_residuals() function to extract the residuals from the fitted models along with the original and predicted data:

extract_regression_residuals(fast_reg_tbl)

[[1]]
# A tibble: 32 × 4
   .model_type     .actual .predicted  .resid
                         
 1 lm - linear_reg    21.4       21.1  0.276 
 2 lm - linear_reg    21         23.9 -2.91  
 3 lm - linear_reg    19.7       18.7  1.01  
 4 lm - linear_reg    18.1       19.7 -1.55  
 5 lm - linear_reg    14.7       12.0  2.73  
 6 lm - linear_reg    24.4       23.7  0.694 
 7 lm - linear_reg    22.8       24.3 -1.48  
 8 lm - linear_reg    13.3       12.5  0.820 
 9 lm - linear_reg    10.4       13.2 -2.85  
10 lm - linear_reg    19.2       19.3 -0.0583
# ℹ 22 more rows

[[2]]
# A tibble: 32 × 4
   .model_type      .actual .predicted  .resid
                          
 1 glm - linear_reg    21.4       21.1  0.276 
 2 glm - linear_reg    21         23.9 -2.91  
 3 glm - linear_reg    19.7       18.7  1.01  
 4 glm - linear_reg    18.1       19.7 -1.55  
 5 glm - linear_reg    14.7       12.0  2.73  
 6 glm - linear_reg    24.4       23.7  0.694 
 7 glm - linear_reg    22.8       24.3 -1.48  
 8 glm - linear_reg    13.3       12.5  0.820 
 9 glm - linear_reg    10.4       13.2 -2.85  
10 glm - linear_reg    19.2       19.3 -0.0583
# ℹ 22 more rows

Now of course we must now visualize! Let’s plot the residuals to see how well our models are performing:

extract_regression_residuals(fast_reg_tbl) |> 
  plot_regression_residuals()

[[1]]

[[2]]

Now the residuals are plotted, we can see how well our models are performing. This is a great way to visually inspect the quality of our models and identify any potential issues.

Let’s look at the predicted vs actual only now:

extract_wflw_pred(fast_reg_tbl,1:nrow(fast_reg_tbl)) |>
  plot_regression_predictions()

[[1]]

[[2]]

Conclusion

The fast_regression() function in tidyAML is a powerful addition to any data scientist’s toolkit, providing a streamlined approach to regression modeling. Whether you’re a seasoned pro or just getting started with machine learning in R, this function can save you time and effort, allowing you to focus on what really matters – interpreting and acting on your results.

Give fast_regression() a try in your next project and experience the ease and efficiency it brings to regression modeling! If you have any questions or want to share your experience, feel free to leave a comment below.

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Simplify Regression Modeling with tidyAML’s fast_regression()

Extracting names of functions defined in a script with treesitter

Maëlle's R blog on Maëlle Salmon's personal website — Thu, 18 Jul 2024 00:00:00 +0000

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coming back from a conference, we might be excited to install and try out the cool things we have heard about. I, going against the stream , decided to experiment with a tool I have not heard about last week, as I unfortunately missed Davis Vaughan’s talk about treesitter. Nonetheless, I caught the idea that treesitter is a parser of code, R code in particular. The treesitter R package uses the tree-sitter C library. There are awesome applications of treesitter among which code search for R on GitHub but I learnt to know a bit of treesitter by solving a boring use case.

My use case: I won’t copy-paste these function names

Have you noticed this nice error message you get if you try to use dplyr::across() directly?

dplyr::across()
#> Error in `dplyr::across()`:
#> ! Must only be used inside data-masking verbs like `mutate()`,
#>   `filter()`, and `group_by()`.

Kirill Müller got the idea to offer a similar behaviour to some igraph functions that exist only for use only within the square operator. Therefore I had to find all functions defined within this function (whose body has been simplified here):

`[.igraph.vs` <- function(...) {
  # some code
  bla <- function(...) {
    # some code
  }
  blop <- function(...) {
    # some code
  }
}

On the script outline I get on the right of my script in RStudio IDE¹, I can see the function names (that would be bla() and blop() in the simplified chunk above) but not copy-paste from there.

I would know how to extract them with xmlparsedata but I thought it might be an opportunity for me to have a look at treesitter. I went through different emotions as a beginner, not all of them positive , but I did get the function names in the end!

Get the whole tree from the script

Below, I dutifully followed the example on the package homepage:

load the R language from the treesitter.r package, as far as I understand the only language available at this point for the R package.
load the parser for that language.
read in the text.
parse the text.

A last step is getting the tree root node as node, because I will query the whole script, and you can only query nodes, not trees.

language <- treesitter.r::language()
parser <- treesitter::parser(language)
text <- brio::read_lines("/home/maelle/Documents/rigraph/R/iterators.R") |>
  paste(collapse = "\n")
tree <- treesitter::parser_parse(parser, text)
node <- treesitter::tree_root_node(tree)

Find the node of the parent function of interest

The documentation page of treesitter::query() recommends reading the documentation of tree-sitter about the query syntax. That documentation features very useful examples.

Below I am looking for “binary_operator” whose left hand side is an identifier that I capture as “name”, and whose right hand side is a “function_definition” I capture as “def”. For some reason if I did not capture “def” then I got less information about the node. I also use a predicate: the name of the function has to be equal to “[.igraph.vs”.

To find out I was after a “function_definition”, I parsed a few lines of made-up code to study the resulting tree.

square_bracket_thing <- '
(
((binary_operator
  lhs: (identifier) @name
  rhs: (function_definition)) @def
  (#eq? @name "`[.igraph.vs`"))
  
)
'
square_bracket_query <- treesitter::query(language, square_bracket_thing)

Then I executed the query and extracted the node. In reality this took a bit more trial and error.

square_bracket_thing_captures <- treesitter::query_captures(square_bracket_query, node)
square_bracket_thing_captures
#> $name
#> [1] "def"  "name"
#> 
#> $node
#> $node[[1]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> `[.igraph.vs` <- function(x, ..., na_ok = FALSE) {
#>   args <- lazy_dots(..., .follow_symbols = FALSE)
#> 
#>   ## If indexing has no argument at all, then we still get one,
#>   ## but it is "empty", a name that is  ""
#> 
#>   ## Special case, no argument (but we might get an artificial
#>   ## empty one
#>   if (length(args) < 1 ||
#>     (length(args) == 1 && inherits(args[[1]]$expr, "name") &&
#>       as.character(args[[1]]$expr) == "")) {
#>     return(x)
#>   }
#> 
#>   ## Special case: single numeric argument
#>   if (length(args) == 1 && inherits(args[[1]]$expr, "numeric")) {
#>     res <- simple_vs_index(x, args[[1]]$expr, na_ok)
#>     return(add_vses_graph_ref(res, get_vs_graph(x)))
#>   }
#> 
#>   ## Special case: single symbol argument, no such attribute
#>   if (length(args) == 1 && inherits(args[[1]]$expr, "name")) {
#>     graph <- get_vs_graph(x)
#>     if (!(as.character(args[[1]]$expr) %in% vertex_attr_names(graph))) {
#>       res <- simple_vs_index(x, lazy_eval(args[[1]]), na_ok)
#> 
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (binary_operator [(479, 0), (637, 1)]
#>   lhs: (identifier [(479, 0), (479, 13)])
#>   operator: "<-" [(479, 14), (479, 16)]
#>   rhs: (function_definition [(479, 17), (637, 1)]
#>     name: "function" [(479, 17), (479, 25)]
#>     parameters: (parameters [(479, 25), (479, 48)]
#>       open: "(" [(479, 25), (479, 26)]
#>       parameter: (parameter [(479, 26), (479, 27)]
#>         name: (identifier [(479, 26), (479, 27)])
#>       )
#>       (comma [(479, 27), (479, 28)])
#>       parameter: (parameter [(479, 29), (479, 32)]
#>         name: (dots [(479, 29), (479, 32)])
#>       )
#>       (comma [(479, 32), (479, 33)])
#>       parameter: (parameter [(479, 34), (479, 47)]
#>         name: (identifier [(479, 34), (479, 39)])
#>         "=" [(479, 40), (479, 41)]
#>         default: (false [(479, 42), (479, 47)])
#>       )
#>       close: ")" [(479, 47), (479, 48)]
#>     )
#>     body: (braced_expression [(479, 49), (637, 1)]
#>       open: "{" [(479, 49), (479, 50)]
#>       body: (binary_operator [(480, 2), (480, 49)]
#> 
#> 
#> $node[[2]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> `[.igraph.vs`
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(479, 0), (479, 13)])
square_fn <- square_bracket_thing_captures$node[[1]]

At this point I was already very proud of my tiny win but I did not have the “children functions” yet!

Find the functions defined within the parent

I first considered creating a complicated nested query but I found no example of that. I did see someone telling StackOverflow they did a recursive query and for some reason that gave me the idea of parsing the text of the node, then look for functions in that text.

The query was simpler: looking for function definitions, only capturing the names on the left hand side.

square_tree <- treesitter::parser_parse(parser, treesitter::node_text(square_fn))
square_node <- treesitter::tree_root_node(square_tree)
kiddos_source <- '
(
(binary_operator
  lhs: (identifier) @name
  rhs: (function_definition))
)
'
kiddos_query <- treesitter::query(language, kiddos_source)
square_bracket_thing_captures <- treesitter::query_captures(kiddos_query, square_node)
head(square_bracket_thing_captures)
#> $name
#>  [1] "name" "name" "name" "name" "name" "name" "name" "name" "name" "name"
#> [11] "name" "name" "name" "name"
#> 
#> $node
#> $node[[1]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> `[.igraph.vs`
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(0, 0), (0, 13)])
#> 
#> $node[[2]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .nei
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(29, 2), (29, 6)])
#> 
#> $node[[3]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> nei
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(50, 2), (50, 5)])
#> 
#> $node[[4]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .innei
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(53, 2), (53, 8)])
#> 
#> $node[[5]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> innei
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(56, 2), (56, 7)])
#> 
#> $node[[6]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .outnei
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(59, 2), (59, 9)])
#> 
#> $node[[7]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> outnei
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(62, 2), (62, 8)])
#> 
#> $node[[8]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .inc
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(65, 2), (65, 6)])
#> 
#> $node[[9]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> inc
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(78, 2), (78, 5)])
#> 
#> $node[[10]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> adj
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(81, 2), (81, 5)])
#> 
#> $node[[11]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .from
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(84, 2), (84, 7)])
#> 
#> $node[[12]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> from
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(96, 2), (96, 6)])
#> 
#> $node[[13]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .to
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(99, 2), (99, 5)])
#> 
#> $node[[14]]
#> 
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────
#> to
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(111, 2), (111, 4)])

After that I was able to get the names of the children functions! I was actually only interested in those whose names start with a dot as all the other ones are deprecated anyway.

kiddos_functions <- purrr::map_chr(square_bracket_thing_captures$node, treesitter::node_text)
kiddos_functions[startsWith(kiddos_functions, ".")]
#> [1] ".nei"    ".innei"  ".outnei" ".inc"    ".from"   ".to"

Tada! Now I can go fix the issue I was tasked with.

Conclusion

In this post I report on my first encounter with the treesitter package for parsing R code. Copy-pasting the few function names would surely have been faster, but sometimes you’ve got to sit and learn something new.

No, I have not installed Positron yet. ︎

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Extracting names of functions defined in a script with treesitter

Inferring the gender of the subjects from RNAseq BAM files

Di Cook — Wed, 17 Jul 2024 12:33:00 +0000

[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I show how getGender(), function can be used for inferring the gender of the studied subjects from their binary alignment bam files. The gender can be a source of unwanted variation within the data, for which you may want to adjust your differential gene expression or splicing analysis. However, complete metadata are unfortunately not always available. Furthermore, sometimes details within metadata are incorrect or have been misplaced due to manual error. Therefore, it is a good practice to quickly double check some details within the data to either complete the missing metadata information or to make sure that the prior stages have been performed without any accidental mix-ups.

ThegetGender() function can be run on a vector that contains the path to several bam files.

In the example, I first construct a vector that has the paths to several bam files (from various places on the server!) that I would like to analyze:

library(rnatoolbox)

bam1<-scan(
“/proj/pehackma/ali/OBSCN_ExonUsage/encode/bamOnlyFiles.txt”,
what=”character”, sep=’\n’)
bam2<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names = TRUE)
bam3<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names = TRUE)
bam2<- bam2[-c(grep("TX_",bam2))]
bam4<-scan(
“/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles_05_10_2022.txt”,
what=”character”, sep=’\n’)
bam5<-scan(
“/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles2_05_10_2022.txt”,
what=”character”, sep=’\n’)
bam6<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names = T)
bam7<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names=T)
bam6<- bam6[c(grep("TX_",bam6))]
bam7<- bam7[c(grep("TX_",bam7))]

bamVec<- unique(c(bam1,bam2,bam3,bam4,bam5,bam6,bam7))
names(bamVec)<- gsub("\\..*","",basename(bamVec))

The classification will be based on the number of the fragments mapping to chromosome Y (defined as the numChr parameter) vs the number of fragments mapping to chromosome Y (defined as the numChr parameter). I need to define a ratio cutoff. For now I assign 0.05. Later however I will show how a suitable cutoff can be chosen and why 0.05 makes sense for this data.

ratioCuftoff=0.05

(chrRatio<- getGender(bamFiles=bamVec,
          plotFile=”/proj/pehackma/ali/test/test_rnatoolbox/getGender_res_3.png”,
          height=600, width=1100, fileFormat=”png”,pointsize=14,
          numChr=”chrY”, denumChr=”chrX”,
          mar=c(24.1, 4.1, 1.1, 1.1),
          hLine=ratioCuftoff, main=””))

# [1] 0.01506109 0.07722695 0.01344776 0.01196316 0.02678638 0.06733287
# [7] 0.01923814 0.07279685 0.06990508 0.01842288 0.06224493 0.01863105
# [13] 0.02040690 0.01802919 0.01715347 0.01968033 0.01868028 0.03803660
# [19] 0.01974870 0.01619692 0.07069193 0.06678158 0.07593763 0.10010664
# [25] 0.07228590 0.06714948 0.07386076 0.06651051 0.06846690 0.06959353
# [31] 0.08471437 0.07470632 0.09894125 0.08619078 0.01249461 0.07840080
# [37] 0.06956370 0.01244372 0.07121727 0.06335510 0.02085689 0.10241148
# [43] 0.07971098 0.09502325 0.02094541 0.07242126 0.02400726 0.01946038
# [49] 0.02036327 0.08630780 0.01860873 0.06303455 0.01895086 0.01871783
# [55] 0.02924722 0.08475490 0.08575956 0.08299903 0.08161365 0.02532606
# [61] 0.02747711 0.09117602 0.08533186 0.11900393 0.08611137 0.02049278
# [67] 0.08114836 0.07215539 0.06997266 0.08368385 0.08486157 0.07210998
# [73] 0.09763424 0.09177915 0.08260885 0.01730037 0.07390761 0.10200900
# [79] 0.08319738 0.09293284 0.09269107 0.02454981 0.10519674 0.09717035
# [85] 0.08505843 0.09700546

and the plot looks like this:

Now let’s see how to find a good ratio cutoff. The best ratio should be where the biggest jump within the ratio values are! This is how it can be detected.

diffDf<- data.frame(ratio=sort(chrRatio),
                    consequtive_difference=c(NA,diff(sort(chrRatio))))

head(diffDf)
# ratio consequtive_difference
# 1 0.01196316                     NA
# 2 0.01244372           4.805589e-04
# 3 0.01249461           5.088199e-05
# 4 0.01344776           9.531592e-04
# 5 0.01506109           1.613323e-03
# 6 0.01619692           1.135837e-03

indJump<- which(
diffDf$consequtive_difference==
    max(diffDf$consequtive_difference, na.rm=TRUE))

We notice that after sorting (in increasing order), the biggest jump of ratio values is from 0.03803660 to 0.06224493. The sum of htese 2 values divided 2 would be a good choice which is ~ 0.05.

diffDf[c(indJump-1,indJump),]
# ratio consequtive_difference
# 31 0.03803660 0.008789386
# 32 0.06224493 0.024208324

sum(diffDf[c(indJump-1,indJump),1])/2
#[1] 0.05014077

We can Finally use the ratios and the cutoff to label males and females.

(gender<- c("FEMALE","MALE")[as.numeric(chrRatio>ratioCuftoff)+1])
#[1] “FEMALE” “MALE”   “FEMALE” “FEMALE” “FEMALE” “MALE”   “FEMALE” “MALE”
#[9] “MALE”   “FEMALE” “MALE”   “FEMALE” “FEMALE” “FEMALE” “FEMALE” “FEMALE”
#[17] “FEMALE” “FEMALE” “FEMALE” “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”
#[25] “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”
#[33] “MALE”   “MALE”   “FEMALE” “MALE”   “MALE”   “FEMALE” “MALE”   “MALE”
#[41] “FEMALE” “MALE”   “MALE”   “MALE”   “FEMALE” “MALE”   “FEMALE” “FEMALE”
#[49] “FEMALE” “MALE”   “FEMALE” “MALE”   “FEMALE” “FEMALE” “FEMALE” “MALE”
#[57] “MALE”   “MALE”   “MALE”   “FEMALE” “FEMALE” “MALE”   “MALE”   “MALE”
#[65] “MALE”   “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”
#[73] “MALE”   “MALE”   “MALE”   “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”
#[81] “MALE”   “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”

To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Inferring the gender of the subjects from RNAseq BAM files

VBA: Saving and Closing a Workbook

Steven P. Sanderson II, MPH — Wed, 17 Jul 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this tutorial, you’ll learn how to save and close an Excel workbook using VBA (Visual Basic for Applications) and then doing it from R. We’ll create a simple VBA script that saves and closes a workbook, and then we’ll call this script from R using the RDCOMClient package.

Prerequisites

VBA Script

First, let’s create a simple VBA script that saves and closes a workbook. Here’s the VBA code:

Sub SaveAndCloseWorkbook()
    Dim wb As Workbook
    Set wb = ThisWorkbook
    wb.Save
    wb.Close
End Sub

Explanation:

Sub SaveAndCloseWorkbook(): This line starts the subroutine named SaveAndCloseWorkbook.
Dim wb As Workbook: This declares a variable wb as a Workbook object.
Set wb = ThisWorkbook: This sets wb to refer to the workbook where the VBA code is running.
wb.Save: This saves the workbook.
wb.Close: This closes the workbook.

Calling VBA from R

Now, let’s see how you can call this VBA script from R using the RDCOMClient package. This package allows R to interact with COM objects, such as Excel.

Step-by-Step R Code

Install RDCOMClient: If you haven’t installed it yet, you can do so from the R console.
```
install.packages("RDCOMClient", repos = "http://www.omegahat.net/R")
```
Write the R Code: Here’s the R script to run the VBA code.

library(RDCOMClient)

# Create a new Excel application
excel_app <- COMCreate("Excel.Application")

# Make the Excel application visible
excel_app[["Visible"]] <- TRUE

# Open an existing workbook or create a new one
workbook_path <- "C:/path/to/your/workbook.xlsx"
wb <- excel_app$Workbooks()$Open(workbook_path)

# Run the VBA macro
excel_app$Run("SaveAndCloseWorkbook")

# Quit the Excel application
excel_app$Quit()

# Release the COM object
rm(excel_app)
gc()

Explanation:

library(RDCOMClient): Loads the RDCOMClient library to interact with COM objects.
*excel_app <- COMCreate(“Excel.Application”)**: Creates a new Excel application instance.
excel_app[[“Visible”]] <- TRUE: Makes the Excel application visible (optional).
workbook_path: Path to your Excel workbook.
wb <- excel_appOpen(workbook_path): Opens the workbook.
excel_app$Run(“SaveAndCloseWorkbook”): Runs the VBA macro SaveAndCloseWorkbook.
excel_app$Quit(): Quits the Excel application.
rm(excel_app) and gc(): Releases the COM object and performs garbage collection to free up memory.

Try It Yourself

This example gives you a starting point to work with VBA and R together. Try modifying the VBA code to suit your needs, like adding more functionalities or handling different scenarios. Experimenting with this setup will give you a better understanding of how to automate Excel tasks from R.

Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: VBA: Saving and Closing a Workbook

R-bloggers

Creating Summary Tables in R with tidyquant and dplyr

Introduction

Examples

Using tidyquant for Summary Tables

Example: Calculating Average Price by Month

Using dplyr for Summary Tables

Example: Calculating Average Closing Price by Month

Your Turn!

Global movement of Happiness ladder with Machine learning in R

Introduction

Global happiness ladder

Cross-sectional happiness ladder for 2023

Times series vs. panel data visualisation

Mean global happiness lalder

Cross-sectional mean happiness ladder for 2023

Times series vs. panel data visualisation

Factor analysis of panel global happiness ladder

References

Mastering Wildcard Searches in R with grep()

Introduction

Understanding grep()

Using Wildcards in grep()

Examples

Strings that Start with a Pattern

Strings that End with a Pattern

Strings that Contain a Pattern

Combining Patterns with .*

Your Turn!

The distribution has changed; and pretty tables in base R

Getting the Workbook Name in VBA and Calling It from R

Introduction

Example

Getting the Workbook Name Using VBA

Calling VBA Code from R

Conclusion

plotting individual values within multiple groups together with their means

Augmenting RNA-Ligand Binding Prediction With Machine Learning: A Leap Towards Enhanced Drug Discovery

The Challenge in Detail

Our Approach

Achievements and Results

Implications for Drug Discovery

Develop your AI models with us!

FAQs on RStudio: Installation, Features, and Tips for Data Analysis

Checking if a String Contains Multiple Substrings in R

Introduction

Examples

Base R Approach

Explanation

Using stringr

Explanation

Using stringi

Explanation

Try It Yourself!

Empowering the R Community: Insights from Myles Mitchell of the Leeds Data Science Group

Bootstrap Confidence Intervals: Exports in Japan

{Shiny.Telemetry} 0.3.0: Track User Behavior In Your Shiny Applications

Introduction to shiny.telemetry 0.3.0

What’s New in Version 0.3.0?

Improved User Tracking

Track Only What You Want

Adds Tracking of Shiny Errors

Support for MongoDB

Bug Fixes and Security Update

Documentation and Miscellaneous Improvements

How to Concatenate Strings in R

Introduction

Examples

Concatenating Strings in Base R

Using paste()

Using paste0()

Custom Separator

Concatenating Strings with stringr

Using str_c()

Custom Separator

Concatenating Strings with stringi

Using stri_c()

Custom Separator

Conclusion

Forecasting uncertainty: sequential split conformal prediction + Block bootstrap (web app)

Understanding `grep()`

Using Wildcards in `grep()`

Combining Patterns with `.*`

Using `stringr`

Using `stringi`

Using `paste()`

Using `paste0()`

Concatenating Strings with `stringr`

Using `str_c()`

Concatenating Strings with `stringi`

Using `stri_c()`

Use `geom_marquee()` instead of `geom_text()`

Add more styles using the `style` aesthetic

Create text boxes using the `width` argument