In R, there are three types of loops:
while loops
for loops
repeat loops
R for L...
Continue reading: R for Loop]]>In programming, loops are used to repeat the execution of a block of code. Loops help you to save time, avoid repeatable blocks of code, and write cleaner code.
In R, there are three types of loops:
A for
loop is used to iterate over a list, vector or any other object of elements. The syntax of for
loop is:
for (value in sequence) { # block of code }
Here, sequence is an object of elements and value takes in each of those elements. In each iteration, the block of code is executed. For example,
numbers = c(1, 2, 3, 4, 5) # for loop to print all elements in numbers for (x in numbers) { print(x) }
Output
[1] 1 [1] 2 [1] 3 [1] 4 [1] 5
In this program, we have used a for
loop to iterate through a sequence of numbers called numbers. In each iteration, the variable x stores the element from the sequence and the block of code is executed.
Let’s use a for
loop to count the number of even numbers stored inside a vector of numbers.
# vector of numbers num = c(2, 3, 12, 14, 5, 19, 23, 64) # variable to store the count of even numbers count = 0 # for loop to count even numbers for (i in num) { # check if i is even if (i %% 2 == 0) { count = count + 1 } } print(count)
Output
[1] 4
In this program, we have used a for
loop to count the number of even numbers in the num vector. Here is how this program works:
for
loop to iterate through the num vector using the variable i.for (i in num) { # code block }
for
loop, we check if each element is divisible by 2 or not. If yes, then we increment count by 1.if (i %% 2 == 0) { count = count + 1 }
You can use the break
statement to exit from the for
loop in any iteration. For example,
# vector of numbers numbers = c(2, 3, 12, 14, 5, 19, 23, 64) # for loop with break for (i in numbers) { # break the loop if number is 5 if( i == 5) { break } print(i) }
Output
[1] 2 [1] 3 [1] 12 [1] 14
Here, we have used an if
statement inside the for
loop. If the current element is equal to 5
, we break the loop using the break
statement. After this, no iteration will be executed.
Instead of terminating the loop, you can skip an iteration using the next
statement. For example,
# vector of numbers numbers = c(2, 3, 12, 14, 5, 19, 23, 64) # for loop with next for (i in numbers) { # use next to skip odd numbers if( i %% 2 != 0) { next } print(i) }
Output
[1] 2 [1] 12 [1] 14 [1] 64
Here, we have used an if
statement inside the for
loop to check for odd numbers. If the number is odd, we skip the iteration using the next
statement and print only even numbers.
You can include a for
loop inside another for
loop to create a nested loop.
Consider the example below. Suppose we have two sequences of numbers. We want to print all the combinations where the sum of numbers in both the sequences is even.
# vector of numbers sequence_1 = c(1, 2, 3) sequence_2 = c(1, 2, 3) # nested for loop for (i in sequence_1) { for (j in sequence_2) { # check if sum is even if ( (i+j) %% 2 == 0 ) { print(paste(i, j)) } } }
Output
[1] "1 1" [1] "1 3" [1] "2 2" [1] "3 1" [1] "3 3"
In the above program, we have created two sequences: sequence_1 and sequence_2, both containing numbers from 1
to 3
.
We then used a nested for
loop to iterate through the sequences. The outer loop iterates through sequence_1 and the inner loop iterates through sequence_2.
for (i in sequence_1) { for (j in sequence_2) { # code block } }
In each iteration,
The if
statement inside the nested loops checks if i + j
is even or not. If it is, then we print i and j.
if ( (i+j) %% 2 == 0 ) { print(paste(i, j)) }
calmcode.io is an e-learning platform that I really really really recommend to programmers and data scientists:
It is free.
It involves open source tools.
It uses bite-sized tutorial videos.
It explains tools clearly.
It explains everything calmly.
There’s tons of content about computer programming, data science, and personal productivity.
On top of this all, it’s by Vincent Warmerdam, and everything he touches seems to turn to gold.
Check it out here: https://calmcode.io/
You can subscribe to their newsletter here, to receive new content fresh from the presses, straight in your inbox!
There are just so many tutorials on so many different topics! Here are some quick glances at some topics and tools:
We’re thrilled to announce that Quantargo Workspace is now out of Beta and generally available! Quantargo Workspace lets you easily create and manage data science projects using R and Python, with advanced features ...
Continue reading: Quantargo Workspace Now Out of Beta]]>We’re thrilled to announce that Quantargo Workspace is now out of Beta and generally available! Quantargo Workspace lets you easily create and manage data science projects using R and Python, with advanced features like publishing, scheduling and credential management. Get started here for free.
In tandem with the launch we also added awesome new features which enable a host of new use-cases:
Publishing
Publishing makes it dead simple to quickly share outputs of your workspace like reports, plots or data sets. Simply hit the “Publish” button and let the magic happen: the file is executed and all outputs are automatically published to a unique URL that you can share! This URL is always up-to-date, so if you re-publish your file the publication will reflect this automatically. This works with any R or Python code as well as RMarkdown documents!
Published outputs can then be viewed and shared via a standalone link:
Scheduling
You can now create schedules from a new panel in the workspace editor. Schedules allow you to automate tedious tasks like report generation and data aggregation by running your code in regular intervals. Different intervals are supported like daily, weekly and monthly:
You can create multiple schedules, each with different intervals and times. This makes it a perfect for report generation and together with Auto-Publish you get an always up-to-date link for your reports. Scheduling has been in the works for quite some time and it is finally ready, so please try it out and let us know what you think!
Credential Management
With this latest addition, you can now store confidential credentials like API keys and service credentials. Secrets allow you to securely store and use secrets in your code without exposing them. They are encrypted at rest and never shared.
Together with scheduling this allows you to securely connect to third party APIs. Check out the new Twitter Bot template for how to connect to the Twitter API through Quantargo Workspace.
Limited time coupon code for our Developer and PRO plans: Use the code FREEWORKSPACE
at checkout to get the first month completely free! Our paid plans allow you to create private workspaces and as well as give you a lot more API calls.
That’s it for now. Stay safe and healthy!
Most of the functions in R take a vector as input and return a vectorized output. Similarly, the vector equivalent of the traditional if...else b...
Continue reading: R ifelse() Function]]>In R, the ifelse()
function is a shorthand vectorized alternative to the standard if...else
statement.
Most of the functions in R take a vector as input and return a vectorized output. Similarly, the vector equivalent of the traditional if...else
block is the ifelse()
function.
The syntax of the ifelse()
function is:
ifelse(test_expression, x, y)
The output vector has the element x if the output of the test_expression
is TRUE
. If the output is FALSE
, then the element in the output vector will be y.
# input vector x <- c(12, 9, 23, 14, 20, 1, 5) # ifelse() function to determine odd/even numbers ifelse(x %% 2 == 0, "EVEN", "ODD")
Output
[1] "EVEN" "ODD" "ODD" "EVEN" "EVEN" "ODD" "ODD"
In this program, we have defined a vector x using the c()
function in R. The vector contains a few odd and even numbers.
We then used the ifelse()
function which takes the vector x as an input. A logical operation is then performed on x to determine if the elements are odd or even.
For each element in the vector, if the test_expression
evaluates to TRUE
, then the corresponding output element is "EVEN
", else it's "ODD
".
# input vector of marks marks <- c(63, 58, 12, 99, 49, 39, 41, 2) # ifelse() function to determine pass/fail ifelse(marks < 40, "FAIL", "PASS")
Output
[1] "PASS" "PASS" "FAIL" "PASS" "PASS" "FAIL" "PASS" "FAIL"
This program determines if the students have passed or failed based on a condition. Here, if the marks in the vector are less than 40, then the student is considered to have failed.
The if statement is a conditional statement that allows you to provide conditions to execute a piece of code.
The syntax of if statement in R is:
if(test_expression) {
# body of if statement
}
If the test_expression inside the i...
Continue reading: R if…else]]>The if
statement is a conditional statement that allows you to provide conditions to execute a piece of code.
The syntax of if
statement in R is:
if(test_expression) { # body of if statement }
If the test_expression
inside the if
statement is TRUE
, then the code inside the if
block will be executed.
You can use the else
statement along with an if
statement to specify code to be executed if the test_expression
in if
statement returns FALSE
.
if(test_expression) { # block of code if condition is true } else { # block of code if condition is false }
x <- 12 # check if x is positive or negative number if (x > 0) { print("x is a positive number") } else { print("x is a negative number") }
Output
[1] "x is a positive number"
Here, since x > 0
evaluates to TRUE
, the code inside the if
block gets executed.
If you want to test more than one condition, you can use the optional else if
statement along with your if...else
statements. The syntax is:
if(test_expression_1) { # code block 1 } else if (test_expression_2){ # code block 2 } else { # code block 3 }
Here,
test_expression_1
returns TRUE
, then code block 1 is executed.test_expression_1
returns FALSE
, then the test_expression_2
is evaluated.test_expression_2
returns TRUE
, then the code block 2 is executed.test_expression_2
returns FALSE
, then code block 3 is executed.x <- 0 # check if x is positive or negative or zero if (x > 0) { print("x is a positive number") } else if (x < 0) { print("x is a negative number") } else { print("x is zero") }
Output
[1] "x is zero"
In this program, we have used an if...else if...else
block to check whether x is a positive number, a negative number, or zero. Here,
if (x > 0) {...}
is executed if x is positiveelse if (x < 0) {...}
is executed if x is negativeelse {...}
is executed if x is 0Since x = 0
, the else
block is executed.
You can have nested if...else
statements inside if...else
blocks in R. This allows you to specify conditions inside conditions. For example,
x <- 20 # check if x is positive if (x > 0) { # check if x is even or odd if (x %% 2 == 0) { print("x is a positive even number") } else { print("x is a positive odd number") } # execute if x is not positive } else { # check if x is even or odd if (x %% 2 == 0) { print("x is a negative even number") } else { print("x is a negative odd number") } }
Output
[1] "x is a positive even number"
In this program,
if...else
block checks whether x is positive or negative. If x is greater than 0, the code inside the outer if
block is executed. Otherwise, the code inside the outer else
block is executed.if (x > 0) { ... .. ... } else { ... .. ... }
if...else
block checks whether x is even or odd. If x is perfectly divisible by 2, the code inside the inner if
block is executed. Otherwise, the code inside the inner else
block is executed.if (x %% 2 == 0) { ... .. ... } else { ... .. ... }
Numeric: It represents both whole and floating-point numbers. For example, 123, 32.43, etc.
Integer: It represents only whole numbers and is denoted by L. For example, 23L, 39L, etc.
Compl...
Numbers in R can be divided into 3 different categories:
L
. For example, 23L, 39L, etc.i
. For example, 2 + 3i, 5i, etc.Numeric data type is the most frequently used data type in R. It is the default data type whenever you declare a variable with numbers.
You can store any type of number (with or without decimal) in a variable with numeric
data type. For example,
# decimal variable my_decimal <- 123.45 print(class(my_decimal)) # variable without decimal my_number <- 34 print(class(my_number))
Output
[1] "numeric" [1] "numeric"
Here, both the my_decimal and my_number variables are of numeric
type.
Integers are a type of numeric data that can take values without decimal. It's mostly used when you are sure that the variable can not have any decimal values in the future.
In order to create an integer
variable, you must use the suffix L
at the end of the value. For example,
my_integer <- 123L # print the value of my_integer print(my_integer) # print the data type of my_integer print(class(my_integer))
Output
[1] 123 [1] "integer"
Here, the variable my_integer contains the value 123L
. The suffix L
at the end of the value indicates that my_integer is of integer
type.
In R, variables with complex data types contain values with an imaginary part. This can be indicated by using the i
as a suffix. For example,
# variable with only imaginary part z1 <- 5i print(z1) print(class(z1)) # variable with both real and imaginary parts z2 <- 3 + 3i print(z2) print(class(z2))
Output
[1] 0+5i [1] "complex" [1] 3+3i [1] "complex"
Here, the variables z1 and z2 have been declared as complex
data types with an imaginary part denoted by the suffix i
.
In R, we use the as.numeric()
function to convert any number to numeric
value. For example,
# integer variable a <- 4L print(class(a)) # complex variable b <- 1 + 2i print(class(b)) # convert from integer to numeric x <- as.numeric(a) print(class(x)) # convert from complex to numeric y <- as.numeric(b) print(class(y))
Output
[1] "integer" [1] "complex" [1] "numeric" [1] "numeric" Warning message: imaginary parts discarded in coercion
Here, you can see that while converting the complex
number to a numeric
value, the imaginary parts are discarded.
You can use the as.complex()
function to convert any number to a complex
value. For example,
# integer variable a <- 4L print(class(a)) # numeric variable b <- 23 print(class(b)) # convert from integer to complex y <- as.complex(a) print(class(y)) # convert from numeric to complex z <- as.complex(b) print(class(z))
Output
[1] "integer" [1] "numeric" [1] "complex" [1] "complex"
I am thrilled to share that my article “Bringing the World to the Classroom: Teaching Statistics and Programming in a Project-Based Setting” is published in PS: Political Science and Politics as an open-access article. I set up and tested the concept while teaching programming and statistics classes. It gives instructors a blueprint for teaching stats and programming in a project-based setting. It also shows how this can be applied in a virtual format or as a block seminar and how to integrate open science and peer group support practices in this teaching setting. While I use my introductory R course as a working example, the format can be applied to any course setting that requires students to learn (and most importantly practice) “hands-on” material!
I also highlighted some students‘ projects in the article — and if the word limit had permitted, I would have mentioned even more to show how creative and innovative the students‘ projects were!
I am incredibly thankful to Sabine Carey for allowing me to teach (and test) my methods classes in different settings (online and offline, seminar long and in a block seminar) and giving me the freedom to create the methods course I would have enjoyed when being a student. I would also like to thank Dennis Hammerschmidt, Anna-Lena Hönig, and Melanie Klinger for discussing earlier ideas of this course and for providing the best “teachers’ training” with the HDZ I could have wished for. And, of course, as the teaching part is only one side of the whole adventure, it wouldn’t have been so much fun if there weren’t curious and ambitious students who were willing to embark on this journey and to learn something new!
The article comes with supplementary material: a syllabus template in LaTeX, an R Markdown template for data analysis that I used throughout the course, and a term paper template in R Markdown. If you’re interested, the supplementary material is online on Dataverse ^{1}
And, to further spark the fire of R, all this wouldn’t have been possible without such a great community that also provides accessible, high-quality online resources for free – my special thanks go to RStudio, R-Ladies, and CorrelAid!
If you click on the tree-based display, the order of the files and folders becomes more accessible: ︎
As two postgrad students on summer vacation but with no travel plans (during this global pandemic of course), we took up an internship at King Abdullah University of Science and Technology (KAUST) last summer and ended up collaborating on a really cool project with Paula Moraga! Did we mention that we worked in Saudi Arabia, while living in Australia? All is well when we have the internet.
We also ended up presenting our project at useR! 2021, and winning an award for the most outstanding lightning talk! So, keep reading to see how we got there!
Although there is so much data being collected in multiple disciplines and made openly available, it may be difficult to find, retrieve and utilize these resources. Spatial and spatio-temporal data are two data types which are collected and used in research across a diverse array of domains.
With the R programming language becoming increasingly popular among academics, researchers, and scientists, and because spatial data is easier to interpret when visualized, Paula thought it would be a great idea to create a repository of data sources and simple tutorials on how to retrieve and visualize spatial data using R, during our time as interns at KAUST.
This idea became rspatialdata – a collection of data sources and tutorials on downloading and visualising spatial data using R.
rspatialdata consists of a non-exhaustive list of R-packages which have been developed as clients for different spatio-temporal databases and APIs. It also consists of tutorials on how to use these R packages, understand the different types of spatial data objects available and create visualizations using them. After doing a lot of research, we managed to pick out a few R-packages which we were confident gave up-to-date data from reliable sources.
Coincidentally, much of what we picked out happened to be rOpenSci packages! We were amazed at how rich and easy to use most of these packages were. So here is our experience using those packages and some things we really enjoyed doing!
The rnaturalearth package (by Andy South) facilitates communication with Natural earth map data. It allows you to easily download and visualize boundaries of countries and boundaries of states within countries as well.
We used rnaturalearth to download the boundaries of countries in many of our tutorials – including our tutorials on elevation, rainfall and humidity.
osmdata (by Mark Padgham, Robin Lovelace, Maëlle Salmon and Bob Rudis) is an R package for downloading spatial data from OpenStreetMap (OSM) – a very cool open source project. We were amazed at the huge variety of spatial features available to us! We were able to download spatial data about almost anything from amenities such as colleges, cinemas, hospitals and banks, to different types of highways and streets such as walking and bicycle paths, residential streets, motorways and service lanes.
We used the osmdata package to download and visualize hospitals in Lagos, Nigeria. Then we also downloaded spatial data for different types of highways, streets and waterways in Lagos, and created the following map entirely using data retrieved through osmdata!
Have a read through our Open Street Map data tutorial for a complete tutorial on how to install the osmdata package, find what spatial features are available, learn how to download these features and also create some cool maps (including an interactive one)!
MODIS (Moderate Resolution Imaging Spectroradiometer) is an instrument aboard the NASA Terra and Aqua satellites, which orbit the entire Earth every 1-2 days, acquiring data at different spatial resolutions. The data acquired by MODIS describes features of the land, oceans and the atmosphere.
MODIStsp – an acronym for ‘MODIS Time Series Processing’ (by Lorenzo Busetto and Luigi Ranghetti), is an R package for downloading and pre-processing time series raster data from MODIS data products.
We used the MODIS Vegetation Index Products (NDVI and EVI), to visualize the Normalized Difference Vegetation Index in Mongolia.
We also used the MODIS Land Cover Type Products to visualize the land cover classification in Zimbabwe.
Using the MODIStsp package made it so much easier to find what data products were available and also directly download and save this data using R. Our Vegetation and Land Cover tutorials explain how we used the MODIStsp R package to download data and create cool maps.
nasapower (by Adam Sparks) is a client for the ‘NASA POWER’ global meteorology, surface solar energy and climatology data API, and aims to make it quick and easy to download climatology data for analysis, visualization, modeling and many other purposes.
We used the nasapower package to retrieve rainfall and humidity data, and it was such a simple task. We only had to submit the duration of data we needed along with the geographical location, and the package took care of all the hard work.
We downloaded relative humidity and rainfall data and then visualized the relative humidity in Western Australia and the rainfall in Gansu, China.
Take a look at our rainfall and humidity tutorials for a complete guide on how to use the nasapower package to download data and how to create cool visualizations using the data.
The rdhs package (by OJ Watson and Jeff Eaton) is an API client for Demographic and Health Surveys (DHS) data – which is a collection of population, health, HIV and nutrition data from more than 400 surveys in over 90 countries. This data is considered to be sensitive and hence requires one to set up an account with DHS and request permission to access the data.
Although this may look like a tedious process, the rdhs package makes it very smooth by allowing us to test out the functions of the package using model data. So one could even start on the analysis while waiting for data access rights!
Read through our tutorial on Demographic and Health Surveys (DHS) for more details on how to use the rdhs package.
Have a look at our administrative boundaries tutorial which runs you through creating outline maps of countries and their administrative divisions, our population tutorial which runs you through visualizing population estimates of countries using choropleth maps and cartograms, and our tutorials on elevation, temperature, malaria, and air pollution for ideas on how to download and visualize each respective property.
You just read through some of the things we enjoyed while creating rspatialdata. We are excited for you to try it out too! So give it a go, get creative with it, and let us know how you did!
Our goal for rspatialdata is not to be a comprehensive list of R packages or tutorials, but to be a starting point for anyone to find and download spatial data related to different domains, and visualize different types of spatial data objects using different approaches.
So if you think of anymore R packages that would fit into this collection, ideas on new tutorials or cool spatial visualizations, do reach out to us! Let’s talk about how we can collaborate!
Whether you want to do, share, teach, or learn data science, RStudio Cloud is a cloud-based solution that allows you to do so online. The RStudio Cloud team has rolled out new features and improvements since our last post in May 2021. So what’s new?
Let’s take a closer look at these updates.
Expand your data science workbench with Jupyter projects
Jupyter Notebook projects are now available to Premium, Instructor, or Organization account holders. Once you are in RStudio Cloud, you can create and work with Jupyter projects as easily as RStudio IDE projects. Click on the New Project button, then select New Jupyter Project from the menu that appears. If you haven’t yet joined the beta program, you will be prompted to fill out a brief form — submit that and we’ll get you into the program ASAP.
Doing so allows you to work in a Jupyter notebook:
This functionality is currently in beta — we’d love to hear your feedback.
Pay Less! Get More!
Need more time for your analysis? Or have other projects that you’d like to run? In RStudio Cloud, we’ve provided you with more cost-effective options for your data science work. We have bumped up the number of projects available on the Free and Plus plans to fifty. We have also provided more flexibility if you need more time by increasing the monthly project hours included in each plan and halving the cost per additional hour.
Number of projects included | Monthly project hours included | Cost per additional project hour | |
Free plan | 50 (was 15) | 25 (was 15) | – |
Plus plan | 50 (was 15) | 75 (was 50) | 10¢ (was 20¢) |
Premium plan | Unlimited | 200 (was 160) | 10¢ (was 20¢) |
Instructor plan | Unlimited | 300 (was 160) | 10¢ (was 20¢) |
In addition, each project can now store up to 20GB of files, data, and packages on disk, up from the prior 3GB for files and data, and 3GB for packages.
Work with confidence with upgraded Ubuntu
In RStudio Cloud, you can use the latest versions of R and Python packages with confidence that the underlying operating system has the features they require. Starting on September 13, 2021, all projects still running Ubuntu 16.04 (Xenial) will be automatically upgraded to Ubuntu 20.04 (Focal) the next time they are opened. For more information, please visit this community article.
Learn more about RStudio Cloud
We are excited to provide you with more capabilities so that you can jump right into your data science work. For more information and resources, please visit:
Way back in 2018, long before the pandemic, I described a soon-to-be implemented simstudy function genMultiFac that facilitates the generation of multi-factorial study data. I followed up that post with a description of how we can use these types of efficient designs to answer multiple questions in the context of a ...
Continue reading: Analyzing a factorial design by focusing on the variance of effect sizes]]>Way back in 2018, long before the pandemic, I described a soon-to-be implemented simstudy
function genMultiFac
that facilitates the generation of multi-factorial study data. I followed up that post with a description of how we can use these types of efficient designs to answer multiple questions in the context of a single study.
Fast forward three years, and I am thinking about these designs again for a new grant application that proposes to study simultaneously three interventions aimed at reducing emergency department (ED) use for people living with dementia. The primary interest is to evaluate each intervention on its own terms, but also to assess whether any combinations seem to be particularly effective. While this will be a fairly large cluster randomized trial with about 80 EDs being randomized to one of the 8 possible combinations, I was concerned about our ability to estimate the interaction effects of multiple interventions with sufficient precision to draw useful conclusions, particularly if the combined effects of two or three interventions are less than additive. (That is, two interventions may be better than one, but not twice as good.)
I am thinking that a null hypothesis testing framework might not be so useful here, given the that the various estimates could be highly uncertain, not to mention the multiple statistical tests that we would need to conduct (and presumably adjust for). Rather, a Bayesian approach that pools estimates across interventions and provides posterior probability distributions may provide more insight into how the interventions interact could be a better way to go.
With this in mind, I went to the literature, and I found these papers by Kassler et al and Gelman. They both describe a way of thinking about interaction that emphasizes the estimates of variance across effect estimands. I went ahead and tested the idea with simulated data, which I’m showing here. Ultimately, I decided that this approach will not work so well for our study, and I came up with a pretty simple solution that I will share next time.
The scenarios described by both papers involve studies that may be evaluating many possible interventions or exposures, each of which may have two or more levels. If we are dealing with a normally distributed (continuous) outcome measure, we can model that outcome as
\[ y_{i} \sim N\left(\mu = \tau_0 + \tau^1_{j_{1_i}} + \dots + \tau^k_{j_{k_i}} + \tau^{12}_{j_{12_i}} + \dots + \tau^{k-1, k}_{j_{k-1,k_i}} + \tau^{123}_{123_i} + \dots + \tau^{k-2, k-1, k}_{k-2, k-1, k_i} + \dots, \ \sigma = \sigma_0\right), \]
where there are \(K\) interventions, and intervention \(k\) has \(j_k\) levels. So, if intervention \(3\) has 4 levels, \(j_3 \in \{1,2,3,4\}.\) \(\tau_0\) is effectively the grand mean. \(\tau^k_1, \tau^k_2, \dots, \tau^k_{j_k},\) are the mean contributions for the \(k\)th intervention, and we constrain \(\sum_{m=1}^{j_k} \tau^k_m = 0.\) Again, for intervention \(3\), we would have \(\tau^3_1 \dots, \tau^3_4,\) with \(\sum_{m=1}^{4} \tau^3_m = 0.\)
The adjustments made for the two-way interactions are represented by the \(\tau^{12}\)’s through the \(\tau^{k-1,k}\)’s. If intervention 5 has \(2\) levels then for the interaction between interventions 3 and 5 we have \(\tau^{35}_{11}, \tau^{35}_{12}, \tau^{35}_{21}, \dots, \tau^{35}_{42}\) and \(\sum_{m=1}^4 \sum_{n=1}^2 \tau^{35}_{m,n} = 0.\)
This pattern continues for higher orders of interaction (i.e. 3-way, 4-way, etc.).
In the Bayesian model, each set of \(\tau_k\)’s shares a common prior distribution with mean 0 and standard deviation \(\sigma_k\):
\[ \tau^k_1, \dots, \tau^k_{j_k} \sim N(\mu = 0, \sigma = \sigma_k), \] where \(\sigma_k\) is a hyperparameter that will be estimated from the data. The same is true for the interaction terms for interventions \(k\) and \(l\):
\[ \tau^{kl}_{11}, \dots, \tau^{kl}_{j_k, j_l} \sim N(\mu = 0, \sigma = \sigma_{kl}), \ \ \text{where } k < l \]
To assess whether there is interaction between the interventions (i.e. the effects are not merely additive), we are actually interested the variance parameters of the interaction \(\tau\text{'s}\). If, for example there is no interaction between different levels of interventions of 3 and 5, then \(\sigma_{35}\) should be close to \(0\), implying that \(\tau^{35}_{11} \approx \tau^{35}_{12} \approx \dots \approx \tau^{35}_{42} \approx 0\). On the other hand, if there is some interaction effect, then \(\sigma_{35} > 0,\) implying that at least one \(\tau^{35} > 0.\)
One advantage of the proposed Bayesian model is that we can use partial pooling to get more precise estimates of the variance terms. By this, I mean that we can use information from each \(\sigma^{kl}\) to inform the others. So, in the case of 2-way interaction, the prior probability assumption would suggest that the the variance terms were drawn from a common distribution:
\[ \sigma^{12}, \sigma^{13}, \dots, \sigma^{k-1,k} \sim N(\mu = 0, \sigma = \sigma_{\text{2-way}}) \]
We can impose more structure (and hopefully precision) by doing the same for the main effects:
\[ \sigma^{1}, \sigma^{2}, \dots, \sigma^{k} \sim N(\mu = 0, \sigma = \sigma_{\text{main}}) \]
Of course, for each higher order interaction (above 2-way), we could impose the same structure:
\[ \sigma^{123}, \dots, \sigma^{12k}, \dots, \sigma^{k-2, k-1, k} \sim N(\mu = 0, \sigma = \sigma_{\text{3-way}}) \]
And so on. Though at some point, we might want to assume that there is no higher order interaction and exclude it from the model; in most cases, we could stop at 2- or 3-way interaction and probably not sacrifice too much.
When I set out to explore this model, I started relatively simple, using only two interventions with four levels each. In this case, the factorial study would have 16 total arms \((4 \times 4)\). (Since I am using only 2 interventions, I am changing the notation slightly, using interventions \(a\) and \(b\) rather than \(1\) and \(2\).) Individual \(i\) is randomized to one level in \(a\) and one level \(b\), and \(a_i \in \{1,2,3,4\}\) and \(b_i\in \{1,2,3,4\}\), and \(ab_i \in \{11, 12, 13, 14, 21, 22, \dots, 44\}.\) Using the same general model from above, here is the specific model for continuous \(y\):
\[ y_{i} \sim N\left(\mu = \tau_0 + \tau^a_{a_i} + \tau^b_{b_i} + \tau^{ab}_{ab_i}, \ \sigma = \sigma_0\right) \]
Take note that we only have a single set of 2-way interactions since there are only two groups of interventions. Because of this, there is no need for a \(\sigma_{\text{2-way}}\) hyperparameter; however, there is a hyperparameter \(\sigma_{\text{main}}\) to pool across the main effects of \(a\) and \(b\). Here are the prior distribution assumptions:
\[\begin{aligned} \tau_0 &\sim N(0, 5) \\ \tau^a_1, \tau^a_2, \tau^a_3, \tau^a_4 &\sim N(0, \sigma_a) \\ \tau^b_1, \tau^b_2, \tau^b_3, \tau^b_4 &\sim N(0, \sigma_b) \\ \tau^{ab}_{11}, \tau^{ab}_{12}, \dots \tau^{ab}_{44} &\sim N(0, \sigma_{ab}) \\ \sigma_a, \sigma_b &\sim N(0, \sigma_\text{main}) \\ \sigma_{ab} &\sim N(0, 5) \\ \sigma_\text{main} &\sim N(0, 5) \\ \sigma &\sim N(0,5) \end{aligned}\]In order to ensure identifiability, we have the following constraints:
\[\begin{aligned} \tau^a_1 + \tau^a_2 + \tau^a_3 + \tau^a_4 &= 0 \\ \tau^b_1 + \tau^b_2 + \tau^b_3 + \tau^b_4 &= 0 \\ \tau^{ab}_{11} + \tau^{ab}_{12} + \dots + \tau^{ab}_{43} + \tau^{ab}_{44} &= 0 \end{aligned}\]library(simstudy) library(data.table) library(cmdstanr) library(caret) library(posterior) library(bayesplot) library(ggdist) library(glue)
The parameters \(\tau_0, \tau_a, \tau_b, \text{ and } \tau_{ab}\) are set so that there is greater variation in treatment \(a\) compared to treatment \(b\). In both cases, the sum of the parameters is set to \(0\).
t_0 <- 0 t_a <- c(-8, -1, 3, 6) t_b <- c(-3, -1, 0, 4)
The interaction is set in this case so that there is an added effect when both \(a=2 \ \& \ b=2\) and \(a=3 \ \& \ b=2\). Again, the parameters are set so that the sum-to-zero constraint is maintained.
x <- c(4, 3) nox <- - sum(x) / (16 - length(x)) t_ab <- matrix(c(nox, nox, nox, nox, nox, 4, nox, nox, nox, 3, nox, nox, nox, nox, nox, nox), nrow = 4, byrow = TRUE) t_ab ## [,1] [,2] [,3] [,4] ## [1,] -0.5 -0.5 -0.5 -0.5 ## [2,] -0.5 4.0 -0.5 -0.5 ## [3,] -0.5 3.0 -0.5 -0.5 ## [4,] -0.5 -0.5 -0.5 -0.5 sum(t_ab) ## [1] 0
The data definitions for the arm assignments and the outcome \(y\) are established using the simstudy
package:
d1 <- defDataAdd(varname = "y", formula = "mu", variance = 16, dist = "normal")
Now we are ready to generate the data:
set.seed(110) dd <- genMultiFac(nFactors = 2, levels = 4, each = 30, colNames = c("a", "b")) dd[, mu := t_0 + t_a[a] + t_b[b] + t_ab[a, b], keyby = id] dd <- addColumns(d1, dd)
The plot shows the the average outcomes by arm. The interaction when \(a=2 \ \& \ b=2\) and \(a=3 \ \& \ b=2\) is apparent in the two locations where the smooth pattern of increases is interrupted.
The function shown next simply generates the data needed by Stan
. (The Stan
implementation is shown below in the addendum.) Take note that we convert the \(\tau_{ab}\) design matrix of 0’s and 1’s to a single vector with values ranging from 1 to 16.
dt_to_list <- function(dx) { dx[, a_f := factor(a)] dx[, b_f := factor(b)] dv <- dummyVars(~ b_f:a_f , data = dx, n = c(4, 4)) dp <- predict(dv, dx ) N <- nrow(dx) ## number of observations I <- 2 X2 <- 1 main <- as.matrix(dx[,.(a,b)]) ab <- as.vector(dp %*% c(1:16)) x <- as.matrix(ab, nrow = N, ncol = X2) y <- dx[, y] list(N=N, I=I, X2=X2, main=main, x=x, y=y) }
I am using cmdstanr
to interact with Stan
:
mod <- cmdstan_model("code/model_2_factors.stan", force_recompile = TRUE) fit <- mod$sample( data = dt_to_list(dd), refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 500, iter_sampling = 2500, adapt_delta = 0.99, step_size = .05, max_treedepth = 20, seed = 1721 ) ## Running MCMC with 4 parallel chains... ## ## Chain 2 finished in 15.5 seconds. ## Chain 1 finished in 17.7 seconds. ## Chain 3 finished in 20.3 seconds. ## Chain 4 finished in 23.3 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 19.2 seconds. ## Total execution time: 23.6 seconds.
Here is just one set of trace plots for \(\tau^a_1, \dots, \tau^a_4\) that indicate the sampling went quite well - the variables not shown were equally well-behaved.
posterior <- as_draws_array(fit$draws()) mcmc_trace(posterior, pars = glue("t[{1},{1:4}]"))
Since we are focused on the possibility of 2-way interaction, the primary parameter of interest is \(\sigma_{ab},\) the variation of the interaction effects. (In the Stan
model specification this variance parameter is sigma_x, as in interaction.) The plot shows the 95% credible intervals for each of the main effect variance parameters as well as the interaction variance parameter.
The fact that the two main effect variance parameters (\(\sigma_a\) and \(\sigma_b\)) are greater than zero supports the data generation process which assumed different outcomes for different levels of interventions \(a\) and \(b\), respectively.
And the credible interval for \(\sigma_{ab}\) (sigma_x), likewise is shifted away from zero, suggesting there might be some interaction between \(a\) and \(b\) at certain levels of each.
mcmc_intervals(posterior, pars = c(glue("sigma_m[{1:2}]"), "sigma_x[1]"))
We can hone in a bit more on the specific estimates of the \(\tau_{ab}\)’s to see where those interactions might be occurring. It appears that t_x[1,6] (representing \(\tau_{22}\)) is an important interaction term - which is consistent with the data generation process. However, \(\tau_{32}\), represented by t_x[1,10] is not obviously important. Perhaps we need more data.
mcmc_intervals(posterior, pars = glue("t_x[1,{1:16}]"))
Below is a visual representation of how well the model fits the data by showing the interval of predicted cell counts for each \(a/b\) pair. The observed means (shown as white dots) sit on top of the predictions (shown by the colored lines), suggesting the model is appropriate.
r <- as_draws_rvars(fit$draws(variables = c("t_0","t", "t_x"))) dnew <- data.frame( genMultiFac(nFactors = 2, levels = 4, each = 1, colNames = c("b", "a"))) dnew$yhat <- with(r, rep(t_0, 16) + rep(t[1, ], each = 4) + rep(t[2, ], times = 4) + t(t_x)) ggplot(data = dnew, aes(x=b, dist = yhat)) + geom_vline(aes(xintercept = b), color = "white", size = .25) + stat_dist_lineribbon() + geom_point(data = dsum, aes(y = yhat), color = "white", size = 2) + facet_grid(.~a, labeller = labeller(a = label_both)) + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank()) + scale_fill_brewer()
Perhaps the rationale for focusing on the variance can be best appreciated by looking at a contrasting scenario where there is only a single main effect (for intervention \(a\)) and no interaction. Here we would expect the estimates for the intervention \(b\) main effects variance as well as the variance of the interaction terms to be close to zero.
t_0 <- 0 t_a <- c(-8, -1, 3, 6) t_b <- c(0, 0, 0, 0) t_ab <- matrix(0, nrow = 4, ncol = 4)
The plot of the observed means is consistent with the data generation process:
## Running MCMC with 4 parallel chains... ## ## Chain 2 finished in 9.4 seconds. ## Chain 1 finished in 9.9 seconds. ## Chain 4 finished in 10.2 seconds. ## Chain 3 finished in 21.8 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 12.8 seconds. ## Total execution time: 21.9 seconds.
And yes, the posterior distribution for \(\sigma_{ab}\) (sigma_x) is now very close to zero …
and the effect parameters are all centered around zero:
Once again, the predicted values are quite close to the observed means - indicating the model is a good fit:
In the motivating application, there are actually three interventions, but each one has only two levels (yes or no). In this case, the level mean and across-level variance parameters were poorly estimated, probably because there are so few levels. This forced me to take a more traditional approach, where I estimate the means of each randomization arm. I’ll share that next time.
References:
Gelman, Andrew. “Analysis of variance—why it is more important than ever.” The annals of statistics 33, no. 1 (2005): 1-53.
Kassler, Daniel, Ira Nichols-Barrer, and Mariel Finucane. “Beyond “treatment versus control”: How Bayesian analysis makes factorial experiments feasible in education research.” Evaluation review 44, no. 4 (2020): 238-261.
The model is implemented in Stan using a non-centered parameterization, so that the parameters \(tau\) are a function of a set of \(z\) parameters, which are standard normal parameters. This does not dramatically change the estimates, but eliminates divergent chains, improving sampling behavior.
data { int<lower=1> N; // # of observations int<lower=1> I; // # of interventions int<lower=1> X2; // # of 2-way interactions int main[N, I]; // interventions int x[N, X2]; // interactions vector[N] y; // outcome } parameters { real t_0; vector[3] z_raw[I]; vector[15] z_x_raw[X2]; real<lower=0> sigma; real<lower=0> sigma_m[I]; real<lower=0> sigma_x[X2]; real<lower=0> sigma_main; } transformed parameters { // constrain parameters to sum to 0 vector[4] z[I]; vector[16] z_x[X2]; vector[4] t[I]; vector[16] t_x[X2]; vector[N] yhat; for (i in 1:I) { z[i] = append_row(z_raw[i], -sum(z_raw[i])); } for (i in 1:X2) { z_x[i] = append_row(z_x_raw[i], -sum(z_x_raw[i])); } for (i in 1:I) for (j in 1:4) t[i, j] = sigma_m[i] * z[i, j]; for (i in 1:X2) for (j in 1:16) t_x[i, j] = sigma_x[i] * z_x[i, j]; // yhat for (n in 1:N) { real ytemp; ytemp = t_0; for (i in 1:I) ytemp = ytemp + t[i, main[n, i]]; // 2 sets of main effects for (i in 1:X2) ytemp = ytemp + t_x[i, x[n, i]]; // 1 set of interaction effects yhat[n] = ytemp; } } model { sigma ~ normal(0, 5); sigma_m ~ normal(0, sigma_main); sigma_x ~ normal(0, 5); sigma_main ~ normal(0, 5); t_0 ~ normal(0, 5); for (i in 1:I) z_raw[i] ~ std_normal(); for (i in 1:X2) z_x_raw[i] ~ std_normal(); y ~ normal(yhat, sigma); }
Apache Parquet is a popular column
storage file format used by Hadoop systems, such as Pig,
Spark, and Hive. The file format is
language independent and has a binary representation. Parquet is used to
efficiently store large data sets and has the extension .parquet. This
blog post aims to understand ...
Apache Parquet is a popular column
storage file format used by Hadoop systems, such as Pig,
Spark, and Hive. The file format is
language independent and has a binary representation. Parquet is used to
efficiently store large data sets and has the extension .parquet
. This
blog post aims to understand how parquet works and the tricks it uses to
efficiently store data.
Key features of parquet are:
The latter two points allow for efficient storage and querying of data.
Suppose we have a simple data frame:
tibble::tibble(id = 1:3, name = c("n1", "n2", "n3"), age = c(20, 35, 62)) #> # A tibble: 3 × 3 #> id name age #> <int> <chr> <dbl> #> 1 1 n1 20 #> 2 2 n2 35 #> 3 3 n3 62
If we stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such as,
SELECT * FROM table_name WHERE id == 2
We simply go to the 2nd row and retrieve that data. It’s also very easy
to append rows to the data set – we just add a row to the bottom of the
file. However, if we want to sum the data in the age
column, then this
is potentially inefficient. We would need to determine which value on
each row is related to age
, and extract that value.
Parquet uses column storage. In column layouts, column data are stored sequentially.
1 2 3 n1 n2 n3 20 35 62
With this layout, queries such as
SELECT * FROM dd WHERE id == 2
are now inconvenient. But if we want to sum up all ages, we simply go to the third row and add up the numbers.
In R, we read and write parquet files using the {arrow} package.
# install.packages("arrow") library("arrow") packageVersion("arrow") #> [1] '5.0.0'
To create a parquet file, we use write_parquet()
# Use the penguins data set data(penguins, package = "palmerpenguins") # Create a temporary file for the output parquet = tempfile(fileext = ".parquet") write_parquet(penguins, sink = parquet)
To read the file, we use read_parquet()
. One of the benefits of using
parquet, is small file sizes. This is important when dealing with large
data sets, especially once you start incorporating the cost of cloud
storage. Reduced file size is achieved via two methods:
compression
argument
in write_parquet()
. The default is
snappy
.Do you use RStudio Pro? If so, checkout our our managed RStudio services
Since parquet uses column storage, values of the same type are number stored together. This opens up a whole world of optimisation tricks that aren’t available when we save data as rows, e.g. CSV files.
Suppose a column just contains a single value repeated on every row. Instead of storing the same number over and over (as a CSV file would), we can just record “value X repeated N times”. This means that even when N gets very large, the storage costs remain small. If we had more than one value in a column, then we can use a simple look-up table. In parquet, this is known as run length encoding. If we have the following column
c(4, 4, 4, 4, 4, 1, 2, 2, 2, 2) #> [1] 4 4 4 4 4 1 2 2 2 2
This would be stored as
To see this in action, lets create a simple example, where the character
A
is repeated multiple times in a data frame column:
x = data.frame(x = rep("A", 1e6))
We can then create a couple of temporary files for our experiment
parquet = tempfile(fileext = ".parquet") csv = tempfile(fileext = ".csv")
and write the data to the files
arrow::write_parquet(x, sink = parquet, compression = "uncompressed") readr::write_csv(x, file = csv)
Using the {fs} package, we extract the size
# Could also use file.info() fs::file_info(c(parquet, csv))[, "size"] #> # A tibble: 2 × 1 #> size #> <fs::bytes> #> 1 1015 #> 2 1.91M
We see that the parquet file is tiny, whereas the CSV file is almost 2MB. This is actually a 500 fold reduction in file space.
Suppose we had the following character vector
c("Jumping Rivers", "Jumping Rivers", "Jumping Rivers") #> [1] "Jumping Rivers" "Jumping Rivers" "Jumping Rivers"
If we want to save storage, then we could replace Jumping Rivers
with
the number 0
and have a table to map between 0
and Jumping Rivers
.
This would significantly reduce storage, especially for long vectors.
x = data.frame(x = rep("Jumping Rivers", 1e6)) arrow::write_parquet(x, sink = parquet) readr::write_csv(x, file = csv) fs::file_info(c(parquet, csv))[, "size"] #> # A tibble: 2 × 1 #> size #> <fs::bytes> #> 1 1.09K #> 2 14.31M
This encoding is typically used in conjunction with timestamps. Times are typically stored as Unix times, which is the number of seconds that have elapsed since January 1st, 1970. This storage format isn’t particularly helpful for humans, so typically it is pretty-printed to make it more palatable for us. For example,
(time = Sys.time()) #> [1] "2021-09-21 17:05:08 BST" unclass(time) #> [1] 1632240309
If we have a large number of time stamps in a column, one method for reducing file size is to simply subtract the minimum time stamp from all values. For example, instead of storing
c(1628426074, 1628426078, 1628426080) #> [1] 1628426074 1628426078 1628426080
we would store
c(0, 4, 6) #> [1] 0 4 6
with the corresponding offset 1628426074
.
There are a few other tricks that parquet uses. Their GitHub page gives a complete overview.
If you have a parquet file, you can use parquet-mr to investigate the encoding used within a file. However, installing the tool isn’t trivial and does take some time.
The obvious question that comes to mind when discussing parquet, is how does it compare to the feather format. Feather is optimised for speed, whereas parquet is optimised for storage. It’s also worth noting, that the Apache Arrow file format is feather.
The RDS file format used by readRDS()/saveRDS()
and load()/save()
.
It is file format native to R and can only be read by R. The main
benefit of using RDS is that it can store any R object – environments,
lists, and functions.
If we are solely interested in rectangular data structures, e.g. data frames, then reasons for using RDS files are
The advantages of using parquet are
compression = "gzip"
in
write_parquet()
for a fair comparison.For updates and revisions to this article, see the original post
An interesting case on X validated of someone puzzled by the simulation (and variance) of the random variable 1/X when being able to simulate X. And being surprised at the variance of the ratio being way larger than the variances of both numerator and denominator.
by Hong Ooi
This article is a lightly-edited version of the “Microsoft365R and Shiny” vignette in the latest Microsoft365R release.
We describe how to incorporate Microsoft365R and interactive authentication with Azure Active Directory (AAD) into a Shiny web app. There are a few steps involved:
The default Microsoft365R app registration only works when the package is used on a local machine; it does not support running in a remote server. Because of this, when you use Microsoft365R inside a Shiny app, you (or your friendly local sysadmin) must register that app in AAD.
The main things to set in your app registration are:
The redirect URI of your app, ie, your user-facing site address. For example if your app is hosted in shinyapps.io, this would be a URL of the form https://youraccount.shinyapps.io/appname
. If your app uses a special port number rather than the default port 443 for HTTPS, don’t forget to include that as well. It’s possible to set more than one redirect, so you can reuse a single app registration for multiple Shiny apps.
The type of redirect, either native (mobile & desktop) or webapp. There are also other types of redirects, but these are the only ones relevant to R. The difference between a mobile & desktop and a webapp redirect is that you supply a client secret when authenticating with the latter, but not the former. It’s recommended to use a webapp redirect for a Shiny app, as the client secret helps prevent third parties from hijacking your app registration. The client secret is also set as part of the app registration.
The intended audience of your app, ie, who is allowed to use it. This can be only members of your AAD tenant; members of any AAD tenant; or anyone with a Microsoft account (including personal accounts).
The permissions required by your app. Refer to the app_registration.md file for the list of permissions Microsoft365R uses. You can omit any permissions that you don’t need if your app doesn’t use all of Microsoft365R’s functionality, eg if you don’t handle emails you can omit Mail.Send and Mail.ReadWrite.
The following pages at the AAD documentation will be helpful:
A step-by-step guide to registering an app in the Azure portal.
How to set permissions for an app. For a Shiny app, note that you want delegated permissions from the Microsoft Graph API, not application permissions.
Below is a basic app that logs the user in, retrieves their OneDrive, and lists the contents of the root folder.
One thing to note is that the regular Microsoft365R client functions like get_sharepoint_site
, get_team
etc are intended for use on a local machine. While they will still work when called in a web app, it’s a better idea to call the underlying R6 methods directly: Microsoft365R extends AzureGraph with several R6 classes and methods, which do the actual work of interacting with the Microsoft 365 REST API.
Here, we call the get_drive()
method for the AzureGraph::az_user
class, which retrieves the OneDrive for a user. For more information, see the online help page in R for the Microsoft365R “add_methods” topic: ?add_methods
.
library(AzureAuth) library(AzureGraph) library(Microsoft365R) library(shiny) tenant <- "your-tenant-here" # the application/client ID of your app registration # - not to be confused with the 'object ID' or 'service principal ID' app <- "your-app-id-here" # the address/redirect URI of your app # - AAD allows only HTTPS for non-localhost redirects, not HTTP redirect <- "https://example.com/mysite" port <- httr::parse_url(redirect)$port options(shiny.port=if(is.null(port)) 443 else as.numeric(port)) # the client secret of your app registration # - NEVER put secrets in code: # - here we get it from an environment variable # - unnecessary if you have a 'desktop & mobile' redirect pwd <- Sys.getenv("EXAMPLE_SHINY_CLIENT_SECRET", "") if(pwd == "") pwd <- NULL # get the Graph permissions listed for the app, plus an ID token resource <- c("https://graph.microsoft.com/.default", "openid") # a simple UI: display the user's OneDrive ui <- fluidPage( verbatimTextOutput("drv") ) ui_func <- function(req) { opts <- parseQueryString(req$QUERY_STRING) if(is.null(opts$code)) { auth_uri <- build_authorization_uri(resource, tenant, app, redirect_uri=redirect, version=2) redir_js <- sprintf("location.replace(\"%s\");", auth_uri) tags$script(HTML(redir_js)) } else ui } server <- function(input, output, session) { opts <- parseQueryString(isolate(session$clientData$url_search)) if(is.null(opts$code)) return() token <- get_azure_token(resource, tenant, app, password=pwd, auth_type="authorization_code", authorize_args=list(redirect_uri=redirect), version=2, use_cache=FALSE, auth_code=opts$code) # display the contents of the user's OneDrive root folder drv <- ms_graph$ new(token=token)$ get_user()$ get_drive() output$drv <- renderPrint(drv$list_files()) } shinyApp(ui_func, server)
This is just a short update to my previous post. Bundestag elections are over and according to the preliminary official results the new Bundestag will have 735 members. That’s 32 more than the previous 703 members, but at least far from 841 members predicted in my previous post that was based on last week’s forecast data.
The main reason is that the Bavarian CSU performed substantially better in terms of 2nd votes than predicted by the Forsa forecast from last week. While the forecast predicted a CSU 2nd vote share of 29.3% in Bavaria among the parties entering the parliament, the CSU actually achieved 36.8%.
I copied the preliminary results from mandatsrechner.de and added them to my Github repository.
Let’s see whether we get the same seat distribution as in the official results:
source("seat_calculator.R") dat = read.csv("results_2021.csv",encoding="UTF-8") %>% select(-seats.mr) res = compute.seats(dat) summarize.results(res) ## Total size: 734 seats ## # A tibble: 7 x 5 ## party vote_share seat_shares seats ueberhang ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 SPD 0.282 0.281 206 0 ## 2 CDU 0.207 0.206 151 0 ## 3 Gruene 0.162 0.161 118 0 ## 4 FDP 0.126 0.125 92 0 ## 5 AfD 0.113 0.113 83 0 ## 6 CSU 0.0567 0.0613 45 3 ## 7 Linke 0.0536 0.0531 39 0
That are the same results as the preliminary official one’s, with one exception. As the party of the danish minority party, the SSW did not have to pass the 5% hurdle and also got one seat this election. Adding this seat leads to the total of 735 seats. Also note that all vote shares are only relative to the votes of the parties that enter the Bundestag.
Except for the district Munich-South, which went to the Green party, the CSU won all direct districts in Bavaria. But e.g. the district Munich-West was won by the CSU won with only 146 votes ahead of the 2nd ranked candidate from the Green party. What would be the size of the parliament if the Green would have also won Munich-West?
library(dplyrExtras) dat.mod = dat %>% mutate_rows(party=="CSU" & land=="Bayern", direct = direct-1) %>% mutate_rows(party=="Gruene" & land=="Bayern", direct = direct+1) res = compute.seats(dat.mod) sum(res$seats)+1 ## [1] 718
We would have 17 seats less: only 718 seats. Note that in each of the 4 Munich districts the candidates of the SPD and the Green party have together substantially more direct votes than the CSU. If the voters of these two parties would have better coordinated so that the CSU gets no direct seat in Munich, the size of the Bundestag could have been reduced to even 683 members. Normally, one would have thought the Bavarian voters would take such a chance to reduce the amount of money spent in Berlin…
There were even direct seats more expensive in terms of Bundestag size than a direct seat for the CSU: one of the three direct seats captured by the Linke. If the Linke would not have won three direct seats, they would have lost all their 2nd vote seats, because they did not breach the 5% hurdle. The number of seats is not reduced automatically by the fact that a party does not enter the parliament. Yet, if the Linke would not get its second votes, the CSU share among the relevant 2nd votes would increase and a smaller increase of the Bundestag’s size would be necessary to balance CSU’s direct seats with its 2nd vote share.
So what would be the size of parliament if the Linke only would have gotten only 2 direct seats and one of its Berlin seats went to the Green party?
dat.mod = dat %>% filter(party != "Linke") %>% mutate_rows(party=="Gruene" & land=="Berlin", direct = direct+1) res = compute.seats(dat.mod) sum(res$seats)+3 ## [1] 698
We then would have a parliament of only 698 members. So the third direct seat won by the Linke generated 37 additional seats in the Bundestag.
For example,
# declare boolean
x
In R, boolean variables can take only 2 values – TRUE
and FALSE
.
For example,
# declare boolean x <- TRUE print(x) print(class(x)) # declare boolean using single character y <- F print(y) print(class(y))
Output
[1] TRUE [1] "logical" [1] FALSE [1] "logical"
Here, we have declared x and y as boolean variables. In R, Boolean variables belong to the logical
class.
You can also declare boolean variables using a single character - T
or F
. Here, T
stands for TRUE
and F
stands for FALSE
.
Comparison operators are used to compare two values.
Operator | Description | Example |
---|---|---|
> |
Greater than | 5 > 6 returns FALSE |
< |
Less than | 5 < 6 returns TRUE |
== |
Equals to | 10 == 10 returns TRUE |
!= |
Not equal to | 10 != 10 returns FALSE |
>= |
Greater than or equal to | 5 >= 6 returns FALSE |
<= |
Less than or equal to | 6 <= 6 returns TRUE |
The output of a comparison is a boolean value. For example, to check if two numbers are equal, you can use the ==
operator.
x <- 10 y <- 23 # compare x and y print(x == y) # FALSE
Similarly, to check if x is less than y, you can use the <
operator.
x <- 10 y <- 23 # compare x and y print(x < y) # TRUE
Since, the value stored in x is less than the value stored in y, the comparison x < y
results in TRUE
.
Logical operators are used to compare the output of two comparisons. There are three types of logical operators in R. They are:
&
)|
)!
)The AND operator &
takes as input two logical values and returns the output as another logical value.
The output of the operator is TRUE
only when both the input logical values are either TRUE
or evaluated to TRUE
.
Let a and b represent two operands. 0
represents FALSE
and 1
represents TRUE
. Then,
a | b | a & b |
---|---|---|
1 | 1 | 1 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 0 |
For example,
# print & of TRUE and FALSE combinations TRUE & TRUE TRUE & FALSE FALSE & TRUE FALSE & FALSE
Output
[1] TRUE [1] FALSE [1] FALSE [1] FALSE
The input to any logical operator can also be a comparison between two or more variables. For example,
x <- 10 y <- 23 z <- 12 print(x<y & y>z)
Output
[1] TRUE
Here, the condition checks whether x is less than y and y is less than z or not. If both the conditions evaluate to TRUE
, then the output is TRUE
. If any of them is FALSE
, the output turns out to be FALSE
.
The OR operator |
returns TRUE
if all or any one of the logical inputs is TRUE
or evaluates to TRUE
. If all of them are FALSE
, then it returns FALSE
. Consider the table below.
a | b | a | b |
---|---|---|
1 | 1 | 1 |
1 | 0 | 1 |
0 | 1 | 1 |
0 | 0 | 0 |
For example,
# print | of TRUE and FALSE combinations TRUE | TRUE TRUE | FALSE FALSE | TRUE FALSE | FALSE
Output
[1] TRUE [1] TRUE [1] TRUE [1] FALSE
Here, if any one of the inputs is TRUE
, then the output is TRUE
.
Similar to the case of AND operator, you can use any number of comparisons as input to the OR operator. For example,
w <- 54 x <- 12 y <- 25 z <- 1 print(w>x | x>y | z>w)
Output
[1] TRUE
Here, only the comparison w>x
evaluates to TRUE
. Apart from that, all the other comparisons evaluate to FALSE
. Since, at least one of the inputs is TRUE
, the output of the entire comparison is TRUE
.
The NOT operator !
is used to negate the logical values it is used on. If the input value is TRUE
, it will turn to FALSE
and vice-versa.
a | !a |
---|---|
1 | 0 |
0 | 1 |
For example,
# print ! of TRUE and FALSE !TRUE !FALSE
Output
[1] FALSE [1] TRUE
Here, the output is the negation of the input.
We can use the !
operator with comparisons. For example, !(x > 12)
is the same as x <= 12
. This means that x is not greater than 12. Which means that x can be less than or equal to 12.
You can also use the NOT operator with any in-built function that evaluates to boolean value. For example,
x <- 3 + 5i # using ! with in-built function print(!is.numeric(x))
Output
[1] TRUE
Here, since x is of type complex
, the function is.numeric(x)
evaluates to FALSE
and the negation of FALSE
is TRUE
, hence the output.
You can use all the three logical operators with comparison operators.
x <- 5 print(is.numeric(x) & (x>5 | x==5))
Output
[1] TRUE
Here, we can consider the entire operation in two parts - is.numeric(x)
and (x>5 | x==5)
. Since, there is an AND operator between them, if both of them evaluate to TRUE
, only then the output is TRUE
.
This is how the program works:
is.numeric(x)
- this evaluates to TRUE
since x is of numeric
type(x>5 | x==5)
- this evaluates to TRUE
since x==5
is TRUE
x
Continue reading: R Data Types]]>A data type of a variable specifies the type of data that is stored inside that variable. For example,
x <- 123L
Here, 123L
is an integer data. So the data type of the variable x is integer
.
We can verify this by printing the class of x.
x <- 123L # print value of x print(x) # print type of x print(class(x))
Output
[1] 123 [1] "integer"
Here, x is a variable of data type integer
.
In R, there are 6 basic data types:
logical
numeric
integer
complex
character
raw
Let's discuss each of these R data types one by one.
The logical
data type in R is also known as boolean data type. It can only have two values: TRUE
and FALSE
. For example,
bool1 <- TRUE print(bool1) print(class(bool1)) bool2 <- FALSE print(bool2) print(class(bool2))
Output
[1] TRUE [1] "logical" [1] FALSE [1] "logical"
In the above example,
TRUE
,FALSE
.Here, we get "logical"
when we check the type of both variables.
Note: You can also define logical variables with a single letter - T
for TRUE
or F
for FALSE
. For example,
is_weekend <- F print(class(is_weekend)) # "logical"
In R, the numeric
data type represents all real numbers with or without decimal values. For example,
# floating point values weight <- 63.5 print(weight) print(class(weight)) # real numbers height <- 182 print(height) print(class(height))
Output
[1] 63.5 [1] "numeric" [1] 182 [1] "numeric"
Here, both weight and height are variables of numeric
type.
The integer
data type specifies real values without decimal points. We use the suffix L
to specify integer data. For example,
integer_variable <- 186L print(class(integer_variable))
Output
[1] "integer"
Here, 186L
is an integer data. So we get "integer"
when we print the class of integer_variable.
The complex
data type is used to specify purely imaginary values in R. We use the suffix i
to specify the imaginary part. For example,
# 2i represents imaginary part complex_value <- 3 + 2i # print class of complex_value print(class(complex_value))
Output
[1] "complex"
Here, 3 + 2i
is of complex
data type because it has an imaginary part 2i
.
The character
data type is used to specify character or string values in a variable.
In programming, a string is a set of characters. For example, 'A'
is a single character and "Apple"
is a string.
You can use single quotes ''
or double quotes ""
to represent strings. In general, we use:
''
for character variables""
for string variablesFor example,
# create a string variable fruit <- "Apple" print(class(fruit)) # create a character variable my_char <- 'A' print(class(my_char))
Output
[1] "character" [1] "character"
Here, both the variables - fruit and my_char - are of character
data type.
A raw
data type specifies values as raw bytes. You can use the following methods to convert character data types to a raw data type and vice-versa:
charToRaw()
- converts character data to raw datarawToChar()
- converts raw data to character dataFor example,
# convert character to raw raw_variable <- charToRaw("Welcome to Programiz") print(raw_variable) print(class(raw_variable)) # convert raw to character char_variable <- rawToChar(raw_variable) print(char_variable) print(class(char_variable))
Output
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a [1] "raw" [1] "Welcome to Programiz" [1] "character"
In this program,
charToRaw()
function to convert the string "Welcome to Programiz"
to raw bytes."raw"
as output when we print the class of raw_variable.rawToChar()
function to convert the data in raw_variable back to character form."character"
as output when we print the class of char_variable.One hundred sixty new packages covering a wide array of topics made it to CRAN in August. I thought I would emphasize the breadth of topics by expanding the number of categories organizing my “Top 40” selections beyond core categories that appear month after month. Here are my picks in fourteen categories: Archaeology, Computational Methods, Data, Education, Finance, Forestry, Genomics, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization. Based on informal impressions formed over the last several months, I believe a new category combining applications in forestry, animal populations, climate change could become a regular core category.
DIGSS v1.0.2: Provides a simulation tool to estimate the rate of success that surveys including user-specific characteristics have in identifying archaeological sites given specific parameters of survey area, survey methods, and site properties. See Kintigh (1988) for background and the vignette for examples.
simlandr v0.1.1: Provides a set of tools for constructing potential landscapes for dynamical systems using Monte-Carlo simulation which is especially suitable for formal psychological models. There are vignettes on Dynamic Models and Simulations, Constructing Potential Landscapes, and Calculating the Lowest Elivation Path.
metaboData v0.6.2: Provides access to remotely stored data sets from a variety of biological sample matrices analyzed using mass spectrometry metabolomic analytical techniques. See the vignette.
metadat v1.0-0: Contains a collection of data sets useful for teaching meta analysis. See README for more information.
nflreadr v1.1.0: Provides functions for downloading data from the GitHub repository for the nflverse project. There is a brief Introduction and several short vignettes that serve as the data dictionary for the various files Draft Picks, Rankings, etc.
OCSdata v1.0.2: Provides functions to access and download data from the Open Case Studies repositories on GitHub. See the vignette to get started.
rATTAINS v0.1.2: Implements an interface to United States Environmental Protection Agency (EPA) ATTAINS database used to track information provided by states about water quality assessments conducted under federal Clean Water Act requirements. There is a vignette.
taylor v0.2.1: Provides access to a curated data set of Taylor Swift songs, including lyrics and audio characteristics. Data comes Genius and the Spotify API. See README for examples,
karel v0.1.0: Provides an R implementation of Karel the robot, a programming language for teaching introductory concepts about general programming in an interactive and fun way, by writing programs to make Karel achieve tasks in the world she lives in. There are several vignettes including one on Control Structures and another on Algorithmic Decomposition.
roger v0.99-0: Implements tools for grading the coding style and documentation of R scripts. This is the R component of Roger the Omni Grader, an automated grading system for computer programming projects based on Unix shell scripts. Look here for more information.
dispositionEffect v1.0.0: Implements four different methodologies to evaluate the presence of the disposition effect and other irrational investor behaviors based on investor transactions and financial market data. There is a Getting Started Guide, and vignettes on Analysis, Disposition Effects in Parallel, and Time Series Disposition Effects.
HDShOP v0.1.1: Provides functions to construct shrinkage estimators of high-dimensional mean-variance portfolios and performs high-dimensional tests on optimality of a given portfolio. See Bodnar et al. (2018), Bodnar et al. (2019), and Bodnar et al. (2020) for background.
tcsinvest v0.1.1: Implements an interface to the Tinkoff Investments API which enables analysts and traders can interact with account and market data from within R. Clients for both REST and Streaming protocols have been implemented. There is a vignette.
APAtree v1.0.1: Provides functions to map the area potentially available (APA) using the approach from Gspaltl et al. (2012) and also aggregation functions to calculate stand characteristics based on APA-maps and the neighborhood diversity index as described in Glatthorn (2021). See the vignette for examples.
efdm v0.1.0: Implements the European Forestry Dynamics Model (EFDM), a large-scale forest model that simulates the development of a forest and estimates volume of wood harvested for any given forested area. See Packalen et al. (2015) for background and the vignette for examples.
molnet v0.1.0: Implements a network analysis pipeline that enables integrative analysis of multi-omics data including metabolomics. It allows for comparative conclusions between two different conditions, such as tumor subgroups, healthy vs. disease, or generally control vs. perturbed. The case study presented in the vignette uses data published by Krug (2020).
simtrait v1.0.21P Provides functions to simulate complex traits given a SNP genotype matrix and model parameters with an emphasis on avoiding common biases due to the use of estimated allele frequencies. Traits can follow three models: random coefficients, fixed effect sizes, and multivariate normal. GWAS method benchmarking functions as described in Yao and Ochoa (2019) are also provided. See the vignette.
statgenIBD v1.0.1: Provides functions to calculate biparental, three and four-way crosses Identity by Descent (IBD) probabilities using Hidden Markov Models and inheritance vectors following Lander & Green (1987) and Huang (2011). See the vignette for examples.
text2map v0.1.0: Provides functions for computational text analysis for the social sciences including functions for working with word embeddings, text networks, and document-term matrices. For background on the methods used see Stoltz and Taylor (2019), Taylor and Stoltz (2020), Taylor and Stoltz (2020), and Stoltz and Taylor (2021). There is a Quick Start Guide and a vignette on Concept Class Analysis.
NPRED v1.0.5: Uses partial informational correlation (PIC) to identify the meaningful predictors from a large set of potential predictors. Details can be found in Sharma & Mehrotra, (2014), Sharma et al.(2016), and Mehrotra & Sharma (2006). See the vignette for examples.
stabiliser v0.1.0: Implements an approach to variable selection through stability selection and the use of an objective threshold based on permuted data. See Lima et al (2021) and Meinshausen & Buhlmann (2010) for details and the vignette for an example.
dreamer v3.0.0: Fits longitudinal dose-response models utilizing a Bayesian model averaging approach as outlined in Gould (2019) for both continuous and binary responses. See the vignette.
smartDesign v0.72: Implements the SMART trial design, as described by He et al. (2021) which includes multiple stages of randomization where participants are randomized to an initial treatment in the first stage and then subsequently re-randomized between treatments in the following stage. There is a Dynamic Treatment Tutorial and a Sequential Design Tutorial.
bootf2 v0.4.1: Provides functions to compare dissolution profiles with confidence intervals of the similarity factor f2 and also functions to simulate dissolution profiles. There are multiple vignettes including and Introduction a Simulation Example.
track2KBA v1.0.1: Provides functions to prepare and analyze animal tracking data in order to identify areas of potential interest for population level conservation. See Lascelles et al. (2016) for background on the methodology employed and the vignette for examples and workflow.
chyper v0.3.1: Provides functions to work with the conditional hypergeometric distribution. See the vignette.
sprtt v0.1.0: Provides functions to perform sequential t-tests including those of Wald (1947), Rushton (1950), Rushton (1952), and Hajnal (1961). There is an Introduction to the package, a Use Case, and a vignette on the Sequential t-test.
SurvMetrics v0.3.5: Implements popular evaluation metrics commonly used in survival prediction including Concordance Index, Brier Score, Integrated Brier Score, Integrated Square Error, Integrated Absolute Error and Mean Absolute Error. For detailed information, see Ishwaran et al. (2008) and Moradian et al. (2017). The vignette offers examples.
DCSmooth v1.0.2: Implements nonparametric smoothing techniques for data on a lattice or functional time series which allow for modeling a dependency structure of the error terms of the nonparametric regression model. See Beran & Feng (2002), Mueller & Wang (1994), Feng & Schaefer (2021), and Schaefer & Feng (2021) for the background and the vignette for examples.
STFTS v0.1.0: Implements statistical hypothesis tests of functional time series including a functional stationarity test, a functional trend stationarity test and a functional unit root test.
WASP v1.4.1: Implements wavelet-based variance transformation methods for system modeling and prediction. For details see Jiang et al. (2020), Jiang et al. (2020), and Jiag et al. (2021) There is a vignette with examples.
ExpImage v0.2.0: Provides an image editing tool for researchers which includes functions for segmentation and for obtaining biometric measurements. There are several vignettes including: Contagem de bovinos, Contagem de objetos, and Como editar imagens.
meltr v1.0.0: Provides functions to read non-rectangular data, such as ragged forms of csv (comma-separated values), tsv (tab-separated values), and fwf (fixed-width format) files. See README to get started.
plumbertableau v0.1.0: Implements tools for building plumber
APIs that can be used in Tableau workbooks. There is a package Introduction and vignettes on Writing Extensions, Using Extensions in Tableau, and Publishing Extensions to RStudio Connect.
string2path v0.0.2: Provides functions to extract glyph information from a font file, translate the outline curves to flattened paths or tessellated polygons, and return the results as a data.frame
. See README for an example.
trackdown v1.0.0: Uses Googel Drive to implement tools for collaborative writing and editing of R Markdown and Sweave documents. There are some Tech Notes and vignettes on Features and Workflow.
aRtsy v0.1.1: Provides algorithms for creating artwork in the ggplot2
language that incorporate some form of randomness. See README for examples and package use.
ggcleveland v0.1.0: Provides functions to produce ggplot2
versions of the visualization tools described in William Cleveland’s book Visualizing Data. The vignette contains several examples.
ggtikz v0.0.1: Provides tools to annotate ggplot2
plots with TikZ code using absolute data or relative coordinates. See the vignette.
tidycharts v0.1.2: Provides functions to generate charts compliant with the International Business Communication Standards (IBCS) including unified bar widths, colors, chart sizes, etc. There is a Getting Started guide and vignettes on EDA, Customization, and Joining Charts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | [Set Objective] $K$9 [To] Max [By Changing Variable Cells] $F$4:$J$4,$F$5:$H$5 [Subject to the Constraints] $F$4:$J$4 <= $D$4 $F$4:$J$4 >= $C$4 $F$9:$K$9 >= $C$9 [Make Unconstrained Variables Non–Negative] Checked | cs |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | #========================================================# # Quantitative ALM, Financial Econometrics & Derivatives # ML/DL using R, Python, Tensorflow by Sang-Heon Lee # # https://kiandlee.blogspot.com #——————————————————–# # A simple Asset/Liability CF Matching problem #========================================================# graphics.off() # clear all graphs rm(list = ls()) # remove all files from your workspace library(ROI) library(ROI.plugin.neos) # Gross Interest Rates Rx = 1.01 # Xi : the line of credit Ry = 1.02 # Yi : CP 90d Rz = 1.003 # Zi : excess funds # decision variables # –> x1,2,3,4,5, y1,2,3, z1,2,3,4,5, v # Left hand side matrix v.LHS <– c( 1,0,0,0,0, 1,0,0, –1,0,0,0,0, 0, –Rx,1,0,0,0, 0,1,0, Rz,–1,0,0,0, 0, 0,–Rx,1,0,0, 0,0,1, 0,Rz,–1,0,0, 0, 0,0,–Rx,1,0, –Ry,0,0, 0,0,Rz,–1,0, 0, 0,0,0,–Rx,1, 0,–Ry,0, 0,0,0,Rz,–1, 0, 0,0,0,0,–Rx, 0,0,–Ry, 0,0,0,0,Rz, –1 ) m.LHS <– matrix(v.LHS, nrow = 6, byrow = TRUE) # v (14th decision variable) is the objective function lp_obj <– L_objective(c(rep(0,13),1)) # LHS * X = RHS lp_con <– L_constraint( L = m.LHS, dir = rep(“==”, 6), rhs = c(150,100,–200,200,–50,–300)) # Lower & Upper bounds for decision variables lp_bound <– V_bound( li = 1:14, ui = 1:14, lb = rep(0,14), ub = c(rep(100,5), rep(Inf,9))) # Set Problem lp <– OP(objective = lp_obj, constraints = lp_con, bounds = lp_bound, maximum = TRUE) # Solve Problem opt <– ROI_solve( lp, solver = “neos”, method = “mosek”, email = “your email address”) # Print cat(“\nResult for simple A/L CF Matching \n\n”, “Objective Function Value = “, round(opt$objval,4), “\n”, “Decision Variable Xi = “, round(opt$solution[1:5],4), “\n”, “Decision Variable Yi = “, round(opt$solution[6:9],4), “\n”, “Decision Variable Zi = “, round(opt$solution[10:14],4)) | cs |
1 2 3 4 5 6 7 | Result for simple A/L CF Matching Objective Function Value = 92.4969 Decision Variable Xi = 0 50.9804 0 0 0 Decision Variable Yi = 150 49.0196 203.4344 0 Decision Variable Zi = 0 351.9442 0 0 92.4969 | cs |
25 September 2021
Yes fellow R friends, the book list keeps growing nicely. Thank you very much to the contributors (both in submissions and for authoring these books!) : Colin Fay , Robert D. Brown III , William Becker and Tony Hirst
by Colin Fay
JavaScript in practice for Shiny users.
Freely avaiable online.
https://www.bigbookofr.com/shiny.html#javascript-4-shiny—field-notes
Business case analysis, often conducted in spreadsheets, exposes decision makers to additional risks that arise just from the use of the spreadsheet environment. This book discusses how to use the statistical programming language R to develop a business case simulation and analysis. It presents a methodology that minimizes decision delay by focusing stakeholders on what matters most and suggests pathways for minimizing the risk in strategic and capital allocation decisions.
Paid versions
Composite indicators are aggregations of indicators which aim to measure (usually socio-economic) complex and multidimensional concepts which are difficult to define, and cannot be measured directly. Examples include innovation, human development, environmental performance, and so on. This book gives a detailed guide on building composite indicators in R, focusing on the recent COINr package, which is an end-to-end development environment for composite indicators. Although COINr is the main tool used in the book, it also gives general explanation and guidance on composite indicator construction and analysis in R, ranging from normalisation, aggregation, multivariate analysis and global sensitivity analysis.
Free
This book describes a range of data analysis and visualisation techniques that can be applied to motorsport timing and results data in general, and Formula One data in particular.
Paid
https://www.bigbookofr.com/sport-analytics.html#wrangling-f1-data-with-r-a-data-junkies-guide
Taking a simple rally route dataset, what can we do with it? This book describes a wide range of techniques for working with geodata, including routes and elevantion rasters. From 2D and 3D mapping, to a wide range of route analysis techniques, the techniques described are also relevant to a wide range of othr route analysis contexts, including ecological trail analysis.
Free online book
A handy guide to visualising a wide range of motorsport timing and results data, concentrating on rally data associated with the FIA World Rally Championship (WRC).
The post 6 New books added to BigBookofR: Javascript, Business Case Analysis, Indicator Dev and a bunch for Motorsports! appeared first on Oscar Baruffa.
Subscript out of bounds, Subscript out of limits in R: How to Fix?
The following is an example of a typical R error:
Error in x[6, ] : subscript out of bounds
When you try to access a column or row in a matrix that doesn’t exist, you’ll get this error.
Using the following matrix as an example, this guide explains the exact procedures to troubleshoot this error.
Reading Data From Excel Files (xls,xlsx,csv) into R-Quick Guide »
set.seed(123)
Let’s create a matrix with 5 rows and 3 columns
x = matrix(data = sample.int(100, 30), nrow = 5, ncol = 3) x [,1] [,2] [,3] [1,] 36 84 5 [2,] 87 3 19 [3,] 100 66 98 [4,] 26 86 74 [5,] 55 76 94
The code below tries to access the matrix’s 6th row, which does not exist.
Extract text from pdf in R and word Detection »
x[6, ]
Error in x[6, ] : subscript out of bounds
We get the subscript out of limits error because the 6th row of the matrix does not exist.
We can use the nrow() function to figure out how many rows are in the matrix if we don’t know:
nrow(x) [1] 5
The matrix only has five rows, as we can see. As a result, we can only access the rows with integers less than or equal to 5.
The following code tries to access the matrix’s fourth column, which doesn’t exist:
Bubble Chart in R-ggplot & Plotly » (Code & Tutorial) »
x[, 4]
Error in x[, 4] : subscript out of bounds
The subscript out of limits error occurs because the fourth column of the matrix does not exist.
If we don’t know how many columns the matrix has, we can use the ncol() function to figure it out.
ncol(x) [1] 3
The matrix only has three columns, as we can see. As a result, we can only access the columns with integers less than or equal to three.
aggregate Function in R- A powerful tool for data frames »
The following code tries to access a value in the matrix’s 6th row and 4th column that does not exist.
x[6, 4]
Error in x[6, 4] : subscript out of bounds
We get the subscript out of limits error because neither the 6th row nor the 4th column of the matrix exists.
We can use the dim() function to figure out how many rows and columns are in the matrix if we don’t know:
dim(x) [1] 5 3
The matrix has only 5 rows and 3 columns, as can be seen. As a result, while accessing the rows and columns, we can only utilize numbers that are less than or equal to these values.
rbind in r-Combine Vectors, Matrix or Data Frames by Rows »
The post Error in x[6, ]: subscript out of bounds appeared first on finnstats.
§1 (1) Subject to the deviations resulting from this Act, the German Bundestag shall consist of 598 Members o...
Continue reading: Will the next German parliament have a gigantic size? A law and coding challenge…]]>On Sunday (26th of September), we will have the main parliamentary elections in Germany. The deepl translation of the election law states:
§1 (1) Subject to the deviations resulting from this Act, the German Bundestag shall consist of 598 Members of Parliament. […]”
Well, the “deviations” led in the previous elections to a Bundestag with 703 members. And while in 2020 the law was reformed with the stated goal to have a smaller Bundestag, we may even end up this time with more than 800 members if current voting forecasts should indeed realize. This post first explains the problem, points out the challenges to implement the law in R, and then studies with R several scenarios, also to get an idea how variants of the election law would affect the size and composition of the Bundestag.
There are 299 electoral districts in Germany. For each district one direct candidate will be elected to the Bundestag. In addition to voting for their district representative, voters have a second vote for a party.
Things would be easy if the second vote would just determine the remaining 299 of the 598 total seats that are not directly elected. But the core idea is instead that the 2nd votes determine parties’ shares of all 598 seats.
This means if a party would gain 50 seats according to the 2nd votes and also wins directly 40 electoral districts, it would just get 50-40=10 additional delegates beyond the direct candidates.
But what if the party only would get 40 seats according to the 2nd votes and but wins 50 electoral districts? Then still all 50 direct candidates have to get a seat. According to the election law before 2011, the party would essentially get the 50-40=10 seats in addition: these additional seats are called Überhangmandate.
The largest number of Überhangmandate occurred in the 2009 elections: 24, all for the CDU/CSU. That were roughly 10% of CDU/CSU’s total seats. Besides distorting the proportions of seats resulting from 2nd votes such “Überhangmandate” create other problems. In combination with the allocation of party seats across the 16 Bundesländer, the election system could lead to the paradoxical situation that an additional 2nd vote for a party in a Bundesland where it got Überhangmandate would reduce the total number of seats of that party. That became apparent in the 2005 elections where due to the sudden death, elections in one district took place two weeks later and the CDU had incentives to get few 2nd votes (see e.g. Section 1 in Heese 2012).
The constitutional court decided in 2012 that the election law must be designed such that the impact of Überhangmandate is limited (roughly said there should not be more than 15 Überhangmandate).
Now how could you reform the law? Two main routes come quickly to mind:
i) Reduce the number of electoral districts so that e.g. only 40% of the 598 seats are directly elected. While in theory still a large number of Überhangmandate could occur, it would be considerably less likely.
ii) If a party gets more direct seats than they would get according to their 2nd vote share, increase the size of the parliament until all parties’ seats match their seats according to the 2nd vote share.
In 2012 most parties (including the opposition) agreed to a reform of the election law that, mainly followed the 2nd approach: variably increase the size of the Bundestag. Was such a large consensus possible because no party really minded the outlook to get more member of parliament positions?
While in 2013 we still had a moderate increase to only 631 seats in the Bundestag, in 2017 there were 709 seats.
If current election forecasts should actually realize, the next Bundestag may even go substantially beyond 800 seats. The main reason is the low projection for 2nd votes for the CSU (the Bavarian version of the Christian democrats), which is still expected to win a huge number of direct seats.
Before getting into the intricacies of the election law, let us do some simple rule-of-thumb calculation.
There is a great website www.mandatsrechner.de that collects forecasts and other relevant data to allow detailed predictions for the seat distribution. I copied the data based on the Forsa forecast from 09-21 for 2nd votes and Prognos Election.de from 09-17 for the direct seats. (You can find the csv file here.)
We have separately for each Bundesland forecasts for each party that is expected to enter parliament:
dat = read.csv("prognose_2021.csv",encoding="UTF-8") %>% select(-seats.mr) head(dat,8) ## land pop party votes direct ## 1 Schleswig-Holstein 2659792 CDU 389588 6 ## 2 Schleswig-Holstein 2659792 SPD 487012 5 ## 3 Schleswig-Holstein 2659792 Gruene 394052 0 ## 4 Schleswig-Holstein 2659792 Linke 80974 0 ## 5 Schleswig-Holstein 2659792 FDP 221930 0 ## 6 Schleswig-Holstein 2659792 AfD 122180 0 ## 7 Hamburg 1537766 CDU 177921 0 ## 8 Hamburg 1537766 SPD 280211 6
Let us compute for each party the share of 2nd votes and also its direct seats relative to a total number of 598 seats:
total_votes = sum(dat$votes) dat %>% group_by(party) %>% summarize( votes = sum(votes), direct = sum(direct), share_votes = votes / total_votes, direct_over_598 = direct / 598 ) %>% arrange(share_votes) ## # A tibble: 7 x 5 ## party votes direct share_votes direct_over_598 ## <chr> <int> <int> <dbl> <dbl> ## 1 CSU 1917218 41 0.0448 0.0686 ## 2 Linke 2790931 5 0.0652 0.00836 ## 3 AfD 5116704 14 0.120 0.0234 ## 4 FDP 5116705 0 0.120 0 ## 5 Gruene 7907632 8 0.185 0.0134 ## 6 CDU 8316189 93 0.194 0.156 ## 7 SPD 11628872 138 0.272 0.231
We see that the CSU is projected to get only 4.48% of 2nd votes, yet its predicted 41 direct seats would be 6.86% of a parliament with 598 members. For all other parties the number of seats according to vote shares is larger than their number of direct seats. Roughly said, the key logic of the election law is that the parliament would be increased so that the 41 direct seats of the CSU correspond to the their 4.48% share of second votes. This rule of thumb calculation would yield a size of
598 * 6.86 / 4.48 = 915.7
That would be a gigantic, very crowded and expensive parliament.
The election law is of course much more complex than the rule of thumb calculation above because it has to deal with integer problems and allocation of seats not only between parties but also between the 16 Bundesländer. Furthermore, the fear of an excessively large parliament was stated as reason for another reform of the election law in 2020 by the governing parties.
I thought it would be an interesting exercise to transform the relevant parts of current election law into R code that allows to compute the resulting size and seat distribution of the Bundestag given a data set as above as input.
Well, I utterly failed…
After an afternoon and evening of headaches I thought I had a correct interpretation of the law and implemented an algorithm. Yet, it turned out that my interpretation was wrong. I am deeply impressed by anybody who is able to implement the correct algorithm just by reading the law that describes the procedure!
Thankfully, I then got the hint that there is an example calculation by the Bundeswahlleiter. That made the procedure indeed understandable. Also, after reading what is actually done, it even seems to be kind of consistent with what is written in the law…
While for some people an understandable explanation may remove the fun of the coding challenge, there still remain some interesting steps. For example, implementing the Webster/Saint-Lague/Scheppers method described in §6 (2) Sentences 2-7 with a minimum seat restriction.
You can find my implementation here on Github. In the following section, I use the code to explore the size and composition of the parliament under the current law and under some variations.
Let us compute the 2021 seat distribution given the actual law with our forecast data:
source("seat_calculator.R") res = compute.seats(dat) summarize.results(res) ## Total size: 841 seats ## # A tibble: 7 x 5 ## party vote_share seat_shares seats ueberhang ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 SPD 0.272 0.270 227 0 ## 2 CDU 0.194 0.194 163 0 ## 3 Gruene 0.185 0.184 155 0 ## 4 AfD 0.120 0.119 100 0 ## 5 FDP 0.120 0.119 100 0 ## 6 Linke 0.0652 0.0654 55 0 ## 7 CSU 0.0448 0.0488 41 3
We predict now a Bundestag that is still extremely large with 841 members. One reason why we don’t reach the 915 members from our rule-of-thumb calculation is that the 2020 reform allows a total of 3 Überhangmandate. In our scenario they would all go to the CSU, which thus would achieve 4.88% of seats while it only has 4.48% of votes.
Let’s assume the SPD earns captures one more direct seat from the CSU. How large will the parliament then be?
library(dplyrExtras) dat.mod = dat %>% mutate_rows(party=="CSU" & land=="Bayern", direct = direct-1) %>% mutate_rows(party=="SPD" & land=="Bayern", direct = direct+1) res = compute.seats(dat.mod) sum(res$seats) ## [1] 817
That would lead to a substantial reduction of 24 total seats from 841 to 817. For me it feels not like a well designed election system if a single electoral district can have such huge impact on the size of the parliament…
What would happen if the 2020 reform would have fully exhausted the limit of 15 Überhangmandate the constitutional court seem to have imposed as maximum cap in its 2012 decision? We can compute it:
res = compute.seats(dat,max.ueberhang = 15) summarize.results(res) ## Total size: 612 seats ## # A tibble: 7 x 5 ## party vote_share seat_shares seats ueberhang ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 SPD 0.272 0.265 162 0 ## 2 CDU 0.194 0.190 116 0 ## 3 Gruene 0.185 0.180 110 0 ## 4 AfD 0.120 0.118 72 0 ## 5 FDP 0.120 0.118 72 0 ## 6 CSU 0.0448 0.0670 41 14 ## 7 Linke 0.0652 0.0637 39 0
We now would get a nicely small Bundestag with only 612 seats, where the 14 seats above 598 are all Überhangmandate for the CSU. Hard to imagine that such a law would be survive long after such an outcome, but at least we would have a nicely workable small Bundestag.
Another aspect of the 2020 reform is that Überhangmandate in some Bundesländer are offset by a seat reduction of the party in other Bundesländer. For the CSU that has no effect, since it only can be elected in Bavaria. But what would happen if the CDU and CSU were treated as a single party in the algorithm?
dat.mod = mutate_rows(dat, party=="CSU" & land=="Bayern", party="CDU") res = compute.seats(dat.mod,max.ueberhang = 3) res = mutate_rows(res, party=="CDU" & land=="Bayern", party="CSU") summarize.results(res) %>% select(-ueberhang) ## Total size: 632 seats ## # A tibble: 7 x 4 ## party vote_share seat_shares seats ## <chr> <dbl> <dbl> <dbl> ## 1 SPD 0.272 0.271 171 ## 2 Gruene 0.185 0.184 116 ## 3 CDU 0.194 0.179 113 ## 4 AfD 0.120 0.119 75 ## 5 FDP 0.120 0.119 75 ## 6 CSU 0.0448 0.0649 41 ## 7 Linke 0.0652 0.0649 41
For the representation of results, we again distinguish between CDU and CSU. We now also find a relatively small parliament with only 632 seats. The CSU gets 6.49% of seats with only 4.48% of 2nd votes but the additional seats now come at the cost of the CDU. For the other parties vote shares and seat shares are roughly aligned. So while this solution would lead to a small parliament, the substantial over-representation of Bavarian members of parliament might create strong tensions in the CDU/CSU sister-ship.
The election reform from 2020 also determined that starting in 2024 the number of electoral districts is reduced from 299 to 280. What if this change would already be implemented this election? One problem is that we don’t have a forecast of how the 280 direct seats would be distributed. For a guess, I simply distribute the 280 districts using the Saint-Lague-Scheppers method treating the original direct mandates as votes.
dat.mod = mutate(dat, direct = sls(direct, 280)) res = compute.seats(dat.mod,max.ueberhang = 3) summarize.results(res) %>% select(-ueberhang) ## Total size: 773 seats ## # A tibble: 7 x 4 ## party vote_share seat_shares seats ## <chr> <dbl> <dbl> <dbl> ## 1 SPD 0.272 0.270 209 ## 2 CDU 0.194 0.194 150 ## 3 Gruene 0.185 0.184 142 ## 4 AfD 0.120 0.119 92 ## 5 FDP 0.120 0.119 92 ## 6 Linke 0.0652 0.0647 50 ## 7 CSU 0.0448 0.0492 38
As result, we only find a moderate reduction leading to 773 seats.
What if we reduce the districts to approximately 40% of the 598 seats? That would be 239 electoral districts:
dat.mod = mutate(dat, direct = sls(direct, 239)) res = compute.seats(dat.mod,max.ueberhang = 3) summarize.results(res) %>% select(-ueberhang) ## Total size: 640 seats ## # A tibble: 7 x 4 ## party vote_share seat_shares seats ## <chr> <dbl> <dbl> <dbl> ## 1 SPD 0.272 0.270 173 ## 2 CDU 0.194 0.194 124 ## 3 Gruene 0.185 0.184 118 ## 4 AfD 0.120 0.119 76 ## 5 FDP 0.120 0.119 76 ## 6 Linke 0.0652 0.0641 41 ## 7 CSU 0.0448 0.05 32
Now we find a more workable Bundestag with 640 members. I guess if we end up with a Bundestag above 800 members such a more ambitious reduction of electoral districts is a likely outcome of future legislation.
What if we would use the second vote to just determine the distribution of the 299 seats that are not directly elected in an electoral district? Then the size of the parliament could always be fixed to 598 seats. Let’s compute the resulting seat distribution given our forecasts:
dat %>% group_by(party) %>% summarize( direct = sum(direct), votes = sum(votes) ) %>% mutate( seats2 = sls(votes, 299), seats = direct + seats2 ) %>% mutate( vote_share = votes / sum(votes), seat_shares = seats / sum(seats) ) %>% arrange(desc(seats)) ## # A tibble: 7 x 7 ## party direct votes seats2 seats vote_share seat_shares ## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> ## 1 SPD 138 11628872 81 219 0.272 0.366 ## 2 CDU 93 8316189 58 151 0.194 0.253 ## 3 Gruene 8 7907632 55 63 0.185 0.105 ## 4 CSU 41 1917218 13 54 0.0448 0.0903 ## 5 AfD 14 5116704 36 50 0.120 0.0836 ## 6 FDP 0 5116705 36 36 0.120 0.0602 ## 7 Linke 5 2790931 20 25 0.0652 0.0418
While the SPD, CDU and CSU would substantially gains seats, the other parties lose.
Would such scheme be constitutional? The German constitution actually restricts the election law in Article 39 only as follows:
Members of the German Bundestag shall be elected by an universal, direct, free, equal and secret election. They are representatives of the whole people, not bound by orders and instructions and subject only to their conscience.
I guess these criteria in the first sentence are also satisfied by the British system where all members of parliament are personally elected as winners of an electoral district. As a layman it seems a bit strange to me that the constitutional court and other institutions seem to emphasize that the 2nd votes should to such a large extend determine the proportion of parties’ seats. The constitution rather explicitly states that members of the Bundestag shall not be bound by any party orders, which seems consistent with a more prominent role of the first vote that directly elects a candidate.
Yet, it seems very unlikely that this scheme will ever be implemented.
Well, I guess many people believe that the CSU will get in the end some more 2nd votes than currently predicted by traditional forecasts. But we will see on Sunday…
Yoav Raskin suggested that it would be useful to support right-to-left (RTL) text in {emayili}
, so that languages like Hebrew, Arabic and Aramaic would render properly. I’ll be honest, this was not something that I had previously considered. But agreed, it would be a cool feature.
library(emayili) packageVersion("emayili") [1] '0.5.5'
The first step to make this happen was writing a short CSS file, rtl.css
.
.rtl { direction: rtl; color: green; } body { margin: 20px 0px; } figure { margin-bottom: 20px; } .rtl figcaption { direction: inherit; }
This makes use of the CSS direction
property (which by default is ltr
), setting it to rtl
. I’m also colouring the RTL text in green, just to make things super clear.
I put together a simple .Rmd
file to test, rtl.Rmd
, rendered it into a message and then dispatched.
envelope() %>% subject("Right-to-Left Text") %>% render("rtl.Rmd", css_files = "rtl.css")
The extra CSS is included via the css_files
parameter.
If you want this to work nicely on Gmail then you’ll also need to set include_css = c("rmd", "highlight")
because Gmail doesn’t currently appreciate Bootstrap CSS.
In the interests of brevity the contents of rtl.Rmd
have been appended to the end of the post.
And this is what the delivered email looks like:
Having a rtl
class allowed me to mix both left-to-right and right-to-left content in the same message. In reality this is a rather unlikely requirement. To use RTL for the entire message just adapt the CSS as follows:
body { direction: rtl; }
If there are other tweaks that I can make on {emayili}
to ensure that it supports your requirements, please raise an issue on the repository.
{emayili}
package is developed & supported by Fathom Data.
Here are the contents of rtl.Rmd
:
--- output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer non magna magna. Etiam eu sapien nulla. Cras ac lectus non urna scelerisque porttitor ac non tortor. <div class="rtl"> צחוק המעבורת, מהדובים של הכרית החופשית. מי רוצה לפספס, עצוב בחצר האחורית, כד רך. עכשיו הכאב של הכאב הוא לא הכאב של הטרדה, הצורך של אגמי החלבון של התפוז. אלמנט כלשהו, וגם לא יסוד כימי של נדל"ן, מראית עין של מגוון כלי רכב לנחות, רגע לפני הספד החיים. אניאס הוא מחבר רכיב המחיר. אין פחד אלא אם להיות כנים. אני חי במייסר העצוב של הכדורגל שלי, בסמך הזן כמובן. אבל המענה והקשתות, או קצות האורקים. אל האגם או לפניו. הזמן של העמק לוקח. אפילו הפחד מהפחד, הקשתות או המעבורת, אתת האינטרנט. </div> ```{r pressure, out.width="50%", fig.cap="כיתוב דמות", fig.class="rtl"} par(mar = c(5, 4, 1, 1)) plot(pressure, xlab = "טֶמפֶּרָטוּרָה", ylab = "לַחַץ", mar = c(5, 4, 0, 2)) ``` <div class="rtl"> המחלה נושאת את הסבירות לסוף השבוע, והפחד מהכדורגל קשה. אבל האריה והאריה והילדים נמצאים במיטה. זה ברוטב צריך להגיד. אבל מבחינתו, צרות החיים מותירות הרבה חיים, הקריקטורה זקוקה לארוסים. עכשיו אני שונא את הגורמים, לא את הכאב ולא את הסביבה, הכניסה קלה. עכשיו הכניסה לחיי. </div> Cras fringilla nunc in tellus sagittis accumsan. Mauris nisi tellus, congue sit amet turpis nec, accumsan lacinia neque. Nulla facilisi.
By default <img>
tags are wrapped in a tight <p></p>
embrace by {knitr}
. In general this works really well. However, I want to have more control over image formatting for {emayili}
.
I’d like to have the <img>
tags wrapped by <figure></figure>
. It’d also be useful to have the option of inserting a <figcaption>
. To support these requirements I added a {knitr}
hook which adds in these tags. It responds to the following chunk parameters (all optional):
out.width
— the figure widthfig.cap
— figure captionfig.alt
— figure alternative text andfig.class
— class attached to the <figure>
tag.First we’ll try out the figure caption with the following .Rmd
file.
--- output: html_document --- ```{r fig.cap="Mercury Vapour"} plot(pressure) ```
Rendering this file, attaching to an envelope
object and sending it with {emayili}
yields the message below.
Nothing too exciting, just a caption neatly displayed below the plot.
Next let’s try attaching a class to the <figure>
tag. We’ll use the CSS file figures.css
. This file needs to be specified with the css_files
argument to render()
.
body { margin: 30px 0px; } figure { margin-bottom: 20px; } figure.center { text-align: center; } figcaption { color: white; text-align: left; background-color: #9fa4a7; padding: 10px 20px; } figcaption::before { content: "Figure: "; }
In the .Rmd
file we use the fig.class
parameter to apply the center
class to the <figure>
.
--- output: html_document --- ```{r out.width="75%", fig.class="center"} plot(pressure) ```
And this is what the rendered message looks like.
The <img>
is now centered in the <figure>
.
Finally we’ll bring it all together by styling a figure caption.
--- output: html_document --- ```{r out.width="75%", fig.class="center", fig.cap="Mercury Vapour"} plot(pressure) ```
And the result in your inbox.
Since many automated reports use figures, the ability to style those figures should be very useful.
{emayili}
package is developed & supported by Fathom Data.
Originally posted on Mango Solutions website
The Enterprise Applications of the R Language Conference (EARL) is a cross-sector conference focusing on the commercial use of the R programming language. The conference is dedicated to the real-world usage of R with some of the world’s leading practitioners. This year, it was held September 6-10, 2021.
Thank you to everyone who joined us for EARL 2021 – especially to all of the fantastic presenters! We were pleased to receive lots of really positive feedback from the online event and there are plenty of highlights to share.
Branka Subotic, NATS
It was great to kick off EARL 2021 with our first keynote of the day from Branka. She has worked for NATS since 2018 and is currently their Director of Analytics. Branka shared with us interesting ways to help teams to work together and also some unusual ways to upskill! Her talk was peppered with some videos showing us flight data and the impacts of Covid.
Chris Beeley, NHS – Stronger together, making healthcare open- building the NHS-R Community
We are always delighted to hear from the NHS at the EARL Conference and this year was no exception. We were treated to a passionate talk from Chris on how the NHS-R community has been built up over the years and how their conference has gone from strength to strength. We all know how supportive the R community can be, so it is great to see this in action.
Amit Kohli – Introduction to network analysis
Amit gave us an introduction to the principles of network analysis and shared several use-cases demonstrating their unique powers. Amit also included a fun way to interact with his talk with the use of a QR code – we can always rely on Amit to entertain us! Our team thought it was a really interesting topic and it felt accessible to those who perhaps don’t know much on the subject.
Emily Riederer, Capital One – How to make R packages part of your team
We loved Emily’s fun concept of making R packages a real part of your team and her use of code, and the choices she made along the way. Her talk examined how internal R packages can drive the most value for their organisation when they embrace an organisation’s context, as opposed to open source packages which thrive with increasing abstraction. Read our interview with Emily here.
Dr. Jacqueline Nolis, Saturn Cloud
We closed the day with our final keynote talk from Jacqueline Nolis. She is a data science leader with over 15 years of experience in managing data science teams and projects, at companies ranging from DSW to Airbnb. She currently is the Head of Data Science at Saturn Cloud where she helps design products for data scientists. Jacqueline spoke to us about taking risks in your career and shared with us the various risks she has taken over her career and how they went! It was inspiring to hear from an experienced data scientist that it’s ok to take a risk every now and then – and refreshing to hear her honesty about what could have gone better – and how she has ultimately learned and grown from this.
These are just a few of the brilliant talks from a fantastic conference day. It was a delight to have speakers and attendees joining us from across the world – so thank you again to all that came along.
We are hoping to be back in London next year to host EARL in-person again. We are tentatively holding the 6th-8th of September 2022 as our conference dates. If you’d like to keep up-to-date on all things EARL please join our mailing list. We will open the call for abstracts in January 2022.
The post EARL ONLINE 2021: HIGHLIGHTS appeared first on R Consortium.
Data Science Conference (DSC) Austria is knocking on YOUR door, this time the theme is AI powered sustainability: Save the world through data! And the best is—we still have free tickets until Sept 25, so be quick! ...
Continue reading: Data Science Conference Austria 2021]]>Data Science Conference (DSC) Austria is knocking on YOUR door, this time the theme is AI powered sustainability: Save the world through data! And the best is—we still have free tickets until Sept 25, so be quick!
DSC Austria will happen on September 27-28th and during the event, you will get a chance to listen to over 3 Keynotes, 25 high-quality talks and 6 tech tutorials on the topic of Sustainability, AI & ML, Data-Driven Decision Making and Data & AI Literacy—but that’s not all!
With the DSC Austria ticket you get:
Full access to DSC Austria 2021 talks and sessions
Entry to virtual networking sessions
Online certificate of attendance
Check it out and reserve your spot:
RESERVE FREE TICKET • CHECK FULL PROGRAM
Introducing the 30 Day Sustainability Data Challenge
As part of Quantargo’s Tech tutorial on Sept 27 at 9 AM CET we will start the 30 Day Sustainability Data Challenge. The challenge is inspired by the 30 Day Chart Challenge and asks participants to post interesting visualizations covering sustainability on Twitter. Anyone is welcome to contribute, no matter which data source or tool you use.
The only rules are:
You can also consider adding other hashtags like #rstats or #sustainability to reach more people.
At the end of the challenge we will sum the number of likes and retweets of each twitter account which participated and posted according to above guidelines. We will also post rankings as the challenge progresses. It is allowed and even encouraged to create scheduled Twitter bots using our Quantargo workspace (see next section).
And here come the prices:
Quantargo Workspace
In the tech tutorial during the conference on Sept 27 at 9 AM CET, we will also introduce the brand new scheduling and (encrypted) secrets features of the Quantargo workspace. With these new features it is very easy to create scheduled R Bots which tweet new messages at a specified time and interval. We will show some examples of how to create bots tweeting about sustainability.
Additionally, the workspace is great to seamlessly create APIs. We will show an example covering and AirBnB dataset in Vienna to
So stay tuned and healthy, see you at the conference and happy to see your posts! #30DaySustainabilityDataChallenge.