(This article was first published on ** Deeply Trivial**, and kindly contributed to R-bloggers)

Two statistical indices crossed my inbox in the last week, both of which use fast food restaurants to measure a concept indirectly.

First up, in the wake of recent hurricanes, is the Waffle House Index. As *The Economist* explains:

Waffle House, a breakfast chain from the American South, is better known for reliability than quality. All its restaurants stay open every hour of every day. After extreme weather, like floods, tornados and hurricanes, Waffle Houses are quick to reopen, even if they can only serve a limited menu. That makes them a remarkably reliable if informal barometer for weather damage.

The index was invented by Craig Fugate, a former director of the Federal Emergency Management Agency (FEMA) in 2004 after a spate of hurricanes battered America’s east coast. “If a Waffle House is closed because there’s a disaster, it’s bad. We call it red. If they’re open but have a limited menu, that’s yellow,” he explained to NPR, America’s public radio network. Fully functioning restaurants mean that the Waffle House Index is shining green.

Next is the Big Mac Index, created by *The Economist*:

The Big Mac index was invented by The Economist in 1986 as a lighthearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), the notion that in the long run exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries.

You might remember a discussion of the “basket of goods” in my post on the Consumer Price Index. And in fact, the Big Mac Index, which started as a way “to make exchange-rate theory more digestible,” it’s since become a global standard and is used in multiple studies. Now you can use it too, because the data and methodology have been made available on GitHub. R users will be thrilled to know that the code is written in R, but you’ll need to use a bit of Python to get at the Jupyter notebook they’ve put together. Fortunately, they’ve provided detailed information on installing and setting everything up.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Deeply Trivial**.

R-bloggers.com offers

(This article was first published on ** Econometrics and Free Software**, and kindly contributed to R-bloggers)

In this blog post, similar to a previous blog post

I am going to show you how we can go from an Excel workbook that contains data to flat file. I will

taking advantage of the structure of the tables inside the Excel sheets by writing a function

that extracts the tables and then mapping it to each sheet!

Last week, October 14th, Luxembourguish nationals went to the polls to elect the Grand Duke! No,

actually, the Grand Duke does not get elected. But Luxembourguish citizen did go to the polls

to elect the new members of the Chamber of Deputies (a sort of parliament if you will).

The way the elections work in Luxembourg is quite interesting; you can vote for a party, or vote

for individual candidates from different parties. The candidates that get the most votes will

then seat in the parliament. If you vote for a whole party,

each of the candidates get a vote. You get as many votes as there are candidates to vote for. So,

for example, if you live in the capital city, also called Luxembourg, you get 21 votes to distribute.

You could decide to give 10 votes to 10 candidates of party A and 11 to 11 candidates of party B.

Why 21 votes? The chamber of Deputies is made up 60 deputies, and the country is divided into four

legislative circonscriptions. So each voter in a circonscription gets an amount of votes that is

proportional to the population size of that circonscription.

Now you certainly wonder why I put the flag of Gambia on top of this post? This is because the

government that was formed after the 2013 elections was made up of a coalition of 3 parties;

the Luxembourg Socialist Worker’s Party, the Democratic Party and The Greens.

The LSAP managed to get 13 seats in the Chamber, while the DP got 13 and The Greens 6,

meaning 32 seats out of 60. So because they made this coalition, they could form the government,

and this coalition was named the Gambia coalition because of the colors of these 3 parties:

red, blue and green. If you want to take a look at the ballot from 2013 for the southern circonscription,

click here.

Now that you have the context, we can go back to some data science. The results of the elections

of last week can be found on Luxembourg’s Open Data portal, right here.

The data is trapped inside Excel sheets; just like I explained in a previous blog post

the data is easily read by human, but not easily digested by any type of data analysis software.

So I am going to show you how we are going from this big Excel workbook to a flat file.

First of all, if you open the Excel workbook, you will notice that there are a lot of sheets; there

is one for the whole country, named “Le Grand-Duché de Luxembourg”, one for the four circonscriptions,

“Centre”, “Nord”, “Sud”, “Est” and 102 more for each **commune** of the country (a commune is an

administrative division). However, the tables are all very similarly shaped, and roughly at the

same position.

This is good, because we can write a function to extracts the data and then map it over

all the sheets. First, let’s load some packages and the data for the country:

```
library("tidyverse")
library("tidyxl")
library("brotools")
```

```
# National Level 2018
elections_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx",
sheets = "Le Grand-Duché de Luxembourg")
```

`{brotools}`

is my own package. You can install it with:

`devtools::install_github("b-rodrigues/brotools")`

it contains a function that I will use down below. The function I wrote to extract the tables

is not very complex, but requires that you are familiar with how `{tidyxl}`

imports Excel workbooks.

So if you are not familiar with it, study the imported data frame for a few minutes. It will make

understanding the next function easier:

```
extract_party <- function(dataset, starting_col, target_rows){
almost_clean <- dataset %>%
filter(row %in% target_rows) %>%
filter(col %in% c(starting_col, starting_col + 1)) %>%
select(character, numeric) %>%
fill(numeric, .direction = "up") %>%
filter(!is.na(character))
party_name <- almost_clean$character[1] %>%
str_split("-", simplify = TRUE) %>%
.[2] %>%
str_trim()
almost_clean$character[1] <- "Pourcentage"
almost_clean$party <- party_name
colnames(almost_clean) <- c("Variables", "Values", "Party")
almost_clean %>%
mutate(Year = 2018) %>%
select(Party, Year, Variables, Values)
}
```

This function has three arguments, `dataset`

, `starting_col`

and `target_rows`

. `dataset`

is the

data I loaded with `xlsx_cells`

from the `{tidyxl}`

package. I think the following picture illustrates

easily what the function does:

So the function first filters only the rows we are interested in, then the cols. I then select

the columns I want which are called `character`

and `numeric`

(if the Excel cell contains characters then

you will find them in the character column, if it contains numbers you will them in the numeric

column), then I fill the empty cells with the values from the `numeric`

column and the I remove

the NA’s. These two last steps might not be so clear; this is how the data looks like up until the

`select()`

function:

```
> elections_raw_2018 %>%
+ filter(row %in% seq(11,19)) %>%
+ filter(col %in% c(1, 2)) %>%
+ select(character, numeric)
# A tibble: 18 x 2
character numeric
```
1 1 - PIRATEN - PIRATEN NA
2 NA 0.0645
3 Suffrage total NA
4 NA 227549
5 Suffrages de liste NA
6 NA 181560
7 Suffrage nominatifs NA
8 NA 45989
9 Pourcentage pondéré NA
10 NA 0.0661
11 Suffrage total pondéré NA
12 NA 13394.
13 Suffrages de liste pondéré NA
14 NA 10308
15 Suffrage nominatifs pondéré NA
16 NA 3086.
17 Mandats attribués NA
18 NA 2

So by filling the NA’s in the numeric the data now looks like this:

```
> elections_raw_2018 %>%
+ filter(row %in% seq(11,19)) %>%
+ filter(col %in% c(1, 2)) %>%
+ select(character, numeric) %>%
+ fill(numeric, .direction = "up")
# A tibble: 18 x 2
character numeric
```
1 1 - PIRATEN - PIRATEN 0.0645
2 NA 0.0645
3 Suffrage total 227549
4 NA 227549
5 Suffrages de liste 181560
6 NA 181560
7 Suffrage nominatifs 45989
8 NA 45989
9 Pourcentage pondéré 0.0661
10 NA 0.0661
11 Suffrage total pondéré 13394.
12 NA 13394.
13 Suffrages de liste pondéré 10308
14 NA 10308
15 Suffrage nominatifs pondéré 3086.
16 NA 3086.
17 Mandats attribués 2
18 NA 2

And then I filter out the NA’s from the character column, and that’s almost it! I simply need

to add a new column with the party’s name and rename the other columns. I also add a “Year” colmun.

Now, each party will have a different starting column. The table with the data for the first party

starts on column 1, for the second party it starts on column 4, column 7 for the third party…

So the following vector contains all the starting columns:

`position_parties_national <- seq(1, 24, by = 3)`

(If you study the Excel workbook closely, you will notice that I do not extract the last two parties.

This is because these parties were not present in all of the 4 circonscriptions and are very, very,

very small.)

The target rows are always the same, from 11 to 19. Now, I simply need to map this function to

this list of positions and I get the data for all the parties:

```
elections_national_2018 <- map_df(position_parties_national, extract_party,
dataset = elections_raw_2018, target_rows = seq(11, 19)) %>%
mutate(locality = "Grand-Duchy of Luxembourg", division = "National")
```

I also added the `locality`

and `division`

columns to the data.

Let’s take a look:

`glimpse(elections_national_2018)`

```
## Observations: 72
## Variables: 6
## $ Party
``` "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN...
## $ Year 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ Variables "Pourcentage", "Suffrage total", "Suffrages de liste...
## $ Values 6.446204e-02, 2.275490e+05, 1.815600e+05, 4.598900e+...
## $ locality "Grand-Duchy of Luxembourg", "Grand-Duchy of Luxembo...
## $ division "National", "National", "National", "National", "Nat...

Very nice.

Now we need to do the same for the 4 electoral circonscriptions. First, let’s load the data:

```
# Electoral districts 2018
districts <- c("Centre", "Nord", "Sud", "Est")
elections_district_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx",
sheets = districts)
```

Now things get trickier. Remember I said that the number of seats is proportional to the population

of each circonscription? We simply can’t use the same target rows as before. For example, for the

“Centre” circonscription, the target rows go from 12 to 37, but for the “Est” circonscription

only from 12 to 23. Ideally, we would need a function that would return the target rows.

This is that function:

```
# The target rows I need to extract are different from district to district
get_target_rows <- function(dataset, sheet_to_extract, reference_address){
last_row <- dataset %>%
filter(sheet == sheet_to_extract) %>%
filter(address == reference_address) %>%
pull(numeric)
seq(12, (11 + 5 + last_row))
}
```

This function needs a `dataset`

, a `sheet_to_extract`

and a `reference_address`

. The reference

address is a cell that actually contains the number of seats in that circonscription, in our

case “B5”. We can easily get the list of target rows now:

```
# Get the target rows
list_targets <- map(districts, get_target_rows, dataset = elections_district_raw_2018,
reference_address = "B5")
list_targets
```

```
## [[1]]
## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [24] 35 36 37
##
## [[2]]
## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25
##
## [[3]]
## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [24] 35 36 37 38 39
##
## [[4]]
## [1] 12 13 14 15 16 17 18 19 20 21 22 23
```

Now, let’s split the data we imported into a list, where each element of the list is a dataframe

with the data from one circonscription:

`list_data_districts <- map(districts, ~filter(.data = elections_district_raw_2018, sheet == .)) `

Now I can easily map the function I defined above, `extract_party`

to this list of datasets. Well,

I say easily, but it’s a bit more complicated than before because I have now a list of datasets

and a list of target rows:

```
elections_district_2018 <- map2(.x = list_data_districts, .y = list_targets,
~map_df(position_parties_national, extract_party, dataset = .x, target_rows = .y))
```

The way to understand this is that for each element of `list_data_districts`

and `list_targets`

,

I have to map `extract_party`

to each element of `position_parties_national`

. This gives the intented

result:

`elections_district_2018`

```
## [[1]]
## # A tibble: 208 x 4
## Party Year Variables Values
##
```
## 1 PIRATEN 2018 Pourcentage 0.0514
## 2 PIRATEN 2018 CLEMENT Sven (1) 8007
## 3 PIRATEN 2018 WEYER Jerry (2) 3446
## 4 PIRATEN 2018 CLEMENT Pascal (3) 3418
## 5 PIRATEN 2018 KUNAKOVA Lucie (4) 2860
## 6 PIRATEN 2018 WAMPACH Jo (14) 2693
## 7 PIRATEN 2018 LAUX Cynthia (6) 2622
## 8 PIRATEN 2018 ISEKIN Christian (5) 2610
## 9 PIRATEN 2018 SCHWEICH Georges (9) 2602
## 10 PIRATEN 2018 LIESCH Mireille (8) 2551
## # ... with 198 more rows
##
## [[2]]
## # A tibble: 112 x 4
## Party Year Variables Values
##
## 1 PIRATEN 2018 Pourcentage 0.0767
## 2 PIRATEN 2018 COLOMBERA Jean (2) 5074
## 3 PIRATEN 2018 ALLARD Ben (1) 4225
## 4 PIRATEN 2018 MAAR Andy (3) 2764
## 5 PIRATEN 2018 GINTER Joshua (8) 2536
## 6 PIRATEN 2018 DASBACH Angelika (4) 2473
## 7 PIRATEN 2018 GRÜNEISEN Sam (6) 2408
## 8 PIRATEN 2018 BAUMANN Roy (5) 2387
## 9 PIRATEN 2018 CONRAD Pierre (7) 2280
## 10 PIRATEN 2018 TRAUT ép. MOLITOR Angela Maria (9) 2274
## # ... with 102 more rows
##
## [[3]]
## # A tibble: 224 x 4
## Party Year Variables Values
##
## 1 PIRATEN 2018 Pourcentage 0.0699
## 2 PIRATEN 2018 GOERGEN Marc (1) 9818
## 3 PIRATEN 2018 FLOR Starsky (2) 6737
## 4 PIRATEN 2018 KOHL Martine (3) 6071
## 5 PIRATEN 2018 LIESCH Camille (4) 6025
## 6 PIRATEN 2018 KOHL Sylvie (6) 5628
## 7 PIRATEN 2018 WELTER Christian (5) 5619
## 8 PIRATEN 2018 DA GRAÇA DIAS Yanick (10) 5307
## 9 PIRATEN 2018 WEBER Jules (7) 5301
## 10 PIRATEN 2018 CHMELIK Libor (8) 5247
## # ... with 214 more rows
##
## [[4]]
## # A tibble: 96 x 4
## Party Year Variables Values
##
## 1 PIRATEN 2018 Pourcentage 0.0698
## 2 PIRATEN 2018 FRÈRES Daniel (1) 4152
## 3 PIRATEN 2018 CLEMENT Jill (7) 1943
## 4 PIRATEN 2018 HOUDREMONT Claire (2) 1844
## 5 PIRATEN 2018 BÖRGER Nancy (3) 1739
## 6 PIRATEN 2018 MARTINS DOS SANTOS Catarina (6) 1710
## 7 PIRATEN 2018 BELLEVILLE Tatjana (4) 1687
## 8 PIRATEN 2018 CONTRERAS Gerald (5) 1687
## 9 PIRATEN 2018 Suffrages total 14762
## 10 PIRATEN 2018 Suffrages de liste 10248
## # ... with 86 more rows

I now need to add the `locality`

and `division`

columns:

```
elections_district_2018 <- map2(.y = elections_district_2018, .x = districts,
~mutate(.y, locality = .x, division = "Electoral district")) %>%
bind_rows()
```

We’re almost done! Now we need to do the same for the 102 remaining sheets, one for each **commune**

of Luxembourg. This will now go very fast, because we got all the building blocks from before:

```
communes <- xlsx_sheet_names("leg-2018-10-14-22-58-09-737.xlsx")
communes <- communes %-l%
c("Le Grand-Duché de Luxembourg", "Centre", "Est", "Nord", "Sud", "Sommaire")
```

Let me introduce the following function: `%-l%`

. This function removes elements from lists:

`c("a", "b", "c", "d") %-l% c("a", "d")`

`## [1] "b" "c"`

You can think of it as “minus for lists”. This is called an infix operator.

So this function is very useful to get the list of communes, and is part of my package, `{brotools}`

.

As before, I load the data:

```
elections_communes_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx",
sheets = communes)
```

Then get my list of targets, but I need to change the reference address. It’s “B8” now, not “B7”.

```
# Get the target rows
list_targets <- map(communes, get_target_rows,
dataset = elections_communes_raw_2018, reference_address = "B8")
```

I now create a list of communes by mapping a filter function to the data:

`list_data_communes <- map(communes, ~filter(.data = elections_communes_raw_2018, sheet == .)) `

And just as before, I get the data I need by using `extract_party`

, and adding the “locality” and

“division” columns:

```
elections_communes_2018 <- map2(.x = list_data_communes, .y = list_targets,
~map_df(position_parties_national, extract_party, dataset = .x, target_rows = .y))
elections_communes_2018 <- map2(.y = elections_communes_2018, .x = communes,
~mutate(.y, locality = .x, division = "Commune")) %>%
bind_rows()
```

The steps are so similar for the four circonscriptions and for the 102 **communes** that I could

have write a big wrapper function and the use it for the circonscription and **communes** at once.

But I was lazy.

Finally, I bind everything together and have a nice, tidy, flat file:

```
# Final results
elections_2018 <- bind_rows(list(elections_national_2018, elections_district_2018, elections_communes_2018))
glimpse(elections_2018)
```

```
## Observations: 15,544
## Variables: 6
## $ Party
``` "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN...
## $ Year 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ Variables "Pourcentage", "Suffrage total", "Suffrages de liste...
## $ Values 6.446204e-02, 2.275490e+05, 1.815600e+05, 4.598900e+...
## $ locality "Grand-Duchy of Luxembourg", "Grand-Duchy of Luxembo...
## $ division "National", "National", "National", "National", "Nat...

This blog post is already quite long, so I will analyze the data now that R can easily ingest it

in a future blog post.

If you found this blog post useful, you might want to follow me on twitter

for blog post updates.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Econometrics and Free Software**.

R-bloggers.com offers

(This article was first published on ** CillianMacAodh**, and kindly contributed to R-bloggers)

It has been quite a while since I posted, but I haven’t been idle, I completed my PhD since the last post, and I’m due to graduate next Thursday. I am also delighted to have recently been added to R-bloggers.com so I’m keen to get back into it.

I have already written 2 posts about writing functions, and I will try to diversify my content. That said, I won’t refrain from sharing something that has been helpful to me. The function(s) I describe in this post is an artefact left over from before I started using R Markdown. It is a product of its time but may still be of use to people who haven’t switched to R Markdown yet. It is lazy (and quite imperfect) solution to a tedious task.

At the time I wrote this function I was using R for my statistics and Libreoffice for writing. I would run a test in R and then write it up in Libreoffice. Each value that needed reporting had to be transferred from my R output to Libreoffice – and for each test there are a number of values that need reporting. Writing up these tests is pretty formulaic. There’s a set structure to the sentence, for example writing up a t-test with a significant result nearly always looks something like this:

An independent samples t-test revealed a significant difference in X between the Y sample, (*M* = [ ], *SD* = [ ]), and the Z sample, (*M* = [ ], SD = [ ]), *t*([df]) = [ ], *p* = [ ].

And the write up of a non-significant result looks something like this:

An independent samples t-test revealed no significant difference in X between the Y sample, (*M* = [ ], *SD* = [ ]), and the Z sample, (*M* = [ ], SD = [ ]), *t*([df]) = [ ], *p* = [ ].

Seven values (the square [ ] brackets) need to be reported for this single test. Whether you copy and paste or type each value, the reporting of such tests can be very tedious, and leave you prone to errors in reporting.

In order to make reporting values easier (and more accurate) I wrote the `t_paragraph()`

function (and the related `t_paired_paragraph()`

function). This provided an output that I could copy and paste into a Word (Libreoffice) document. This function is part of the `desnum`

^{1} package (McHugh, 2017).

`t_parapgraph()`

FunctionThe `t_parapgraph()`

function runs a t-test and generates an output that can be copied and pasted into a word document. The code for the function is as follows:

```
# Create the function t_paragraph with arguments x, y, and measure
# x is the dependent variable
# y is the independent (grouping) variable
# measure is the name of dependent variable inputted as string
t_paragraph <- function (x, y, measure){
# Run a t-test and store it as an object t
t <- t.test(x ~ y)
# If your grouping variable has labelled levels, the next line will store them for reporting at a later stage
labels <- levels(y)
# Create an object for each value to be reported
tsl <- as.vector(t$statistic)
ts <- round(tsl, digits = 3)
tpl <- as.vector(t$p.value)
tp <- round(tpl, digits = 3)
d_fl <- as.vector(t$parameter)
d_f <- round(d_fl, digits = 2)
ml <- as.vector(tapply(x, y, mean))
m <- round(ml, digits = 2)
sdl <- as.vector(tapply(x, y, sd))
sd <- round(sdl, digits = 2)
# Use print(paste0()) to combine the objects above and create two potential outputs
# The output that is generated will depend on the result of the test
# wording if significant difference is observed
if (tp < 0.05)
print(paste0("An independent samples t-test revealed a significant difference in ",
measure, " between the ", labels[1], " sample, (M = ",
m[1], ", SD = ", sd[1], "), and the ", labels[2],
" sample, (M =", m[2], ", SD =", sd[2], "), t(",
d_f, ") = ", ts, ", p = ", tp, "."), quote = FALSE,
digits = 2)
# wording if no significant difference is observed
if (tp > 0.05)
print(paste0("An independent samples t-test revealed no difference in ",
measure, " between the ", labels[1], " sample, (M = ",
m[1], ", SD = ", sd[1], "), and the ", labels[2],
" sample, (M = ", m[2], ", SD =", sd[2], "), t(",
d_f, ") = ", ts, ", p = ", tp, "."), quote = FALSE,
digits = 2)
}
```

When using `t_paragraph()`

, `x`

is your DV, `y`

is your grouping variable while `measure`

is a string value that the name of the dependent variable. To illustrate the function I’ll use the `mtcars`

dataset.

`t_parapgraph()`

FunctionThe `mtcars`

dataset is comes with R. For information on it simply type `help(mtcars)`

. The variables of interest here are `am`

(transmission; 0 = automatic, 1 = manual), `mpg`

(miles per gallon), `qsec`

(1/4 mile time). The two questions I’m going to look at are:

- Is there a difference in miles per gallon depending on transmission?
- Is there a difference in 1/4 mile time depending on transmission?

Before running the test it is a good idea to look at the data^{2}. Because we’re going to look at differences between groups we want to run descriptives for each group separately. To do this I’m going to combine the the `descriptives()`

function which I previously covered here (also part of the `desnum`

package) and the `tapply()`

function.

The `tapply()`

function allows you to run a function on subsets of a dataset using a grouping variable (or index). The arguments are as follows `tapply(vector, index, function)`

. `vector`

is the variable you want to pass through `function`

; and `index`

is the grouping variable. The examples below will make this clearer.

We want to run descriptives on `mtcars$mpg`

and on `mtcars$qsec`

and for each we want to group by transmission (`mtcars$am`

). This can be done using `tapply()`

and `descriptives()`

together as follows:

`tapply(mtcars$mpg, mtcars$am, descriptives)`

```
## $`0`
## mean sd min max len
## 1 17.14737 3.833966 10.4 24.4 19
##
## $`1`
## mean sd min max len
## 1 24.39231 6.166504 15 33.9 13
```

Recall that 0 = automatic, and 1 = manual. Replace `mpg`

with `qsec`

and run again:

`tapply(mtcars$qsec, mtcars$am, descriptives)`

```
## $`0`
## mean sd min max len
## 1 18.18316 1.751308 15.41 22.9 19
##
## $`1`
## mean sd min max len
## 1 17.36 1.792359 14.5 19.9 13
```

`t_paragraph()`

Now that we know the values for automatic vs manual cars we can run our t-tests using `t_paragraph()`

. Our first question:

Is there a difference in miles per gallon depeding on transmission?

`t_paragraph(mtcars$mpg, mtcars$am, "miles per gallon")`

`## [1] An independent samples t-test revealed a significant difference in miles per gallon between the sample, (M = 17.15, SD = 3.83), and the sample, (M =24.39, SD =6.17), t(18.33) = -3.767, p = 0.001.`

There is a difference, and the output above can be copied and pasted into a word document with minimal changes required.

Our second question was:

Is there a difference in 1/4 mile time depending on transmission?

`t_paragraph(mtcars$qsec, mtcars$am, "quarter-mile time")`

`## [1] An independent samples t-test revealed no difference in quarter-mile time between the sample, (M = 18.18, SD = 1.75), and the sample, (M = 17.36, SD =1.79), t(25.53) = 1.288, p = 0.209.`

This time there was no significant difference, and again the output can be copied and pasted into word with minimal changes.

The function described was written a long time ago, and could be updated. However I no longer copy and paste into word (having switched to R markdown instead). The reporting of the p value is not always to APA standards. If p is < .001 this is what should be reported. The code for `t_paragraph()`

could be updated to include the `p_report`

function (described here) which would address this. Another limitation is that the formatting of the text isn’t perfect, the letters (N,M,SD,t,p) should all be italicised, but having to manually fix this formatting is still easier than manually transferring individual values.

Despite the limitations the functions `t_paragraph()`

and `t_paired_paragraph()`

^{3} have made my life easier. I still use them occasionally. I hope they can be of use to anyone who is using R but has not switched to R Markdown yet.

McHugh, C. (2017). *Desnum: Creates some useful functions*.

- To install
`desnum`

just run`devtools::install_github("cillianmiltown/R_desnum")`

- In this case this is particularly useful because there are no value labels for
`mtcars$am`

, so it won’t be clear from the output which values refer to the automatic group and which refer to the manual group. Running descriptives will help with this. - If you want to see the code for
`t_paired_paragraph()`

just load`desnum`

and run`t_paired_paragraph`

(without parenthesis)

To **leave a comment** for the author, please follow the link and comment on their blog: ** CillianMacAodh**.

R-bloggers.com offers

(This article was first published on ** R-posts.com**, and kindly contributed to R-bloggers)

In 2016, 2.1 million Americans were found to have an opioid use disorder (according to SAMHSA), with drug overdose now the leading cause of injury and death in the United States. But some of the country’s top minds are working to fight this epidemic, and statisticians are helping to lead the charge.

In* **This* is Statistics’ second annual fall data challenge, high school and undergraduate students will use statistics to analyze data and develop recommendations to help address this important public health crisis.

The contest invites teams of two to five students to put their statistical and data visualization skills to work using the Centers for Disease Control and Prevention (CDC)’s Multiple Cause of Death (Detailed Mortality) data set, and contribute to creating healthier communities. Given the size and complexity of the CDC dataset, programming languages such as R can be used to manipulate and conduct analysis effectively.

Each submission will consist of a short essay and presentation of recommendations. Winners will be awarded for best overall analysis, best visualization and best use of external data. Submissions are due November 12, 2018.

If you or a student you know is interested in participating, get full contest details here.

Teachers, get resources about how to engage your students in the contest here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-posts.com**.

R-bloggers.com offers

(This article was first published on ** R-english – Freakonometrics**, and kindly contributed to R-bloggers)

Some pre-Halloween post today. It started actually while I was in Barcelona : kids wanted to go back to some store we’ve seen the first day, in the gothic part, and I could not remember where it was. And I said to myself that would be quite long to do all the street of the neighborhood. And I discovered that it was actually an old problem. In 1962, Meigu Guan was interested in a postman delivering mail to a number of streets such that the total distance walked by the postman was as short as possible. How could the postman ensure that the distance walked was a minimum?

A very close notion is the concept of **traversable graph**, which is one that can be drawn without taking a pen from the paper and without retracing the same edge. In such a case the graph is said to have an **Eulerian trail **(yes, from Euler’s bridges problem). An Eulerian trail uses all the edges of a graph. For a graph to be Eulerian **all the vertices must be of even order**.

An algorithm for finding an optimal Chinese postman route is:

- List all odd vertices.
- List all possible pairings of odd vertices.
- For each pairing find the edges that connect the vertices with the minimum weight.
- Find the pairings such that the sum of the weights is minimised.
- On the original graph add the edges that have been found in Step 4.
- The length of an optimal Chinese postman route is the sum of all the edges added to the total found in Step 4.
- A route corresponding to this minimum weight can then be easily found.

For the first steps, we can use the codes from Hurley & Oldford’s Eulerian tour algorithms for data visualization and the PairViz package. First, we have to load some R packages

1 2 3 4 | require(igraph) require(graph) require(eulerian) require(GA) |

Then use the following function from stackoverflow,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | make_eulerian = function(graph){ info = c("broken" = FALSE, "Added" = 0, "Successfull" = TRUE) is.even = function(x){ x %% 2 == 0 } search.for.even.neighbor = !is.even(sum(!is.even(degree(graph)))) for(i in V(graph)){ set.j = NULL uneven.neighbors = !is.even(degree(graph, neighbors(graph,i))) if(!is.even(degree(graph,i))){ if(sum(uneven.neighbors) == 0){ if(sum(!is.even(degree(graph))) > 0){ info["Broken"] = TRUE uneven.candidates <- !is.even(degree(graph, V(graph))) if(sum(uneven.candidates) != 0){ set.j <- V(graph)[uneven.candidates][[1]] }else{ info["Successfull"] <- FALSE } } }else{ set.j <- neighbors(graph, i)[uneven.neighbors][[1]] } }else if(search.for.even.neighbor == TRUE & is.null(set.j)){ info["Added"] <- info["Added"] + 1 set.j <- neighbors(graph, i)[ !uneven.neighbors ][[1]] if(!is.null(set.j)){search.for.even.neighbor <- FALSE} } if(!is.null(set.j)){ if(i != set.j){ graph <- add_edges(graph, edges=c(i, set.j)) info["Added"] <- info["Added"] + 1 } } } (list("graph" = graph, "info" = info))} |

Then, consider some network, with 12 nodes

1 2 3 | g1 = graph(c(1,2, 1,3, 2,4, 2,5, 1,5, 3,5, 4,7, 5,7, 5,8, 3,6, 6,8, 6,9, 9,11, 8,11, 8,10, 8,12, 7,10, 10,12, 11,12), directed = FALSE) |

To plot that network, use

1 2 3 4 | V(g1)$name=LETTERS[1:12] V(g1)$color=rgb(0,0,1,.4) ly=layout.kamada.kawai(g1) plot(g1,vertex.color=V(newg)$color,layout=ly) |

Then we convert it to some traversable graph by adding 5 vertices

1 2 3 4 5 | eulerian = make_eulerian(g1) eulerian$info broken Added Successfull 0 5 1 g = eulerian$graph |

as shown below

1 2 | ly=layout.kamada.kawai(g) plot(g,vertex.color=V(newg)$color,layout=ly) |

We cut those 5 vertices in two part, and therefore, we add 5 artificial nodes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | A=as.matrix(as_adj(g)) A1=as.matrix(as_adj(g1)) newA=lower.tri(A, diag = FALSE)*A1+upper.tri(A, diag = FALSE)*A for(i in 1:sum(newA==2)) newA = cbind(newA,0) for(i in 1:sum(newA==2)) newA = rbind(newA,0) s=nrow(A) for(i in 1:nrow(A)){ Aj=which(newA[i,]==2) if(!is.null(Aj)){ for(j in Aj){ newA[i,s+1]=newA[s+1,i]=1 newA[j,s+1]=newA[s+1,j]=1 newA[i,j]=1 s=s+1 }}} |

We get the following graph, where all nodes have an even number of vertices !

1 2 3 4 5 6 7 8 9 10 11 12 | newg=graph_from_adjacency_matrix(newA) newg=as.undirected(newg) V(newg)$name=LETTERS[1:17] V(newg)$color=c(rep(rgb(0,0,1,.4),12),rep(rgb(1,0,0,.4),5)) ly2=ly transl=cbind(c(0,0,0,.2,0),c(.2,-.2,-.2,0,-.2)) for(i in 13:17){ j=which(newA[i,]>0) lc=ly[j,] ly2=rbind(ly2,apply(lc,2,mean)+transl[i-12,]) } plot(newg,layout=ly2) |

Our network is now the following (new nodes are small because actually, they don’t really matter, it’s just for computational reasons)

1 2 3 | plot(newg,vertex.color=V(newg)$color,layout=ly2, vertex.size=c(rep(20,12),rep(0,5)), vertex.label.cex=c(rep(1,12),rep(.1,5))) |

Now we can get the optimal path

1 2 3 4 5 6 7 | n <- LETTERS[1:nrow(newA)] g_2 <- new("graphNEL",nodes=n) for(i in 1:nrow(newA)){ for(j in which(newA[i,]>0)){ g_2 <- addEdge(n[i],n[j],g_2,1) }} etour(g_2,weighted=FALSE) [1] "A" "B" "D" "G" "E" "A" "C" "E" "H" "F" "I" "K" "H" "J" "G" "P" "J" "L" "K" "Q" "L" "H" "O" "F" "C" [26] "N" "E" "B" "M" "A" |

or

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | edg=attr(E(newg), "vnames") ET=etour(g_2,weighted=FALSE) parcours=trajet=rep(NA,length(ET)-1) for(i in 1:length(parcours)){ u=c(ET[i],ET[i+1]) ou=order(u) parcours[i]=paste(u[ou[1]],u[ou[2]],sep="|") trajet[i]=which(edg==parcours[i]) } parcours [1] "A|B" "B|D" "D|G" "E|G" "A|E" "A|C" "C|E" "E|H" "F|H" "F|I" "I|K" "H|K" "H|J" "G|J" "G|P" "J|P" [17] "J|L" "K|L" "K|Q" "L|Q" "H|L" "H|O" "F|O" "C|F" "C|N" "E|N" "B|E" "B|M" "A|M" trajet [1] 1 3 8 9 4 2 6 10 11 12 16 15 14 13 26 27 18 19 28 29 17 25 24 7 22 23 5 21 20 |

Let us try now on a real network of streets. Like Missoula, Montana.

I will not try to get the shapefile of the city, I will just try to replicate the photography above.

If you look carefully, you will see some problem : 10 and 93 have an odd number of vertices (3 here), so one strategy is to connect them (which explains the grey line).

But actually, to be more realistic, we start in 93, and we end in 10. Here is the optimal (shortest) path which goes through all vertices.

Now, we are ready for Halloween, to go through all streets in the neighborhood !

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-english – Freakonometrics**.

R-bloggers.com offers

(This article was first published on ** R – Fantasy Football Analytics**, and kindly contributed to R-bloggers)

Week 7 Gold Mining and Fantasy Football Projection Roundup now available. Go check out our cheat sheet for this week.

The post Gold-Mining Week 7 (2018) appeared first on Fantasy Football Analytics.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Fantasy Football Analytics**.

R-bloggers.com offers

(This article was first published on ** DataCamp Community - r programming**, and kindly contributed to R-bloggers)

Here is the course link.

Data visualization is an integral part of the data analysis process. This course will get you introduced to rbokeh: a visualization library for interactive web-based plots. You will learn how to use rbokeh layers and options to create effective visualizations that carry your message and emphasize your ideas. We will focus on the two main pieces of data visualization: wrangling data in the appropriate format as well as employing the appropriate visualization tools, charts and options from rbokeh.

In this chapter we get introduced to rbokeh layers. You will learn how to specify data and arguments to create the desired plot and how to combine multiple layers in one figure.

In this chapter you will learn how to customize your rbokeh figures using aesthetic attributes and figure options. You will see how aesthetic attributes such as color, transparancy and shape can serve a purpose and add more info to your visualizations. In addition, you will learn how to activate the tooltip and specify the hover info in your figures.

In this chapter, you will learn how to put your data in the right format to fit the desired figure. And how to transform between the wide and long formats. You will also see how to combine normal layers with regression lines. In addition you will learn how to customize the interaction tools that appear with each figure.

In this chapter you will learn how to combine multiple plots in one layout using grid plots. In addition, you will learn how to create interactive maps.

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Community - r programming**.

R-bloggers.com offers

(This article was first published on ** DataCamp Community - r programming**, and kindly contributed to R-bloggers)

Here is the course link.

This course will help you take your data visualization skills beyond the basics and hone them into a powerful member of your data science toolkit. Over the lessons we will use two interesting open datasets to cover different types of data (proportions, point-data, single distributions, and multiple distributions) and discuss the pros and cons of the most common visualizations. In addition, we will cover some less common alternatives visualizations for the data types and how to tweak default ggplot settings to most efficiently and effectively get your message across.

In this chapter, we focus on visualizing proportions of a whole; we see that pie charts really aren’t so bad, along with discussing the waffle chart and stacked bars for comparing multiple proportions.

We shift our focus now to single-observation or point data and go over when bar charts are appropriate and when they are not, what to use when they are not, and general perception-based enhancements for your charts.

We now move on to visualizing distributional data, we expose the fragility of histograms, discuss when it is better to shift to a kernel density plots, and how to make both plots work best for your data.

Finishing off we take a look at comparing multiple distributions to each other. We see why the traditional box plots are very dangerous and how to easily improve them, along with investigating when you should use more advanced alternatives like the beeswarm plot and violin plots.

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Community - r programming**.

R-bloggers.com offers