Psychwire

Blog has Moved!!

hayward — Wed, 11 May 2011 19:33:14 +0000

My site will remain here but my blog has migrated to the address below:

http://psychwire.wordpress.com/

Thanks!

Charting the Defeat of AV using R (and some ggplot2 and merge operations on top)

hayward — Sun, 08 May 2011 14:08:28 +0000

In this post, I’ll be graphing some results from a recent referendum held here in the UK and combining it with the results of a set of local elections that were held at the same time. I’ll give some examples of graphing stuff using ggplot2 and will also show some info regarding merging datasets.

At the outset, I want to point out that this isn’t intended to be a ‘using stats to be political’ post. I just like playing around with data. Don’t for a second assume that I’m trying to say anything meaningful here. It’s just for entertainment purposes only.

The Vote, and the Alternative Vote

We have a coalition government in the UK, between the Conservatives and the Liberal Democrats. One thing the Lib Dems have pushed for, and in a rare instance of getting their own way have actually achieved, is having a referendum on changing the voting system here. They wanted to institute Alternative Voting. The vote was cast last week. AV was crushed.

At the same time, votes were cast for the local councils. The people who voted, myself included, were handed two exciting bits of paper to scribble on at the polling station. The Lib Dems lost the most ground in the past 30 years.

Charting the defeat of AV using R

So I saw on the Guardian website that they were offering a spreadsheet of the AV results broken down by different areas in the UK. I played around with a bit, and then thought that it might be interesting to compare the AV results to the local council election results. Yes, there’s a load of correlation not implying causation from that idea, as people who voted in the AV referendum may not have necessarily voted in the local council elections, and furthermore, people who did vote in both may not have voted consistently with the party that they were supporting. In other words, many people may have voted Lib Dem, the party which favours AV the most, and then voted against AV. Still, cross-comparing the results from the referendum and the local elections should, at an overall level, give some basic indication of the feeling and political vibe in different areas. Again, remember, this is all for fun. I’m more than happy to admit that I’m not an expert (or even a novice) in political science, if that’s what you even call this whole “running stats on votes” thing that I’m doing here.

I took the spreadsheet regarding AV from the Guardian and then headed off in search for a similar spreadsheet containing local election results. The closest I could find was on the Telegraph website. This one only covers England I think. Most sites give a breakdown of the local election results in a format that isn’t easy to put into a spreadsheet (i.e., I’d have to sit here for hours cross-tabulating the ones that are missing), so I’m going with what I can get. I was surprised to find that our dear old government doesn’t retain a centralised copy of the results and put them on a website.

With two datasets in hand, one called av and one called les (Local council ElectionS), I was ready to start. I ran a merge of the two to get started:

combined_base = merge(av, les)

In both datasets, there is a column called name which is used to match everything up. As my AV dataset contains more rows than the local council elections dataset, I end up with only those areas in the AV dataset that also appeared in the local council elections dataset. This gave me 279 rows.

Next up: select only the local councils where the Conservatives, Labour or Lib Dems gained overall council control (indicated by the winner column). Then create a new column called win_label which is a textual version of the shortened names (these are C, Lab and LD) listed in winner.

combined = combined_base[combined_base$winner=="C" | combined_base$winner=="Lab" | combined_base$winner=="LD",]

combined$win_label[combined$winner=="C"] = "Conservative"

combined$win_label[combined$winner=="Lab"] = "Labour"

combined$win_label[combined$winner=="LD"] = "Liberal Democrat"

Next we can do a histogram of the number of councils where each party were victorious, compared to the proportion of the electorate in those councils who voted YES to AV:

ggplot(combined)+

aes(x=yes_perc)+

geom_histogram()+

scale_x_continuous("Percentage of YES to AV votes")+

scale_y_continuous("Number of Local Councils")+

facet_wrap(facet=~win_label)

The code gives us the following:

From the histograms, the defeat of the Lib Dems in the local elections is very clear. They hardly won anything.

Ok, so let’s take a look at it from a different angle. We have information available in the datasets regarding the percentage of people who voted in each area. Here’s the R code:

ggplot(combined)+

aes(x=yes_perc, y=turnout_perc, colour=win_label)+

geom_point(size=4)+

scale_colour_manual(values = c("blue","red", "orange")) +

scale_x_continuous("Percentage of YES to AV votes")+

scale_y_continuous("Percentage of Electorate who Voted")

Note the use of scale_colour_manual there to set each of the parties to their respective colours. I also resized the points within the geom_point command because the Lib Dem orange points were hard to see with the smaller default size.

Aside from the one rare instance where there was a high YES to AV vote and also a Lib Dem council being voted in (i.e., what would be expected), it seems there is a strong clustering towards a low proportion of YES votes.

One other point about this graph that stands out. Take a look at how the councils where Labour (red) were voted in tend to fall in areas where less of the electorate voted. When 45% or more voted, the Conservatives dominated, except for three Lib Dem wins.

Summary Stats

Finally, let’s look at some descriptive stats as a summary. Here’s the code.

ddply(combined, c("Council"), summarize,

"Yes Percentage (mean)"=mean(yes_perc),

"Turnout Percentage (mean)"=mean(turnout_perc))

And here’s the table:

Is R an ideal language to teach the fundamentals of programming to researchers with no experience in programming?

hayward — Fri, 06 May 2011 11:00:00 +0000

UPDATE: I’ve modified the title of this post a bit to clarify what I was really thinking when I wrote it. What I was really thinking was which programming language to choose to teach some fellow researchers how to get into the absolute basics of programming, out of the very limited set of languages I know. The tasks they need to do need only a minimal understanding of programming, and of R, so many of the issues that can be experienced won’t even come up for them. To put things into context, it only took two days for me to work out how to do everything that I need to do in R going from scratch, so it’s not as if I’m writing packages or doing anything particularly fancy myself, and these people who I will be teaching will be doing stuff that is less complicated than what I needed to do.

That being said, I’d like to thank those who commented for pointing out why R isn’t a great language for people starting out with programming. I’m still new to using R, so obviously don’t have the depth of experience with potential problems that others do, so it’s helpful to learn from others’ experience (or should that be “misery”?)! Python, Pascal and Ruby all sound like great options for getting into programming. I’m going to leave my initial post, with all it’s inaccuracies, intact below: first, because I think it’s good to have a record of what I have said so I can look back at how daft I was in the future, and second because, as people took the time to post comments, I don’t want the time they spent making comments and correcting me to have gone to waste. If I deleted most of what I said or removed the post, then their comments would seem odd or incorrect.

—

I’m helping out some colleagues learn programming from having zero experience with it in any shape or form. It’s quite a daunting task in some senses, because, well, it may not be easy! They are researchers, so they’ll need it for processing data and generating output, and perhaps processing BIG DATA at some point too.

After some debate about the best way to go ahead, I’ve settled with R as being my weapon of choice to train these lucky individuals. The choices were as follows – note that I don’t know that many programming languages, so it’s not a huge list. I thought it would be worth sharing the pros and cons of each.

~~PHP~~

Pros: Dead easy to use. Nice and easy integration with databases which can be used to deal with data processing. Can be extended to, for example, generate images (a plus for these people who study visual cognition, so often need to make pretty pictures to show to participants in experiments). There’s also an immense number of tutorials and guides on the net, and people who aren’t into research can help you out just by knowing their PHP.

Cons: Probably overkill. Running a webserver all the time can be a pain, even if XAMPP is used. It’s not easy (or even possible, as far as I am aware) to run statistical tests using PHP or any classes that can be added in.

~~Python~~

Pros: Forces users to write clean code, and again it’s very easy to use. Possible to integrate with databases to churn through datasets. Like PHP, it can be used to generate images for use in experiments (pygame), and again there are plenty of examples and tutorials. Plenty of extensions to do stats and plot graphs (NumPy and Matplotlib). Oh, and it’s named after Monty Python. Ni.

Cons: again, probably overkill. Forcing people to worry about indentation can get horribly confusing when they are barely aware of what they are doing, and they can get tripped up. Just a personal issue I guess, but I’ve not quite managed to get to grips with OOP in python. Maybe that’s because I did it first in PHP and never could do more than crash my computer when trying to learn Java. Ho hum.

~~Javascript~~

Pros: Easy syntax, and its power is growing with the new HTML 5 specifications. I mention it because I recently saw this illustration of basic programming and it seemed worth considering. There’s no need to compile anything which is often good for beginners too.

~~Cons: not really intended for churning big datasets and the kind of things I have in mind. Quite a bit of the decent libraries out there need to be paid for to be used.~~

Pros: syntax is very simple, with few gotchas present in other languages (e.g., ending lines with a semicolon or forcing tabs in lines and so on). As it’s loosely typed, this can be both a blessing and a curse. It’s a blessing because users don’t have to worry about declaring variables. It’s a curse because they can slip into bad habits and not understand variable types properly. Oh, and I don’t need to say that it can work on all sorts of databases, churn through data very rapidly, generate images, run statistical tests and plot graphs that are of publication quality.

Cons: Had to really think about this, but I guess that R is a nightmare to google for any kind of help when you’re stuck. I think it’s a fundamental issue relating to the fact that calling something a letter of the alphabet probably doesn’t help SEO rankings all that much. The official documentation would benefit from being a bit more like the PHP documentation (though maybe there is a site like that for R, I’ve just not found it), with users able to comment and give better examples than those provided initially. That being said, there are more blogs on R than you can shake even a very large proverbial stick at, which more than make up for it. I always search the legendary R-bloggers.com search box before googling anything to do with R now. I’ve never had to look any further than that.

~~Is R an ideal language to teach the fundamentals of programming to beginners?~~

I think the answer is “yes”. The beginners I have in mind are researchers and have specific needs regarding data processing, and it would benefit them to learn how to run stats in R, opening up future possibilities as well (e.g., LMEs). I’ve not mentioned Matlab, which I know is a favourite for researchers, because (1) it’s a gigantic monster to download and install, (2) I don’t know it that well and (3) it’s prohibitively expensive. I was also tempted to evaluate the use of LOLCODE to see if there was any mileage in using it (“IM IN YR LOOP UPPIN YR VAR TIL BOTH SAEM VAR”).

I myself first dabbled in programming back when I had a Sinclair back in the old days, and we did some very basic BASIC at primary school. Later on, I used BASIC to make emulators that mimicked my friends’ phrases and behaviour. Some of them were spot on! I guess I’ve always been trying to model human behaviour. I’ll post up the material I use to teach my colleagues to help them out and have a permanent copy of the material we go through.

~~That’s it for now, please feel free to share any other languages you may have found to be good for beginners. I’m sure there are some things that I have missed.~~

Loops, Conditionals and Variables: A Basic Simulation in R

hayward — Tue, 03 May 2011 12:00:23 +0000

In this post, I’m going to go over some basics of using conditionals and loops in R. I’ll expand on the example I use here in future posts. The conditionals and loops will be used to create some dummy eye movement data.

Background

Before I get into the actual code itself, I should probably explain what eye movements are all about. It’s a pretty big topic but basically, the easy way to think about it is to consider that your eyes look at things in the environment that you’re interested in. Put in more scientific terminology, your eyes fixate (point to) objects or areas of the environment containing information that your brain and cognitive systems are trying to process in detail. This happens because, though you don’t realise it, the quality of the visual input from your eyes is actually very poor. You only have colour vision in the dead centre of your visual field (though never, ever notice it), and beyond the centre of your visual field, the input not only goes from being in colour to being in black-and-white, but the clarity and resolution drops off significantly as well.

The solution to the limited quality of visual input is to (1) utilise a load of systems that make you feel comfortable and safe, with everything neatly in colour and crystal clear and (2) to move your eyes around. A lot. You make 5-6 eye movements every second that you are awake, and, though you can of course have conscious control of them, most of the time, you let your eyes scoop up information in the outside world that is relevant so you can basically just get on with your life.

Your eyes are never truly still, though there are periods when they are still and information is taken in. These time periods are called fixations and are interesting because you take in information during fixations. The movements between fixations are called saccades (French for ‘jumps’), and, though you don’t realise it, you’re blind during these saccades. Saccades are short (around 60ms, though this depends on the task), and fixations are much longer (varies considerably, but here we’re talking about 200ms).

The Simulation

Here I’m going to simulate people looking at four different objects: a square, a circle, a star and a triangle. If you imagine a display is drawn out in front of a participant, and these four objects are present. The participant’s job is to locate a circle. Once they have done that, they press a button and the trial ends. However, if they don’t find the circle, they can also give up, but they won’t do that straight away.

Let’s begin!

The Code

We begin by creating a dataframe called fix_table. It has 10,000 rows, given by the seq function.

fix_table <-data.frame(seq(1:10000))

Next we set some default values and create some columns. Trial is the simulated trial number that we’re in. Object is the object the participant is looking at, be it the square, circle, star or triangle. Fix_index is the current index of the fixation – this gets reset to 1 at the start of each trial.

fix_table$trial = 0

fix_table$object = "null"

fix_table$fix_index = 0

Now we set some defaults before running the main loop of the code. Objects is the list of different objects presented in each trial to look at. Fix_index begins at 1 because of it being the first fixation in a trial. Trial_index starts at one for the first trial:

objects = list("circle", "square", "star", "triangle")

fix_index = 1

trial_index = 1

The Simulation Loop

Now for the actual loop that does the simulation itself. It’s a for loop that goes through each row of the fix_table dataframe, starting at 1 (the first row) and ending at the final row, determined by the number of rows function or nrow(fix_table).

for (row in 1:nrow(fix_table)) {

... code goes here...

So, what code do we want to go into the loop? We begin by setting the basic information for that row, updating the fixation index and trial index values, like this:

fix_table[row,"fix_index"] = fix_index

fix_index = fix_index + 1

fix_table[row, "trial"] = trial_index

Next we randomly sample one of the objects to be looked at by the participant:

current = sample(objects,1)[1]

fix_table[row, "object"] = current

Sample randomly selects 1 object from the objects list, and then gets assigned to current. We then update the dataframe called fix_table with the name of the current object being looked at.

After this, we need to decide whether a trial is going to end with the current fixation:

p_end <-rnorm(1, mean=1/fix_index, sd=0.3)

if (p_end>1 | current=="circle") { trial_end=TRUE }

else {trial_end=FALSE}

This is some made-up code that first of all creates a sort-of random probability value that the trial will end. The trial is more likely to end as more fixations are made, and we generate a normally distributed random number with mean of 1/fix_index and standard deviation of 0.3. If this value is greater than 1, the trial will end. Alternatively, as participants are searching for a circle, if they look at the circle, the trial will end. This is determined by the use of the or condtional, signified by the vertical pipe, |. Otherwise, the trial continues.

If the trial is set to end, we need to reset some important values for the new trial to begin. We do this via the following:

if (trial_end==TRUE){

trial_index = trial_index + 1

fix_index = 1

Simple!

The Full Code

Here we go:

fix_table <-data.frame(seq(1:100))

fix_table$trial = 0

fix_table$object = "null"

fix_table$fix_index = 0

objects = list("circle", "square", "star", "triangle")

fix_index = 1

trial_index = 1

for (row in 1:nrow(fix_table)) {

# add basics

fix_table[row,"fix_index"] = fix_index

fix_index = fix_index + 1

fix_table[row, "trial"] = trial_index

# decide which object we are on this time

current = sample(objects,1)[1]

fix_table[row, "object"] = current

# determine if trial ends!

p_end <-rnorm(1, mean=1/fix_index, sd=0.3)

if (p_end>1 | current=="circle") { trial_end=TRUE }

else {trial_end=FALSE}

# if the trial ends, reset values

if (trial_end==TRUE) {

trial_index = trial_index + 1

fix_index = 1   }

Finally, let’s make a histogram of how long it takes for a trial to end:

summary_table <-ddply(fix_table, c("trial"),

summarise, max=max(fix_index))hist(summary_table$max)

Which gives the following:

That’s it for now! More complex aspects will be added later.

Data Aggregation in R: plyr, sqldf and data.table

hayward — Thu, 28 Apr 2011 11:12:03 +0000

I’ve previously put up a couple of posts about aggregating data in R. In this post, I’m going to be trying some other alternative methods for aggregating the dataset. Before I begin, I’d like to thank Matthew Dowle for highlighting these to me. It’s a bit daunting at first, deciding which method of aggregating data is best. So I decided to give them all a go to see what they were like. Let’s go for it!

For this post, I’m going to be using the lexdec dataset that comes with the languageR package. For information see here. I’ve called it full_list here, in order to play around with it. The details of the dataset are not that important; it’s just a case of getting hold of some data from human subjects (i.e., what I’m used to!).

The Target

Before we get into the functions themselves, let’s take a look at the aggregated data that we want. It has the mean, median and standard error of the RT variable (RT stands for Response Time, or time taken to press a button). I want to get some summary statistics of this variable for every level of each participant (Subject column) and for every level of the Class column. So, ultimately, the target is the following summary table (note that I’ve truncated this as there are lots of participants):

A tool by any other name: plyr

Let’s begin with plyr. The power of plyr comes from the fact that it splits up data, runs a function on the split-up data, and then sticks it all back together. It has a wide variety of useful aggregation functions, but here I’m going to use ddply. This function gives as it’s output a dataframe and gives as output another dataframe. The plyr functions are written in the syntax of XYply where X is the input object type and Y is the output object type. In this case, both ds of ddply stand for dataframe. Let’s look at some initial code:

ddply(full_list, c("Subject","Class"), function(df)mean(df$RT))

This is fine, and gives us mean DPS values for each class and spec. But there’s a problem. The “mean” column is labelled V1, which isn’t that helpful, especially if we have multiple columns computed (i.e., ending up with V1, V2, V3 makes it hard to remember which column is which). So let’s get the column labelled:

ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT))))

Great! Now let’s add some more columns to output:

ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT),

MEDIAN=median(df$RT),SE=sqrt(var(df$RT)/length(df$RT)))))

This then gives us the target aggregated table pictured above.

It needs no sequel: sqldf

Next up is sqldf. The name gives is away slightly: it’s a library for running SQL statements on data frames. SQL stands for Structured Query Language, with data stored on tables in a database. There are a number of SQL database types, which are all reasonably similar, and sqldf uses as default the incredibly popular SQLite. To get the target aggregated data using this, it’s a case of running a simple query:

sqldf("SELECT SUBJECT, CLASS, AVG(RT) AS AVERAGE, MEDIAN(RT) AS MEDIAN,

SQRT((VARIANCE(RT)/COUNT(RT))) AS SE

FROM full_list

GROUP BY SUBJECT, CLASS")

Note that to get the number of rows involved, we need to use COUNT rather than LENGTH. Easy!

How the tables have turned: data.table

The last library to look at here is data.table. This has the benefit of being considered the roadrunner of aggregation functions. It’s damn fast! This can be achieved as follows:

dps_dt = data.table(full_list)
dps_dt[,list(AVERAGE=.Internal(mean(RT)), MEDIAN=median(RT),

SE= sqrt(var(RT)/length(RT))),by=list(Subject,Class)]

Note that the first line takes our data.frame called full_list and casts it as a data.table object type. Here, two lists are used to do two things: (1) create the column names and (2) group the data by class and spec. The first list call sets up the column names and the calculations that need to be run. The second list gets fed to the by function which then aggregates by class and spec.

Summary

So, there we have three additional ways to aggregate data using R, to be added to tapply() and aggregate() which I have covered previously. Whichever one you end up using will probably depend on your own experience with using them (or, for example, whether you are familiar with SQL in the cae of sqldf), what needs you have, and how fast you need your aggregation processing to be.

Further Adventures in Visualisation with ggplot2

hayward — Mon, 25 Apr 2011 12:00:52 +0000

So I previously took a look at some data of player performance from a computer game. In this post, I’m going to do some further visualisations using ggplot2. The data consists of different types of player character, different roles for those characters, and their overall damage output (the unit here is damage per second, or DPS). To obtain the data, I took the top 40 highest scores from this website and pasted them into a spreadsheet (i.e., I didn’t try to kill their server by scraping the data, I copied it all by hand. How nice!).

So let’s begin. First, I want to take a look at some boxplots. But I don’t want them to be ordinary boxplots: I want them to be ordered by how well the players were able to score. So, I begin by sorting them by their median, and then plotting them.

ordered_spec = with(full_list, reorder(spec, DPS, median))
ggplot(full_list, aes(ordered_spec, DPS, fill = class)) +
geom_boxplot() +
opts(axis.text.x = theme_text(angle = 90, hjust = 0, size=7))

The boxplot is produced from the simple geom_boxplot() command. To order the data, I used the reorder command, which reorders the spec factor according to the median of DPS . This then gets applied to the aesthetic mappings of the ggplot() command to reorder the output.

A quick note: initially when trying to reorder factors and output for plots, I tried to do this using ggplot itself. This was a mistake, as it’s not easy to do so. After much hunting around, I saw that it’s better to reorder you factors before you put them into ggplot, then the output will come out in the right order.

Anyway, here’s the graph:

You can see that there’s quite a range of performance. The poorer-performing groups are, for the most part, those who have other roles so shouldn’t be high on DPS. That is, all apart from subtlety, which is not doing so well. That too, really has another role, but it’s surprising to see it so low (I remember when it was quite good for DPS, about five years ago though now).

Next, let’s take a look at something slightly different. In the data, we also have the seconds column, which lets us know how many seconds a player was active for. Perhaps it’s the case that players get tired, so a plot of their performance by how long they were active for might be revealing. It may alternatively be the case that a shorter period of time will benefit players because they can use special abilities which increase their damage output – though these abilities can only be used every few minutes. This could mean that a player who uses all of their special abilities and then dies (so their time stops) may have a high DPS output.

ggplot(full_list_dps)
+aes(x=seconds, y=DPS, colour=class)
+geom_point()
+scale_colour_hue()

Here, we just need to specify the x and y axis values. The points are plotted using the geom_point() command. Colours are added using scale_colour_hue(). There are a wide variety of colour options that can be used. Here’s the graph:

There appears to be a large clustering together, though I guess it seems like there is a weak downwards trend. Let’s just run a correlation for the sake of it, shall we?

cor.test(full_list_dps$seconds, full_list_dps$DPS)

The output says there is a significant (p<.0001) negative correlation of -0.39.

Finally, let’s break it down and facet the output, so we can look at each class individually.

ggplot(full_list_dps)+aes(x=seconds, y=DPS, colour=spec)+
facet_wrap(facet=~class)+
geom_point()+ scale_colour_hue()

That gives us this:

That’s all for now – up next will be some methods for summarising the data, followed by statistical tests (starting with ANOVAs, then moving onto LMEs). Again, just note that this is for fun, and not intended to be an accurate account of player performance by any remote stretch of the imagination.

Sexy, Geeky Graphs using ggplot2 in R

hayward — Fri, 22 Apr 2011 08:00:15 +0000

So I’ve been looking for some data to play with while learning R, other than the data I’m analysing for various experiments and papers I’m working on. I thought to myself, “Hey, this R stuff is pretty geeky. Can I engage in a higher level of geekiness?” And I think I’ve found a way: using R to analyse player performance in a computer game.

Background: The Data

The game in question is the epic cash cow known as World of Warcraft (otherwise known as WoW to some, or pronounced “Woo” by hilarious people), made by dear old Blizzard Entertainment. I’ve been a long-time player of Blizzard games, starting with a demo of Warcraft 2 that came on a CD with a magazine (hey, CDs, remember when games came on CDs?). Since then it’s been the works… Warcraft 3, Starcraft (1 and a tiny bit of 2), Diablo 2 (for far too long). I also have in my house a copy of Lost Vikings on the SNES (my other half’s, she’s as bad with this stuff as I am, though it does mean we have two SNES machines). Sadly, I don’t get time to play games these days – though I did used to raid a lot when I was an undergrad, I don’t really have time now.

Fighting the good fight: taking on a Pome Wraith - a zombie with an apple on its head

Anyway, in plain English – for those of you who haven’t heard of this game before now- the point of a large part of the game is to take your character that you have control of and go and bash large, unpleasant creatures on the noggin. After a while, those creatures die and leave you with shiny prizes and loot. It might sound a bit simplistic, but actually it gets quite complex: there are a large number of decisions you need to make in order to maximise your performance, you need to be very fast to react to changing circumstances in the environment, and you need to work with a set of other people in parallel to get the job done. For an example of stuff people need to learn in order to do a decent job, take a look here.

All of this (and the fact that there is an enormous players numbering many millions across the globe) has meant that there has been a drive to get the most out of what players can do. There’s a sizeable community of players who run various models and simulations to work out the best ways to do things. This has made me often wonder if the player performance could also benefit from being analysed in a post-hoc manner. Rather than using models and simulations, why not take actual player performance and see how people fare?

Well, there are problems with that: not everyone is very good at the game. Plus, that would involve a lot of data collection (which I assume Blizzard do in some shape or form, by the way, from comments they have made at various times). So, let’s go for a different approach. Let’s pick the best players and see how they manage. These best players will serve as an approximation to the ideal maximum of what can be achieved. Now, here’s where you may be thinking “hrmmmm”, but please, stick with it. This is more meant to be an entertaining illustration to what various functions in R can do, rather than a set of data being analysed that I intend to stand by and be certain can be trusted. It’s all a bit of fun.

Fortunately, there’s an easy way to get the best scores that players have achieved: World of Logs has a ranking system for the best scores on various fights in the game. So, I went there, found an encounter, and started copying and pasting the ranks into a spreadsheet. I picked the top 40 scores for Nefarian. He’s a big dragon who was killed in a previous version of the game, but is back now with a headache or something. Actually, I remember him toasting me a few times (I was a rogue back then, and our tank didn’t understand the whole ‘rotate the giant puppy’ part of the rogue class call).

Getting into ggplot2

Now that I have my data set up, I’m going to do some basic graphs using ggplot2. Now, if you’re like me and have seen some examples of what ggplot2 can do, you might have thought “oh my, that looks sexy!”. And then you tried to work out how to make nice-looking graphs and became somewhat unstuck. Trust me, though, it’s worth persevering with, because ggplot’s power comes from its flexibility. I used to make my graphs using Sigmaplot, but now I have a graph format that I like, it’s a case of copy and pasting things around to get very nice graphs instantly.

I initially started trying to use the qplot() funciton, but, as I understand, it is limited in various ways compared to what the mighty ggplot() function can do. So let’s stick with ggplot(), or else you’ll have to learn how to do things twice, and that’s no fun at all.

The basic way that ggplot() works is very similar to a number of other programming languages when it comes to putting together images (e.g., pygame images and image creation in PHP – I’m sure it’s similar to others too, but those are all I’ve used). Essentially, you stack a set of options and commands on top of a blank canvas. So you start with nothing, then you say, “right, let’s make a plot”, then you start building things into it. You want points drawn? Stack them on the canvas. You want error bars? Stack them on, too. If you don’t tell it what to do, it will, in some cases, make assumptions about what you want, and go with the defaults. For some functions and programs, the defaults are horrible. This is not the case with ggplot: the defaults are awesome.

So here I’m just going to do something very simple to illustrate how you can build up options and commands to make a set of graphs. I’m basing this on an example from the ggplot documentation. I’m going to make a series of density plots of player Damage Per Second (DPS, the standard indicator of performance, and the more the better!) and compare the various specialisations (specs) which are sub-components of the various classes in the game. Depending on what you want to do, you might choose one spec over another. Similarly, depending on what you want to do, you might pick one specific class. Say you want to turn into a bear: you’d be a druid. If you want to be skirt-wearing magician: you’d be a mage. And so on! Anyway, on with the code:

ggplot(full_list, aes(DPS, fill = spec)) +
facet_wrap(facet=~class)+
geom_density(alpha = 0.2) +
scale_x_continuous("Damage Per Second (DPS)")+
opts(axis.text.x = theme_text(angle = 90, hjust = 0, size=7))

Note the “+” symbols at the end of each line. The + is used to add additional options to the ggplot command, but, if you are running them from a script, you’ll need to ensure that, if you have multiple +options on multiple lines, you need to add the + symbol at the end of a line, not the start of a line, or it won’t run. That took me a while to work out!

Anyway, the first line tells ggplot to use the dataset I have called full_list. The next command aes, starts outlining aesthetic mappings for the plot to use. Here I define my x-axis by entering DPS. Next I tell it to colour the different plots by spec by using the fill command.

Next comes facet_wrap which splits up the graphs like the lattice function by the class factor. This will produce one graph for each class.

The third line adds a geom_density or density plot element. The transparency (alpha) is set to 0.2 to enable you to see how the density plots overlap.

The fourth line sets the x-axis title using the scale_x_continuous command. Note that if your x-axis is a factor you need to use scale_x_discrete instead.

Finally we have the opts or options. There are a huge number of options, the best list of which I’ve found is here. Here I’ve set the x-axis text to be angled and therefore easier to read without overlapping.

Now, let’s take a look at the output:

The full set of plots. There may be too many specs and colours here!

You can see that some specs of different classes do better than others. Some aren’t supposed to do much damage, as they have other roles (e.g., the ones with “prot” in the name). Again, please don’t take this as a serious attempt at comparing the specs and classes, it’s just some data to play around with and explore for illustrative purposes.

The next steps will be to try out various ways of summarising the data (e.g. data.table, aggregate, plyr), after which I’ll start running some statistical tests.

Aggregate Function in R: Making your life easier, one mean at a time

hayward — Wed, 20 Apr 2011 08:00:40 +0000

I previously posted about calculating medians using R. I used tapply to do it, but I’ve since found something that feels easier to use (at least to me).

aggregated_output = aggregate(DV ~ IV1 * IV2,
                data=data_to_aggregate, FUN=median)
aggregated_output

The above code saves an aggregated dataset to aggregated_output and gives you the median in a column. The median (or mean, or whatever function you want to apply) is specified by FUN=. The value to create a median for is specified by DV (dependent variable).

The aggregate function also gives additional columns for each IV (independent variable). You can have as many of these as you like. Here, I have two, and these are specified by IV1 * IV2.

Those of you who are familiar with relational databases will see immediately that this function is somewhat similar to GROUP BY (in MySQL). The bonus is that you don’t need to SELECT the IV columns that you want to be provided; those are done automatically. For example, take a look at this:

SELECT IV1, IV2, AVG(DV) FROM data_to_aggregate GROUP BY IV1, IV2

There is apparently more than one way to skin a cat (even if it’s a cat that’s made of data).

RStudio, Revolution Analytics and Deducer: A Tale of Three GUIs

hayward — Tue, 19 Apr 2011 11:59:58 +0000

I’m in the process of moving from SPSS to R at the moment. It’s not been the easiest of rides, but then learning how to do a core part of your job never really should be. It’s been fun, though – don’t get me wrong – it’s definitely been an adventure!! Here I’m going to review my (limited) experience with some of the GUIs available for R. Don’t shout at me if I haven’t fully tested them – these are the views of a newbie (n00b). This is by no means intended to be a fully-detailed or fully-researched account of the programs here. I actually think they are all great and have been using them interchangeably during the learning process. Best of all, as they are all free, it has meant that changing between the three as I learn has cost me nothing, and I’ve picked up bits and pieces of new ideas from each of them. I’m writing this in the hope that others will give them all a go and learn something too.

One of the interesting parts of my time learning R has been the increasing realisation that it’s turning into something new. When I first tried it out several years back, the first load of the default R installation was, well, not pleasant, consisting of little more than the most basic of interfaces, coupled with a console that was basically worthless to a beginner. It was about as much fun as trying to install Linux about 10 years ago: inevitably, you wish you had stayed at home, so scurry off and hide.

And then came Revolution…

But things have changed since that time. R is turning into a new beast, full of potential and possibilities. Imagine my surprise – nay, joy – when I discovered Revolution Analytics. It’s a powerful beast built on the R code base. Users are presented with an IDE that actually makes life considerably easier. The IDE contains a console and a scripting window, which means you do get the best of all worlds – code, console, and pretty buttons which make life easier. Great. Some of the chief people behind Revolution Analytics were heavily involved in SPSS before moving to R – so they know what they are doing. I’ve seen users give the company some grief over the fact that it’s not open source. That’s not a debate I want to get into, though I am pleased that they have a free academic license, which is definitely a good thing. I think they charge business users for it, particularly for it’s optimisations for churning through large datasets.

Anyway, I digress. Revolution Analytics is great. Download it and give it a go.

What the hell is a Deducer?

Deducer is a slightly different beast to Revolution Analytics. The point of Deducer, it seems, is to replace the functionality of full-GUI statistical packages (hello, SPSS… PASW… or whatever you are called now). This is a brilliant goal and Deducer is making masses of headway in terms of becoming an awesome package. It has built-in functionality and buttons for producing sexy graphs using ggplot2. Keep an eye on this one. It can also do some forms of analyses already, and I’d predict that it won’t be long before it can do pretty much anything.

Deducer is also great – download it and also give it a go. It has a Data View and Variable View (like in SPSS) which eliminates the usual annoyances of R assuming what is a factor and what is a number.

I guess I should have been calling it DeduceR. Should I? No ideaR. Oh my, this R stuff is getting out of contRol…

Back to the RStudio

The other GUI I’ve been using is RStudio. This is my personal favourite. It’s the fastest to install, and the fastest to load out of the three I’m reviewing. I know, I know, loading times don’t matter, right? If something takes 10 seconds to load, that just means you’ll spend ten less seconds on Facebook, surely? Well, maybe. Loading times are often a good sign of how much bloat there is in a program, as well as how much effort has been put into optimising the program to make it obscenely fast.

There are several reasons why RStudio is my current favourite. It has options for re-colouring the editor window to a dark colour scheme (plus points for me, I like dark schemes). As with Deducer, it is easy to import files into datasets. Packages are easy to manage, graphics are easier to take a look at and export, and datasets are easy to inspect (though you can’t edit variable types when viewing datasets, at least as far as I know). Together, these three points make it feel like a qualitative and quantitative shift towards something where you can still learn how to do the headache-inducing scripting stuff, but without the kind of headaches that drive you back to SPSS. Oh, and it can comment out multiple lines in a script with a single click of a button. How cool is that?

Go and give RStudio a go, now, now!

Summary

That’s my experience so far – I’m sure it will change as I learn more ! Beyond the differences between these various GUIs, there is a clear point that needs to be considered. The fact that many different people are now working to bring R to becoming something that can be used more widely can only be good thing (TM). These GUIs, and others like them, will encourage developers to work harder to produce even better alternatives to the R base installation, so I expect, even a year from now, the landscape will be entirely different.

UPDATE: Thanks to Tal Galili, it’s just been pointed out to me that Deducer and RStudio can be used together. It’s just a case of running library(Deducer) then JGR(), after which you call Deducer again from within the JGR console (this ensures everything is installed for Deducer and ready to go). I had assumed from the documentation for Deducer that it wasn’t possible to do this (no idea I didn’t try it, my bad). Great!

Pivot Tables and Medians in R

hayward — Sat, 16 Apr 2011 18:46:27 +0000

Pivot Tables are a useful way of aggregating data into the format that you’re after. In this example, I’m going to be using R to pivot some data and calculate medians for me. This is useful because Excel can calculate medians (the =MEDIAN(values)) function, but what it can’t do is calculate medians for Pivot Tables. I assume that it can’t do this because calculating the median of large groups of aggregated data can be very computationally intensive, and may take longer than you would expect.

The good news, however, is that R can do this with problems. Say that you have run an experiment and are left with the following:

participant	condition	score
1	1	95
1	1	90
1	2	105
1	2	110
2	1	64
2	1	80
2	2	90
etc.

But that’s now what you want – instead, say that you want the following:

Participant	condition_1	condition_2
1	median of score	median of score
2	median of score	median of score
etc.

Here’s the code I used to sort this out:

[stextbox id=”grey” image=”null”]

datafile = read.table(file.choose(), header= TRUE)

median_output <- tapply(as.numeric(datafile$score), list(datafile$participant, datafile$condition), median)

write.table(median_output, file.choose())[/stextbox]

Using file.choose() presents you with a pop-up window asking which file to load in to use as your datafile and also asks you, at the end of the script, where you want to save your pivoted data. At this point, you can call it a text file (e.g., “medians.txt”) and save it to wherever you want.

To Pivot more complex datasets, all you need to do is add more columns from your dataset to the list function. You’ll then get the fully pivoted data out instead.

Don’t forget that you can run this using funcitons other than the median (e.g., mean) – just replace median with whatever you need.

Note finally that I ran as.numeric() on the score column. This was done because, when reading in the raw data, R sometimes assumes that the column is a factor rather than a numeric column. If it’s assumed the wrong thing, you’ll probably get an error saying “Error in tapply… arguments must have same length”. If this happens, make sure that all of your columns which should be a factor are a factor and all of your columns which should be numeric are numeric.