So rock me mama like a wagon wheel, rock me mama anyway you feel (Wagon Wheel, Old Crow Medicine Show)
This is the third iteration of Hilbert curve. I placed points in its corners. Since the curve has beginning and ending, I labeled each vertex with the order it occupies:Dark green vertex are those labeled with prime numbers and light ones with non-prime. This is the sixth iteration colored as I described before (I removed lines and labels):
Previous plot has 4.096 points. There are 564 primes lower than 4.096. What If I color 564 points randomly instead coloring primes? This is an example:
Do you see any difference? I do. Let me place both images together (on the right, the one with primes colored):
The dark points are much more ordered in the first plot. The second one is more noisy. This is my particular tribute to Stanislaw Ulam and its spiral: one of the most amazing fruits of boredom in the history of mathematics.
This is the code:
library(reshape2) library(dplyr) library(ggplot2) library(pracma) opt=theme(legend.position="none", panel.background = element_rect(fill="white"), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank()) hilbert = function(m,n,r) { for (i in 1:n) { tmp=cbind(t(m), m+nrow(m)^2) m=rbind(tmp, (2*nrow(m))^r-tmp[nrow(m):1,]+1) } melt(m) %>% plyr::rename(c("Var1" = "x", "Var2" = "y", "value"="order")) %>% arrange(order)} iter=3 #Number of iterations df=hilbert(m=matrix(1), n=iter, r=2) subprimes=primes(nrow(df)) df %>% mutate(prime=order %in% subprimes, random=sample(x=c(TRUE, FALSE), size=nrow(df), prob=c(length(subprimes),(nrow(df)-length(subprimes))), replace = TRUE)) -> df #Labeled (primes colored) ggplot(df, aes(x, y, colour=prime)) + geom_path(color="gray75", size=3)+ geom_point(size=28)+ scale_colour_manual(values = c("olivedrab1", "olivedrab"))+ scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+ scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+ geom_text(aes(label=order), size=8, color="white")+ opt #Non labeled (primes colored) ggplot(df, aes(x, y, colour=prime)) + geom_point(size=5)+ scale_colour_manual(values = c("olivedrab1", "olivedrab"))+ scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+ scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+ opt #Non labeled (random colored) ggplot(df, aes(x, y, colour=random)) + geom_point(size=5)+ scale_colour_manual(values = c("olivedrab1", "olivedrab"))+ scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+ scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+ opt
This is the bi-monthly R-bloggers post (for 2016-01-25) for new R Jobs.
Just visit this link and post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest:
In R-users.com you may see all the R jobs that are currently available.
(you may also look at previous R jobs posts).
]]>One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R.
We also came upon another cool approach, in the mixtools
package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp
function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.
You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.
Suppose this is our data:
In two dimensions, it’s pretty easy to see how many clusters to try for, but in higher dimensions this gets more difficult. Let’s set as our null hypothesis that this data is broken into two clusters.
We can now estimate the mean and covariance matrices of these two clusters, for instance by using principle components analysis. If we assume that the clusters were generated by gaussian processes with the observed means and covariance matrices, then we can generate synthetic data sets, of the same size as our real data,that we know have only two clusters, and also have the same means and covariances.
In other words, the full null hypothesis is that the data is composed of two gaussian clusters, of the observed mean and covariance.
Now we want to test the hypothesis that a data set is composed of three clusters (or more) against the null hypothesis that it has only two clusters. To do this, we generate several synthetic data sets as described above, and cluster them into three clusters. We evaluate each clustering by some measure of clustering goodness; a common measure is the total within-sum-of-squares (total WSS) (see also Chapter 8 from our book). We then compare the total WSS from three-clustering the real data with the distribution of total WSS of the synthetic data sets.
For a data set with two actual gaussian clusters, we would expect the total WSS of a three clustering to be lower than that of a two clustering, because total WSS tends to decrease as we partition the data into more (and internally “tighter”) clusters. Our simulations give us a plausible distribution of the range of total WSS that we would tend to see.
If the real data is really in two (gaussian) clusters, then when we three-cluster it, we would expect a total WSS within the range of our simulations. The hope is that if the data is actually in more than two clusters, then the clustering algorithm — we used K-means for our experiments — will discover a three-clustering that is *better* (lower total WSS) than what we saw in our simulations, because the data is grouped into tighter clusters than what our null hypothesis assumed. This is what happens in our example case. Here’s a comparison of the real data to the above simulated data set when both are three-clustered.
Here’s a comparison of the total WSS of the three-clustered real data (in red) compared to 100 bootstrap simulations:
Judging by these results, we can reject the null hypothesis that the data is in two clusters, and operate on the assumption that there are three or more clusters. In general, if the total WSS of the true data’s k+1 clustering is lower than at least 95% of the total WSS from the appropriate bootstrap simulations (for a significance level of p < 0.05), then we reject the null hypothesis that there are k clusters, and assume that the data has at least k+1 clusters. This continues with increasing values of k until we can no longer reject our null hypothesis. For this example, we found four clusters — which is in fact how we generated the test data.
boot.comp
does, too). Sometimes the procedure gets “stuck,” whereby it fails to reject the null hypothesis at k, but then does reject it at k+1. So alternatively, we can run the procedure for a fixed set of k (say, in the range from 1 to 10), and then take as the number of clusters the last k where we fail to reject the null hypothesis (skipping over the “stuck points”). This sometimes “unsticks” the procedure — and sometimes it doesn’t, so we default to the “immediate stop” procedure in our interactive demonstration of this method (see link below).mixtools
or another mixture model analysis package.boot.clust
function uses maximum likelihood, which generally estimates the true parameters better (assuming the data is truly gaussian), and is unbiased.The code to generate the data sets in our experiments and run the parametric bootstrap is available at Github here. You can experiment with different dimensionalities of data and different numbers of clusters by altering the parameters in the R Markdown file kcomp.Rmd
, but it is easier to play around by using the Shiny app we wrote to demonstrate the method. You can download the data generated by the Shiny app for your own experiments, as well.
In case you missed them, here are some articles from January of particular interest to R users.
Animated visualizations and analysis of data from NYC's municipal bike program, created with R.
Many local R user groups are sharing materials from meetups using Github.
A detailed R tutorial on analyzing your Twitter archive and performing sentiment analysis.
How to combine R and Python in Jupyter notebooks.
Many datasets are available for analysis in R using Kaggle's online platform, including the American Community Survey.
Getting started with Markov Chains in R and even more R packages for Markov
Chain analysis.
Replays are available for recent webinars on Microsoft R Open and Microsoft R Server.
Microsoft R Open 3.2.3 (formerly Revolution R Open), and new CRAN Time Machine now available at MRAN.
Overview of parallel computing in R.
R packages providing sources of data.
Visual Studio will soon support the R language.
Microsoft R Server available free to students and developers.
Revolution R is now Microsoft R.
A new ggplot2 extension avoids overlapping text labels.
R played a big part in a scientific breakthrough regarding reproducibility of results.
An online data science course using Microsoft Azure and R.
A review of the 7th R user conference in Spain.
Using network analysis in R to explore connections in the movie "Love Actually".
The most popular posts on the Revolutions blog in 2015.
General interest stories (not related to R) in the past month included: pinball skills, when walking up the escalator is inefficient, Pokemon or Big Data and mimicking famous guitar styles.
As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
Table 1: The histogram captures the distribution of contributions. Note the x axis is scaled by log10 so the same distanced exists between 1 and 10 as, 10 and 100, or 100 and 1000 |
Clinton | ||
First Report (July) | $8,098,571 | 269,952 |
Fall Report (October) | $5,193,811 | 173,127 |
Year End Report (December) | $5,707,408 | 190,247 |
Sanders | ||
First Report (July) | $10,465,912 | 348,864 |
Fall Report (October) | $20,187,064 | 672,902 |
Year End Report (December) | $23,421,034 | 780,701 |
Figure 4: Total number of contributions over time and the difference between the two. |
Figure 5: Total quantity of dollars contributed over time. |
In order to do our analysis, we look at four hundred thousand individualized contributions reported to the FEC at the end of the 2015 year. These contributions are only reported for individuals who have donated $200 or more. Because many individuals give smaller than $200, this means we do not have individualized information for many individuals. For Hillary Clinton, this means about 16% of contributions are not reported as individualized. For Bernie Sanders, this means 74% of contributions are not reported as individualized.
Table 1: The histogram captures the distribution of contributions. Note the x axis is scaled by log10 so the same distanced exists between 1 and 10 as, 10 and 100, or 100 and 1000 |
From Table 1, we can see that throughout the entire 2015, Clinton has vastly more large contributors than Sanders with over 20,000 campaign contributors giving the maximum contribution value of $2700*. Clinton also has a larger number of large donors giving the contribution values with another 20,000 donors giving between $500 and $2700.
Conversely, for the smaller value donations, Sanders has many more contributors than Clinton with nearly 35,000 contributions at $100 compared with Clinton’s 23,000. With $50 donations, Sanders also does much better with over double the number of donations with over 40,000 contributions compared with Clinton’s 20,000. The difference is even more stark with Sanders receiving nearly 40,000 ten dollar donations compared with Hillary’s 12,000.
There are some ways to avoid the legal contributions limits as discussed in this NPR article.
From Figure 2 we can see some pretty shocking facts about the nature of her contributions early in her bid. In April and May and almost into June the upper quartile (top 25%) of her contributions were at or above the legal maximum. This is vastly different from Sanders who had a handful of contributions at or about the legal maximum but nothing close to the number by Clinton. Overall, difference between the two in April and May could not be any more stark with the upper quartiles for Sanders at or below the median for Clinton for nearly all of the months observed.
Overall we can see there is a significant amount of movement in the size of donations over time. For both Clinton and Sanders, there is a bit of a race to the bottom. This is driven somewhat by the nature of the reporting laws as contributions are not reported until an individual has given at least $200. After that, all contributions are reported. Thus many of the smaller contributions will be reported as repeat contributors keep donating.
Form Figure 3, we can see that the difference in the nature of contributions by candidate is vast with almost all of the contributions to the Sanders campaign of less than $500. For the first two months over half of the listed contributions for the Clinton campaign was $500 or more. Over time, the average size on contributions decreased though much faster for the Sanders campaign.
From Figures 2 and 3 we might be concerned that Sander’s campaign is not capable of raising sufficient funds to compete with the Clinton campaign. However, this is forgetting that Sanders has many many more contributors than Clinton. In order to get an estimate of the number of contributions that are given but itemized, I look at the number of contributions each quarter unitized and assume that those contributions are on average $30 (probably a high estimate).
Table 1: Total Not-Itemized Contributions by quarter. # of Contributions is based on assuming each of these contributions estimated $30.
$ Not-Itemized | # Of Contributions | |
---|---|---|
Clinton | ||
First Report (July) | $8,098,571 | 269,952 |
Fall Report (October) | $5,193,811 | 173,127 |
Year End Report (December) | $5,707,408 | 190,247 |
Sanders | ||
First Report (July) | $10,465,912 | 348,864 |
Fall Report (October) | $20,187,064 | 672,902 |
Year End Report (December) | $23,421,034 | 780,701 |
From Table 1, we can see that Clinton initially reported nearly as many small contributions as Sanders, those contributions have since fallen off while Sanders small contributions have significantly increased in order to outpace Clinton by four to one.
Figure 4: Total number of contributions over time and the difference between the two. |
Smoothing the number of small contributors across the months campaigning we end up with Figure 4 in which we quickly see how vast the difference between the number of contributors to the Sanders campaign and the Clinton campaign are.
Initially, Clinton enters the race a little earlier with a quarter million contributions. However, once Sanders enters the campaign, he quickly gains support with his total number of contributions matching that of Clinton by June 5th and continues to grow. By September 20th Sanders has already collected twice the number of contributions that Clinton has.
So how does this map to total contributions collected over time? We already know that Clinton has a large number of big donors on her side.
Figure 5: Total quantity of dollars contributed over time. |
From Figure 5, we can see that despite Clinton getting an early and big hand up from large money. As early as July, the difference in funds raised by Clinton was just over 30 million more than that raised by Sanders. However, despite continuing to have large contributions, this difference has not increased much since with as of the end of the year, only a little more than a 35 million dollar different from individual contributions.
Overall, this is an AMAZING fact. Somehow, despite having the majority of wealthy democratic donors in her corner, Clinton has failed to out-raise Sanders since July!
Not only that, but Hillary is in a difficult position, many of her largest donors have already maxed out their ability to legally contribute to her campaign, yet very few of Sanders contributors have gotten close to maxing out their legal ability to contribute to campaigns. Of course there are always the dubiously legal contributions to candidate Super-PACs made legal by the infamous “Citizens United” supreme court ruling.
However, as Sanders has campaigned against Super-PACs and Hillary is attempting to win over his supporters, it will certainly be interesting to see how fundraising changes moving forward as she risks being hamstrung by her narrow but affluent base.
Related Posts:
As First Lady, Popularity of Babies Named “Hillary” Dropped by an Unprecedented 90%
Hillary Clinton’s Biggest 2016 Rival: Herself
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America’s Poor
What it means to be a US Veteran Today
When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:
Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:
You can actually see a preview of those requests (usually JSON):
While you could go through all the headers and cookies and transcribe them into httr::GET
or httr::POST
requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl
and curl
packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:
curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed |
While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter
.
The curlconverter
package has (for the moment) two main functions:
straighten()
: which returns a list
with all of the necessary parts to craft an httr
POST
or GET
callmake_req()
: which actually _returns a working httr
call, pre-filled with all of the necessary information.By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req()
or req_params <- straighten()
), but they can take in a vector of cURL command lines, too (NOTE: make_req()
is currently limited to one while straighten()
can handle as many as you want).
Let’s show what happens using election results cURL command line:
REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed" resp <- curlconverter::straighten(REP) jsonlite::toJSON(resp, pretty=TRUE) ## [ ## { ## "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"], ## "method": ["get"], ## "headers": { ## "Pragma": ["no-cache"], ## "DNT": ["1"], ## "Accept-Encoding": ["gzip, deflate, sdch"], ## "X-Requested-With": ["XMLHttpRequest"], ## "Accept-Language": ["en-US,en;q=0.8"], ## "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"], ## "Accept": ["*/*"], ## "Cache-Control": ["no-cache"], ## "Connection": ["keep-alive"], ## "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"], ## "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"] ## }, ## "cookies": { ## "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"], ## "s_cc": ["true"] ## }, ## "url_parts": { ## "scheme": ["http"], ## "hostname": ["graphics.latimes.com"], ## "port": {}, ## "path": ["election-2016-31146-feed.json"], ## "query": {}, ## "params": {}, ## "fragment": {}, ## "username": {}, ## "password": {} ## } ## } ## ] |
You can then use the items in the returned list to make a GET
request manually (but still tediously).
curlconverter
‘s make_req()
will try to do this conversion for you automagically using httr
‘s little used VERB()
function. It’s easier to show than to tell:
curlconverter::make_req(REP) |
VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", add_headers(Pragma = "no-cache", DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", `X-Requested-With` = "XMLHttpRequest", `Accept-Language` = "en-US,en;q=0.8", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", Accept = "*/*", `Cache-Control` = "no-cache", Connection = "keep-alive", `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/")) |
You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).
You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter
Node module by Nick Carneiro.
It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).
This 5 hour course includes 5 chapters and covers the given topics below:
Chapter 1: In this chapter we’ll put you in the right frame of mind for developing meaningful visualizations. You’ll learn how to think about your audience first and be introduced to the basics of ggplot2 - the 7 layers of the grammar of graphics.
Chapter 2: In this chapter we’ll explore the iris dataset from several different perspectives to see how the data structure affects plots in ggplot2. You’ll see that making your data conform to a structure that matches the plot in mind will make the task of visualization much easier.
Chapter 3: Aesthetic mappings are the cornerstone of the grammar of graphics plotting concept. This is where the magic happens! Learn to convert continuous and categorical data into visual scales that provide access to a large amount of information in a very short time.
Chapter 4: A plot’s geometry dictates what visual elements will be used. In this chapter we’ll familiarize you with the geometries used in the three most common plot types you’ll encounter: scatter plots, bar charts and line plots.
Chapter 5: In this chapter you'll learn about qplot; it is a quick and dirty form of ggplot2. It’s not as intuitive as the full-fledged ggplot() function, but is useful in specific instances.
Learn how to produce meaningful and beautiful data visualizations with DataCamp’s ggplot2 course series. Be introduced to the principles of good visualizations and the grammar of graphics plotting concept implemented in the ggplot2 package. Learn yourself how to make complex exploratory plots, and be able to make a custom plotting function to explore a large data set, combining statistics and excellent visuals. Begin learning Data Visualization with ggplot2 interactively, today.
This 5 hour course includes 5 chapters and covers the given topics below:
Chapter 1: In this chapter we’ll put you in the right frame of mind for developing meaningful visualizations. You’ll learn how to think about your audience first and be introduced to the basics of ggplot2 – the 7 layers of the grammar of graphics.
Chapter 2: In this chapter we’ll explore the iris dataset from several different perspectives to see how the data structure affects plots in ggplot2. You’ll see that making your data conform to a structure that matches the plot in mind will make the task of visualization much easier.
Chapter 3: Aesthetic mappings are the cornerstone of the grammar of graphics plotting concept. This is where the magic happens! Learn to convert continuous and categorical data into visual scales that provide access to a large amount of information in a very short time.
Chapter 4: A plot’s geometry dictates what visual elements will be used. In this chapter we’ll familiarize you with the geometries used in the three most common plot types you’ll encounter: scatter plots, bar charts and line plots.
Chapter 5: In this chapter you’ll learn about qplot; it is a quick and dirty form of ggplot2. It’s not as intuitive as the full-fledged ggplot() function, but is useful in specific instances.
Add data visualization to your skills today!
The post Durban R Users Group Meetup: 24 February 2016 @ The Green Door appeared first on Data Science Africa.
]]>We’re kicking off the inaugural meeting of the Durban R Users Group with a live video presentation by Andrie de Vries (Senior Programme Manager, R Community Projects at Microsoft / Revolution Analytics). Andrie will be talking about “Demonstration of using R in the cloud together with Azure Machine Learning”. If you’ve kept up with Microsoft’s recent contributions to R then you’ll know that this is going to be an interesting talk.
We’ll also have Professor Bruce Page (University of KwaZulu-Natal, School of Biological and Conservation Sciences) talking about using the T-LoCoH package for tracking space/time data in R. Bruce is an recognised expert on monitoring the movements and changes in home ranges of wild animals.
Andrew Collier will give a talk entitled “From Klueless to Kaggle in 15 Minutes” which will demonstrate how you can rapidly get up to speed on Data Analysis and Predictive Modelling in R.
Get the details and sign up here.
The post Durban R Users Group Meetup: 24 February 2016 @ The Green Door appeared first on Data Science Africa.