Articles – ProgrammingR

Collecting Stock Data Using R: A Quick Guide

John — Tue, 25 Mar 2025 03:22:32 +0000

Collecting data can be a drudge for many tasks in economics or securities analysis. Fortunately, R has some good options available to streamline this task.

Popular R Packages for Stock Data

quantmod:
The quantmod package is a favorite among financial analysts.
It provides functions for quantitative financial modeling, including the retrieval of historical stock prices from sources such as Yahoo Finance.
With quantmod, you can not only automate downloading the price history for a stock but also crank out a handful of charts in your favorite format.

tidyquant:
Another useful package is tidyquant, which unlocks an entire set of useful libraries within the tidyverse, which plays well with financial data.
These additional capabilities simplify manipulating data, running basic studies on price action and integrating other visualizations.

Example: Retrieving Stock Data

Here’s a quick example using both quantmod and tidyquant:

Using quantmod:

# Install and load quantmod if you haven't already
if (!require("quantmod")) install.packages("quantmod")
library(quantmod)

# Retrieve historical data for Microstrategy 
getSymbols("MSTR", src = "yahoo", from = "2020-01-01", to = "2025-03-01")

# Visualize the stock data using a basic chart
chartSeries(MSTR, theme = chartTheme("white"))

Using tidyquant:

# Install and load tidyquant if needed
if (!require("tidyquant")) install.packages("tidyquant")
library(tidyquant)

# Retrieve historical data for Microstrategy.
mstr_data <- tq_get("MSTR", from = "2020-01-01", to = "2025-03-01")

# Display the first few rows of the data
head(mstr_data)

These examples demonstrate the ease of accessing stock data.

In practice, which package you choose generally depends on who you work with. Quantmod is popular within the trading and financial analyst communities. Analysts with a more academic background or broader programming interests have likely been pitched on the tinyverse a few times along the way. Each works, however, and can dramatically simplify your workflow.

Conclusion

We’ve shown you two ways to get the job done. The choice is up to you!

The post Collecting Stock Data Using R: A Quick Guide appeared first on ProgrammingR.

How to Remove Outliers in R

Syed — Sun, 19 Jan 2020 14:14:51 +0000

Statisticians often come across outliers when working with datasets and it is important to deal with them because of how significantly they can distort a statistical model.

Your dataset may have values that are distinguishably different from most other values, these are referred to as outliers. Usually, an outlier is an anomaly that occurs due to measurement errors but in other cases, it can occur because the experiment being observed experiences momentary but drastic turbulence. In either case, it is important to deal with outliers because they can adversely impact the accuracy of your results, especially in regression models.

In this tutorial, I’ll be going over some methods in R that will help you identify, visualize and remove outliers from a dataset.

Looking at Outliers in R

As I explained earlier, outliers can be dangerous for your data science activities because most statistical parameters such as mean, standard deviation and correlation are highly sensitive to outliers. Consequently, any statistical calculation based on these parameters is affected by the presence of outliers.

Whether it is good or bad to remove outliers from your dataset depends on whether they affect your model positively or negatively. Remember that outliers aren’t always the result of badly recorded observations or poorly conducted experiments. They may also occur due to natural fluctuations in the experiment and might even represent an important finding of the experiment.

Whether you’re going to drop or keep the outliers requires some amount of investigation. However, it is not recommended to drop an observation simply because it appears to be an outlier.

Statisticians have devised several ways to locate the outliers in a dataset. The most common methods include the Z-score method and the Interquartile Range (IQR) method. However, I prefer the IQR method because it does not depend on the mean and standard deviation of a dataset and I’ll be going over this method throughout the tutorial.

The interquartile range is the central 50% or the area between the 75^th and the 25^th percentile of a distribution. A point is an outlier if it is above the 75^th or below the 25^th percentile by a factor of 1.5 times the IQR.

For example, if

Q1= 25^th percentile

Q3= 75^th percentile

Then, IQR= Q3 – Q1

And an outlier would be a point below [Q1- (1.5)IQR] or above [Q3+(1.5)IQR].

If this didn’t entirely make sense to you, don’t fret, I’ll now walk you through the process of simplifying this using R and if necessary, removing such points from your dataset. For starters, we’ll use an in-built dataset of R called “warpbreaks”. It neatly shows two distinct outliers which I’ll be working with in this tutorial.

You can load this dataset on R using the data function.

# remove outliers in r - import data
data("warpbreaks")

Once loaded, you can begin working on it.

Visualizing Outliers in R

One of the easiest ways to identify outliers in R is by visualizing them in boxplots. Boxplots typically show the median of a dataset along with the first and third quartiles. They also show the limits beyond which all data values are considered as outliers. It is interesting to note that the primary purpose of a boxplot, given the information it displays, is to help you visualize the outliers in a dataset.

You can create a boxplot to identify your outliers using:

# remove outliers in R - initial boxplot
boxplot(warpbreaks)$out

[You can also label outliers for better visualization using the “ggbetweenstats” function which comes with the “ggstatsplot” package. If you haven’t installed it already, you can do that using the “install.packages” function.

Your code should look like this.

# install the package 
install.packages("ggstatsplot")

# Load the package
library(ggstatsplot)

# Load the dataset 
data("warpbreaks")

# Create a boxplot of the dataset, outliers are shown as two distinct points
boxplot(warpbreaks)$out

#Create a boxplot that labels the outliers  
ggbetweenstats(warpbreaks,
wool, breaks, outlier.tagging = TRUE)

Finding Outliers – Statistical Methods

Now that you have some clarity on what outliers are and how they are determined using visualization tools in R, I can proceed to some statistical methods of finding outliers in a dataset.

This important because visualization isn’t always the most effective way of analyzing outliers. You can’t always look at a plot and say, “oh! this is an outlier because it’s far away from the rest of the points”. Your data set may have thousands or even more observations and it is important to have a numerical cut-off that differentiates an outlier from a non-outlier. This allows you to work with any dataset regardless of how big it may be.

Building on my previous discussion of the IQR method to find outliers, I’ll now show you how to implement it using R.

I’ll be using the quantile() function to find the 25th and the 75th percentile of the dataset, and the IQR() function which elegantly gives me the difference of the 75^th and 25^th percentiles.

# how to find outliers in r
Q <- quantile(warpbreaks$breaks, probs=c(.25, .75), na.rm = FALSE)

It may be noted here that the quantile() function only takes in numerical vectors as inputs whereas warpbreaks is a data frame. I, therefore, specified a relevant column by adding $breaks, this passes only the “breaks” column of “warpbreaks” as a numerical vector.

The IQR function also requires numerical vectors and therefore arguments are passed in the same way.

# how to find outliers in r - calculate Interquartile Range
iqr <- IQR(warpbreaks$breaks)

Now that you know the IQR and the quantiles, you can find the cut-off ranges beyond which all data points are outliers.

# how to find outliers in r - upper and lower range
up <-  Q[2]+1.5*iqr # Upper Range  
low<- Q[1]-1.5*iqr # Lower Range

Eliminating Outliers

Using the subset() function, you can simply extract the part of your dataset between the upper and lower ranges leaving out the outliers. The code for removing outliers is:

# how to remove outliers in r (the removal)
eliminated<- subset(warpbreaks, warpbreaks$breaks > (Q[1] - 1.5*iqr) & warpbreaks$breaks < (Q[2]+1.5*iqr))

The boxplot without outliers can now be visualized:

# how to remove outliers in r - resulting boxplots
ggbetweenstats(eliminated, wool, breaks, outlier.tagging = TRUE)

[As said earlier, outliers may or may not have to be removed, therefore, be sure that it is necessary to do so before eliminating outliers.

Other Ways of Removing Outliers

Now that you know what outliers are and how you can remove them, you may be wondering if it’s always this complicated to remove outliers. Fortunately, R gives you faster ways to get rid of them as well.

The one method that I prefer uses the boxplot() function to identify the outliers and the which() function to find and remove them from the dataset.

First, we identify the outliers:

# identify outliers in r boxplot
boxplot(warpbreaks$breaks, plot=FALSE)$out

Then save the outliers in a vector:

# how to remove outliers in r (alternative method)
outliers <- boxplot(warpbreaks$breaks, plot=FALSE)$out

This vector is to be excluded from our dataset. The which() function tells us the rows in which the outliers exist, these rows are to be removed from our data set. However, before removing them, I store “warpbreaks” in a variable, suppose x, to ensure that I don’t destroy the dataset.

# how to remove outliers in r (alternative method)
x<-warpbreaks 
x<- x[-which(x$breaks %in% outliers),]

I have now removed the outliers from my dataset using two simple commands and this is one of the most elegant ways to go about it. R gives you numerous other methods to get rid of outliers as well, which, when dealing with datasets are extremely common. However, being quick to remove outliers without proper investigation isn’t good statistical practice, they are essentially part of the dataset and might just carry important information. Losing them could result in an inconsistent model.

The Author:

Syed Abdul Hadi is an aspiring undergrad with a keen interest in data analytics using mathematical models and data processing software. His expertise lies in predictive analysis and interactive visualization techniques. Reading, travelling and horse back riding are among his downtime activities. Visit him on LinkedIn for updates on his work.

The post How to Remove Outliers in R appeared first on ProgrammingR.

Validate Me! Simple Test vs. Holdout Samples in R

John — Wed, 08 Jan 2020 12:54:14 +0000

In statistics, it is often necessary to not only model data but test that model as well. To do this, you need to randomly separate the data into two groups ensuring even samples regardless of the order of the original sample.

Statistical model

When a data scientist is working with a data set, he will create a statistical model. Creating a model involves finding a mathematical formula that fits the data. Once you have a model, you need to validate it. This involves comparing it to a validation sample. To get this sample, the original sample is randomly separated into two samples. One sample is a training data set used to develop the model and the other is used to validate your model.

Model validation

The process of validating a mathematical model uses a different data set than that used to develop it. Model validation is performed by comparing the model with a second data set to see if it also fits the model. If it does not fit, another model is needed. A frequent way of obtaining a validation sample is to split the original sample into two parts with one part being held for validation purposes.

Holdout sample

Part of the process of evaluating a mathematical model involves separating the original data set into training data, and a holdout sample. The training data is used to develop a mathematical model by fitting a mathematical formula to it. This mathematical formula is then applied to the holdout sample, to validate the formula. To ensure that such a comparison is valid, you must make sure that both data sets are statistically meaningful. If you only have one original data set, it is important to separate the data randomly to keep both sets statistically meaningful.

How to split data into training and testing in R

Answering the question of how to split data into training and testing in R requires using the sample function. The sample function has the format of sample (dataset, size, probability, replace), and it returns a vector of randomly selected values from the dataset. The dataset variable is the group of values the sample function selects from. Size is the number of values the sample function returns to the output vector. Probability is the optional parameter setting the probability of getting the value in certain positions. Replace is an optional logical parameter for deciding whether to allow duplication in the selection.

> x=c(4,11,25,35,45,55,68,73,86,99)
 > set.seed(5)
 > a=sample(1:10,5,FALSE)
 > a
 [1] 2 9 7 3 1
 > x[a]
 [1] 11 86 68 25  4
 > x[-a]
 [1] 35 45 55 73 99

When this formula is applied to a data set such as a victor by using the results of sample function as the index [a] and anti-index [-a] of that dataset. The set.seed function serves the purpose of ensuring that others can replicate your results.

Application

Here is an example of the practical application of pedal data for the iris flower.

 > data(iris)
 > smp_size = floor(0.5 * nrow(iris))
 > smp_size
 [1] 75
 > set.seed(37)
 > train_ind = sample(seq_len(nrow(iris)), size = smp_size)
 > train = iris[train_ind, ]
 > test = iris[-train_ind, ]
 > train
 > test

It demonstrates the steps needed to separate a dataset into training and testing samples. These two samples can then be used to create and test a mathematical model.

The need to divide a data set to provide a testing sample is critical to validating mathematical models of data being evaluated. While there is no one function that does the entire job, the process is still quite simple in R.

The post Validate Me! Simple Test vs. Holdout Samples in R appeared first on ProgrammingR.

Zen and The Art of Competing Against MBA’s

John — Thu, 24 Jan 2019 12:39:02 +0000

“I appreciate your ambition, but we’re looking for an MBA…”

My senior manager smiled and indicated the topic was closed. Despite the fact I was effectively running our direct mail program in the absence of my recently departed boss, the door was closed and locked. I quit two months later.

Within three years, I was promoted to her job at another company. Without an MBA.

None of this was luck. I took the other job because there was an obvious way to make this happen.

The method I’m about to lay out is intended for analysts and engineers. It’s not the only way to pull an end run around the MBA requirement, but it’s a great match with our unique abilities and passions.

Don’t Play Their Game

How do you beat Bobby Fischer, the chess grand master?

Play him in anything but chess.

The admission process for an elite MBA program is biased towards candidates with strong social skills who are coming from Finance and Business backgrounds. They are then groomed through a two year process which polishes their social skills, broadens their perspective, and expands their social network.

MBA’s are the kings and queens of being “broadly” developed and well rounded.

So be what they aren’t: deep.

This isn’t sneering. It is humanly impossible to be truly broad and deep; there’s just not enough time. You either know a little about many things or a lot about a specialty. Pick one and own it.

Hunt for High Value (Technical) Muck

No matter how many PowerPoints are unleashed in the boardroom, the performance of a businesses ultimately dictated by processes and technology. Either the machine works or it doesn’t.

Be alert for messy tasks with a high impact on business performance. Especially ones which require technical skills or advanced training.

These play to your strengths.

Control the Spice, Control The Universe

Everyone wants to do marketing strategy. Marketing Analytics? Meh. Data pulls and list processing? Surely you can outsource that, right? That’s totally beneath the dignity of an MBA.

The funny thing about managing list processing:

By directly working with the data, you see things that don’t get shown in a report
By directly controlling this process, you can make changes to the data you want

Jumping back to the marketing example I shared, within three months of taking over mailing list processing, our team had access to data that nobody else was looking at. Better yet, we started reading test results in weeks instead of months. But the real value was getting visibility into the details of failed tests: we could salvage insights to improve our future bets.

Speaking as an engineer, working on this project was like being a kid in a candy store. Lots of interesting puzzles to figure out, plenty of things to optimize, and some creative coding challenges. Plus you get to show off some neat things at the end. This was totally inside my comfort zone.

That being said, this makes for good “tech guy of the year” material but we’re still not getting promoted. That takes the next step.

Translate The Win Back Into Business Terms

Go up two levels in the org chart and they probably don’t know what list processing is. They may barely understand analytics. They’re certainly not very interested in R programming.

They do understand getting things done.

So craft a story which translates the technical accomplishment into business terms and impact.

For the direct marketing example, the story looks like this:

We were able to figure out who responds to our sales and marketing programs
Since we’re selecting the right prospects and offers, our sales are up over 80%
Even better, we’re able to test in weeks rather than months, so we learn faster
Our client (a large retailer) is happy and wants to double the marketing budget

Boom. Technical result translated into MBA terms. Backed by the full power of knowing what was going on inside the box and using engineering expertise to make it run better.

When the job came open, my “unqualified” application was moved to the head of the line.

Cashing In: Getting Paid What You’re Worth

While I’m a fan of employer loyalty, there’s a key drawback in this method. Step Four addresses this.

Employers price talent based on the kind of problems they’re being hired to solve. Most of us start our career as interchangeable “commodity talent”. Corporate HR departments are good at figuring this out.

The problem with being a commodity is you’re at the mercy of the market when it comes to pricing. If you’re pricing my services as an R developer, there’s a lot of other people quoted rates for similar work. At least in the eyes of an unsophisticated corporate buyer. So salaries are going to cap out at around $X.

Chances are $X is a decent wage. Good analytics developers are in short supply. But you’re going to struggle with the dysfunctional logic of companies wanting to pay top talent the “market average” rate.

Look back at the second step. You just pulled off a significant accomplishment. The value of that win is substantially higher than $X. Companies aren’t completely silly – they won’t pay the full amount. But they’re willing to pay a lot more if your job description is “can deliver specific win” than “writes code”.

My first job description at General Electric was “writes code”. My job description when I went to my next company was basically “implements change programs”. That salary bracket pays about double the first. Once you get above a certain level, you’re generally hired to solve a specific problem. Most of my recent engagements involve pricing and product development: solve this margin issue, grow this product line. The value of the these improvements can be estimated and plays a role in the compensation discussion.

The problem with staying at one company is their HR organization rarely resets their view of your value. They may raise it a percentage but always compare that percentage against the rest of their employees to ensure “progress is fair”. But your market value doesn’t grow 4% per year. You’re hopping between the brackets, delivering a very different level of value – that will be priced at a higher level in the market.

The right time to move is when you can legitimately claim a significantly higher compensation bracket and your existing employer is unwilling to close the gap. It’s worth a discussion however. Be ready to explain how the value you offer has changed by an order of magnitude to justify the increase. A smart employer can offer a big adjustment to “close the gap”.

Distilling the Method

So there’s a four step process here:

Look for messy technical problems related to the area you want to play in
Solve technical problems using analytics and engineering expertise
Repackage and explain the accomplishment in terms of business impact
Once you’re established as an expert in the space, reprice your services

Sometimes the value is more than simply money. Improvements in speed, control, and visibility can often be just as valuable to a senior executive trying to drive change. The IT organizations at many big companies are incredibly slow and non-responsive; giving executives a way to get things done without traversing this swamp can earn you admission to some interesting opportunities.

Don’t ignore the value of being able to give knowledgeable advice either. Knowing how to translate the business direction into technical changes is a valued skill in product management. The reverse, being able to translate technical innovation into business change is a vital part of competitive intelligence. Your depth of expertise can set you apart for these roles.

Either way, this gives you options to skip the MBA, focus on what you love, and get the job anyway…

The post Zen and The Art of Competing Against MBA’s appeared first on ProgrammingR.

The First Date with your Data in R

John — Mon, 04 Jun 2018 01:01:17 +0000

So you have your data, now what? With a little R code, you can quickly get to know a lot about your dataset. By taking care of basic data hygiene, gathering summary statistics, and taking a quick look at your data through graphs first, your later analysis is strengthened and simplified. The graphs you produce in this tutorial will not only be useful for your understanding, but also for communicating your results in a report or article.

In this tutorial, we’ll cover how to to all of that in R, using the program RStudio. To keep this from getting to complicated, we’ll also stick to only using functions that exist in the base R program.

We’ve used a file already in the R library for this tutorial: co2. It’s a popular and well-supported dataset showing the concentrations of carbon dioxide in the atmosphere over the several decades. If you’re already comfortable using R, you can use your own data by modifying the sample code. In the next section, we’ll show you how to add the co2 file to your code, so you can take it out on your first date.

Making Introductions

One of the reasons that R is so attractive to scientists and data analysts is the breadth of its scope. With the right extensions, its full functionality can be used on most any file type or database format. For the purposes of this tutorial, to keep it clear and simple, we’ll walk through getting to know a file already in the R library.

Uploading Data

To add the ‘co2’ dataset to your environment in RStudio, run ‘data(co2)’. Data() tells R where to find the dataset, and ‘co2’ calls the dataset.

Checking for Accuracy

It’s important to check that R uploaded your entire dataset with the correct headers and labels. Common practice is to print a snippet of the dataset in R and cross-reference that with what your dataset in the program you created or found it in. The head and tail functions allows you to do this by printing a sample of the entries. Use ‘head(co2)’ to take a look at your first few entries.

Use ‘tail(co2)’ to have R print the last. If these look right, you can be fairly sure that your dataset loaded correctly

Getting to Know Your Data

Now that your dataset is loaded into R, you’re ready to get to know it. R has several quick functions you can run to visualize and understand your data enough to select the proper statistical tools later in your analysis.

Seeing your First Patterns

R offers a quick and simple tool, the summary function, for numerically checking the patterns in your data. Taking a look at your quantiles, median, and mode is as easy as running the code ‘summary(co2)’. What you’ll see next are the summary statistics for co2 concentrations in an easy-to-read table.

Visualizing your data

While that table is useful, it doesn’t tell you what your data actually looks like. Next, we’ll produce a quick plot in order to visually get a sense of how the data is distributed and if there are any quirks, like outliers.

Plot is a handy function built into base R just for this purpose. Since co2 concentrations are continuous, you can use the code ‘plot(co2)’ as is, without manipulation, to take your first look at the data.

You can see that the data generally trends upwards and oscillates over the course of each year. There are no obvious outliers to worry about, and it seems that the relationship is linear. These observations allow you to start making general conclusions about your data and see where to go next.

This plot was helpful for you to understand your data, but to use it for anything else requires a few updates. You can alter parameters within the plot function to produce a professional quality graphic. Run the code ‘plot(co2, xlab = ‘year’, ylab = ‘CO2 (ppm)’, main = ‘Atmospheric CO2 Concentrations from 1959 – 1998’, col = ‘red’)’ to put the finishing touches on your graph.

Planning your Next Date

Things have gone smoothly so far. You now know enough about your data to choose further testing and even have some figures that show relevant trends in your data. With your data already loaded into R, you can explore the rest of the functionality that R has to offer.

If you’re new to R, you’re likely surprised at how efficient it is. Unlike in other statistical software, many of the functions you use constantly are already loaded in. Calling functions and printing their results can be done in one step. Plus, by using RStudio, all of your figures are quickly accessible and stored for future use.

Also- if you’re in a hurry and need a simple tool, check out our new statistics calculator (free, web based).

The post The First Date with your Data in R appeared first on ProgrammingR.

How To Make Your Data Analyst Resume Stand Out

John — Sun, 05 Feb 2017 23:35:11 +0000

To the typical reader, most technical resumes sound alike and share none of the unique personality behind the paper.

For example, you may know that Jane is meticulous about data quality and has an amazing knack for creatively turning business requests into statistical problems. George is easily the most “flexible” person on your team when it comes to getting up to speed on new packages. But when it comes time to write a resume, they both rattle off the same list of duties and technical keywords.

And since there’s no unique “signal” left in the resume, the recruiting process devolves into relying on other factors.

I’m going to share a couple of simple tips that can help your data analyst resume stand out.

Facts Tell, But Stories Sell

Consider, for a moment, that we are going to hire a sports star. Perhaps a football player.

Two candidates dossiers sit on your desk:

Candidate A’s dossier contains a long list of training programs, games played, and player statistics.

Candidate B’s dossier has a series of photographs and charts that show the best plays of their career.

Which one would you pick?

Most hiring managers would pick Candidate B, since they can visualize how they do on the job.

While I don’t suggest attaching a photograph of you coding, there are other ways to help a hiring manager see you in action. Think about a few of your best moments. What problem were you asked to solve? How did you approach it? What was the result, in terms of what you handed the customer and how it made their life better?

There’s a formula here:

Identify the specific area you were investigating
Describe what you delivered as an insight / solution
Identify who you shared the results with
Describe how it made their life better – with numbers, if possible

For example, consider the following statement:

“Led analysis of user engagement, using click tracking to understand visitor preferences and identify effective areas to promote content. Shared insights with publisher and adjusted website layout, reducing bounce rate 30% and increasing average time on site by 50%.”

If I’m hiring for a digital marketing data science role, this story about an accomplishment just captured my full attention and you are getting an interview. This is as good as a video clip of that football player intercepting a pass and running it in for a score.

Specialists Beat Generalists

Stated otherwise: are you selling parts or are you selling a car?

Analytics gives us a great collection of parts – every assignment gives us a ton of transferable skills and the typical analyst rotates around a couple of different business functions or industries in the first decade of their career. Advanced degrees give us – more parts!

The good news is these skills are rare, so starting salaries are high. The bad news – everyone else is collecting the same parts. After a decade in the workforce, we start to look alike to the business.

The cure? Stop talking about parts and show them the entire car.

Let’s step back for a moment and look at your job in a broader context. You handle part of a larger process, such as statistical modeling or data extraction. Above you is your boss, who is tasked with assembling the contributions of several different employees into a solution that is provided to the business. Ultimately, the solution should help the rest of the business, for which they fund your team’s existence.

Interesting things happens when you learn all of the pieces to deliver a particular solution. First, you can operate with limited supervision. You can also bring that process to a new company, since you know all of the parts. Also, you are more likely to find improvements. All of these things increase your value. Better yet, if you show that you have delivered similar solutions elsewhere, the risk of hiring you drops.

Executing this comes down to playing up the relevant parts of your experience and weaving them into a consistent theme across your last several roles. First, make your specialty the headline of your resume (Senior Pricing Expert!). Next, paint the picture in your introduction (Twelve years of pricing and related experience, building analytical models and pricing software). Focus on accomplishments related to your specialty. The longer (in terms of years or jobs) you can show involvement in the specialty, the better.

Don’t sell them the parts, sell them the car. For Full Price.

Less is More

Now that we’ve talked about how to focus your resume on a specialty, lets discuss what to do about the rest.

Get Rid Of It.

I’m completely serious. The content of a resume going to an actual human should do exactly three things:

Explain the value you offer – particularly if you’re a specialist – with proof you can deliver
Give dates to convince HR you’re not a wanted criminal and were employed / studying
Present a very small number of potential “ice-breakers”

That’s it. Anything outside that reduces your chances of success.

Massive list of every duty? Cut it, doesn’t help
Every certification? Nope, doesn’t help
Every technology ever used? Gone
Eight page resume? Prioritize content.

Edit ruthlessly. What do you want the audience to focus on? Put that the top of the page. Add a couple of interesting personal details at the bottom of your resume, to give people a way to break the ice. Everything else is a risk. You have X minutes in an interview to build a rapport with your interviewer and persuade them you are a fit for the role. Providing details on unrelated topics is a great way to get dragged into an interesting conversation that chews up time without moving you forward. For example, I spent two years hunting credit card fraud. Interesting? Absolutely. Helps close pricing jobs? Not really….

The other benefit of this approach is it keeps you out of traps. Listing every certification or technical exposure you’ve ever been exposed to is a massive risk, especially if you haven’t recently used many of them. Listing experience unrelated to your goal risks derailing the interview. By staying tight and focused, you guide the conversation to the topics you want.

I’m a fan of the one page resume, even for senior people. Mine boils down twenty years of analytics, over half at the director level, into a single page that drives home my area of expertise (marketing and pricing analytics). It gets a decent call back rate, for competitive roles.

The Revised Resume

These changes add up to a tighter, more focused resume. They give you a chance to showcase your unique experience, helping you stand out from the rest of the crowd and potentially qualifying for higher paid roles. Finally, you avoid many common ways to derail an interview by allowing the conversation to focus on the wrong topics.

Most analytics resumes are poorly written. This is an easy way to stand out from the pack.

This article is a follow up to our earlier post on resumes and interview tips for r programmers.

The post How To Make Your Data Analyst Resume Stand Out appeared first on ProgrammingR.

Simple Anagram Finder Using R

John — Mon, 28 Nov 2016 05:19:20 +0000

One of my early programming projects in Python was a word game solver (example: word jumble solver or wordle solver) – the early version was a simple script, which grew into a web application. Since then, I’ve always enjoyed using dictionary search problems to test out a new language.

Today’s article will look at building a searchable anagram dictionary using R. We will take an English word list and create a simple hash-map by sorting the letters into alphabetical order. We will store this into a hash-map data structure (um, Houston, we may have a wee problem here….does R even have such a beast?) and write a simple function to check if your letters can be unscrambled into a word.

Let’s start with the dictionary. We’re going to use the Enable open source dictionary. A copy can be downloaded from here. Please download to your local machine and access it from there.

wordlist <- readLines(“c:\\enable1.txt”)

Now, to sort the letters in a string into alphabetical order. Basically, we’re splitting the word up into letters. Then using the unlist function to convert this into a vector. Then sorting the vector into alphabetical order. Then pasting the whole mess together. I give you… the hashword.

hashword = paste(sort(unlist(strsplit(word, “”))), collapse = “”)

This is helpful for two reasons:

Identifies anagram families – words with the same letters
Can be used to unscramble letters into words

To build the hash map, we will loop through the word list and load them into the hash map. Now… to find options to use for the hash map. Unlike many other languages, R doesn’t have an obvious choice for a fast key-value lookup. My first attempt at this used lists and the name property. While this worked for small examples, it wasn’t efficient for large sets.

And then I stumbled across R environments (longer discussion here). These incorporate a hashmap structure by default, so they are a good candidate for our project. (This functionality has also been encapsulated in the “hash” package available on R-Cran – documentation here.)

We’re ready to write our dictionary builder:

map <- new.env(hash=T, parent=emptyenv())

for (word in wordlist){

hashword = paste(sort(unlist(strsplit(word, “”))), collapse = “”)

if (!is.null(map[[hashword]])){
map[[hashword]] <- c(map[[hashword]], c=word)
} else {
map[[hashword]] <- list(word)
}

}

This iterates across the original list of words, converting each word into a sorted string of letters. It checks the hashmap (the environment variable) to see if we’ve already seen that set of letters before. If we’ve seen it before, we append the word to the list of letters stored for that hashword value. If not, we initialize a list with that word.

And now we need a retrieval function. This is a simple lookup. Remember to sort the word before looking it up.

getAnagrams <- function(x) {
return(map[[paste(sort(unlist(strsplit(x, “”))), collapse = “”)]])
}

And to see the results in action. First, looking at a known “anagram family”:

> getAnagrams(“goat”)
[[1]]
[1] “goat”

$c
[1] “toga”

And next, trying to unscramble letters into a word (check here to confirm results):

> getAnagrams(“terref”)

[[1]]
[1] “ferret”

And there you have it. A simple anagram solver compressed into a couple of lines of R.

This is a decent starting point for those interested in automated puzzle solvers. A couple of useful extension of this code include solving for partial anagrams (uses some but not all letters) – output would look like this or this. Adding a letters-to-points score mapping would enable you to calculate scores for scrabble or words with friends. This works fairly well for basic searches. The next stage up from there generally will require a rewrite, with a faster language and more efficient data structure to address the complexity of more advanced word problems such as solving boggle.

The post Simple Anagram Finder Using R appeared first on ProgrammingR.

Webscraping with rvest: So Easy Even An MBA Can Do It!

John — Mon, 07 Nov 2016 02:44:36 +0000

This is the fourth installment in our series about web scraping with R. This includes practical examples for the leading R web scraping packages, including: RCurl package and jsonlite (for JSON). This article primarily talks about using the rvest package. We will be targeting data using CSS tags.

I read the email and my heart sank. As part of our latest project, my team was being asked to compile statistics for a large group of public companies. A rather diverse set of statistics, from about four different sources. And to make life better, the list was “subject to change”. Translated: be ready to update this mess at a moment’s notice….

The good news. Most of the request was publicly available, crammed into the nooks and corners of various financial sites.

This was a perfect use case for web scraping. An old school update (aka, the intern-o-matic model) would take about three or four hours. Even worse, it would be nearly impossible to quality check. A well written web scraper would be faster and easier to check afterwards.

Getting Set Up – rvest and jsonlite

We’re going to use two external libraries for this rvest tutorial; rvest and jsonlite. To install:

# import rvest
install.packages("rvest")
libraries("rvest")

# r json - installing jsonlite
install.packages("jsonlite")
libraries("jsonlite")

After installing the rvest and jsonlite libraries, I fired up Google and started looking for sources. The information we needed was available on several sites. After doing a little comparison and data validation, I settled on several preferred sources.

Important: Many websites have policies which restrict or prohibit web scraping; the same policies generally prohibit you from doing anything else useful with the data (such as compiling it). If you intend to use the scraped data for public (publication) or commercial use, you should consult a lawyer to understand your legal risks. This code should be used for educational purposes only. In practice, personal scraping is difficult to detect and rarely pursued (particularly if there is a low volume of requests). Keep this in mind if you are going to extract data from website using R.

Back to our example. To reduce the risk of getting a snarky legal letter, we’re going to share a couple of examples using the package to grab information from Wikipedia. The same techniques can be used to pull data from other sites.

Scraping HTML Tables with rvest

In many cases, the data you want is neatly laid out on the page in a series of tables. Here’s a sample of rvest code where you target a specific page and pick the table you want (in order of appearance). This script is going after every item on the page within an HTML tag of

and selecting the first one.

# rvest tutorial - extract data from website using r
src <- "https://en.wikipedia.org/wiki/List_of_largest_employers_in_the_United_States"

# rvest web scraping - get the page
page <- read_html(src)

# rvest html table - use html_nodes to parse html in r
# rvest html_nodes will grab all tables here; you must filter later
# html_table converts to data frame
employers <- page %>%
html_nodes("table") %>%
.[1] %>%
html_table()

# select specific table for final output
employers <- employers[[1]]

This will generate a nicely formatted list of the top employers in the US. This technique can be easily extended to grab data in almost any table on a web page. Basically, grab anything enclosed within a table tag and count through the tables until you find the one you want.

But wait, there’s more! It slices, it dices, it even finds Julianne’s fries.

Slicing and Dicing with CSS selectors

The ability to select pieces of a page using CSS selectors gives you the freedom to do some creative targeting. For example, if you wanted to grab the content of a specific text box on a page. For this second example, we’re going to target a non-table element of the page – the list of sources at the end of the wikipedia article. On Julienning, of course (the cutting technique used to make Julienne Fries).

A little inspect of the page reveals the sources are organized as an ordered list (

# rvest web scraping - get the page
page <- read_html("https://en.wikipedia.org/wiki/Julienning")

# point rvest html_nodes at list elements with class = references
# parse html in r using rvest html_text
sources <- page %<%
html_nodes(".references li") %<%
html_text()

The output of this:

sources
[1] "^ a b Larousse Gastronomique. Hamlyn. 2000. p. 642. ISBN 0-600-60235-4. "
[2] "^ Viard, Alexandre (1820). Le Cuisinier Impérial (10th ed.). Paris. OCLC 504878002. "

Rvest Examples: rvest div class and beyond

This same technique can be used to select items based on the HTML element ID field. In simple terms:

Target by Class ID => appears as
=> you target this as: “.target”
Target by Element ID => appears as
=> you target this as: “#target”
Target by HTML tag type => appears as

=> you target this as “table”

Target child of another tag => appears as

=> you target this as “sources li”

This is just scratching the surface of what you can accomplish using CSS selector targeting. For a deeper view of the possibilities, take a look at some of the tutorials written by the JQuery community. This is a decent starting point if you need to write an R web crawler.

JSON: On a Silver Platter…

Many modern web design frameworks don’t incorporate the data request into the initial HTML document. The initial document serves as a template and the data is retrieved via a series of follow-up JavaScript calls after the page is loaded. You’ll encounter this when you look at the document and realize the data you’re after isn’t anywhere in the HTML (which is usually 80% JavaScript). The trick with these sites is to look at the “network activity” from the page. One of these calls is requesting and getting data. You’ll see a neatly formed JSON (JavaScript Object Notation) object returned by that request. Once you find it, try to reverse engineer the request.

The good news is once you’ve figured out how the request is structured, the data is usually handed to you on a silver platter. The basic design of JSON is a dictionary structure. Data is labeled (usually very well), free of display cruft, and you can filter down to the parts you want. For a deeper look at how to work with JSON, check out our article on this topic.

Other Benefits

While it is always nice to automate the boring stuff, there are a couple of other advantages to using web scraping to over manual collection. The use of scripted processes makes it easier to replicate errors and fix them. You’re no longer at the whim of a (usually bored) human data collector (aka. the inter-o-matic) grabbing the wrong fields or mis-coding a record.

We have also found that large scale database errors are detected faster in this approach. For example, in the corporate data collection project we mentioned earlier we noticed that the websites we were scraping generally didn’t seem to collect accurate data on certain types of companies. While this would have eventually surfaced via a manual collection effort, the process-focused element of scraping forced this issue to the surface quickly.

Finally, since the scraping script shrunk our refresh cycle from several hours to under a minute, we can refresh our results much more frequently.

Scraping Websites with R

This was the latest in our series on web scraping. Check out one of the earlier articles to learn more about scraping:

You may also be interested in the following

The post Webscraping with rvest: So Easy Even An MBA Can Do It! appeared first on ProgrammingR.

Resume & Interview Tips For R Programmers

John — Wed, 06 Jul 2016 08:04:36 +0000

Speaking as a hiring manager, it doesn’t take much to stand out as a candidate for a statistical programming job. We just finished hiring the last of several analyst positions for a new data science unit at my day job. The final round was surprisingly less competitive that I expected; many of the candidates either failed to prepare or made basic mistakes in the job search process.

In the interests of helping others, here are few resume and interview tips that could have improved their chances.

1 – Google me (and my company)

This one is basic, but I was shocked by the volume of candidates who didn’t even bother to learn about the company or the hiring manager.

This is surprisingly easy to fix and is a good first step in establishing yourself as a serious candidate.

First, the recruiter almost always tells you that you will be meeting with Person X from Company Y, who is in Industry Z. Fantastic, feed those three pieces of information into Google or LinkedIn and see what comes up. Usually, you get some great stuff like:

The Hiring Manager’s LinkedIn profile: full of great information about their past and gives hints to subjects which interest them. This gives clues about things your interviewer is interested in; for example, I mention Python and the Python user group. Telling me about your Python projects would probably get you some extra credibility (just saying….)
Company Profile: At a minimum, read their LinkedIn or Crunch base profile and take a peek at their website; read their most recent couple of press releases (these are listed in the “News” section of Google Finance).

If you’ve got the time, look for information about how companies in that industry are using data science or technology. Google and “data science” and “case study”. You’ll usually find a couple of articles about projects other people have done. Think about how we might apply them at my company; these are great interview topics (“I read Company XYZ is using text analysis to data mine customer comments, what do you guys think about this?”) for the dreaded “do you have any questions for me” portion of our conversation.

Humility is important here – the hiring manager knows a lot more about their industry and company that you do. But a little creative thinking can transform our interview from a painful conversation about “your greatest weakness” to a more exciting conversation about what you could do if you got the job. Guess which candidate I’m going to hire….

2 – Focus Job Descriptions On Unique Lessons / Accomplishments

Give the hiring managers a little credit. Most entry level jobs are very similar across companies. Business analysts are generally asked to gather customer requirements. Project Managers hold meetings. Developers and Statistical Programmers write code and tests. In fact, many time we mentally boil this down to a candidate having X years of experience, sorted into buckets (analytics / technical / business).

Instead of doing a recital of your everyday duties on your resume, highlight the top few things you accomplished in that position. Write a short summary of what you did and what the project either accomplished or learned. Pay close attention to anything which is different or potentially interesting to a hiring manager. Which of your accomplishments would your manager brag about to their (non-technical) VP?

For example, I expect any statistical analyst has “extracted and transformed data” and “updated standard reporting”. What would catch my eye is someone describing how they analyzed a marketing program (identified best segments to promote) or figured out how to speed up DNA sequencing or moved a bunch of standard reporting to a self-service website. These type of accomplishment sets your resume apart.

Trust me, if you can tell me a cool story about how you made things better, I’ll assume you can probably handle the customer paperwork.

3 – Don’t List Technical Skills You Don’t Know (Very Well)

This particular topic has ascended to the coveted status of ‘pet peeve’. Every technical resume contains a section which lists the computer languages and packages that you would like me to believe you can use. That last point is crucial to the success of this section.

There is a misconception that having a massive list of technologies on your resume is a good thing. For most jobs and candidates, it isn’t. The reality is that my team has standardized around a couple of core technologies (R, Python, SQL) and anyone joining the group will be required to learn any of the missing technologies PLUS our environment PLUS our data. So there’s a tiny number of perfect unicorns roaming around out there who can know everything on Day One and a much larger number of decent candidates who know most of the package and demonstrate the ability to easily bridge any gaps. Most sane hiring managers are aware of this (and are fine with it).

I get a warm fuzzy feeling you can bridge the gaps when we talk about technologies that you’ve mastered, where you can list significant projects that involved that technology and discuss the details of the project. I don’t care if they were work projects or personal projects. In fact, since the professional projects I’m involved in (marketing and pricing data science) are covered by big confidentiality agreements, I often use examples from my side project (a word game site) for technical discussions. Nobody cares if I share how to build a scrabble cheat. And many managers will give credit for expertise in a similar space. If you can master SAS, you can probably figure out R fairly quickly.

I do not, however, get that same warm fuzzy feeling when you indicate that the only exposure you had to a programming language that you list on your resume is an online course and you’ve never actually used it for a serious project. Especially if you can’t answer basic questions about the core concepts of the language. And please, if you don’t have significant recent practical experience in a technology, don’t dress it up with verbiage indicating you’re “proficient”. The interview is pretty much over once I discover a gap between your resume and reality.

There’s also a question of focus. The more stuff you list, the harder it is for your audience to understand what you’re actually good at. If you boil it down to a couple of highly relevant “preferred technologies”, a hiring manager will know exactly what you’re bringing to the table. You’re also communicating you’re serious about mastering that particular technology. Scrap the fluff skills and talk about your projects.

In summation, don’t put any technical skill on your resume without being prepared to demonstrate significant commitment to applying it.

4 – Don’t Oversell Your Online Classes

Sadly, that online class isn’t really a compelling signal you have technical skills.

First, taking an online course in data science or coding isn’t unique anymore. Most of my entry-level candidate pool claimed some form of online education or independent learning. Furthermore, they rarely provide an employer with an objective measure of technical aptitude. They do demonstrate that you’re interested in the craft, although I already expect that since you applied for the position.

Now – if you took that knowledge and applied it to create a useful project, that quickly flips the script. This can be anything – a website, a useful module or open-source contribution, a tutorial, or an interesting piece of data analysis posted on your blog. Our conversation will shift from a generic discussion of “the latest online course” to a more unique discussion of what you were able to accomplish with the tool. Plus you earn points for being a self-taught developer.

Relax and Have Fun

One common trait for all of our successful candidates was they were able to show our team they were genuinely interested in the mission we were asking them to perform. They spoke enthusiastically about what they could accomplish if we gave them an opportunity. This combination of technical expertise and interest in the role was what got them hired.

Hopefully you’re reading this article because you like the craft of R Programming and data science. So think about your next interview in that sense; not as some weird HR ritual to be endured, but an opportunity speak with the manager about how you can practice our craft. Bring your enthusiasm for R to the meeting and brainstorm with the manager about how you can use it to help them.

That will get you hired!

The post Resume & Interview Tips For R Programmers appeared first on ProgrammingR.

Calling Python from R with rPython

bryan — Mon, 13 Jan 2014 19:23:34 +0000

Python has generated a good bit of buzz over the past year as an alternative to R. Personal biases aside, an expert makes the best use of the available tools, and sometimes Python is better suited to a task. As a case in point, I recently wanted to pull data via the Reddit API. There isn’t an R package that provides easy access to the Reddit API, but there is a very well designed and documented Python module called PRAW (or, the Python Reddit API Wrapper). Using this module I was able to develop a Python-based solution to get and analyze the data I needed without too much trouble.

However, I prefer working in R, so I was glad to discover the rPython package, which enables calling Python scripts from R. After finding rPython, I was able to rewrite my purely Python script as a primarily R-based program.

If you want to use rPython there are a couple of prerequisites you’ll need to address if you haven’t already. No surprise, you’ll need to have Python installed. After that, you’ll need to install the PRAW module via pip install praw. Finally, install the rPython package from CRAN. (But see the note below first if you’re on Windows.)

After you’ve completed those steps, it’s as easy as writing your Python script and adding a line or two to your R code.

First create a Python script that imports the praw module and does the first data call:

import praw

# Set the user agent information
# IMPORTANT: Change this if you borrow this code. Reddit has very strong
# guidelines about how to report user agent information
r = praw.Reddit('Check New Articles script based on code by ProgrammingR.com')

# Create a (lazy) generator that will get the data when we call it below
new_subs = r.get_new(limit=100)

# Get the data and put it into a usable format
new_subs=[str(x) for x in new_subs]

Since the Python session is persistent, we can also create a shorter Python script that we can use to fetch updated data without reimporting the praw module

# Create a (lazy) generator that will get the data when we call it below
new_subs = r.get_new(limit=100)

# Get the data and create a list of strings
new_subs=[str(x) for x in new_subs]

Finally, some R code that calls the Python script and gets the data from the Python variables we create:

library(rPython)

# Load/run the main Python script
python.load("GetNewRedditSubmissions.py")

# Get the variable
new_subs_data <- python.get("new_subs")

# Load/run re-fecth script
python.load("RefreshNewSubs.py")

# Get the updated variable
new_subs_data <- python.get("new_subs")

head(new_subs_data)

A few final notes:

The main drawback to the rPython package is that it currently doesn’t run on Windows. The developer (Carlos J. Gil Bellosta) is working to fix this, though. If that wrinkle gets resolved, I can see this being a very popular package.
You can use RStudio to write your Python programs, which is easier than switching to another IDE for simple scripts. However, it causes an issue with EOL characters. Namely, you need to add a blank line at the end of each .py file to get it to load properly.
The Python session rPython initiates is associated with the R session. Any Python modules you load or variables you create will be available until you remove them or close the R session.

The post Calling Python from R with rPython appeared first on ProgrammingR.