- Overview
- Classification
- Kernel Transformations
- Non-Separable Data
- Further Considerations
- Examples [previous]
- References [current]

The exact cost is specified by the so-called

- The
**Heaviside**loss function penalizes for any instance appearing on the wrong side of the SVM curve. - The
**hinge**loss function only penalizes for instances appearing outside the other margin, and it increases with the distance.

In both instances, the goal is simple: we attempt to maximize the margin while minimizing the total costs.

Practically speaking, this has the simple effect of putting an upper bound on the weights in the

to get a set of support vectors with non-zero weights and a decision function

which is scaled as in the previous posts, and which satisfies

for each support vector.

This approach introduces a new cost parameter which must be specified prior to solving the QP: a high cost means that few support vectors will violate the soft-margin (i.e. few mistakes are allowed), while a low cost means that more mistakes are allowed and the soft-margin may contain a large number of support vectors (see below for an example).

- Overview
- Classification
- Kernel Transformations [previous]
- Non-Separable Data [current]
- Further Considerations [next]
- Examples
- References

The solution is "simple": we devise a transformation from the initial feature space to a higher-dimensional space, and we train a linear SVM to the data (see Watson's example above).

We thus solve

to get a set of support vectors with non-zero weights and a decision function

which is scaled as in the previous post. Class assignment for proceeds exactly as as in the case of the simple linear kernel given by the identity transformation .

Not every transformation is allowable, unfortunately: in order for the associated QP to remain solvable, the **inner product** must be positive semi-definite. Standard families of transformations do satisfy these conditions, however.

How does one chose an appropriate ? Prior knowledge of the data structure, including hitting on the right level of complexity, come in handy. Would you have been able to recognize that

was the "right" transformation for the data presented in the image above? What would be an appropriate transformation for the dataset below?

Obviously, selecting the perfect is not always easy, or even feasible. Thankfully, there are **kernel** functions that generalize the notion of the inner product and that subsumes the need to specify a function in certain cases, at the cost of providing values for a few parameters.

The most common kernels include:

**simple linear**: ;**polynomial**: , where (typically 1) and ;**gaussian**(or radial-basis function): , where is positive, and**sigmoidal**(or perceptron): , for allowable combinations of and .

In this case, we solve

to get a set of support vectors with non-zero weights and a decision function

which is scaled as in the previous post.

- Overview
- Classification [previous]
- Kernel Transformations [current]
- Non-Separable Data [next]
- Further Considerations
- Examples
- References

A separating hyperplane does not necessarily exist. The example provided above (the combination of the data and the suggested boundary) is in some sense “nearly minimal” – only one point is mis-labeled by the decision boundary. That is mostly an artifact of the data in itself, in that example. It is not hard to conceive of examples for which the separation is “maximally” violated, so to speak (as in the spiral example below).

At any rate, when a separating hyperplane exists, it is not necessarily unique (see below for an example derived from a modified mortgage default dataset).

Among all the separating hyperplanes, is it possible to find one which would be optimal? What might this look like?

The **maximum-margin hyperplane** is the “best” separating hyperplane in the sense that it is the one at the center of the largest strip which separates the data points cleanly.

The objective is fairly simple to state: maximize the margin to find the optimal hyperplane.

The **support vectors** are those observations which are closest to the margin on either side. Usually the number of support vector is small, or at the very least, substantially smaller than the number of observations. In this example, the margin is **hard** (meaning that we do not allow for observations within the strip); we will see how to allow for a **soft** margin below.

Let (vector) represent the explanatory features of each observation, with (scalar) the known classification of each of observations. The equation of a hyperplane is given by , where is the **weight vector** and is the **bias**.

There is an infinite number of way of expressing that equation – we scale it so that for any eventual support vector in the available training set. This choice is known as the canonical **hyperplane**.

From geometry, we know that the distance from any point to the canonical hyperplane is

For a support vector , this distance simply becomes .The **margin** is twice that distance: .

Finding the parameters which maximize is equivalent to solving the following quadratic optimization problem:

This constrained quadratic optimization problem (QP) can be solved with the use of Lagrange multipliers. While this **primal** formulation does (in theory) yield the required parameter vector , there can be computational hazards when the dimension of the feature space is high. Furthermore, it is not obvious from the form of the QP how to identify the support vectors before real data is plugged in.

According to the Representer Theorem, we can find a set of weights , for which

Technically speaking we do not need to invoke the Representer Theorem in this case (the **linear kernel** case) as the weights are simply the Lagrange multipliers of the QP and the relations can be derived directly (this does not apply in the general case, however).

Substituting the relations in the primal formulation yields the **dual** formulation, which can be shown to be equivalent to the primal formulation while by-passing the computational issues:

In the dual formulation, the **support vectors** are exactly those with non-zero weights . The corresponding **decision function** is defined by

, which is scaled so that for each support vector , for any observation . Class assignment for proceeds as follows:

- if , then we assign to the class ;
- if , then we assign to the class .

This, of course, represents an idealized situation: linearly separable data with no tolerance for misclassification errors. A number of variants have been presented to handle more realistic scenarios.

- Overview [previous]
- Classification [current]
- Kernel Transformations [next]
- Non-Separable Data
- Further Considerations
- Examples
- References

Throughout, Tufte’s comments and insights are shown in block quotes; our own comments appear as regular text. Whenever possible, examples are sourced and linked to external sources, which may provide more context and detailed information.

Why do we display evidence in a report, in a newspaper article, online? What’s the fundamental reason or goal of our charts and graphs? Tufte suggests that we present evidence to assist our thinking processes (p. 137). In this regard, his principles are universal – a strong argument can be made that they are dependent neither on technology nor culture. Reasoning and communicating our thoughts are intertwined with our lives in a causal and dynamic multivariate Universe;1 whatever cognitive skills allow us to live and evolve can also be brought to bear on the presentation of evidence.

Tufte also highlights a particular symmetry to visual displays of evidence, that consumers should be seeking exactly what producers should be providing, namely

- meaningful comparisons;
- causal networks and underlying structure;
- multivariate links;
- integrated and relevant data;
- honest documentation, and
- primary focus on content.

Physical science displays tend to be less descriptive and verbal, more visual and quantitative; these trends tend to be reversed when dealing with evidence displays about human behaviour;2 in spite of this, Tufte argues that his principles of analytical design can also be applied to social science and medicine. To demonstrate the universality of his principles, Tufte describes in detail how they are applied in a visual display by Charles Joseph Minard (see figure below).

His lengthy analysis of the image is well worth the read (pp. 122–139) – it will not be repeated here (I must confess that the chart leaves me somewhat … unexcited).

Rather, I will illustrate the principles with the help of the following image from the Gapminder Foundation.

It is a bubble chart that plots the 2012 life expectancy, adjusted income per person in USD (log-scaled), population, and continental membership for 193 UN members and 5 other countries, using the latest available data (2011). A high-resolution version of the image can be found on the Gapminder website.

First Principle

Show comparisons, contrasts, differences. (p. 127)

The Fundamental analytical act in statistical reasoning is to answer the question “Compared with what?” Whether we are evaluating changes over space or time, searching big data bases, adjusting and controlling for variables, designing experiments, specifying multiple regressions, or doing just about any kind of evidence-based reasoning,

the essential point is to make intelligent and appropriate comparisons[emphasis added]. Thus, visual displays […] should show comparisons. (p. 127)

Comparisons come in varied flavours: for instance, one could compare a

- unit at a given time against the same unit at a later time;
- unit’s component against another of its components;
- unit against another unit,

or any number of combinations of these flavours. Not every comparison will turn out to be insightful, but avoiding comparisons altogether is equivalent to producing displays built from a single datum, and… well, what’s the point, then?

- a bubble to the right (resp. the left) represents a wealthier (resp. poorer) country;
- a bubble above (resp. below) represents a healthier (resp. sicker) country.

Where to begin? First, note that each bubble represents a different country, and that the location of each bubble’s centre is a precise point corresponding to the country’s life expectancy and its GDP per capita. The size of the bubble is correlated with the country’s population, while its colour is linked to continental membership.

The chart’s compass provides a handy tool for comparison:

For instance, a comparison between Japan, Germany and the USA shows that Japan is healthier than Germany, which is itself healthier than the USA (as determined by life expectancy), while the USA are wealthier than Germany, which is itself wealthier than Japan (as determined by GDP per capita) (see below).

While there is no reason to expect that it should be the case, it is nevertheless possible for two countries to have roughly the same health and the same wealth: consider Indonesia and Fiji, or India and Tuvalu, for instance (see below). In each pair, the centres of both bubbles overlap: any difference in the data must be found in the bubbles’ area or their colour.

Countries can also be compared against world values for life expectancy and GDP per capita (the comparisons for average continental membership or population are less obviously meaningful, in this case). The world’s mean life expectancy and income per person are traced in light blue. Wealthier, healthier, poorer, and sicker are relative terms, but we can also use them to classify the world’s nations with respect to these mean values.3

Second Principle

Show causality, mechanism, explanation, systematic structure. (p. 128)

Yet often

the reason that we examine evidence is to understand causality, mechanism, dynamics, process, or systematic structure[emphasis added]. Scientific research involves causal thinking, for Nature’s laws are causal laws. […] Reasoning about reforms and making decisions also demands causal logic. To produce the desired effects, we need to know about and govern the causes; thus “policy-thinking is and must be causality-thinking”.4 (p. 128)

Simply collecting data may provoke thoughts about cause and effect: measurements are inherently comparative, and comparisons promptly lead to reasoning about various sources of differences and variability. (p. 128)

In essence, this is the core principle behind data visualization: the display needs to explain *something*, it needs to provide links between cause and effect, it needs to tell a meaningful story.

If the visualization could be removed at a later stage without changing the underlying message, then that chart should not have been included in the first place, no matter how pretty and modern it looks, nor how costly it was to produce.

At a quick glance, the relation between the log of the income per person and life expectancy seems to be increasing roughly linearly. The exact parameter values are not known (and I cannot estimate them analytically as I do not have access to the data), but an approximate line-of-best-fit has been added to the figure below. Charts with this form have been found in other disciplines before.

The four quadrants created by the world’s life expectancy and its GDP per capita are not all created equal: naively, we might have expected that each of the quadrants would contain about 25% of the world’s countries (although the large population size of giants like China and India muddle the picture somewhat), however, there is one quadrant which is substantially under-represented (see below). Is it surprising that there should be so few “wealthier” and “sicker” countries? Could it be argued that Russia and Kazakhstan are too near to the separators to really be considered clear-cut members of the quadrant?

In the same vein, when we consider the data visualization as a whole, there seems to be one group of outliers below the main trend, to the right (and possibly one group above the main trend, to the left) which cries out for an explanation: South Africa, for instance, has a relatively high GDP per capita but a low life expectancy (potentially, income disparity between a poor majority and a substantially richer minority might help push the dot to the right, while the lower life expectancy of the majority drives the overall life expectancy to the bottom). Could wars, famines, recessions, and epidemics be recovered from the movement of the bubbles over multiple years?

Third Principle

Show multivariate data; that is, show more than 1 or 2 variables. (p. 130)

Nearly all the interesting worlds (physical, biological, imaginary, human) we seek to understand are inevitably multivariate in nature. (p. 129)

The analysis of cause and effect, initially bivariate, quickly becomes multivariate through such necessary elaborations as the conditions under which the causal relation holds, interaction effects, multiple causes, multiple effects, causal sequences, sources of bias, spurious correlation, sources of measurement error, competing variables, and whether the alleged cause is merely a proxy or a marker variable.5 p. 129)

Reasoning about evidence should not be stuck in 2 dimensions, for the world we seek to understand is profoundly multivariate[emphasis added]. (p. 130)

Alert readers may question the ultimate validity of this principle: after all, doesn’t **Occam’s Razor** warn us that “it is futile to do with more things that which can be done with fewer”?6 Seen in the right light, that seems like a fairly strong admonition to stay away from multivariate analysis.

This interpretation depends, of course, on what it means to “do with fewer”: are we attempting to “do with **fewer**”, or to “**do** with fewer”?. If it’s the former, then we can produce any number of univariate and bivariate charts to represent the data (which in itself starts to border on a multivariate display, although the number of such charts can balloon quite quickly), but any significant link between 3 and more variables is unlikely to be shown, which drastically reduces the explanatory power of the charts. If it’s the latter, the difficulty evaporates: we simply retain as much features as necessary to maintain the desired explanatory power.

Only 4 variables are represented in the display, which we could argue just barely qualifies the data as multivariate. The population size seems uncorrelated with both of the axes’ variates, unlike continental membership: there is a clear divide between the West, most of Asia, and Africa (see below). This “clustering” of the world’s nations certainly fits with common wisdom about the state of the planet, which provides some level of validation for the display.

Other variables could also be considered or added, notably the year, allowing for bubble movement: one would expect that life expectancy and GDP per capita have both been increasing over time. The Gapminder Foundation’s online tool can build bubble charts with other variates, leading to interesting inferences and conclusions.

Fourth Principle

Completely integrate words, numbers, images, diagrams. (p. 131)

The evidence doesn’t care what it is – whether word, number, image.

In reasoning about substantive problems, what matters entirely is the evidence, not particular modes of evidence[emphasis added]. (p. 130)

Words, numbers, pictures, diagrams, graphics, charts, tables belong together[emphasis added]. Excellent maps, which are the heart and soul of good practices in analytical graphics, routinely integrate words, numbers, line-art, grids, measurement scales. (p. 131)

Tables of data might be thought of as paragraphs of numbers, tightly integrated with the text for convenience of reading rather than segregated at the back of a report. […] Images and tables used in public presentations should be annotated with words explaining what is going on. In exploratory data analysis, however, the integration of data needs to be thought through. Perhaps the number of data points may stand alone for a while, so we can get a clean look at the data, although techniques of layering and separation may simultaneously allow a clean look as well as bringing other information into the scene. (p. 131)

When authors and researchers select a single specific method or mode of information during the inquiries, the focus switches from “can we explain what’s happening?” to “can the method we selected explain what’s happening?”. There is an art to selecting methods, and experience and expertise can often suggest relevant methods, but “when all one has is a hammer, everything looks like a nail”, as the saying goes: the goal should be to use whatever (and all) evidence is necessary to explain “what’s happening”. If that goal is met, it makes no difference which modes of evidence were used.

The various details attached to the chart (such as country names, font sizes, axes scale, grid, and world landmarks) provide substantial benefits when it comes to consuming the display. They may become lost in the background, with the effect that they are taken for granted. Compare the display obtained from (nearly) the same data, but without integration of evidence (see below).

Not nearly as compelling, eh? What’s missing?

Fifth Principle

Thoroughly describe the evidence. Provide a detailed title, indicate the authors and sponsors, document the data sources, show complete measurement scales, point out relevant issues. (p. 133)

The credibility of an evidence presentation depends significantly on the quality and integrity of the authors and their data sources. Documentation is an essential mechanism of quality control for displays of evidence.

Thus authors must be named, sponsors revealed, their interests and agenda unveiled, sources described, scales labeled, details enumerated[emphasis added]. (p. 132)

Depending on the context, questions and items to address could include:

- What is the title/subject of the visualization?
- Who did the analysis?
- Who created the visualization? (if distinct from analyst(s))
- When was the visualization published?
- Which version of the visualization is rendered here?
- Where did the underlying data come from?
- Who sponsored the display?
- What assumptions were made during data processing and clean-up?
- What colour schemes, legends, scales are in use in the chart?

It’s not obvious whether all this information can fit inside a single chart in some cases. But, keeping in mind the *Principle of Integration of Evidence*, charts should not be presented in isolation in the first place, and some of the relevant information can be provided in the text, on the webpage, or in an accompanying document. This is especially important when it comes to discussing the methodological assumptions used for data collection, processing, and analysis. An honest assessment may require sizable amounts of text, and it may not be reasonable to include that information with the display (in that case, a link to the accompanying documentation should be provided).

Publicly attributed authorship indicates to readers that someone is taking responsibility for the analysis; conversely, the absence of names signals an evasion of responsibility. […]

People do things, not agencies, bureaus, departments, divisions[emphasis added]. (pp. 132–133)

**What is the title/subject of the visualization?****Who did the analysis? Who sponsored the display? Who created the visualization?****When was the visualization published? Which version of the visualization is rendered here?****Where did the underlying data come from? What assumptions were made during data processing and clean-up?****What colour schemes, legends, scales are in use in the chart?**- Do we observe similar patterns every year?
- Does the shape of the relationship between life expectancy and log-GDP per capita vary continuously over time?
- Do countries ever migrate large distances in the display over short periods?
- Do exceptional events affect all countries similarly?
- What are the effects of secession or annexation?

The health and wealth of nations in 2012, using the latest available data (2011).

The analysis was done by the Gapminder Foundation; the map layout was created by Paulo Fausone. No data regarding the sponsor is found on the chart or in the documentation. The relevant Wikipedia article states that “the Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.” It seems plausible that there is no external sponsor, but that is no certainty.

The 11th version of this chart was published in September 2012. It is the latest available version as of October 2016.

Typically, the work that goes into preparing the data is swept under the carpet in favour of the visualization itself; there are no explicit source of data on this chart, for instance. However, there is a URL in the legend box that leads to detailed information.

For most countries, life expectancy data was collected from the Human Mortality database, the UN Population Division World Population Prospects, files from historian James C. Riley, the Human Life Table database, data from diverse national statistical agencies, the CIA World Fact book, the World Bank, and the South Sudan National Bureau of Statistics.

Benchmark 2005 GDP data was derived via regression analysis from International Comparison Program data for 144 countries, and extended to other jurisdictions using another regression against data from the UN Statistical Division, Maddison Online, the CIA World Fact book, and estimates from the World Bank. The 2012 values were then derived from the 2005 benchmarks using long-term growth rates estimate from Maddison Online, Barro & Ursua, the United Nations Statistical Division, the Penn World Table (mark 6.2), the International Monetary Fund’s World Economic Outlook database, the World Development Indicators, Eurostat, and national statistical offices or some other specific publications.

Population estimates were collated from the United Nations Population Division World Population Prospects, Maddison Online, Mitchell’s International Historical Statistics, the United Nations Statistical Division, the US Census Bureau, national sources, undocumented sources, and “guesstimates”. Exact figures for countries with a population below 3 million inhabitants were not needed as this marked the lower end of the chart resolution.

The Legend Inset is fairly comprehensive:

Perhaps the last item of note is that the scale of the axes differs: life expectancy is measured linearly, whereas GDP per capita is measured on a logarithmic scale.

Sixth Principle

Analytical presentations ultimately stand of fall depending on the quality, relevance, and integrity of their content. (p. 136)

The most effective way to improve a presentation

is to get better content[emphasis added] […] design devices and gimmicks cannot salvage failed content. (p. 136)

The first questions in constructing analytical displays are not “How can this presentation use the color purple?” Not “How large must the logotype be?” Not “How can the presentation use the Interactive Virtual Cyberspace Protocol Display Technology?” Not decoration, not production technology. The first question is “

What are the content-reasoning tasks that this display is supposed to help with?” (p. 136)

A compelling narrative, which may not be the one that was initially expected to emerge from a solid analysis of sound data, is the name of the game. Simply speaking, the visual display should assist in explaining the situation at hand and at answering the original questions.

How would we answer the following questions:

The 2012 Health and Wealth of Nations data represent a single datum in the general space of data visualizations; better content means getting data for more than just 2012.

In an ideal setting, the classes are clearly separable (using an appropriate classifier). Of course, most data in the wild cannot be separated clearly: "best" in this case refers to minimizing the cost of making an error in "messy" cases. SVMs are especially well-suited to these problems, in part due to their flexibility and ability to handle non-linear decision surfaces; this contributes to their current popularity.

In this article, we will briefly discuss the main ideas behind various SVM methods, as well as some of their applications and limitations.(As with all machine learning methods, SVMs do not exist in a vacuum. Enough data has to first be collected, cleaned-up, and processed. Ethical issues must be considered. In all likelihood, dimension reduction and scaling will be required to get the most out of the algorithms. The models must be trained, tested, and validated in order to make predictions that avoid over-fitting. These will be the focus of other articles.)

Consider an artificial dataset which contains information about three features: the age of a customer, their savings balance, and whether or not they defaulted on a mortgage loan (this example is derived from Fawcett & Provost's *Data Science for Business*); the mortgage default variable is categorical (dot for default, plus sign for no default), the explanatory variables are numerical.

A cursory look at the data suggests that (in this dataset, at least), younger borrowers with smaller savings tend to default on their mortgages, whereas that is not the case for older borrowers with larger savings, but there is some overlap.

Let us forget about the unrealistic nature of this small dataset for the moment – these variables could be replaced by height, weight and reported gender, for instance – and concentrate on the classifying task at hand. Can we come up with a rule (or a set of rules) that would help us make a prediction for a set of new observations, assuming that the data at hand is somehow representative of the general situation?

One possible **decision tree** is shown below.

The decision rule derived from this tree is simple:

- borrowers with a savings balance below 50,000$ who were 50 years old or younger defaulted on their mortgage in 100% of the cases;
- borrowers with a savings balance above 50,000$ who were 45 years old or older defaulted on their mortgage in 0% of the cases;
- borrowers with a savings balance below 50,000$ who were 50 years old or older defaulted on their mortgage in 33% of the cases, and
- borrowers with a savings balance above 50,000$ who were 45 years old or younger defaulted on their mortgage in 57% of the cases.

Two of the leaves are pure (meaning that all instances belong to the same category within their respective quadrant), but the possibility of making a mistake (of misclassifying the data) is somewhat high in other two quadrants.

Can we do better? Without access to more features (variables), this is about as effective a classifier as a decision tree can get (other decision trees exist, but they are marginally more effective, at best). For this particular dataset, separating curves which are parallel to the axes are not ideal. To be sure, we could create an intricate decision tree with a large number of separating lines, but that number would be greater than , where is the number of features, but that is undesirable for a well-fitted tree.

It is easy to draw a **decision curve** which improves on the effectiveness of the decision tree:

The decision rule derived from this linear boundary is simpler than the decision tree rule:

- borrowers for whom the pair (balance,age) fall below the decision boundary defaulted on their mortgage in 100% of the cases, while
- borrowers for whom the pair (balance,age) fall above the decision boundary defaulted on their mortgage in 7% of the cases.

A single borrower is mis-classified by the simpler decision rule; for this dataset ,the decision boundary method is a better classifier than the decision tree (using any reasonable metric as a measuring stick). We could easily improve the accuracy to 100% using non-linear curves, but aiming for perfect accuracy at the training level can easily lead to serious over-fitting issues.

Building on the decision boundary approach, SVMs provide a protocol to train classifiers, using non-linear hyper-surfaces as required. In contrast with decision trees and other "classical" classifiers, SVMs are trained on a small subset of the available observations, which can prove useful when dealing with large datasets.

Among other problems, SVM methods have been applied to:

- text categorization
- image classification
- hand-writing recognition
- smoothing and regression
- outlier detection

- S.Few [2012],
*Show Me the Numbers: Designing Tables and Graphs to Enlighten*, Amazon.ca - S.Few [2009],
*Now You See It: Simple Visualization Techniques for Quantitative Analysis*, Amazon.ca - D.M.Wong [2013],
*The Wall Street Journal Guide to Information Graphics: The Do's And Don'ts Of Presenting Data Facts And Figures*, Amazon.ca - C.Nussbaumer Knaflic [2015],
*Storytelling with Data: A Data Visualization Guide for Business Professionals Paperback*, Amazon.ca - N.Yau [2011],
*Visualize This: The FlowingData Guide to Design, Visualization, and Statistics Paperback*, Amazon.ca - N.Yau [2013],
*Data Points: Visualization That Means Something*, Amazon.ca - E.R.Tufte [2006],
*Beautiful Evidence*, Amazon.ca - E.R.Tufte [2001],
*The Visual Display of Quantitative Information*, (2nd ed.), Amazon.ca - E.R.Tufte [1990],
*Envisioning Infortmation*, Amazon.ca - E.R.Tufte [1997],
*Visual Explanations: Images and Quantities, Evidence and Narrative*, Amazon.ca - D.Mccandless [2012],
*Visual Miscellaneum, The Revised And Updated: A Colorful Guide to the World's Most Consequential Trivia*, Amazon.ca - D.Mccandless [2014],
*Knowledge Is Beautiful: A Visual Miscellaneum of Compelling Information*, Amazon.ca - N.Illinsky, J.Steele [2011],
*Designing Data Visualizations: Representing Informational Relationships*, Amazon.ca - H.Wainer [2009],
*Picturing the Uncertain World: How to Understand, Communicate, and Control Uncertainty Through Graphical Display*, Amazon.ca - W.Lefèvre, J.Renn, U.Shoepflin (eds.) [2003],
*The Power of Images in Early Modern Science*, Amazon.ca - P.Murrell [2006],
*R Graphics*, available online - J.Leek [2015],
*The Elements of Data Analytic Style*, leanpub - J.Avirgan [2016],
*The Map That May Unmask Banksy*, FiveThirtyEight - A.Bycoffe [2016],
*The Endorsement Primary*, FiveThirtyEight *2016 National Primary Polls*, FiveThirtyEight- N.Yau [2016],
*Data USA makes government data easier to explore*, Flowing Data - E.Lamb [2016],
*It Doesn’t Add Up*, - E.Lamb [2012],
*Abandoning Algebra Is Not the Answer*, Scientific American - E.Lamb [2016],
*Andrew Hacker and the Case of the Missing Trigonometry Question*, Scientific American - N.Yau [2016],
*Data Proofer automates the data checking process*, Flowing Data - K.Dutton, D.Abrams [2016],
*What Research Says about Defeating Terrorism*, Scientific American - C.Aschwanden [2016],
*Failure Is Moving Science Forward*, FiveThirtyEight - R.Matin, R.Azizi [2015],
*DEA with Missing Data: An Interval Data Assignment Approach*, JOIE - R.Wasserstein, N.Lazar [2016],
*The ASA's statement on p-values: context, process, and purpose*, The American Statistician - T.Siegfried [2016],
*Experts issue warning on problems with P values*, Science News - R.Arthur [2016],
*We Now Have Algorithms To Predict Police Misconduct*, FiveThirtyEight - N.Yau [2016],
*What I Use to Visualize Data*, FlowingData - C.Aschwanden [2016],
*Statisticians Found One Thing They Can Agree On: It’s Time To Stop Misusing P-Values*, FiveThirtyEight - J.Honaker, G.King, M.Blackwell,
*Amelia II: A Program for Missing Data*, Gary King - M.Blackwell, J.Honaker, G.King
*A Unified Approach to Measurement Error and Missing Data: Overview and Applications*, - Y.Zhou, D.Wilkinson, R. Schreiber, R.Pan,
*Large-scale Parallel Collaborative Filtering for the Netflix Prize*, PDF - N.Yau [2016],
*Vega-Lite for quick online charts*, Flowing Data - B. D. CRAVEN, S. M. N. ISLAM [2005],
*Operations Research Methods*, Flowing Data - M.Panza, D.Napoletani, D.Struppa [2010],
*Agnostic Science. Towards a Philosophy of Data Analysis*, HAL - C.Paciorek [2014],
*An Introduction to Using Distributed File Systems and MapReduce through Spark*, - J.Cranshaw, R.Schwartz, J.Hong, N.Sadeh [2012],
*The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City*, *Life expectancy at birth*, Gapminder*Gapminder World 2012 in pdf*, Gapminder- K.Hsu, N.Pathak, J.Srivastava, G.Tschida, E.Bjorklund [2015],
*Data Mining Based Tax Audit Selection: A Case Study of a Pilot Project at the Minnesota Department of Revenue*, - N.Yau [2010],
*Think Like a Statistician – Without the Math*, Flowing Data - N.Lorang [2016],
*Data scientists mostly just do arithmetic and that’s a good thing*, - N.Yau [2010],
*Predictive policing*, - University of Minnesota Duluth
*Lectures*, *Artistic License – Statistics*, Tvtropes*23 Design, Data Visualization and Presentation Quotes from Edward Tufte*, Tvtropes- J. DeCoster [2001],
*Transforming and Restructuring Data*, - N.Yau [2016],
*Role of empathy in visualization*, Flowing Data *Center for Big Data Ethics, Law, and Policy*, Data Science Institute- A.Barry-Jester [2016],
*What Went Wrong In Flint*, FiveThirtyEight - J.Avirgan [2016],
*A History Of Data In American Politics (Part 2): Obama 2008 To The Present*, FiveThirtyEight - C.Bialik [2016],
*Why Betting Data Alone Can’t Identify Match Fixers In Tennis*, FiveThirtyEight - F.Jacobs [2016],
*A World Map of Economic Growth*, Big Think - W.Briggs [2016],
*Machine Learning, Big Data, Deep Learning, Data Mining, Statistics, Decision & Risk Analysis, Probability, Fuzzy Logic FAQ*, WILLIAM M. BRIGGS *Imagine Storing All the Worlds Archives in a Box of Seeds*, New Scientist- S.Few, [2011],
*The Chartjunk Debate*, - H.Enten [2015],
*Harry’s Guide To 2016 Election Polls*, FiveThirtyEight - A.Gefter [2015],
*A Private View of Quantum Reality*, Quanta Magazine - N.Yau [2015],
*R growth on StackOverflow reigns supreme*, Flowing Data - C.Bialik [2015],
*As A Major Retraction Shows, We’re All Vulnerable To Faked Data*, FiveThirtyEight *A European defense ministry revamps its logistics strategy and operations*, McKinsey&Company- JF.Portarrieu [2013],
*City of Toulouse*, IBM - K.Bonnes [2014],
*Predictive Analytics for Supply Chains: a Systematic Literature Review*, - J. Bencina [2011],
*Fuzzy Decision Trees as a Decision-making Framework in the Public Sector*, *The role of quantitative techniques in decision making process*, Essay UK- R.Larson [2002],
*Public Sector Operations Research: A Personal Journey*, *Real-Time Enterprise Stories*, Real Time Research REPORTS*Decision Science for Housing and Community Development: An interview with co-author Michael Johnson*, Statistics Views*City of Almere: Statistical analysis and predictive analytics allocate resources to citizens while planning for growth*, IBM*Woonbedrijf improves tenants’ quality of living*, IBM Software- M.Rockwell [2015],
*DHS to expedite data scans for foreign fighters*, FCW - M.Hansen, A.Stermberg [2015],
*NOAA’s Data Heads for the Clouds*, the White House - D.Major [2015],
*Open data, analytics key to Police Data Initiative*, GCN - L.Cornish [2015],
*Data in action: The role of data in humanitarian disasters*, Devex *Statisticians using social media to track foodborne illness and improve disaster response*, PHYS.ORG- Z.Mendelson [2015],
*Cities Can Use Big Data to Find Out What They Really Don’t Know*, Next City - N.Bishop [2015],
*Jen Q. Public: Governments can win the improper payment chase with analytics*, IBM - J.Shueh [2014],
*Minneapolis Launches Citywide Analytics Platform*, Government Technology - N.Bishop [2015],
*Public Sector News: How data and analytics promise a different future*, IBM - N.Bishop [2015],
*Public Sector News: The question of citizen's privacy*, IBM - N.Bishop [2015],
*Public Sector News: How governments can unleash the power of analytics*, IBM - B.Cortez-Neavel [2015],
*Data Analytics, Prevention Efforts Could Drive Down Child Deaths*, The Chronicle of Social Change *The Benefits of Analytics in the Public Sector*, JMP- H.Nicol
*Local Government and digital services: options for improving local services*, Public Service Transformation Network - S.Bateman [2014]
*The Data Science in Government programme: using data in new ways to improve what government does*, GOV.UK *Big Data for Development: Technocratic & Democratic Considerations*, K- A.Syvajarvi, J.Stenvall
*Data mining in public and private sectors: organizational and government applications*, Google Books - M.Gasco [2012]
*Proceedings of the 12th European Conference on e-Government*, Google Books - Y.Zhao
*R and Data Mining: Examples and Case Studies*, Google Books - G. K. GUPTA
*Introduction to data mining with case studies*, Google Books - P.Putten, G.Melli, B.Kitts
*Data Mining Case Studies*, - M.Nguyen-Nielsen, et.al,
*Existing data sources for clinical epidemiology: Danish registries for studies of medical genetic diseases*, - N.Yau [2016]
*Using information graphics to calibrate bias*, Flowing Data *Accounting for Errors with a Non-Normal Distribution*, Engineering Statistics Handbook*Opinion Research Poll*, CNN Opinion Research- G.Dvorsky [2014]
*Computers are providing solutions to math problems that we can't check*, iO9 *Missing-data imputation*, Stat Columbia- P. Allison [2012]
*Modern Methods for Missing Data*, Amstat - C.Wild [2012]
*The Wilcoxon Rank-Sum Test*, Stat Auckland - A.Pan, et.al [2013]
*Walnut Consumption Is Associated with Lower Risk of Type 2 Diabetes in Women*, The Journal of Nutrition - E.Inglis-Arkell [2012]
*Why the Exact Same Lottery Numbers Came Up Twice in One Week*, iO9 - H.Nolan [2014]
*Exonerations Are on the Rise. Justice Is Not.*, GAWKER - S.Nieuwenhuis, B.Forstmann, E.Wagenmakers [2011]
*Erroneous analyses of interactions in neuroscience: a problem of significance*, Nature NeuroScience *Significant*, Explain XKCD*Log Scale*, Explain XKCD- D.Hand [2014]
*Math Explains Likely Long Shots, Miracles and Winning the Lottery*, Scientific American - A.Koo [2013]
*A Decade After Moneyball, Have The A's Found A New Market Inefficiency?*, Regressing - M.Enserink [2012]
*Fraud Detection Method Called Credible But Used Like an 'Instrument of Medieval Torture'*, Science - R.Harder [2010]
*How To Generate Your Own Benford’s Law Numbers*, Think Harder - R.Nuzzo [2014]
*Scientific method: Statistical errors*, nature.com - I. JP [2014]
*Why most published research findings are false*, NCBI - D.Stapel, S.Lindenberg [2011]
*Coping with Chaos: How Disordered Contexts Promote Stereotyping and Discrimination*, Science - E.Callaway [2011]
*Report finds massive fraud at Dutch universities*, nature.com - E.Yong [2012]
*The data detective*, nature.com - E.Yong [2012]
*Replication studies: Bad copy*, nature.com *This Website Exposes a Scientific and Medical Cover Up*, nature.com- J.Walthoe
*This Website Exposes a Scientific and Medical Cover Up*, nature.com - J.Walthoe
*Looking out for number one*, +plus - A.Frazier et.al [2013]
*Prospective Study of Peripregnancy Consumption of Peanuts or Tree Nuts by Mothers and the Risk of Peanut or Tree Nut Allergy in Their Offspring*, JAMA Pediatric - R.Shapiro
*Prospective Study of Peripregnancy Consumption of Peanuts or Tree Nuts by Mothers and the Risk of Peanut or Tree Nut Allergy in Their Offspring*, JAMA Pediatric - J.Dempsey
*Our Army: Soldiers, Politics, and American Civil-Military Relations*, Princeton Press - K.Button et.al [2013]
*Power failure: why small sample size undermines the reliability of neuroscience*, nature.com - Public Health England [2014]
*Measles: guidance, data and analysis*, GOV.UK *The statisticians at Fox News use classic and novel graphical techniques to lead with data*, Simply Statistics- N.Yau [2011]
*Open thread: Can you spot the wrongness in this tax graph?*, Flowing Data - A. Hart
*Lies, damn lies, and the 'Y' axis*, Washington Post *A Guide for the statistically perplexed*, Polling*Lies, Damned Lies, and Statistics*, tvtropes*A Little Statistics is a Dangerous Thing*, TheNib- E.Inglis-Arkell [2014]
*The night the Gambler's Fallacy lost people millions*, iO9 - E.Inglis-Arkell [2014]
*Statistics professor challenges midwives' math on home birth safety*, iO9 - M.Cheyney, et.al [2014]
*Outcomes of Care for 16,924 Planned Home Births in the United States: The Midwives Alliance of North America Statistics Project, 2004 to 2009*, Wiley Online Library - R.Misra [2014]
*One graph explaining why you should always order a larger pizza*, iO9 - P.Clarke [2014]
*Title IX's Other Effects: Do Sports Make Women Less Religious?*, Regressing - B.Barnwell [2014]
*Bridging the Analytics Gap*, Grantland - K.Wagner [2014]
*Two Days At Sloan: How Sports Analytics Got Lost In The Fog*, Regressing - M.Bruenig [2014]
*America's Class System Across The Life Cycle*, Demos - G.Bluestone [2014]
*Casino Says World-Famous Gambler Cheated It Out of $10 Million*, GAWKER - R.Gonzalez [2014]
*Our New Favorite Website: Spurious Correlations*, iO9 - E.Inglis-Arkell [2014]
*One Mistake Fooled an Entire Nation About Who Would Be President*, iO9 - N.Yau [2014]
*Military infographic fascination*, iO9 - J.Raff [2014]
*How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists*, Huffpost Science - J.Lepore [2014]
*The Disruption Machine*, The New Yorker - N.Yau [2014]
*Detailed UK census data browser*, Flowing Data - J.Pinto da Costa, L. Roque [2006]
*Limit Distribution for the Weighted Rank Correlation Coefficient*, REVSTAT - A.Weinstein [2014]
*Adam Weinstein’s Discussions*, GAWKER - D.Thompson [2014]
*The Misguided Freakout About Basement-Dwelling Millennials*, The Atlantic - R.Gonzalez [2014]
*Statistical Proof That Lionel Messi Is the Best Soccer Player On Earth*, iO9 - D.Mersereau [2014]
*Why Is a 30% Chance of Rain Different from a 30% Risk of Tornadoes?*, The Vane - S.Wolfram [2013]
*Data Science of the Facebook World*, Stephen Wolfram - B.Fung [2012]
*The Global Geography of HIV: 20 Years of Change—in 1 GIF*, The Atlantic - H.Brady [2013]
*Watch the Country Get Fatter in One Animated Map*, Slate - R.Gonzalez [2014]
*U.S. Remains Key Growth Market for Cigarettes, Despite Graphs Like This*, iO9 - A.Newitz [2014]
*Can Network Theory Help Explain Epic Mythology?*, iO9 - Hawkingdo [2014]
*I Solved Gerrymandering … sorta!*, GERRYMANDERING - N.Silver [2014]
*Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?*, FiveThirtyEight - B.Morris [2014]
*Billion-Dollar Billy Beane*, FiveThirtyEight - E.Lamb [2014]
*British Rail's Shocking Defiance of Standard Metrics*, Scientific American - N.Yau [2014]
*How well we don’t understand probability*, Flowing Data - N.Silver [2010]
*BREAKING: Daily Kos to Sue Research 2000 for Fraud*, FiveThirtyEight - M.Strauss [2014]
*Statistician Creates Model To Predict What's Next In Game Of Thrones*, iO9 - A.Burneko [2014]
*Numbers One Through 12, Ranked*, The Concourse - G.Dvorsky [2014]
*Why The Sudden Surge Of Retractions At Nature Magazine?*, iO9 - S.Burtch [2014]
*Hockey Analytics: Why They Help And What's Coming Next*, SB Nation - R.Gonzalez [2014]
*How Much Would It Cost To Raise A Kid Like Calvin from Calvin and Hobbes?*, iO9 - Simply Statistics [2014]
*Data science can't be point and click*, Simply Statistics - S.Corinaldi [2015]
*I created a bot to find love online – reader, it worked*, The Guardian - N.Yau [2015]
*The Elements of Data Analytic Style*, Flowing Data - N.Yau [2015]
*The Price is Right winner and cancer survivor calculates the odds*, Flowing Data - N.Yau [2015]
*Searching for stock market spoofers*, Flowing Data - C.Bialik [2015]
*Scare Headlines Exaggerated The U.S. Crime Wave*, FiveThirtyEight - J.Asher [2015]
*Murder Rates Don’t Tell Us Everything About Gun Violence*, FiveThirtyEight - R.Ehrenberg [2015]
*Analysis gives a glimpse of the extraordinary language of lying*, Science News - N.Yau [2015]
*The Most Regional Names in US History*, Flowing Data *Thanksgiving in Charts and Graphs*, The Gentlemans Armchair- N.Yau [2014]
*Lexical distance between European languages*, Flowing Data - P.Murrell [2014]
*R Graphics*, R Graphics - N.Yau [2010]
*How to visualize data with cartoonish faces ala Chernoff*, Flowing Data *A Critique of Chernoff Faces*, eagereyes- R.Misra [2014]
*6P.M. is the most dangerous time of day to be a pedestrian*, iO9 - C.Anders [2014]
*Fascinating Chart: Top 20 Metropolitan Areas in the U.S.A., 1790-2010*, iO9 - K.Wagner [2014]
*Every NBA Team's Season, In One Chart*, Regressing - T.Ley [2014]
*Interactive Chart Finds Your New Favorite Beer For You*, FoodSpin - R.Misra [2014]
*A graph showing all the languages whose words invaded English*, iO9 - N.Yau [2014]
*How people really read and share online*, Flowing Data - R.Fischer-Baum [2014]
*Which Countries Have Produced The Most World-Famous Athletes?*, Regressing - N.Yau [2014]
*Level of road grid*, Flowing Data - N.Yau [2014]
*A visual analysis of the Boston subway system*, Flowing Data *Logistic Modeling with Categorical Predictors*, SAS*Stressed Out: Americans Tell Us About Stress In Their Lives*, NPR- N.Yau [2014]
*Polling for stress*, Flowing Data - B.Swihart, et.al
*Lasagna plots: A saucy alternative to spaghetti plots*, Lasagna plots *What’s the difference between an Infographic and a Data Visualisation?*, Jackhagley- J.Pavlus, et.al
*Infographic: If 7 Billion People Lived In One City, How Big Would It Be?*, Co.Design *Left vs Right v1.5*, Information is Beautiful- Mike [2011]
*Most Pirated Artists 2007 – 2010 Word Cloud*, The Evil Jam - M.Hahsler, S.Chelluboina

*Visualizing Association Rules: Introduction to the R-extension Package arulesViz*, Visualizing Association Rules - R.Misra [2014]
*An Interactive Chart Showing Which Jobs STEM Majors Really End Up In*, iO9 - N.Yau [2014]
*Markov Chains explained visually*, Flowing Data - M.Strauss [2014]
*Here's What Your 1.1 Million Comments On Net Neutrality Look Like*, iO9 - N.Yau [2014]
*State of birth, by state and over time*, Flowing Data - N.Yau [2014]
*Finding small villages in big cities*, Flowing Data - G.Dvorsky [2014]
*These Simple Tips Will Make Your Science Visualizations Rock*, iO9 - M.Strauss [2014]
*Transforming Data Into Beer Could Be The Greatest Idea Ever*, iO9 - R.Misra [2015]
*What Visualization Best Illustrated A Tricky Scientific Concept For You?*, iO9 - N.Yau [2014]
*Real Chart Rules to Follow*, Flowing Data - N.Yau [2015]
*Bar Chart Baselines Start at Zero*, Flowing Data - N.Yau [2015]
*Venn Diagrams: Read and Use Them the Right Way*, Flowing Data - N.Yau [2015]
*Classic 1939 book on graphs in its entirety*, Flowing Data - N.Yau [2015]
*Weight loss and life events*, Flowing Data - N.Yau [2015]
*What probability means in different fields*, Flowing Data - N.Yau [2015]
*What Does Probability Mean in Your Profession?*, Math With Bad Drawings - N.Yau [2015]
*A timeline of history*, FlowingData *Left vs Right v1.5*, Information is Beautiful- N.Yau [2015]
*Work Counts*, FlowingData - N.Yau [2015]
*Most Common Use of Time, By Age and Sex*, FlowingData - A.Crossman [2016]
*Data Cleaning*, About Education *Data Cleaning*, Analysis*Top ten ways to clean your data*, Microsoft- R.Cody, et.al,
*Data Cleaning 101*, ucla - T.Orchard, M.Woodbury,
*A MISSING INFORMATION PRINCIPLE: THEORY AND APPLICATIONS*, Project Euclid

- P.Allison,
*Modern Methods for Missing Data*, amstat

*Regression diagnostics and cautions: outliers and influential points*, uoregon

- H.Wickham
*Tidy Data*, Journal of Statistical Software - V.Powell
*Conditional probability*, Setosa *Qualities of a Good Question*, StatPac*GOOD DATA FROM BAD QUESTIONS? IMPOSSIBLE!*, Cooperative Extension*Electronic Information Resources - Myth and Reality*, stsci- M.Püschel

*Small Guide to Making Nice Tables*, Carnegie Mellon - N.Yau [2014]
*The important parts of data analysis*, FlowingData - T.Hothorn, et.al,

[2006],*party: A Laboratory for Recursive Partytioning*, R package

- Z.Weinersmith

[2014],*An artificial one-liner generator*, Scientia salon - N.Webb

[2006],*Reliability Coefficients and Generalizability Theory*, handbook of statistics - A.Cernat [2013],
*The impact of mixing modes on reliability in longitudinal studies*, ESRC - B.Tran, C.Tucker [2010],
*Using Latent Class Models to Better Understand Reliability in Measures of Labor Force Status*, JSM 2010 - R.Fischer-Baum [2014],
*Charts: Your Spending Habits Get Lamer As You Age*, Regressing - G.Dvorsky [2014],
*20 Crucial Terms Every 21st Century Futurist Should Know*, iO9 - C.Proust-Lima, et.al,

[2016],*Package ‘lcmm’-Extended Mixed Models Using Latent Classes and Latent Processes* - Z.Bursac, et.al,

[2008],*Purposeful selection of variables in logistic regression* - Y.Zhang [2011],
*Dimension Reduction*, Dimension Reduction Slides *Organisational Core Values*, Organisational Core Values- N.Yau [2013],
*Getting started with visualization after getting started with visualization*, Flowing Data - B.Fry [2004],
*Computational Information Design*, Massachusetts Institute of Technology - N.Yau [2014],
*A more visual world data portal*, Flowing Data - S.Boriah, et.al,
*Similarity Measures for Categorical Data: A Comparative Evaluation*, University of Minnesota

- P.Allison
*What’s the Best R-Squared for Logistic Regression?*, statistical horizons *The curse of dimensionality*, The Shape of Data*Decision Trees*, The Shape of Data*Duality and Coclustering*, The Shape of Data- S.Fefilatyev, et.al,
*Detection of Anomalous Particles from Deepwater Horizon Oil Spill Using SIPPER3 Underwater Imaging Platform*, Proceedings Template - WORD *Pre-Crime Data Mining*, Pre-Crime Data Mining- A.Bellaachia, E.Guven
*Predicting Breast Cancer Survivability Using Data Mining Techniques*, Predicting Breast Cancer Survivability - J.Rath [2014],
*Data Scientists Predict Oscar Winners*, Data Center Knowledge - E.Lamb [2014],
*The Saddest Thing I Know about the Integers*, scientific american - V.Velickovic,
*What Everyone Should Know about Statistical Correlation*, american scientist - N.Silver
*Rich Data, Poor Data*, fivethirtyeight - A.Hoorfar, M.Hassant [2008],
*INEQUALITIES ON THE LAMBERT W FUNCTION AND HYPERPOWER FUNCTION*, JIPAM - Investopedia Staff,
*A Beginner's Guide To Hedging*, investopedia - T.Yates,
*Practical And Affordable Hedging Strategies*, investopedia - M.Kang [2015],
*Exploring the 7 Different Types of Data Stories*, mediashift - H.Chen [2014],
*Curve Fitting & Multisensory Integration*, cogsci.ucsd.edu - T.Minka,
*Building statistical models by visualization*, Microsoft Research - Y.Zhao [2015],
*R and Data Mining: Examples and Case Studies*, r data mining - Y.Zhao [2015],
*Introduction to Data Mining with R*, r data mining - D. Meyer [2015],
*Support Vector Machines*, r-project - A.Fatahi [2010],
*TRUNCATED ZERO INFLATED BINOMIAL CONTROL CHART FOR MONITORING RARE HEALTH EVENTS*, IJRRAS

- A.Lazarevic, et.al, [2004],
*Data Mining for Analysis of Rare Events:A Case Study in Security, Financial and Medical Applications*, University of Minnesota Tutorial - D.Farace, J.Schöpfel,
*Grey Literature in Library and Information Studies*, DE GRUYTER *A Practical Guide to Statistics for Online Experiments*, optimizely

**Blogs and Sites**

- MulinBlog: a digital communication blog
- SportingCharts
- FlowingData
- Column Five Media
- Visual.ly
- eagereyes: Visualization and Visual Communication
- Quick-R: accessing the power of R
- New York Times' The Upshot
- Nate Silver's FiveThirtyEight
- Simply Statistics
- Using Visual Explanations to Create Learning: a research portfolio about visual explanations, learning and interactivity
- Martin Grandjean: Digital Humanities, Data Visualization, Network Analysis
- The Guardian US Interactive Team
- TULP Interactive
- SMBC Comics
- Daily Science Fiction
- Cracked
- NaCTeM
- Flowing Data: Books
- Visualization Books in the Queue
- Microsoft Research
- Venu's Mushings
- Nedroid
- Image & Narrative
- Visualizing Science
- PKP
- OpenDOAR
- University of Glasgow
- IBM developWorks
- SAS
- Dataversity
- London School of Hygiene & Tropical Medicine
- IBM Big Data & Analytics Hub
- Statistics without Borders
- Next City
- IBM
- EJEG
- Publications and Media Library
- GreyNet International
- AMSER
- The Grey Literature Report
- Open Grey
- GreyNet
- Information is Beautiful
- Bloomberg
- Ptable
- MathTube
- Dell Software
- WolframMathWorld
- DEA Zone
- Stat Trek
- StatSci.org
- Research Utopia
- Bad Science
- Richard D. Gill's home page
- nature.com
- deutsch29
- phD
- GOV.UK
- Data cuisine
- ESPN
- Information aesthetics
- Visualization Group
- GGobi
- ggplot2
- Foam Tree
- Smart-stats
- Lucidchart
- Fathom
- Information is Beautiful
- Gapminder
- Hive On Demand
- Information Visualization
- Lucas Infografia
- STEPHEN MCMURTRY
- eagereyes
- matplotlib
- Mike Bostock’s Blocks
- Plotly
- The Information Diet
- The University of Western Australia
- stanford vis group
- UW Interactive Data Lab
- Data is Beautiful
- R Project
- IBM Knowledge Center
- Weka
- UC Irvine Machine Learning Repository
- Datahub
- Historical Climate Data
- Ottawa's Open Data Catalogue
- Information is Beautiful
- Quantum blog
- Trading with Python
- RESEARCH UTOPIA
- A Visual Introduction to Machine Learning
- data science central
- r data mining
- William Vorhies's Blog
- Things Of Interest
- Databases covering grey literature and reports
- Grey Literature Report
- LexisNexis Searchable Directory of Online Sources
- The Quartz guide to bad data
- PennState Eberly College of Science
- Quantopian Blog
- Twiecki Github

**General**

*Data Visualization*, Wikipedia- R.Kosara [2008],
*What is a Visualization?*, eagleeyes *Rossmo's formula*, Wikipedia*Similarity measure*, Wikipedia*Benford's law*, Wikipedia*Kalman filter*, Wikipedia*IRIS Toolbox*, CodePlex*Naive Bayes classifier*, Wikipedia*K-means++*, Wikipedia*MovieLens*, Grouplens*Hands-on Exercises*, Spark*Stat 571: Statistical Methods*,*Bias*, Wikipedia*Determining the number of clusters in a data set*, Wikipedia*Time series*, Wikipedia*Category:Data clustering algorithms*, Wikipedia*Introducing Kaggle Datasets*, Kaggle*Awesome Public Datasets*, GitHub*The 2015 Data Awards*, FiveThirtyEight*Datasets for Data Mining*, The University of Edinburgh School of Informatics*The Data Science Industry: Who Does What (Infographic)*, Back to DataCamp*The Field Guide to Data Science*, Booz Allen Hamilton*Data Visualization*, FEMA*Farming Concrete Mill*, Farming Concrete Mill*Public Service Transformation Academy Launch*, FEMA*Danish Medical Data Distribution*, DMDD*UC Irvine Machine Learning Repository*, UCI*Mann–Whitney U test*, Wikipedia*Bootstrapping (statistics)*, Wikipedia*Predictive analytics*, Wikipedia*Quick-R*, Quick-R*Research Methods*, StatPac*Data analysis*, Wikipedia*Big data*, Wikipedia*Analytics*, Wikipedia*Validity (statistics)*, Wikipedia*Math Department*, Clackamas*Alcula*, Alcula*Theory of Correspondence Analysis*, Statmath- M.Bendixen [1996]
*A Practical Guide to the Use of Correspondence Analysis in Marketing Research*, Marketing Bulletin *One-way MANOVA in SPSS Statistics*, AERD Statistics*There are known knowns*, Wikipedia*Nate Silver Quotes*, goodreads*Nate Silver Quotes*, Brainy Quote*FiveThirtyEight*, Wikipedia*The Signal and the Noise*, Wikipedia*Moneyball*, Wikipedia*Fabrication (science)*, Wikipedia*Scientific misconduct*, Wikipedia*Data analysis techniques for fraud detection*, Wikipedia*How to be a Data Detective*, NPE*Bad Science (book)*, Wikipedia*Publishers withdraw more than 120 gibberish papers*, nature.com*Bokeh, a Python library for interactive visualization*, Flowing Data*Misleading graph*, Wikipedia*Treemapping*, Wikipedia*Heat map*, Wikipedia*Parallel coordinates*, Wikipedia*Box plot*, Wikipedia*Chernoff face*, Wikipedia*What is Visualization? A Definition*, eagereyes*Chernoff Face*, WolframMathWorld*Data visualization*, Wikipedia*Data grab bag*, Flowing Data*Edward Tufte*, Wikipedia*Statistics Calculator: Box Plot*, alcula*Statistics Calculator: Scatter Plot*, alcula*Statistics Calculator: Linear Regression*, alcula*50 Great Examples of Data Visualization*, Web Designer Depot*The 38 best tools for data visualization*, Creative Bloq/a>*Dominant Players*, XKCD*Treemap Basics*, Hive On Demand*What is a treemap? 5 examples and how you can create one*, Fishbowl NY*Where do college graduates work?*, United States Census Bureau*Announcing the Information is Beautiful Awards 2015*, Information is Beautiful- C.Chapman [2009]
*Data Visualization and Infographics Resources*, Smashing Magazine - V.Friedman [2007]
*Data Visualization: Modern Approaches*, Smashing Magazine - V.Friedman [2008]
*Data Visualization and Infographics*, Smashing Magazine - A.Marcelionis [2015]
*Fun With Physics In Data Visualization*, Smashing Magazine - A.Sahagun [2014]
*Data Visualization: Modern Approaches | Smashing Magazine*, ARI SAHAGÚN *Data Visualization: Modern Approaches*, Pearltrees*40 videos about data visualization*, Visualoop*Visualizations*, TED*A Tour through the Visualization Zoo*, acmqueue*Graph drawing*, wikipedia- S.Machlis [2015]
*LEARN TO USE R*, Learn to use R *Calendrier des formations*, solutionstat*Winners: Kantar Information is Beautiful Awards 2015*, Information is Beautiful*Misuse of statistics*, wikipedia*Statistics/Data Analysis/Data Cleaning*, wikibooks*Data cleansing*, wikipedia*Imputation*, wikipedia*Missing data*, wikipedia*Tidy data*, R-project*Sensitivity and specificity*, wikipedia*LaTeX/Tables*, wikibooks*Receiver operating characteristic*, wikibooks*Pearson's chi-squared test*, wikipedia*Matthews correlation coefficient*, wikipedia*Statistical classification*, wikipedia*Multiclass classification*, wikipedia*Accuracy and precision*, wikipedia*Binary classification*, wikipedia*Classification chart*, wikipedia- C.Molnar [2012],
*Conditional Trees*, linkedIn *Conditional inference trees vs traditional decision trees*, StackExchange- N.Yau [2015],
*How to Make Smoothed Density Maps in R*, Flowing Data *LaTeX table capabilities*, StackExchange*Standard Procurement Templates*, Buyandsell.gc.ca- N.Yau [2014],
*Extract CSV data from PDF files with Tabula*, Flowing Data *Growth Mixture Modeling, Path Specification*, OpenMx*Multivariate statistics*, wikipedia*Tools for making latex tables in R*, StackExchange*Introductory Statistics*, Introductory Statistics*Six Sigma*, wikipedia- N.Yau [2014]
*Large-ish data packages in R*, Flowing Data *FOOD RESILIENCE*, data.gov*9 essential LaTeX packages everyone should use*, how to tex*How to extract text based on font color from a cell with text of multiple colors*, StackExchange*DATA SCIENCE CODE OF PROFESSIONAL CONDUCT*, Data Science*Data Science + Ethics*, Data Science + Ethics*When is small data better than big?*, Data Science Central- V.Ho
*Why Small Data May Be Bigger Than Big Data*, Inc. *Category:Data clustering algorithms*, wikipedia- N.Yau [2014],
*Curse of dimensionality, interactive demo*, Flowing Data - N.Yau [2014],
*A collection of small datasets*, Flowing Data *Lean Construction Special Issue*, The Search Guide- N.Yau [2014],
*Casual visualization books for the coffee table*, Flowing Data - N.Yau [2014],
*Planets as fruit to show scale*, Flowing Data *Law of total variance*, wikipedia*Structural equation modeling*, wikipedia*Bayesian linear regression*, wikipedia*Logistic regression*, wikipedia*List of cognitive biases*, wikipedia*Scale (social sciences)*, wikipedia*Index (economics)*, wikipedia*Least squares support vector machine*, wikipedia*Multivariate normal distribution*, wikipedia*German tank problem*, wikipedia*Doomsday argument*, wikipedia*Market neutral*, wikipedia*Hedge (finance)*, wikipedia*not-for-profit academic endeavor*, spliddit*CLARIFY YOUR DECISIONS*, darkhorse*Online Web of Science to bibTeX conversion*, Lagom.nl- N.Yau [2014],
*When data gets creepy*, Flowing Data - N.Yau [2014],
*Identifying cheaters in test results, a simple method*, Flowing Data - N.Yau [2015],
*R Cheat Sheet and Guide for Graphical Parameters*, Flowing Data - R.Nuzzo [2015],
*Scientists Perturbed by Loss of Stat Tools to Sift Research Fudge from Fact*, scientific american - N.Yau [2015],
*Problems with algorithmic policy-making*, Flowing Data - A.Marcus, I.Oransky [2015],
*How the Biggest Fabricator in Science Got Caught*, nautilus - N.Yau [2015],
*Fudging the crime statistics and police misconduct*, Flowing Data - T.Minka,
*Microsoft Research*, Microsoft Research - A. Vries, J.Meys
*How to Use the Clipboard to Copy and Paste Data in R*, For Dummies *How to fix libatk-1.0-0.dll error*, wikifixes*System for Information on Grey Literature in Europe*, wikipedia*Grey literature*, wikipedia*What is Grey Literature?*, greylit*Grey Literature?*, opengrey*SSM1100Y: Research Paper Course Guide*, U of T Libraries*Meta-analysis*, wikipedia*CRAN Task View: Meta-Analysis*, r project*Why perform a meta-analysis?*, Comprehensive Meta-Analysis*Meta-Analysis*, Study Design 101- J.Deeks,
*Analysing data and undertaking meta-analyses*, Cochrane Handbook - C.Aschwanden, [2015],
*Not Even Scientists Can Easily Explain P-values*, FiveThirtyEight *TMS Recovery Program*, tmswiki

**Survey and Sampling**

- Statistics Canada,
*National Population Health Survey: Household Component, Longitudinal (NPHS)*, StatCan - Statistics Canada,
*National Population Health Survey (NPHS), Cycle 1-9*, OPHID *National Population Health Survey Household Component*, Statistics Canada*National Population Health Survey Household Component Quesionnaire*, Statistics Canada*2000 National Population Health Survey (Cycle 4) Content for June 2000*, Statistics Canada*National Population Health Survey Household Component-Cycle 9-Quesionnaire*, Statistics Canada*Research Methods*, StatPac*Qualities of a Good Question*, StatPac*Good Data From Bad Questions? Impossible!*, Cooperative Extension- M.D’Orazio [2010],
*Evaluating reliability of combined responses through latent class models*, Istat *Air Carrier Traffic at Canadian Airports*, Statistics Canada*Air Carrier Traffic at Canadian Airports (51-203-X)*, Statistics Canada- N.Diakopoulos [2013],
*How Google Flu Trends Is Getting to the Bottom of Messy Data*, Harvard Business Review *Charitable giving by Canadians*, Statistics Canada*Charitable giving by Canadians Table 2*, Statistics Canada*Charitable giving by Canadians Table 3*, Statistics Canada*Charitable giving by Canadians Table 4*, Statistics Canada*Charitable giving by Canadians Table 7*, Statistics Canada*Charitable giving by Canadians Table 8*, Statistics Canada*Charitable giving by Canadians Table 9*, Statistics Canada*Charitable giving by Canadians Chart 1*, Statistics Canada*Charitable giving by Canadians Chart 2*, Statistics Canada*Charitable giving by Canadians Chart 3*, Statistics Canada*Charitable giving by Canadians Chart 4*, Statistics Canada

**Compilations**

- A.Shienkman [2015],
*Our 47 weirdest charts from 2015*, FiveThirtyEight - N.Yau [2015],
*10 Best Data Visualization Projects of 2015*, FlowingData

**Code to Produce Graphics**

- W.Chang [2013],
*R Graphics Cookbook*, Amazon.ca - J.Lander [2013],
*R for Everyone: Advanced Analytics and Graphics*, Amazon.ca *Star (Spider/Radar) Plots and Segment Diagrams*, R-Manual- N.Yau [2010],
*How to visualize data with cartoonish faces à la Chernoff*, FlowingData *Star Plots and Segment Diagrams of Multivariate Data*, Basic R package*Boxplots*, Quick-R- Bokeh, a Python interactive visualization library
- D3.js, is a JavaScript library for manipulating documents based on data,
- J. Zhang [2012],
*SUGI 29: Techniques for Generating Dynamic Code from SAS® DICTIONARY Data*, *Commonly Used Attribute Options*, SAS- S.Slaughter, L.Delwiche
*Using PROC SGPLOT for Quick High-Quality Graphs*, - P.Hebbar
*Off the Beaten Path: Create Unusual Graphs with GTL*, - N.Yau
*How to Make Bubble Charts*, Flowing Data - N.Yau
*Comparing ggplot2 and R Base Graphics*, Flowing Data - N.Yau
*Moving to the “worst” place in America*, Flowing Data *10 tips for making your R graphics look their best*, Revolutions- N.Lemoine
*R for Ecologists: Putting Together a Piecewise Regression*, R Bloggers - M.Friendly, E.Kwan, C.LaBrish [2016]
*Visualizing Categorical Data with SAS and R: Exercises*, Datavis - P.Burns [2011]
*The R Inferno*, Burns Stats - MicroSoft
*Present your data in a radar chart*, MicroSoft - Quick R
*Boxplots*, Quick R - Math UCLA
*Star Plots and Segment Diagrams of Multivariate Data*, Math UCLA - R Documentation
*Star Plots and Segment Diagrams*, R Documentation - B.Huang, et.al
*tourrGui: A gWidgets GUI for the Tour to Explore High-Dimensional Data Using Low-Dimensional Projections*, Journal of Statistical Software

- H.Wainer

*Improving Tabular Displays, with NAEP Tables as Examples and Inspirations*, Journal of Educational and Behavioral Statistics

- D.Cook

*How, when and why to use interactive and dynamic graphics*, Iowa State University

- R.Wicklin [2011]

*Visualizing correlations between variables in SAS*, The Do Loop - N.Yau [2014]
*How to Read and Use Histograms in R*, Flowing Data - N.Yau [2014]
*Accessible Web visuals and code with p5.js*, Flowing Data - N.Yau [2015]
*Horizon Graphs, with a Food Pricing Example*, Flowing Data - N.Yau [2015]
*How to Make Horizon Graphs in R*, Flowing Data - S.Machlis [2013]
*Beginner's Guide to R: Painless Data Visulization*, computer world - H.Wickham [2011]
*ggplot2 basics*, ggplot2 basics - K.Rodden [2016]
*Sequences sunburst*, Sequences sunburst - M.Bostock [2016]
*Curved Links*, Curved Links - N.Yau [2015]
*Plotly.js, a JavaScript graphing library, open-sourced*, Flowing Data - N.Yau [2012]
*xkcd-style charts in R, JavaScript, and Python*, Flowing Data - S.Raschka [2014]
*Implementing a Principal Component Analysis in Python*, sebastianraschka *SAS:LOGISTIC Procedure*, SAS*Data-Driven Documents*, D3.js*Scatter Plots in Python*, Plotly*shapes_and_collections example code: scatter_demo.py*, matplotlib- H.Wickham, H.Hofmann
*Intro to R*, Intro to R *Line Charts*, Quick-R*Plotting earthquake data*, r-bloggers- J.Albert [2016]
*Graphing Pitch Count Effects*, Exploring Baseball Data with R *Plotting the Iris Data*, warwick*Code to create a scatterplot matrix*, ggplot2*Scatter Plot Matrices in R*, Data Analysis and Visualization Using R*R color cheatsheet*, R color cheatsheet*How to Make a Histogram with ggplot2*, r-bloggers*Histogram and density plot*, Cookbook for R- N.Horton [2011],
*Example 9.1: Scatterplots with binning for large datasets*, r-bloggers - K.Kleinman [2011],
*Example 8.41: Scatterplot with marginal histograms*, r-bloggers - D.Attali [2015],
*ggExtra: R package for adding marginal histograms to ggplot2*, r-bloggers - N.Zumel [2015],
*Wanted: A Perfect Scatterplot (with Marginals)*, r-bloggers - F.Veronesi [2013],
*Box-plot with R – Tutorial*, r-bloggers *Data Structures*, python

**General Code**

- SAS User Guide
- L. Gau,
*SAS Global Forum: Write SAS Code to Generate Another SAS Program*, - H.Wickham,
*Optimising code*, - P.Gill, E.Wong, [2014],
*Methods for Convex and General Quadratic Programming*, ucsd - C.Gohlke,
*Unofficial Windows Binaries for Python Extension Packages*, University of California, Irvine *Visualizing the distribution of a dataset*, stanford*Emacs Newbie Key Reference*, emacswiki- N.Yau [2015],
*Extract data from PDF files and export to CSV*, flowing data - J.Salvatier, et.al,
*Probabilistic Programming in Python using PyMC*, PyMC3 *Scatterplots*, Quick-R*Adding a legend to a plot*, r-bloggers*How I used R to create a word cloud, step by step*, Georeferenced*Axes and Text*, Quick-R*SVM example with Iris Data in R*, github*Cheatsheet – 11 Steps for Data Exploration in R (with codes)*, analytics vidhya- R.Hamer, P.Simpson,
*SAS Tools for Meta-Analysis*, SAS - C.Sheu, S.Suzuki, [2001],
*Meta-analysis using linear models*, citeseerx

- R.Butterfield, [2009],
*The Use of SAS in Meta-Analysis*, ncsu.edu - J.Gloudemans, et.al, [2011],
*MV_META: A SAS Macro for Multivariate Meta-Analysis*, SESUG 2011 - M.Komaroff, [2012],
*APPLICATION OF META-ANALYSIS IN CLINICAL TRIALS*, PharmaSUG - S.Kovalchik, [2013],
*Tutorial On Meta-Analysis In R*, R useR! Conference 2013 - A.C.Del Re, [2015],
*A Practical Tutorial on Conducting Meta-Analysis in R*, The Quantitative Methods for Psychology - J.Rickert, [2014],
*R and Meta-Analysis*, R bloggers

**Debugging and Common Questions**

- SAS
- StackFlow
- SAS
*Mathematics Questions*, stackexchange*Approximation for Lambert W function near zero*, stackexchange*pymc3*, github*How to customize lines in ggpairs*, stackoverflow*What are pseudo R-squareds*, Institute for Digital Research and Education*How do I interpret odds ratios in logistic regression*, Institute for Digital Research and Education

**Technical Details**

- M.Lin [2013],
*A color palette optimized for data visualization*, MulinBlog *PyMC3*,*Color Code*, Coolors*Color Code_R*,*Using colors in R*,

**Interactive/Dynamic/Animated Data Visualization**

*Keeping Up With the 2014 Winter Olympics*, Washington Post (member required for access).*Sochi 2014 Winter Olympic Games Calendar*, Sports Interaction- N.Yau [2016],
*How You Will Die*, FlowingData - K.Collins [2015],
*Why Infectious Bacteria are Winning*, Quartz - Bokeh, a Python interactive visualization library
- D3.js, is a JavaScript library for manipulating documents based on data
*You Draw It: How Family Income Predicts Children's College Chances*, The Upshot, New York Times- R.Harris, N.Popovich, K.Powell [2015],
*Watch how the measles outbreak spreads when kids get vaccinated – and when they don't*, The Guardian - S.Yee, T.Chu [2015],
*A Visual Introduction to Machine Learning, part 1*, R2D3.us - T.Randall, B.Migliozzi [2015],
*2014 Was the Hottest Year on Record*, Bloomberg - J.W.Tulp [2015],
*Goldilocks*, TULP Interactive *This is What the Spread of Walmart Looks Like From 1962 to 2006*, Cheezburger*Player Usage Charts*, Hockey Abstract- N.Yau [2015],
*Automatic charts and insights in Google Sheets*, FlowingData

**Heat Maps**

**Box Plots**

*Box Plot*, Wikipedia

**Parallel Coordinates/Spaghetti Plots**

*Parallel Coordinates*, Wikipedia

**Maps**

- N.Yau [2014],
*Where people run*, FlowingData. - N.Yau [2014],
*Amount of snow to cancel school*, FlowingData, reporting on redditor atrubetskoy's map. - R.Masra [2014],
*A map of ?how much snow it takes to cancel school across the U.S.*, io9, reporting on redditor atrubetskoy's map. - N.Yau [2013],
*The most regional names in US history*, FlowingData *An Unconventional Look at the European Map*, The Dialogue- N.Yau [2016],
*Changing river path seen through satellite images*, FlowingData - D.Walbert
*The mathematics of projections*, LEARN NC *This is What the Spread of Walmart Looks Like From 1962 to 2006*, Cheezburger- C.Maria [2014],
*Nine beautiful maps that will change how you see the world*, The Weather Network - A.Newitz [2014],
*Map shows which countries are contributing the most to climate change*, iO9 - G.Dvorsky [2014],
*An interactive map showing how baby names spread across the US*, iO9 *Many ways to see the world*, ODT Maps*Find all the countries of the world in the updated map*, Gapminder- N.Yau [2014]
*How to Make an Interactive Treemap*, Flowing Data - F.Jacobs
*Current Affairs: European Electricity Exports and Imports*, Big Think - A.Liptak [2015]
*Data Visualization Shows How Segregated Our Cities Are*, iO9 - L.Czerniewicz [2015]
*A World Map Based on Scientific Research Papers Produced*, iO9 - F.Jacobs
*The Map as Persuader*, Big Think *Plotting elevation maps and shaded relief images from latitude, longitude, and elevation pairs*, StackExchange*What Makes a Map Beautiful?*, StackExchange*A Model of Breast Cancer Causation*, Breast Cancer- N.Yau [2014],
*Explorations of People Movements*, Flowing Data - S.Sayad
*An Introduction to Data Mining*, Saedsayad - S.Lynn
*Self-Organising Maps for Customer Segmentation using R*, LinkedIn

**Text Analysis**

- K.Elliott, R.Johnson, T.Mellnik [2014],
*History through the president’s words*, Washington Post (membership required for access). - N.Yau[2016],
*The Guardian analyzes 70m comments, unearthing online abuse*, Flowing Data - P.Wong
*Visualizing Association Rules for Text Mining*, Visualizing Association Rules for Text Mining

**Queueing**

*Queueing Delay*,- Y.Abdelkader, M.Al-Wohaibi [2011],
*Computing the Performance Measures in Queueing Models via the Method of Order Statistics*, Journal of Applied Mathematics

**Data Envelopment Analysis**

**Time Series**

- O.Anava, E.Hazan, A.Zeevi [2015],
*Online Time Series Prediction with Missing Data*, - D.Fung [2006],
*Methods for the Estimation of Missing Values in Time Series*, - M.Vlachos [2005],
*A practical Time-Series Tutorial with MATLAB*, *Timeseries class*, MathWorks*Time Series Decomposition*, MathWorks*Parametric Trend Estimation*, MathWorks*Seasonal Adjustment Using S(n,m) Seasonal Filters*, MathWorks*Moving Average Trend Estimation*, MathWorks*Seasonal Adjustment Using a Stable Seasonal Filter*, MathWorks*Resample*, MathWorks*Seasonal Adjustment*, MathWorks- Statistics Austria, T.Wien [2012]
*Interactive adjustment and outlier detection of time dependent data in R*, Conference of European Statistians - B.Pecar [2012]
*Automating Time Series Analysis*, *PROC X12 Statement*, SAS- J.Honaker, G.King [2010]
*What to Do about Missing Values in Time-Series Cross-Section Data*, *Mann-Kendall Test For Monotonic Trend*,*Detrending*,*PROC X12 Example*, SAS- T.Jackson, M.Leonard
*Seasonal Adjustment Using the X12 Procedure*, *Working With Time Series Data*,- R.Peng [2016]
*Time Series Analysis in Biomedical Science - What You Really Need to Know*,

**Bayesian Analysis**

*Understanding empirical Bayes estimation (using baseball statistics)*,- J.Bowers, C.Davis,
*Bayesian Just-So Stories in Psychology and Neuroscience*, - J.Horgan
*Are Brains Bayesian?*, - H.Thornburg
*Introduction to Bayesian Statistics*, - E.Yudkowsky
*An Intuitive Explanation of Bayes' Theorem*, *IS THE BRAIN BAYESIAN?*,- J. Horgan [2016]
*Bayes's Theorem: What's the Big Deal?*, Scientific American - Andrew [2008]
*Why I don’t like Bayesian statistics*, Statistical Modeling, Causal Inference, and Social Science *Bayes' Theorem with Lego*, COUNT BAYESIE- T.Wiecki
*Bayesian data analysis with PyMC3*, Quantopian Inc. - A.Gelman, et.al, [2014]
*Stan: A platform for Bayesian inference*, Columbia University

- N.Yau [2012]
*Bayesian fantasy football 101*, flowing data *Bayesian Data Analysis with PyMC3*, github

**Star Diagrams**

*Star (Spider/Radar) Plots and Segment Diagrams*, R-Manual*Star Plots and Segment Diagrams of Multivariate Data*, Basic R package

**Chernoff Faces**

*Chernoff Faces*, Wolfram MathWorld- R.Kosara [2007],
*A Critique of Chernoff Faces*, eagereyes - N.Yau [2010],
*How to visualize data with cartoonish faces à la Chernoff*, FlowingData *Chernoff Face*, Wikipedia*The Trouble with Chernoff*, Map Hugger- L.Golden, M.Sirdesai [1992]
*Chernoff Faces: a Useful Technique For Comparative Image Analysis and Representation*, Map Hugger - C.Morris, D.Ebert, P.Rheingans
*An Experimental Analysis of the Effectiveness of Features in Chernoff Faces*, UMBC

*Baseball managers Chernoff faces*, information aesthetics

- A.Schwarz [2008]
*Professor Puts a Face on the Performance of Baseball Managers*, The New York Times

**Network Visualizations**

- M.Grandjean [2015], Network Visualizations: Mapping Shakespeare's Tragedies, Martin Grandjean
- A.Newitz [2014]
*Can Network Theory Help Explain Epic Mythology?*, iO9 *Visualizing Reddit Discussions*, Visualizing Reddit Discussions- N.Hirakata [2015]
*Neo4j to make svg for visualization of relationship graph*, Gappy Facets *global language network*, global language network

**Neural Network**

- N.Yau [2016],
*Here's how a neural network works*, FlowingData - D. Smilkov and S. Carter, GitHub
- E.Bernhardsson [2016],
*Analyzing 50k fonts using deep neural networks*, - I.Zandi [2000],
*Use of Artificial Neural Network as a Risk Assessment Tool in Preventing Child Abuse*, Artificial neural network - P.RK [2000],
*Applying artificial neural network models to clinical decision making*, Artificial neural network - M.Gales [2001],
*Multi-Layer Perceptron: Introduction and Training*, Multi-Layer Perceptron: Introduction and Training

- N.Yau [2015],
*Neural Network for selfie analysis*, flowing data

**Association Rules**

*Association Rules*, Association Rules- S.Brossette, et.al,
*Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance*, NCBI - E.García, et.al,
*Drawbacks and solutions of applying association rule mining in learning management systems*, the International Workshop *Association Rules*, Association Rules- R.Sousa, F.Rodrigues, [2013],
*Mining association rules with rare and frequent items*, ACM Digital Library - C.Berberidis,
*Inter-Transaction Association Rules Mining for Rare Events Prediction*, Aristotle University of Thessaloniki - Y.Koh, N.Rountree,
*Rare Association Rule Mining and Knowledge Discovery*, Information Science Reference - D.Rai

, et.al, [2012],*MSApriori using Total Support Tree Data Structure*, International Journal of Computer Applications

- YC.Lee, et.al, [2004],
*Mining association rules with multiple minimum supports using maximum constraints*, science direct - YH.Hu, YL.C [2004],
*Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism*, science direct - Wesley [2012],
*Association Rule Learning and the Apriori Algorithm*, r bloggers - H.Yun, et.al, [2003],
*Mining association rules on significant rare data using relative support*, science direct - W.Lin, [2003],
*Collaborative Recommendation via Adaptive Association Rule Mining*, Worcester Polytechnic Institute - U.Bhatt, P.Patel, [2014],
*A Recent Overview: Rare Association Rule Mining*, International Journal of Computer Applications - C.Romero, et.al,
*Mining Rare Association Rules from e-Learning Data*, University of Córdoba - >M.Khan, et.al,
*Fuzzy Weighted Association Rule Mining with Weighted Support and Confidence Framework*, citeseerx - >E.Cohen, et.al,
*Finding Interesting Associations without Support Pruning*, stanford.edu - >
*Confidence Based Pruning*, hyper textbook shop - >S.Kannan, R.Bhaskaran [2009],
*Association Rule Pruning based on Interestingness Measures with Clustering*, International Journal of Computer Science Issues - >M.Steinbach, et.al [2007],
*Objective Measures for Association Pattern Analysis*, Contemporary Mathematics - >R.Bayardo Jr, et.al, [1999],
*Constraint-Based Rule Mining in Large, Dense Databases*, The 15th Int’l Conf. on Data Engineering - >R.Bayardo Jr, R.Agrawal, [1999],
*Mining the Most Interesting Rules*, The Fifth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining - >J.Li,
*Efficient Mining of High Confidence Association Rules without Support Thresholds*, citeseerx - J.Bailey, [2002],
*Fast Algorithms for Mining Emerging Patterns*, Springer Link - A.Batbarai, D.Naidu

[2014],*Approach for Rule Pruning in Association Rule Mining for Removing Redundancy*, IJIRCCE - L.Szathmary, et.al,
*Towards Rare Itemset Mining*, HAL

**Classification**

- W.Loh
*Classification and regression trees*, - J.Platt, N.Cristianini, [2000]
*Large Margin DAGs for Multiclass Classification*, *Statistical classification*, wikipedia*Multiclass classification*, wikipedia*Accuracy and precision*, wikipedia*Binary classification*, wikipedia*Classification chart*, wikipedia*Supervised vs. unsupervised learning*, valpola_thesis*Tree-Based Models*, Quick-R*Decision Trees*, r data mining*Classification using neural net in r*, r-bloggers- JP.Vert

*Practical session: Introduction to SVM in R*, svmbasic_notes *Support Vector Regression with R*, SVM Tutorial- J.Rickert [2013],
*Draw nicer Classification and Regression Trees with the rpart.plot package*, Revolutions *Support Vector Machines*, scikit-learn*Support Vector Machines Tutorial*, NEC Labs America*Why use SVM?*, yaksis*Introduction to Support Vector Machines*, opencv

**Clustering**

- M.Meila
*Classic and Modern Data Clustering*, University of Wahington *Clustering - spark.mllib*, Spark- B.Bahmani, B.Moseley, A.Vattani, R.Kumar, S.Vassilvitskii
*Scalable K-Means++*, - A.Vassilaros
*ISODATA*, *Clustering Algorithm Applications*,*ROCK: A Robust Clustering Algorithm for Categorical Attributes*, ROCK- M.Mampaey, J.Vreeken
*Summarizing Categorical Data by Clustering Attributes*, Summarizing Categorical Data by Clustering Attributes - T.Chen et.al
*Model-based multidimensional clustering of categorical data*, Science Direct - P.Kudová et.al
*Categorical Data Clustering Using Statistical Methods and Neural Networks*, Categorical Data Clustering - B.Frey, D.Dueck
*Clustering by Passing Messages Between Data Points*, Science - J.Carbonera
*Are there clustering algorithms developed for dealing naturally with nominal/conceptual/categorial data*, ResearchGate *Cluster Analysis - Introduction*, Clustering and Classification methods for Biologists- H.Finch,
*Comparison of Distance Measures in Cluster Analysis with Dichotomous Data*, Cluster Analysis *Cluster Analysis*, Cluster Analysis*K-means Clustering*, R-statistics blog- B.Mehta,
*IRIS Clustering using R-NNet Neural Network*, SAP

**Predictive Analytics**

*Spine Extrapolation*,*Predictive Analytics*, IBM*The Age of Predictive Analytics*, Office of the Privacy Commissioner of Canada*Analytics Paves the Way for Better Government*, Forbes Insights*Practical Predictive Analytics*, LinkedIn*Predictive Analytics: Context and Use Cases*, LinkedIn*Predictive Analytics World for Government*, Predictive Analytics World for Government- R.Mitchell [2013]
*12 predictive analytics screw-ups*, Computer World - G.Jurman [2012]
*A Comparison of MCC and CEN Error Measures in Multi-Class Prediction*, Plos One *Model evaluation: quantifying the quality of predictions*, scikit-learn- J. Gorodkin [2004]
*Comparing two K-category assignments by a K-category correlation coefficient*, Science Direct - Vihinen M [2004]
*How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis.*, NCBI *Aviation Activity and Forecast*, Toronto Peasrson*PredPol – Predicting crime through data mining*, Generally Thinking

**Uncertainty**

- S.Bell [1999],
*A Beginner’s Guide to Uncertainty of Measurement*, National Physical Laboratory - C.Smith,
*Detecting Anomalies in Your Data Using Benford’s Law*, SUGI - G.Iaccarino,
*Uncertainty Analysis and Optimization*, - D.Kriegman [2001]
*Uncertainty*, - N.Yau [2016]
*An uncertain spreadsheet for estimates*, Flowing Data - H.Wainer [2009]
*Picturing the uncertain world: how to understand, communicate, and control uncertainty through graphical display*, Information Research - E.Inglis-Arkell [2014]
*How near-complete certainty can make you completely wrong*, iO9 *Almost Sure*, Almost Sure- N.Yau [2015]
*Criminal sentencing and a stat lesson on probabilities and uncertainty*, Flowing Data - N.Yau [2015]
*Lessons in statistical significance, uncertainty, and their role in science*, Flowing Data - J.Davies [2015]
*Why You’re Biased About Being Biased*, nautilus *Error and Uncertainty*, Whole Course Items: Error and Uncertainty

**Big Data**

*Big data in the abstract*, CQADS*Big data software*, CQADS- E. Mcnulty [2014],
*Uncerstanding the Big Data: The Seven V's*, Dataconomy - B.Marr [2014],
*Big Data: The 5 Vs Everyone Must Know*, LinkedIn - B.Marr [2015],
*Why only one of the 5 Vs of big data really matters*, IBM - D. Lawson [2013],
*Time for Vendors (and Fundraisers) to Be Big About Big Data*, Working Philanthropy - J.Hess [2015],
*From Police to Pipes: Fresno Leveraging 'Big Data' To Improve City Functions*, NPR For Central California *Embracing the Power of Big Data Correlation in Government*, FedTech- N.Bishop [2015],
*Public Sector News: Advancing analytics to transform cities*, IBM - R.Delgado [2015],
*The Big Data Obstacles Faced by Developing Nations*, TECHVIBES - N.Bishop [2015],
*Public Sector News: The ongoing impact of big data and analytics*, IBM - M.Jeelani [2015],
*Chicago uses new technology to solve this very old urban problem*, Forture - N.Bishop [2015],
*Public Sector News: How analytics is changing our world*, IBM - B.Howarth [2014],
*Big data: how predictive analytics is taking over the public sector*, The Guardian - A.Jensen et.al [2014],
*Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients*, Nature - M.Chen [2014],
*? Is ‘Big Data’ Actually Reinforcing Social Inequalities?*, The Nation - J.Sullivan [2013],
*Forget the needle, consider the haystack: Uncovering hidden structures in massive data collections*, Princeton University - R.Misra [2014],
*How does Big Data help us understand the vastness of space? Ask us now!*, iO9 - L.Greenemeier [2014],
*Why Big Data Isn't Necessarily Better Data*, Scientific American - A.Newitz [2014],
*Here's What You Need to Know About Big Data*, iO9 - M.Korolov [2014],
*10 big myths about Big Data*, network world - C.Mims [2014],
*Why the only thing better than big data is bigger data*, Quartz - A.Jensen [2014],
*Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients*, Nature Communications - B.Casselman [2015],
*Big Government Is Getting In The Way Of Big Data*, fivethirtyeight

**Do's and Don'ts**

*Misleading Graph*, Wikipedia- P.Ford [2014],
*Amazing Military Infographics: an appreciation*, The Message - [2012],
*A History of Dishonest Fox Charts*, Media Matters - Bad Graphs, Tumblr
- E.Klein [2010],
*Lies, Damn Lies and the Y axis*, Washington Post - J.Leek [2012],
*The statisticians at Fox News use classic and novel graphical techniques to lead with data*, Simply Statistics - J.Joyner [2010],
*Bad Graphs Mislead More Than 1000 Words*, Outside the Beltway - K.Drum [2011],
*Fun With Graphs: Making the Rich Look Poor*, Mother Jones - J.Chait [2011],
*Does the Middle Class Have All the Money?*, New Republic *Obama’s Chief Data Scientist Reveals How the Government Uses Big Data*, Time- S.Dhillon
*Researchers to study big data collection used on Canadians*, The Globe and Mail - P.Karon [2015]
*Can Big Data Help Government Do Better? This Foundation Thinks So*, Inside Philanthropy - I.Kottasova [2015]
*Europe's big data bombshell: What you need to know*, CNN - J.Higgins [2015]
*Federal Agencies Warming Up to Big Data*, Commerce Times - J.Higgins [2015]
*Federal Investment in Big Data Applications Heads for Liftoff*, Commerce Times - C.Yiu [2015]
*The Big Data Opportunity*, Policy Exchange *Denmark plans to preserve illegally collected medical data*, EDRi- N.Yau [2016]
*Bad Data — And Worse Decisions — Poisoned Flint*, Flowing Data - T.Siegfried [2010]
*Odds Are, It's Wrong*, ScienceNews *The Problem with Small Sample Sizes*, The Last Behaviorist*Misleading Graphs: Real Life Examples*, Statistics How To- J.Joyner
*Bad Graphs Mislead More Than 1000 Words*, Outside the Beltway *Bad Graphs*, Bad Graphs- D.Shere H.Groch-begley [2012]
*A History Of Dishonest Fox Charts*, MediaMatters - J. Grohol [2006]
*Bad Statistics: USA Today*, psychcentral - B.Goldacre [2011]
*These Guardian / Independent stories are dodgy. Traps in data journalism.*, Bad Science - R.Parikh [2014]
*How to Lie With Data Visualization*, Gizmodo *Don’t Let Maps Fool You*, Fake Science- A.Balliett [2011]
*The Do’s And Don’ts Of Infographic Design*, Smashing Magazine - T.Farrant-Gonzalez [2013]
*All That Glitters Is Not Gold: A Common Misconception About Designing With Data*, Smashing Magazine - N.Veltman [2013]
*Avoiding mistakes when cleaning your data*, School of Data - S.Frankel [2015]
*Data Scientists Don’t Scale*, Harvard Business Review - J.Breaugh [2003]
*Effect Size Estimation: Factors to Consider and Mistakes to Avoid*, Journal of Management - N.Yau [2014]
*CSV Fingerprint: Spot errors in your data at a glance*, Flowing Data - J.Hassell [2014]
*3 Mistaken Assumptions About What Big Data Can Do For You*, CIO - M.Michel [2015],
*6 Reasons You Can't Trust Science Anymore*, cracked

**Others**

- T.Elms [2008],
*Lexical Distance Among the Languages of Europe*, Etymologikon *Thanksgiving in Charts and Graphs*, The Gentleman's Armchair- Data Cusine, experimental research on the representation of data with culinary means
*The online world*, The online world- J. Harris,
*The Periodic Table of Storytelling*, *Random*, SMBC- NYtimes
- Atomic Radius
- NASA
- Web elements
- Periodic Table
- Meta-synthesis
- NASA
- Coursera
- Cheezburger
- W.Hickey [2015] FiveThirtyEight
- K.Goldsberry [2015] FiveThirtyEight
- K.Schaul [2015],
*The number of ‘mass shootings’ in the U.S. depends on how you count*, The Washington Post - Nichols [2015],
*New Ways To Visulize Shot Supression*, The Sports Daily *SCIENTIFIC ENGLISH APHORISMS*, SCIENTIFIC ENGLISH APHORISMS*Popular Quotes*, goodreads*Maths Quotes*, sfsu math- A.Marinus, et.al [2014]
*6 Shocking Studies That Prove Science Is Totally Broken*, Cracked - N.Yau [2014]
*Famous Movie Quotes as Charts*, Flowing Data - N.Silver [2014]
*What the Fox Knows*, FiveThirtyEight - A.Hadhazy [2014]
*HOW TO TRICK OTHERS INTO DOING YOUR BIDDING*, Popular Science *Air Carrier Traffic at Canadian Airports 2009*, Statistics Canada*Air Carrier Traffic at Canadian Airports 2012*, Statistics Canada- M.Daniels
*The Largest Vocabulary*, Polygraph - N.Yau [2014]
*Distribution of letters in the English language*, Flowing Data *Word clouds*, wordle*Data Quotes*, Data Quotes*13 Really Cool Quotes About Data*, Data Quotes*Quotes about Data Science*, Statistics*Five Ws*, wikipedia*25 Greatest Data Quotes*, Data Quotes*Grapefruit*, xkcd*Extrapolating*, xkcd*No, You're Not Entitled To Your Opinion*, iflscience*Crimes Against Hugh's Manatees*, tumblr*Steven Pinker’s Sense of Style*, Scientific American*QUOTE OF THE DAY*, forbes*Quotes About Classification*, Data Quotes- J.Markoff [2011],
*Government aims to build a 'data eye in the sky'*, The New York Times - A.Berg,
*Names and Faces in the News*, UC Berkeley - K.Poulsen [2014]
*How a Math Genius Hacked OkCupid to Find True Love*, wired - E.Yong [2008]
*European genes mirror European geography*, scienceblogs *Will Machines Ever Think Like Humans?*, Scientific American- N.Yau [2014]
*Jeopardy! clues data*, Flowing Data - W.Hickey [2014]
*How Data Can Help You Write A Better Screenplay*, fivethirtyeight *Proposal*, mitacs- N.Yau [2014]
*A scaled Periodic Table of Elements*, Flowing Data - K.Trendacosta [2014]
*This Linguistic Family Tree Is Simply Gorgeous*, iO9 - M.Bertin [2015]
*Why Soccer's Most Popular Advanced Stat Kind Of Sucks*, regressing - C.Aschwanden [2015]
*How To Tell Good Studies From Bad? Bet On Them*, five thirty eight - S.Wolfram [2015]
*What Is Spacetime, Really?*, Stephen Wolfram blog *How Math Works*, Comics*How statisticians changed the war, and the war changed statistics*, The Economist

**Videos**

- grantwoolard,
*Classical Music Mashup*

- D.Arnold, J.Rogness,
*Mobius Transformations Revealed*

- N.Halloran
*The Fallen of World War II*

- N.Yau
*Math of crime and terrorism*

- N.Yau
*Suite of data tools for beginners, focused on fun*

- P.Boily
*The Discovery of Elements*

- T.Lehrer (music), Can YOU sing the elements? (video)
*The Element Song*

- originsX
*Discovery of the Elements - the Movie*

- FiveThirtyEight
*How A Data Scientist Who’d Never Heard Of Basketball Mastered March Madness*

- FiveThirtyEight
*How Data Helped Win The Battle Over Same-Sex Marriage*

- Reason.com
*Prying Open Government: The Sunlight Foundation's Fight for Transparency*

- N.Yau
*Data science, big data, and statistics – all together now* - Piled Higher and Deeper
*Who owns your data?* - N.Yau [2016]
*Algorithms for the Traveling Salesman Problem visualized* - FiveThirtyEight
*How The NYPD Abused Citizens In The Name Of Data, And How One Cop Exposed It*

- R.Vollman [2013]
*NEW TOOL: PLAYER USAGE CHARTS* - IBMVisualAnalytics [2013]
*The Four Pillars of Effective Visualizations* - iNTERNSiDEA master's Chanel [2012]
*David McCandless: "The beauty of data visualization"* - LinkedIn Tech Talks [2012]
*Designing Data Visualizations with Noah Iliinsky* - Office Videos [2015]
*Welcome to our office: David McCandless, renowned data journalist and speaker* - N.Yau [2015]
*US boundary evolution* - N.Yau [2015]
*Sometimes the y-axis doesn’t start at zero, and it’s fine* - N.Yau [2015]
*Fast image classifications in real-time* - D.Conway [2011]
*Tidy Data* - N.Yau [2014]
*Statistical concepts explained through dance* *Explore visualization features*- N.Yau [2015]
*White House appoints first US Chief Data Scientist* - N.Yau [2015]
*Mathematics of love* - MLSS Sydney 2015
*Bayesian Inference and MCMC with Bob Carpenter*