Simply Statistics

Streamline - tidy data as a service

Wed, 31 Mar 2021 00:00:00 +0000

Tldr: We started a company called Streamline Data Science https://streamlinedatascience.io/ that offers tidy data as a service. We are looking for customers, partnerships and employees as we scale up after closing our funding round!

Most of my career, I have worked in the muck of data cleaning. In the world of genomics, a lot of my research has focused on batch effects, synthesizing big genomic data into usable formats, and generally making data easier to use. A couple of years ago, we also started a company called Problem Forward Data Science. Problem Forward offered fractional data science services to a variety of businesses around the country, from the very big corporations to startups just getting started. We were asked to do a lot of different types of data work, everything from turning spreadsheets into dashboards to building complicated forecasting models. But no matter the project, whether in government, academia, or industry, we always ended up with the same problem.

We needed to clean the data before we could do the data science.

This will be no surprise to anyone who has worked in data science or analytics, but the data almost always led to setbacks and frustration when we were working with our clients. Customers wanted complex AI, insightful dashboards, or easy reports, but the data just weren’t ready for that yet. And we wasted a huge amount of time cleaning the data over and over again.

We realized that the most common challenge companies have is that their data processing and management pipelines aren’t ready for analytics. Or as Google so eloquently puts it:

“Everyone wants to do the model work, not the data work”

We realized that this was a service that many businesses needed. They needed someone who could come in and set up a data processing pipeline for them, manage it, and make sure the data were up to date. Some people call this Extract Load Transform (ELT), but we found it goes a bit beyond that. It is figuring out what format is most useful for the people who rely on data and working backward to create a customized and unique data pipeline that gets the data ready to use.

The ELT pipeline we set up is designed to consistently output “tidy data” that makes it easy for our customers to use BI tools like Tableau or Looker and to ingest their data without having to do all the ugly data work that is painful and time-consuming.

We piloted this for one of our startup customers - we built their data pipeline and provided ongoing management, maintenance, and upkeep. When they hired their first data scientist, they were able to quickly create dashboards for their whole business because they already had easy-to-use, tidy data.

We got so excited about this data dry cleaning idea that we started a new company called Streamline Data Science that solely focuses on tidy data as a service. We just closed our seed round and are now working with our first set of customers to set up their data pipelines. The cool thing is we found that our most excited customers were the ones that already had a data scientist on the team. This seems a little counter-intuitive until you realize that we handle the painful/boring bits of data management so they can focus on the fun part.

The interesting thing about Streamline is that it isn’t a product. There are a ton of complicated tools out there that you can use to set up your own data pipeline. Streamline is a service that handles all your data issues for you so the data “just works”. It can often be a lot cheaper than building out a full stack data engineering team in house.

If you are a company that is worried about the state of your data - they are difficult to share, to manage, and to quality control - we’d love to hear from you! We would also love to hear from you if you are a data scientist or analyst at a company that is frustrated about how much time you are spending on data management and cleaning.

I’ll write more in the future about how we figured out setting up a data pipeline efficiently and the problems Streamline solves. We will also be releasing our first public data Streamlines that you can play with. In the meantime, I wanted to share how excited I am to finally be working on solving the first mile of data science and building a company that can help Baltimore grow its data science community.

The Four Jobs of the Data Scientist

Tue, 24 Nov 2020 00:00:00 +0000

In 2019 I wrote a post about The Tentpoles of Data Science that tried to distill the key skills of the data scientist. In the post I wrote:

When I ask myself the question “What is data science?” I tend to think of the following five components. Data science is (1) the application of design thinking to data problems; (2) the creation and management of workflows for transforming and processing data; (3) the negotiation of human relationships to identify context, allocate resources, and characterize audiences for data analysis products; (4) the application of statistical methods to quantify evidence; and (5) the transformation of data analytic information into coherent narratives and stories.

My contention is that if you are a good data scientist, then you are good at all five of the tentpoles of data science. Conversely, if you are good at all five tentpoles, then you’ll likely be a good data scientist.

I still feel the same way about these skills but my feeling now is that actually that post made the job of the data scientist seem easier than it is. This is because it wrapped all of these skills into a single job when in reality data science requires being good at four jobs. In order to explain what I mean by this, we have to step back and ask a much more fundamental question.

What is the Core of Data Science?

This is a question that everyone is asking and I think struggling to answer. With any field there’s always a distinction between the questions of

What is the core of the field?
What do people in that field do on a regular basis?

In case it’s not clear, these are not the same question. For example, in Statistics, based on the curriculum from most PhD program, the core of the field involves statistical methods, statistical theory, probability, and maybe some computing. Data analysis is generally not formally taught (i.e. in the classroom), but rather picked up as part of a thesis or research project. Many classes labeled “Data Science” or “Data Analysis” simply teach more methods like machine learning, clustering, or dimension reduction. Formal software engineering techniques are also not generally taught, but in practice are often used.

One could argue that data analysis and software engineering is something that statisticians do but it’s not the core of the field. Whether that is correct or incorrect is not my point. I’m only saying that a distinction has to be made somewhere. Statisticians will always do more than what would be considered the core of the field.

With data science, I think we are collectively taking inventory of what data scientists tend to do. The problem is that at the moment it seems to be all over the map. Traditional statistics does tend to be central to the activity, but so does computer science, software engineering, cognitive science, ethics, communication, etc. This is hardly a definition of the core of a field but rather an enumeration of activities.

The question then is can we define something that all data scientists do? If we had to teach something to all data science students without knowing where they might end up afterwards, what would it be? My opinion is that at some point, all data scientists have to engage in the basic data analytic iteration.

Data Analytic Iteration

The basic data analytic iteration comes in four parts. Once a question has been established and a plan for obtaining or collecting data is available, we can do the following:

Construct a set of expected outcomes
Design/Build/Apply a data analytic system to the data
Diagnose any anomalies in the analytic system output
Make a decision about what to do next

While this iteration might be familiar or obvious to many, its familiarity masks the complexity involved. In particular, each step of the iteration requires that the data scientist play a different role involving very different skills. It’s like a one-person play where the data scientist has to change costumes when going from one step to the next. This is what I refer to as the the four jobs of the data scientist.

The Four Jobs

Each of the steps in the basic data analytic iteration requires the data scientist to be four different people: (1) Scientist; (2) Statistician; (3) Systems Engineer; and (4) Politician. Let’s take a look at each job in greater detail.

Scientist

The scientist develops and asks the question and is responsible for knowing the state of the science and what the key gaps are. The scientist also designs any experiments for collecting new data and executes the data collection. The scientist must work with the statistician to design a system for analyzing the data and ultimately construct a set of expected outcomes from any analysis of the data being collected.

The scientist plays a key role in developing the system that results in our set of expected outcomes. Components of this system might include a literature review, meta-analysis, preliminary data, or anecdotal data from colleagues. I use the term “Scientist” broadly here. In other settings this could be a policy-maker or product manager.

Statistician

The statistician, in concert with the scientist, designs the analytic system that will analyze the data generated by any data collection efforts. They specify how the system will operate, what outputs it will generate, and obtain any resources needed to implement the system. The statistician draws on statistical theory and personal experience to choose the different components of the analytic system that will be applied.

The statistician’s role here is to apply the data analytic system to the data and to produce the data analytic output. This output could be a regression coefficient, a mean, a plot, or a prediction. But there must be something produced that we can compare to our set of expected outcomes. If the output deviates from our set of expected outcomes, then the next task is to identify the reasons for that deviation.

Systems Engineer

Once the analytic system is applied to the data there are only two possible outcomes:

The outputs meet our expectations, or
The output does not meet our expectations (an anomaly).

In the case of an anomaly, the systems engineer’s responsibility is to diagnose the potential root causes of the anomaly, based on knowledge of the data collection process, the analytic system, and the state of scientific knowledge.

Recently, Emma Vitz wrote on Twitter:

How do you teach debugging to people who are more junior? I feel like it’s such an important skill and yet we seem to have no structured framework for teaching it

For software and for data analysis alike, the challenge is that bugs or unexpected behavior can originate from anywhere. Any complex system is composed of multiple components, some of which may be your responsibility and many of which are someone else’s. But bugs and anomalies do not respect those boundaries! There may be an issue that occurs in one component that only becomes known when you see the data analytic output.

So if you are responsible for diagnosing a problem, it is your responsibility to investigate the behavior of each component of the system. If it is something you are not that familiar with, then you need to become familiar with it, either by learning on your own or (more likely) talking to the person who is in fact responsible.

A common source of unexpected behavior in data analytic output is the data collection process, but the statistician who analyzes the data may not be responsible for that aspect of the project. Nevertheless, the systems engineer who identifies an anomaly has to go back through and talk to the statistician and the scientist to figure out exactly how each component works.

Ultimately, the systems engineer is tasked with taking a broad view of all the activities that affect the output from a data analysis in order to identify any deviations from what we would expect. Once those root causes have been explained, we can then move on to decide how we should act on this new information.

Politician

The politician’s job is to make decisions while balancing the needs of the various constituents to achieve a reasonable outcome. Most statisticians and scientist that I know would recoil at the idea of being considered a politician or that politics in any form would play a role in doing any sort of science. However, my thinking here is a bit more basic: In any data analysis iteration, we are constantly making decisions about what to do, keeping in mind a variety of conflicting factors. In order to resolve these conflicts and come to a reasonable agreement, one has to engage a key skill, which is negotiation.

At various stages of the data analytic iteration the politician must negotiate about (1) the definition of success in the analysis; (2) resources for executing the analysis; and (3) the decision for what to do after we have seen the output from the analytic system and have diagnosed the root causes of any anomalies. Decisions about what to do next fundamentally involve factors outside the data and the science.

Politicians have to identify who the stakeholders of the problem are and what is it that they ultimately want (as opposed to what their position is). For example, an investigator might say “We need a p-value less than 0.05”. That’s their position. But what they want is more likely “a publication in a high profile journal”. Another example might be an investigator who needs to meet a tight publication deadline while another investigator who wants to run a time-consuming (but more robust) analysis. Clearly, the positions conflict but arguably both investigators share the same goal, which is a rigorous high-impact publication.

Identifying positions versus underlying needs is a key task in negotiating a reasonable outcome for everyone involved. Rarely, in my experience, does this process have to do with the data (although data may be used to make certain arguments). The dominating elements of this process tend to be the nature of relationships between each constituent and the constraints on resources (such as time).

Applying the Iteration

If you’re reading this and find yourself saying “I’m not an X” where X is either scientist, statistician, systems engineer, or politician, then chances are that is where you are weak at data science. I think a good data scientist has to have some skill in each of these domains in order to be able to complete the basic data analytic iteration.

In any given analysis, the iteration may be applied anywhere from once to dozens if not hundreds of times. If you’ve ever made two plots of the same dataset, you’ve likely done two iterations. While the exact details and frequency of the iterations may vary widely across applications, the core nature and the skills involved do not change much.

Palantir Shows Its Cards

Wed, 26 Aug 2020 00:00:00 +0000

File this under long-term followup, but just about four years ago I wrote about Palantir, the previously secretive but now soon to be public data science company, and how its valuation was a commentary on the value of data science more generally. Well, just recently Palantir filed to go public and therefore submitted a registration statement (S-1) describing its business. It’s a fascinating read, if you’re into that kind of stuff.

But the important thing is that Palantir itself summarized the question I asked more than 4 years ago. In their enumeration of risk factors, one risk factor they highlight is

If our customers are not able or willing to accept our product-based business model, instead of a labor-based business model, our business and results of operations could be negatively impacted. [emphasis added]

In my original post I wrote about the “Data Science Spectrum”, which has consulting on one end and software on the other.

The point of the diagram was that businesses on the right hand side have huge valuations while businesses on the left side have merely large valuations. The people running Palantir clearly understand this and are trying to push the company in a software-based productized direction.

Here’s the rest of their summary of this risk factor:

Our platforms are generally offered on a productized basis to minimize our customers’ overall cost of acquisition, maintenance, and deployment time of our platforms. Many of our customers and potential customers are instead generally familiar with the practice of purchasing or licensing software through labor contracts, where custom software is written for specific applications, the intellectual property in such software is often owned by the customer, and the software typically requires additional labor contracts for modifications, updates, and services during the life of that specific software. Customers may be unable or unwilling to accept our model of commercial software procurement. Should our customers be unable or unwilling to accept this model of commercial software procurement, our growth could be materially diminished, which could adversely impact our business, financial condition, results of operations, and growth prospects.

Those of us who do data analysis for a living already know this to be true. Custom consulting is not scalable, and therefore, not as valuable as a piece of boxed up software, which is infinitely scalable.

Show Me The Numbers

So, how is Palantir doing?

At first glance it seems their doing pretty well. Their gross profit (Revenue - Cost of Revenue) suggests about a 72% gross margin percentage for 2018 and 67% for 2019, which both seem high. These gross margin percentages are in software company territory. (For comparison, Facebook’s percentage runs around 80%.) This suggests that each dollar of Palantir’s revenue does not have a lot of direct costs associated with it.

But ulimately, Palantir has posted net losses every year, indicating there are significant indirect costs to generating that revenue. In particular, their Sales and marketing costs almost equal their entire gross profit. Reading through the S-1 this ultimately is not surprising. Palantir itself admits that

Our sales efforts involve considerable time and expense and our sales cycle is often long and unpredictable.

Alas, there is some consulting to do after all. My guess is that much of the up front “sales” work comes down to Palantir having to

Figure out a customer’s problem and what question they’re asking;
Figure out how a customer’s data are organized;
Figure out how to their existing software products to the customer’s specific situation.

This should sound familiar to seasoned data scientists. Indeed, this is almost all the work of the data scientist. This is expensive because it requires humans to do it and there’s typically not much to generalize from customer to customer. Implementing the software and deploying it is work too, but is often more straightforward and their are often existing solutions that can be employed.

The Road to Profits

So here’s the problem: Palantir’s route to profitability involves making these costs go down…a lot. Maybe not to zero, but substantially, because each new customer—with their different problems and different data—costs a lot to acquire. If they can do this, they’ve cracked the nut of data science scalability.

Another big expense is Research and Development, which the company describes as developing new methods for data analysis (machine learning tools, etc.). While it’s nice to have room to do open-ended research on new data science tools, my guess is that this line item goes down a lot in the near future, as it often does at companies that start off with large R&D budgets. Ultimately, it would save Palantir ~$300 million a year.

See you in another four years?

Asymptotics of Reproducibility

Thu, 30 Apr 2020 00:00:00 +0000

Every once in a while, I see a tweet or post that asks whether one should use tool X or software Y in order to “make their data analysis reproducible”. I think this is a reasonable question because, in part, there are so many good tools out there! This is undeniably a good thing and quite a contrast to just 10 years ago when there were comparatively few choices.

The question of toolset though is not a question worth focusing on too much because it’s the wrong question to ask. Of course, you should choose a tool/software package that is reasonably usable by a large percentage of your audience. But the toolset you use will not determine whether your analysis is reproducible in the long-run.

I think of the choice of toolset as kind of like asking “Should I use wood or concrete to build my house?” Regardless of what you choose, once the house is built, it will degrade over time without any deliberate maintenance. Just ask any homeowner! Sure, some materials will degrade slower than others, but the slope is definitely down.

Discussions about tooling around reproducibility often sound a lot like “What material should I use to build my house so that it never degrades*?” Such materials do not exist and similarly, toolsets do not exist to make your analysis permanently reproducible.

I’ve been reading some of the old web sites from Jon Claerbout’s group at Stanford (thanks to the Internet Archive), the home of some of the original writings about reproducible research. At the time (early 90s), the work was distributed on CD-ROMs, which totally makes sense given that CDs could store lots of data, were relatively compact and durable, and could be mailed or given to other people without much concern about compatibility. The internet was not quite a thing yet, but it was clearly on the horizon.

But ask yourself this: If you held one of those CD-ROMs in your hand right now, would you consider that work reproducible? Technically, yes, but I don’t even have a CD-ROM reader in my house, so I couldn’t actually read the data. And a larger problem is that a CD from the 90s probably degraded to the point where it is likely unreadable anyway.

Claerbout’s group obviously knew about the web and were transitioning in that direction, but such a transition costs money. As does keeping a keen eye on emerging trends and technology usage.

Hilary Parker and I recently discussed the how the economics of academic research are not well-suited to support the reproducibility of scientific results. The traditional model is that a research grant pays for the conduct of research over a 3-5 year period, after which the grant is finished and there is no more funding. During (or after) that time, scientific results are published. While the funding can be used to prepare materials (data, software, and code) to make the published findings reproducible at the instant of publication, there is no funding afterwards for dealing with two key tasks:

Ensuring that the work continues to be reproducible given changes to the software and computing environment (maintenance)
Fielding questions or inquiries from others interested in reproducing the results or in building upon the published work (support)

These two activities (maintenance and support) can continue to be necessary in perpetuity for every study that an investigator publishes. The mismatch between how the grant funding system works and the requirements of reproducible research is depicted in the diagram below.

When I say “value” in the drawing above, what I really mean is the “reproducibility value”. In the old model of publishing science, there was no reproducibility value because the work was generally not reproducible in the sense that data and code were available. Hence, this whole discussion would be moot.

Traditional paper publications held their value because the text on the page did not generally degrade much over time and copies could easily be made. Scientists did have to field the occasional question about the results but it was not the same as maintaining access to software and datasets and answering technical questions therein. As a result, the traditional economic model for funding academic research really did match the manner in which research was conducted and then published. Once the results were published, the maintenance and support costs were nominal and did not really need to be paid for explicitly.

Fast forward to today and the economic model has not changed but the “business” of academic research has. Now, every publication has data and code/software attached to it which come with maintenance and support costs that can extend for a substantial period into the future. While any given publication may not require significant maintenance and support, the costs for an investigator’s publications in aggregate can add up very quickly. Even a single paper that turns out to be popular can take up a lot of time and energy.

If you play this movie to the end, it becomes soberingly clear that reproducible research, from an economic stand point, is not really sustainable. To see this, it might help to use an analogy from the business world. Most businesses have capital costs, where they buy large expensive things – machinery, buildings, etc. These things have a long life, but are thought to degrade over time (accountants call it depreciation). As a result, most businesses have “maintenance capital expenditure” costs that they report to show how much money they are investing every quarter to keep their equipment/buildings/etc. up to shape. In this context, the capital expenditure is worth it because every new building or machine that is purchased is designed to ultimately produce more revenue. As long as the revenue generated exceeds the cost of maintenance, the capital costs are worth it (not to oversimplify or anything!).

In academia, each new publications incurs some maintenance and support costs to ensure reproducibility (the “capital expenditure” here) but it’s unclear how much each new publication brings in more “revenue” to offset those costs. Sure, more publications allow one to expand the lab or get more grant funding or hire more students/postdocs, but I wouldn’t say that’s universally true. Some fields are just constrained by how much total funding there is and so the available funding cannot really be increased by “reaching more customers”. Given that the budgets for funding agencies (at least in the U.S.) have barely kept up with inflation and the number of publications increases every year, it seems the goal of making all research reproducible is simply not economically supportable.

I think we have to concede that at any given moment in time, there will always be some fraction of published research for which there is no maintenance or support for reproducibility. Note that this doesn’t mean that people don’t publish their data and code (they should still do that!), it just means they don’t support or maintain it. The only question is which fraction should *no*t be supported or maintained? Most likely, it will be older results where the investigators simply cannot keep up with maintenance and support. However, it might be worth coming up with a more systematic approach to determining which publications need to maintain their reproducibility and which don’t.

For example, it might be more important to maintain the reproducibility of results from huge studies that cannot be easily replicated independently. However, for a small study conducted a decade ago that has subsequently been replicated many times, we can probably let that one go. But this isn’t the only approach. We might want to preserve the reproducibility of studies that collect unique datasets that are difficult to re-collect. Or we might want to consider term-limits on reproducibility, so an investigator commits to maintaining and supporting the reproducibility of a finding for say, 5 years, after which either the maintenance and support is dropped or longer-term funding is obtained. This doesn’t necessarily mean that the data and code suddenly disappear from the world; it just means the investigator is no longer committed to supporting the effort.

Reproducibility of scientific research is of critical importance, perhaps now more than ever. However, we need to think harder about how we can support it in both the short- and long-term. Just assuming that the maintenance and support costs of reproducibility for every study are merely nominal is not realistic and simply leads to investigators not supporting reproducibility as a default.

Amplifying people I trust on COVID-19

Wed, 29 Apr 2020 00:00:00 +0000

Like a lot of people, I’ve been glued to various media channels trying to learn about the latest with what is going on with COVID-19. I have also been frustrated - like a lot of people - with misinformation and the deluge of preprints and peer reviewed material. Some of this information is critically important and some is hard to trust.

As a biostatistician at a very visible school of public health I have also had a number of media outreaches, but I’ve been hesitant to do any interviews or talk about COVID-19. The reason is that even thought I have a PhD in Biostatistics and I work in a School of Public Health I actually know very little about infectious disease modeling and response. I think if you aren’t really deep in the field, its difficult to know the difference between someone like me and someone with real expertise.

While I’m not an expert in the area, I know many of the real experts professionally or by reputation. So I thought I’d make a brief list of people and organizations I find credible and have been following for good information in case it is helpful to others. Many of these folks have already been found by audiences much bigger than ours but I just thought it would be useful to amplify further their work.

Paper review

JHU Novel Coronavirus Research Compendium - Hopkins experts rapidly reviewing preprints and peer reviewed literature to find the gems.

Infectious disease modeling

Trevor Bedford - Fred Hutchinson Cancer Research center expert in phylogenetic modeling of infectious disease, his viz work and sober analysis is one of my go-tos.
Justin Lessler - infectious disease professor and epidemiologist at Hopkins who did some of the earliest studies of contact tracing in China.
Kate Grabowski - infectious disease professor and epidemiologist at Hopkins
Nicholas Reich - UMass expert in infectious disease modeling, doing a great job of aggregating and evaluating disease models.
Natalie Dean - University of Florida expert statistician in vaccine clinical trials - also one of my favorite pragmatic reviewers of big papers.

Vaccine development

Derek Lowe - drug discovery chemist and blogger who is one of the best out there at distilling progress on vaccines.

Scicom and public outreach

Ellie Murray - Boston University expert epidemiologist professor and communicator, providing clear understandable breakdowns of the best practices.
Lucy D’Agostino McGowan - Vanderbilt statistics professor and communicator who does an amazing job of breaking down difficult stats and causal inference issues.
Carl Bergstrom - UW Biology Professor and infectious disease expert, providing sober reviews and interactions around many of the papers coming out.

Policy

Tom Ingelsby - Professor and director of Johns Hopkins Center for Health Security has been producing solid analysis and policy recommendations on when to re-open.
Caitlin Rivers - Professor at the Hopkins Center for Health Security, outbreak specialist, also producing solid analysis and policy recommendations.
Andy Slavitt - Ex-Obama health care head and providing solid policy reviews and ideas.
Josh Sharfstein - Professor of the Practice at Johns Hopkins Bloomberg School of Public Health, has a great public health podcast with lots of experts on it.
Keshia Pollack-Porter - Professor of Health Policy and Management at Johns Hopkins Bloomberg School of Public Health who has a great take on mobility issues associated with Covid-19.
Lisa Cooper - Bloomberg Professor at the Johns Hopkins Bloomberg School of Public Health who has great content on inequality of impact.

I’m sure I’ve missed great people to mention as I’ve dashed this off pretty quickly so apologies if I missed you!

Is Artificial Intelligence Revolutionizing Environmental Health?

Wed, 04 Dec 2019 00:00:00 +0000

NOTE: This post was written by Kevin Elliott, Michigan State University; Nicole Kleinstreuer, National Institutes of Health; Patrick McMullen, ScitoVation; Gary Miller, Columbia University; Bhramar Mukherjee, University of Michigan; Roger D. Peng, Johns Hopkins University; Melissa Perry, The George Washington University; Reza Rasoulpour, Corteva Agriscience, and Elizabeth Boyle, National Academies of Sciences, Engineering, and Medicine. The full summary for the workshop on which this post is based can be obtained here.

On June 6 and 7, 2019, the National Academy of Sciences, Engineering, and Medicine (NASEM), hosted a workshop on the use of artificial intelligence (AI) in the field of Environmental Health. Rapid advances in machine learning are demonstrating the ability of machines to carry out repetitive “smart” tasks requiring discreet judgments. Machine learning algorithms are now being used to analyze large volumes of complex data to find patterns and make predictions, often exceeding the accuracy and efficiency of people attempting the same task. Driven by tremendous growth in data availability as well as computing power and accessibility, artificial intelligence and machine learning applications are rapidly growing in various sectors of society including retail, such as predicting consumer purchases; the automotive industry as demonstrated by self-driving cars, and in health care with advances in automated medical diagnoses.

Building upon the major themes of the NASEM workshop, in this blog post we address the following questions:

How might AI advance environmental health?
Does AI change the standards used for conducting environmental health research?
Does the use of AI allow us to change our established research principles?
How does AI impact our training programs for the next generation of environmental health scientists?
Are there barriers within the current academic incentive structures that are hindering the full potential of AI, and how might those barriers be overcome?

How might AI advance environmental health?

Environmental health is the study of how the environment affects human health. Due to the complexity of both human biology and the multiplicity of environmental factors that we encounter daily, studying environmental impacts on human health presents many data challenges. Due to the data boom we have seen in recent years we now have a multitude of individualized data including genetic sequencing and wearable health and activity monitors. We have also seen exponential growth in the availability of data on individual environmental exposures. Wearable sensors and personal chemical samplers are allowing for more detailed exposure models, whereas advancements in exposure biomonitoring in a variety of matrices including blood and urine is giving more granular detail about actual chemical body burdens. We have also seen an increase in available population level data on dietary factors, the social and built environment, climate, and many other variables affected by environmental and genetic factors. Concurrently, while population data are booming, toxicology is creating a variety of experimental models to advance our understanding of how chemicals and environmental exposures may pose risks to human health. Large-scale high-throughput chemical safety screening efforts can now generate data on tens of thousands of chemicals in thousands of biological targets. Integrating these diverse data streams represents a new level of complexity.

AI and machine learning provide many opportunities to make this complexity more manageable, such as highly accurate prediction methods to better assess exposures and flexible approaches to allow incorporation of exposure to complex mixtures in population health analyses. Incorporating artificial intelligence and machine learning methods in environmental health research offers the potential to transform how we analyze environmental exposures and our understanding of how these myriad factors influence our health and contribute to disease.

Does AI change the standards used for conducting environmental health research?

While we think the use of AI and machine learning techniques clearly hold great promise for the advancement of environmental health research, we also believe such techniques introduce new challenges and magnify existing ones. While the major standards by which we conduct scientific research do not change, our ability to adhere to them will require some adaptation. Transparency and repeatability are key. We must ensure that the computational reproducibility and replicability of our scientific findings do not suffer at the hands of complex algorithms and poorly assembled data pipelines. Complex data analyses that incorporate more diverse data types from varied sources stretch our ability to track, curate, and validate these data without robust data curation tools. Although some data curation tools that establish standard approaches for creating, managing, and maintaining data are available, they are usually field-specific, and currently there are no incentives or strict requirements to ensure that investigators use them.

Machine learning and artificial intelligence algorithms have demonstrated themselves to be very powerful. At the same time, we also recognize their complexity and general opacity can be cause for concern. While investigators may be willing to overlook the opacity of these algorithms when predictions are highly accurate and precise, all is well until it isn’t. When an algorithm does not work as expected, it is critical to know why it didn’t work. With transparency and reproducibility of utmost importance, machine learning algorithms must ensure that investigators and data analysts have accountability in their analyses and that regulators have confidence in applying AI generated results to inform public health decisions.

Does the use of AI allow us to change our established research principles?

AI does not change established research principles such as sound study designs and understanding threats of bias. However, there is a need to create updated guidelines and implement best practices for choosing, cleaning, structuring, and sharing the data used in AI applications. Creating appropriate training datasets, engaging in ongoing processes of validation, and assessing the domain of applicability for the models that are generated are also important. As in all areas of science, it is crucial to clarify whether models solely provide accurate predictions or whether they also provide understanding of relevant mechanisms. The current Open Science movement’s emphasis on transparency is particularly relevant to the use of AI and machine learning. Users of these methods in environmental health should be looking for ways to be open about the model training data, to clarify validation methods, to create interpretable “models of the models” where possible, and to clarify their domains of applicability. Recent innovations like model cards, or short documents that go alongside machine learning models to share information that everyone impacted by the model should know, is one example of a way model developers can communicate their models’ strengths and weaknesses in a way that is accessible.

How does AI impact our training programs for the next generation of environmental health scientists?

As complex AI methods are increasingly applied to environmental health research, it is important to consider effective training of the workforce and its future leaders. Currently, training in the application of data science is unstandardized, as trainees learn how to apply methods to a specific research application through an apprenticeship type model, where a trainee works with a mentor. Classroom training standardizes theory and methods, but the mentor teaches the fine details of analyzing data in a specific research area, which introduces heterogeneity into the ways in which scientists analyze data. The lack of training standards leads to a worry that analysts may apply cutting-edge computational/algorithmic approaches to data analysis, without consideration of fundamental biostatistical and epidemiologic principles, such as statistical design, sampling, and inference. Fundamental questions taught in biostatistics and epidemiology courses, such as “Who is in my sample?” and “What is my target population of inference?” are even more relevant in our current era of algorithms and machine learning. Now analysts are agnostically querying databases not designed for population-based research such electronic health records, medical claims, Twitter, Facebook, and Google searches, for new discoveries in environmental health. It is important to recognize that a lack of proper consideration of issues related to sampling, selection bias, correlation of multiple exposures, exposure and outcome misclassification could lead to erroneous results and false conclusions. Training programs will need to evolve so that we do not just teach scientists and analysts how to program models and interpret their results, but also emphasize how to recognize human biases that can be inadvertently built into the data and model approaches, and the continuous need for rigor, responsibility, and reproducibility.

An increased focus on mathematical theory may also improve training in the application of AI to environmental health. A greater effort in developing standardized theory about how and why a specific research area analyses data in a certain way may help adapt approaches from one research area to another. In addition, deeper mathematical exploration of AI methods will help data scientists understand when and why AI methods work well, and when they don’t.

Are there barriers within the current academic incentive structures that are hindering the full potential of AI, and how might those barriers be overcome?

Rigorous data science requires a team science approach to achieve a variety of functions such as developing algorithms, formalizing common data platforms and testing protocols, and properly maintaining and curating data sources. Over recent decades, we have witnessed how the power of team science has improved the understanding of critical health problems of our time such as in unlocking the human genome and achieving major advancements in cancer treatment. These advances have demonstrated the payoff of interdisciplinary, transdisciplinary, and multidisciplinary investigations. Despite these successes, there are still barriers to large team science projects, because these projects often have goals that do not sit precisely within a single funding agency. In order for AI to truly advance environmental health, federal agencies and institutions that fund environmental health research need to create pathways to support large multi-disciplinary and multi-institutional teams that are conducting this research. An example could be a multi-agency/multi-institute funding consortia. A ten-year investment in a well-coordinated initiative that harnesses AI data opportunities could accelerate new findings in not only the environmental causes of disease, but also in informing interventions that can prevent environmentally mediated disease and improve population health.

Final thoughts

We believe machine learning and AI methods have tremendous potential but we also believe they cannot be used in a way that overlooks limitations or relaxes data integrity standards. With these considerations in mind, we have tempered enthusiasm for the promises of these approaches. We have to make sure that environmental health scientists stay out in front of these considerations to avoid potential pitfalls such as the allure of hype or chasing after the next new thing because it is novel rather than truly meaningful. We can do this by fostering ongoing conversations about the challenges and opportunities AI provides for environmental health research. An intentional union of the two cultures of careful (and often overly cautious) stochastic and bold (and often overly optimistic) algorithmic modeling can help to ensuring we are not abandoning principles of proper study design when a new technology comes along, but explore how to use the new technology to better understand the myriad ways the environment affects health and disease.

You can replicate almost any plot with R

Wed, 28 Aug 2019 00:00:00 +0000

Although R is great for quickly turning data into plots, it is not widely used for making publication ready figures. But, with enough tinkering you can make almost any plot in R. For examples check out the flowingdata blog or the Fundamentals of Data Visualization book.

Here I show five charts from the lay press that I use as examples in my data science courses. In the past I would show the originals, but I decided to replicate them in R to make it possible to generate class notes with just R code (there was a lot of googling involved).

Below I show the original figures followed by R code and the version of the plot it produces. I used the ggplot2 package but you can achieve similar results using other packages or even just with R-base. Any recommendations on how to improve the code or links to other good examples are welcomed. Please at to the comments or @ me on twitter: @rafalab.

Example 1

The first example is from this ABC news article. Here is the original:

Here is the R code for my version. Note that I copied the values by hand.

library(tidyverse)
library(ggplot2)
library(ggflags)
library(countrycode)

dat <- tibble(country = toupper(c("US", "Italy", "Canada", "UK", "Japan", "Germany", "France", "Russia")),
              count = c(3.2, 0.71, 0.5, 0.1, 0, 0.2, 0.1, 0),
              label = c(as.character(c(3.2, 0.71, 0.5, 0.1, 0, 0.2, 0.1)), "No Data"),
              code = c("us", "it", "ca", "gb", "jp", "de", "fr", "ru"))

dat %>% mutate(country = reorder(country, -count)) %>%
  ggplot(aes(country, count, label = label)) +
  geom_bar(stat = "identity", fill = "darkred") +
  geom_text(nudge_y = 0.2, color = "darkred", size = 5) +
  geom_flag(y = -.5, aes(country = code), size = 12) +
  scale_y_continuous(breaks = c(0, 1, 2, 3, 4), limits = c(0,4)) +   
  geom_text(aes(6.25, 3.8, label = "Source UNODC Homicide Statistics")) + 
  ggtitle(toupper("Homicide Per 100,000 in G-8 Countries")) + 
  xlab("") + 
  ylab("# of gun-related homicides\nper 100,000 people") +
  ggthemes::theme_economist() +
  theme(axis.text.x = element_text(size = 8, vjust = -16),
        axis.ticks.x = element_blank(),
        axis.line.x = element_blank(),
        plot.margin = unit(c(1,1,1,1), "cm"))

Example 2

The second example from everytown.org. Here is the original:

Here is the R code for my version. As in the previous example I copied the values by hand.

dat <- tibble(country = toupper(c("United States", "Canada", "Portugal", "Ireland", "Italy", "Belgium", "Finland", "France", "Netherlands", "Denmark", "Sweden", "Slovakia", "Austria", "New Zealand", "Australia", "Spain", "Czech Republic", "Hungry", "Germany", "United Kingdom", "Norway", "Japan", "Republic of Korea")),
              count = c(3.61, 0.5, 0.48, 0.35, 0.35, 0.33, 0.26, 0.20, 0.20, 0.20, 0.19, 0.19, 0.18, 0.16,
                        0.16, 0.15, 0.12, 0.10, 0.06, 0.04, 0.04, 0.01, 0.01))

dat %>% 
  mutate(country = reorder(country, count)) %>%
  ggplot(aes(country, count, label = count)) +   
  geom_bar(stat = "identity", fill = "darkred", width = 0.5) +
  geom_text(nudge_y = 0.2,  size = 3) +
  xlab("") + ylab("") + 
  ggtitle(toupper("Gun Murders per 100,000 residents")) + 
  theme_minimal() +
  theme(panel.grid.major =element_blank(), panel.grid.minor = element_blank(), 
        axis.text.x = element_blank(),
        axis.ticks.length = unit(-0.4, "cm")) + 
  coord_flip()

Example 3

The next example is from the Wall Street Journal. The original is interactive but here is a screenshot:

Here is the R code for my version. Note I matched the colors by hand as the original does not seem to follow a standard palette.

library(dslabs)
data(us_contagious_diseases)
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) 

jet.colors <- colorRampPalette(c("#F0FFFF", "cyan", "#007FFF", "yellow", "#FFBF00", "orange", "red", "#7F0000"), bias = 2.25)

dat %>% mutate(state = reorder(state, desc(state))) %>%
  ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "white", size = 0.35) +
  scale_x_continuous(expand = c(0,0)) +
  scale_fill_gradientn(colors = jet.colors(16), na.value = 'white') +
  geom_vline(xintercept = 1963, col = "black") +
  theme_minimal() + 
  theme(panel.grid = element_blank()) +
        coord_cartesian(clip = 'off') +
        ggtitle(the_disease) +
        ylab("") +
        xlab("") +  
        theme(legend.position = "bottom", text = element_text(size = 8)) + 
        annotate(geom = "text", x = 1963, y = 50.5, label = "Vaccine introduced", size = 3, hjust = 0)

Example 4

The next example is from the New York Times. Here is the original:

Here is the R code for my version:

data("nyc_regents_scores")
nyc_regents_scores$total <- rowSums(nyc_regents_scores[,-1], na.rm=TRUE)
nyc_regents_scores %>% 
  filter(!is.na(score)) %>%
  ggplot(aes(score, total)) + 
  annotate("rect", xmin = 65, xmax = 99, ymin = 0, ymax = 35000, alpha = .5) +
  geom_bar(stat = "identity", color = "black", fill = "#C4843C") + 
  annotate("text", x = 66, y = 28000, label = "MINIMUM\nREGENTS DIPLOMA\nSCORE IS 65", hjust = 0, size = 3) +
  annotate("text", x = 0, y = 12000, label = "2010 Regents scores on\nthe five most common tests", hjust = 0, size = 3) +
  scale_x_continuous(breaks = seq(5, 95, 5), limit = c(0,99)) + 
  scale_y_continuous(position = "right") +
  ggtitle("Scraping By") + 
  xlab("") + ylab("Number of tests") + 
  theme_minimal() + 
  theme(panel.grid.major.x = element_blank(), 
        panel.grid.minor.x = element_blank(),
        axis.ticks.length = unit(-0.2, "cm"),
        plot.title = element_text(face = "bold"))

Example 5

This last one is from fivethirtyeight.

Below is the R code for my version. Note that in this example I am essentially just drawing as I don’t estimate the distributions myself. I simply estimated parameters “by eye” and used a bit of trial and error.

my_dgamma <- function(x, mean = 1, sd = 1){
  shape = mean^2/sd^2
  scale = sd^2 / mean
  dgamma(x, shape = shape, scale = scale)
}

my_qgamma <- function(mean = 1, sd = 1){
  shape = mean^2/sd^2
  scale = sd^2 / mean
  qgamma(c(0.1,0.9), shape = shape, scale = scale)
}

tmp <- tibble(candidate = c("Clinton", "Trump", "Johnson"), 
              avg = c(48.5, 44.9, 5.0), 
              avg_txt = c("48.5%", "44.9%", "5.0%"), 
              sd = rep(2, 3), 
              m = my_dgamma(avg, avg, sd)) %>%
  mutate(candidate = reorder(candidate, -avg))

xx <- seq(0, 75, len = 300)

tmp_2 <- map_df(1:3, function(i){
  tibble(candidate = tmp$candidate[i],
         avg = tmp$avg[i],
         sd = tmp$sd[i],
         x = xx,
         y = my_dgamma(xx, tmp$avg[i], tmp$sd[i]))
})

tmp_3 <- map_df(1:3, function(i){
  qq <- my_qgamma(tmp$avg[i], tmp$sd[i])
  xx <- seq(qq[1], qq[2], len = 200)
  tibble(candidate = tmp$candidate[i],
         avg = tmp$avg[i],
         sd = tmp$sd[i],
         x = xx,
         y = my_dgamma(xx, tmp$avg[i], tmp$sd[i]))
})
         
tmp_2 %>% 
  ggplot(aes(x, ymax = y, ymin = 0)) +
  geom_ribbon(fill = "grey") + 
  facet_grid(candidate~., switch = "y") +
  scale_x_continuous(breaks = seq(0, 75, 25), position = "top",
                     label = paste0(seq(0, 75, 25), "%")) +
  geom_abline(intercept = 0, slope = 0) +
  xlab("") + ylab("") + 
  theme_minimal() + 
  theme(panel.grid.major.y = element_blank(), 
        panel.grid.minor.y = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.y = element_text(angle = 180, size = 11, vjust = 0.2)) + 
  geom_ribbon(data = tmp_3, mapping = aes(x = x, ymax = y, ymin = 0, fill = candidate), inherit.aes = FALSE, show.legend = FALSE) +
  scale_fill_manual(values = c("#3cace4", "#fc5c34", "#fccc2c")) +
  geom_point(data = tmp, mapping = aes(x = avg, y = m), inherit.aes = FALSE) + 
  geom_text(data = tmp, mapping = aes(x = avg, y = m, label = avg_txt), inherit.aes = FALSE, hjust = 0, nudge_x = 1)

So You Want to Start a Podcast

Tue, 27 Aug 2019 00:00:00 +0000

Podcasting has gotten quite a bit easier over the past 10 years, due in part to improvements to hardware and software. I wrote about both how I edit and record both of my podcasts about 2 years ago and, while not much has changed since then, I thought it might be helpful if I organized the information in a better way for people just starting out with a new podcast.

One frustrating problem that I find with podcasting is that the easy methods are indeed easy, and the difficult methods are indeed difficult, but the methods that are just above easy, which other markets might label as “prosumer” or something like that, are…kind of hard. One of the reasons is that once you start buying better hardware, everything kind of snowballs because the hardware becomes more modular. So instead of just using your phone headphones to record, you might buy a microphone, that connects to a stand, that connects to a USB interface using an XLR cable, that connects to your computer. Similarly, on the software side, there’s really not much out there that’s free. As a result of both phenomena, costs start to go up pretty quickly as soon as you step up just a little bit.

I can’t do anything about costs, but I thought I could help a little bit on sorting out what’s out there and what’s genuinely valuable. There are two versions here: the free and easy plan if you’re just starting out and the next level up, which is basically what I use.

The three things I’ll cover here that you need for podcasting are:

Hardware - this includes all recording equipment like microphones, stands, cables, etc.
Recording Software - Unless you live in a recording booth you’ll need some software for your computer (which I assume you have!)
Editing Software - the more complicated your podcast gets the more you’ll need to edit (beyond just trimming the beginning and end of the audio files)
Hosting - Unless you plan on running your own server (which is an option but I don’t recommend it) you’ll need someone to host your audio files.

Free and Easy

There are in fact ways to podcast for free and many people stay at this level for a long time because the quality is acceptable and cost is zero. If you want to just get started quickly here’s what you can do:

Hardware - just use the headphones/microphone that came with your mobile phone.
Recording Software - If you are doing a podcast by yourself, you can just use whatever app your phone has to record things like voice memos. On your computer, there should be a built-in app that just lets you record sound through the headphones.
Editing Software - For editing I recommend either not editing (simpler!) or using something like Audacity to just trim the beginning and the end.
Hosting - SoundCloud offers free hosting for up to 3 hours of content. This is plenty for just starting out and seeing if you like it, but you will likely use it up.

If you are working with a partner, it gets a little more complicated and there are some additional notes on the recording software. My go-to recommendation for recording with a partner is to use Zencastr. Zencastr has a free plan that lets you record high-quality audio for a max of 2 people. (If you need to record more than 2 people, you can’t use the free option.) The nice thing about Zencastr is that it uses WebRTC to record directly off your microphone, so you don’t need to worry too much about the quality of your internet connection. What you get is separate audio files, one for each speaker, that are synched together. Occasionally, there are some synching glitches, but usually it works out. The files are automatically uploaded to a Dropbox account, so you’ll need one of those. Because Zencastr automatically goes to MP3 format, the files are relatively small. Also, if you have a guest who is less familiar with audio hardware/software, you can just send them a link that they can click on and they’re recording.

Note that even if your partner is sitting right next to you, it’s often simpler to just go to separate spaces and record “remotely”. The primary benefit of doing this is that you can cleanly record separate/independent audio tracks. This can be useful in the editing process.

If you prefer an all-in-one solution, there are services like Cast and Anchor that offer recording, hosting, and distribution. Cast only has a free 1-month trial and so you have to pay eventually. Anchor appears to be free (I’ve never used it), but it was recently purchased by Spotify so it’s not immediately clear to me if anything will change. My guess is they’ll likely stay free because they want as many people making podcasts as possible. Anchor didn’t exist when I started podcasting but if it had I might have used it first. But it always makes me a little nervous when I can’t figure out how a company makes money.

To summarize, here’s the “free and easy” workflow that I recommend:

Record your podcast using Zencastr (especially if you have a partner), which then puts audio files on Dropbox
Trim beginning/ending of audio file with Audacity
Upload audio to SoundCloud and add episode metadata

And here are the pros and cons:

Pros

It’s free

Cons

Audio quality is acceptable but not great. Earbud type microphones are not designed for high quality and you can usually tell when someone has used them to record. Given that podcasts are all about audio, it’s hard for me to trade off audio quality.
Hosting limitations mean you can only get a few episodes up. But that’s a problem for down the road, right?
Editing is generally a third-order issue, but there is one scenario where it can be critical—when you have a bad internet connection. Bad internet connections can introduce delays and cross-talk. These problems can be mitigated when editing (I give an example here) but only with better software.

Beyond Free

Beyond the free workflow, there are a number of upgrades that you can make and you can easily start spending a lot of money. But the only real upgrade that I think you need to make is to buy a good microphone. Surprisingly, this does not need to cost much money. The best podcasting microphone for the money out there is the Audio Technica ATR2100 USB micrphone. This is the microphone that Elizabeth uses on the The Effort Report and Hilary uses on Not So Standard Deviations. As of this writing it’s \$65 on Amazon, but I’ve seen it for as low as \$40. The benefits of this microphone are:

The audio quality is high
It isolates vocal audio really well and doesn’t pick up a lot of background audio (good for noisy rooms like my office).
It connects directly to a computer via USB so you don’t need to buy a separate USB interface.
It’s cheap

The problem with getting “better” (i.e. more expensive) microphones as that they tend to be more sensitive, which means they pick up more high-frequency background noise. Professional microphones are designed for you to be working in a sound-proof recording studio environment in which you want to pick up as much sound as possible. But podcasting, in general, tends to take place wherever. So you want a microphone that will only pick up your voice right in front of it. Technically, you lose a little quality this way, but it’s equally annoying to have a lot of background noise.

Now that you’ve got a microphone, you need to stick it somewhere. While you can always just hold the microphone, I’d recommend an adjustable stand of some sort. Desk stands like this one are nice because they’re adjustable but they do require you to have a semi-permanent office where you can just keep it. The main point here is that podcasting requires you to sit still and talk for a while, and you don’t want to be uncomfortable while you’re doing it.

The last upgrade you’ll likely need to make is the hosting provider. SoundCloud itself offers an unlimited plan but I don’t recommend it as it’s not really designed for podcasting. I use Libsyn, which has a $5 a month plan that should be enough for a monthly podcast. They also provide some decent analytics that you can download and read into R. What I like about Libsyn is that they do one job and they do it really well. I give them money, and they provide me a service in return. How simple is that?

That’s it for now. I’m happy to make more recommendations regarding software and hardware (feel free to tweet me @rdpeng), but I think what I’ve got here should get you 99% of the way there.

The data deluge means no reasonable expectation of privacy - now what?

Tue, 23 Jul 2019 00:00:00 +0000

Today a couple of different things reminded me about something that I suppose many people are talking about but has been on my mind as well.

The idea is that many of our societies social norms are based on the reasonable expectation of privacy. But the reasonable expectation of privacy is increasingly a thing of the past. Three types of data I’ve been thinking about are:

Obviously identifying data: Data like cellphone GPS traces and public social media posts are obviously information that is indentifiable and reduce privacy.
Data that can be inferred from public data: We can also now infer a lot about people given the data that is public. For example a couple of years ago I challenged the students in my advanced data science class to predict the Gail score - one of the most widely used measures of breast cancer risk - using only the information available from a person’s public Facebook profile. While not all of the information was available, a good fraction of it was. This is an example of something you might not think that posting pictures of your family, your birthday celebrations, and family life events could enable. I was reminded of this when hearing about this paper that claims to be able to deidentify up to 99.98\% of Americans using only 15 pieces of demographic information.
Data other people share about us: The stories around the capture of the Golden Gate Killer using genealogy data make it clear that even when you personally don’t share your data, someone else may be sharing it for you. The same can be said of photos of you that were tagged on Facebook even if you aren’t on the platform.

I don’t think these types of data are going to magically disappear. So like a lot of other people I’ve been wondering how we should individually and as a society adapt to the world where privacy is no longer an expectation.

More datasets for teaching data science: The expanded dslabs package

Fri, 19 Jul 2019 00:00:00 +0000

Introduction

We have expanded the dslabs package, which we previously introduced as a package containing realistic, interesting and approachable datasets that can be used in introductory data science courses.

This release adds 7 new datasets on climate change, astronomy, life expectancy, and breast cancer diagnosis. They are used in improved problem sets and new projects within the HarvardX Data Science Professional Certificate Program, which teaches beginning R programming, data visualization, data wrangling, statistics, and machine learning for students with no prior coding background.

You can install the dslabs package from CRAN:

install.packages("dslabs")

If you already have the package installed, you can add the new datasets by updating the package:

update.packages("dslabs")

You can load the package into your workspace normally:

library(dslabs)

Let’s preview these new datasets! To code along, use the following libraries and options:

# install packages if necessary
if(!require("tidyverse")) install.packages("tidyverse")
if(!require("ggrepel")) install.packages("ggrepel")
if(!require("matrixStats")) install.packages("matrixStats")


# load libraries
library(tidyverse)
library(ggrepel)
library(matrixStats)

# set colorblind-friendly color palette
colorblind_palette <- c("black", "#E69F00", "#56B4E9", "#009E73",
                        "#CC79A7", "#F0E442", "#0072B2", "#D55E00")

Climate change

Three datasets related to climate change are used to teach data visualization and data wrangling. These data produce clear plots that demonstrate an increase in temperature, greenhouse gas levels, and carbon emissions from 800,000 years ago to modern times. Students can create their own impactful visualizations with real atmospheric and ice core measurements.

Modern temperature anomaly and carbon dioxide data: `temp_carbon`

The temp_carbon dataset includes annual global temperature anomaly measurements in degrees Celsius relative to the 20th century mean temperature from 1880-2018. The temperature anomalies over land and over ocean are reported also. In addition, it includes annual carbon emissions (in millions of metric tons) from 1751-2014. Temperature anomalies are from NOAA and carbon emissions are from Boden et al., 2017 via CDIAC.

data(temp_carbon)

# line plot of annual global, land and ocean temperature anomalies since 1880
temp_carbon %>%
    select(Year = year, Global = temp_anomaly, Land = land_anomaly, Ocean = ocean_anomaly) %>%
    gather(Region, Temp_anomaly, Global:Ocean) %>%
    ggplot(aes(Year, Temp_anomaly, col = Region)) +
    geom_line(size = 1) +
    geom_hline(aes(yintercept = 0), col = colorblind_palette[8], lty = 2) +
    geom_label(aes(x = 2005, y = -.08), col = colorblind_palette[8], 
               label = "20th century mean", size = 4) +
    ylab("Temperature anomaly (degrees C)") +
    xlim(c(1880, 2018)) +
    scale_color_manual(values = colorblind_palette) +
    ggtitle("Temperature anomaly relative to 20th century mean, 1880-2018")

Greenhouse gas concentrations over 2000 years: `greenhouse_gases`

The greenhouse_gases data frame contains carbon dioxide ($\mbox{CO}_2$, ppm), methane ($\mbox{CO}_2$, ppb) and nitrous oxide ($\mbox{N}_2\mbox{O}$, ppb) concentrations every 20 years from 0-2000 CE. The data are a subset of ice core measurements from MacFarling Meure et al., 2006 via NOAA. There is a clear increase in all 3 gases starting around the time of the Industrial Revolution.

data(greenhouse_gases)

# line plots of atmospheric concentrations of the three major greenhouse gases since 0 CE
greenhouse_gases %>%
    ggplot(aes(year, concentration)) +
    geom_line() +
    facet_grid(gas ~ ., scales = "free") +
    xlab("Year") +
    ylab("Concentration (CH4/N2O ppb, CO2 ppm)") +
    ggtitle("Atmospheric greenhouse gas concentration by year, 0-2000 CE")

Compare this pattern with manmade carbon emissions since 1751 from temp_carbon, which have risen in a similar way:

# line plot of anthropogenic carbon emissions over 250+ years
temp_carbon %>%
    ggplot(aes(year, carbon_emissions)) +
    geom_line() +
    xlab("Year") +
    ylab("Carbon emissions (metric tons)") +
    ggtitle("Annual global carbon emissions, 1751-2014")

Carbon dioxide levels over the last 800,000 years, `historic_co2`

A common argument against the existence of anthropogenic climate change is that the Earth naturally undergoes cycles of warming and cooling governed by natural changes beyond human control. $\mbox{CO}_2$ levels from ice cores and modern atmospheric measurements at the Mauna Loa observatory demonstrate that the speed and magnitude of natural variations in greenhouse gases pale in comparison to the rapid changes in modern industrial times. While the planet has been hotter and had higher $\mbox{CO}_2$ levels in the distant past (data not shown), the current unprecedented rate of change leaves little time for planetary systems to adapt.

data(historic_co2)

# line plot of atmospheric CO2 concentration over 800K years, colored by data source
historic_co2 %>%
    ggplot(aes(year, co2, col = source)) +
    geom_line() +
    ylab("CO2 (ppm)") +
    scale_color_manual(values = colorblind_palette[7:8]) +
    ggtitle("Atmospheric CO2 concentration, -800,000 BCE to today")

Properties of stars for making an H-R diagram: `stars`

In astronomy, stars are classified by several key features, including temperature, spectral class (color) and luminosity (brightness). A common plot for demonstrating the different groups of stars and their propreties is the Hertzsprung-Russell diagram, or H-R diagram. The stars data frame compiles information for making an H-R diagram with about approximately 100 named stars, including their temperature, spectral class and magnitude (which is inversely proportional to luminosity).

The H-R diagram has the hottest, brightest stars in the upper left and coldest, dimmest stars in the lower right. Main sequence stars are along the main diagonal, while giants are in the upper right and dwarfs are in the lower left. Several aspects of data visualization can be practiced with these data.

data(stars)

# H-R diagram color-coded by spectral class
stars %>%
    mutate(type = factor(type, levels = c("O", "B", "DB", "A", "DA", "DF", "F", "G", "K", "M")),
           star = ifelse(star %in% c("Sun", "Polaris", "Betelgeuse", "Deneb",
                                     "Regulus", "*SiriusB", "Alnitak", "*ProximaCentauri"),
                         as.character(star), NA)) %>%
    ggplot(aes(log10(temp), magnitude, col = type)) +
    geom_point() +
    geom_label_repel(aes(label = star)) +
    scale_x_reverse() +
    scale_y_reverse() +
    xlab("Temperature (log10 degrees K)") +
    ylab("Magnitude") +
    labs(color = "Spectral class") +
    ggtitle("H-R diagram of selected stars")

## Warning: Removed 88 rows containing missing values (geom_label_repel).

United States period life tables: `death_prob`

Obtained from the US Social Security Administration, the 2015 period life table lists the probability of death within one year at every age and for both sexes. These values are commonly used to calculate life insurance premiums. They can be used for exercises on probability and random variables. For example, the premiums can be calculated with a similar approach to that used for interest rates in this case study on The Big Short in Rafael Irizarry’s Introduction to Data Science textbook.

Brexit polling data: `brexit_polls`

brexit_polls contains vote percentages and spreads from the six months prior to the Brexit EU membership referendum in 2016 compiled from Wikipedia. These can be used to practice a variety of inference and modeling concepts, including confidence intervals, p-values, hierarchical models and forecasting.

data(brexit_polls)

# plot of Brexit referendum polling spread between "Remain" and "Leave" over time
brexit_polls %>%
    ggplot(aes(enddate, spread, color = poll_type)) +
    geom_hline(aes(yintercept = -.038, color = "Actual spread")) +
    geom_smooth(method = "loess", span = 0.4) +
    geom_point() +
    scale_color_manual(values = colorblind_palette[1:3]) +
    xlab("Poll end date (2016)") +
    ylab("Spread (Proportion Remain - Proportion Leave)") +
    labs(color = "Poll type") +
    ggtitle("Spread of Brexit referendum online and telephone polls")

Breast cancer diagnosis prediction: `brca`

This is the Breast Cancer Wisconsin (Diagnostic) Dataset, a classic machine learning dataset that allows classification of breast lesion biopsies as malignant or benign based on cell nucleus characteristics extracted from digitized images of fine needle aspirate cytology slides. The data are appropriate for principal component analysis and a variety of machine learning algorithms. Models can be trained to a predictive accuracy of over 95%.

# scale x values
x_centered <- sweep(brca$x, 2, colMeans(brca$x))
x_scaled <- sweep(x_centered, 2, colSds(brca$x), FUN = "/")

# principal component analysis
pca <- prcomp(x_scaled) 

# scatterplot of PC2 versus PC1 with an ellipse to show the cluster regions
data.frame(pca$x[,1:2], type = ifelse(brca$y == "B", "Benign", "Malignant")) %>%
    ggplot(aes(PC1, PC2, color = type)) +
    geom_point() +
    stat_ellipse() +
    ggtitle("PCA separates breast biospies into benign and malignant clusters")

Conclusion

We hope that these additional datasets make the dslabs package even more useful for teaching data science through real-world case studies and motivating examples.

Are you an R programming novice but want to learn how to do all of this and more? Check out the Data Science Professional Certificate Program from HarvardX on edX, taught by Rafael Irizarry!

Research quality data and research quality databases

Wed, 29 May 2019 00:00:00 +0000

When you are doing data science, you are doing research. You want to use data to answer a question, identify a new pattern, improve a current product, or come up with a new product. The common factor underlying each of these tasks is that you want to use the data to answer a question that you haven’t answered before. The most effective process we have come up for getting those answers is the scientific research process. That is why the key word in data science is not data, it is science.

No matter where you are doing data science - in academia, in a non-profit, or in a company - you are doing research. The data is the substrate you use to get the answers you care about. The first step most people take when using data is to collect the data and store it. This is a data engineering problem and is a necessary first step before you can do data science. But the state and quality of the data you have can make a huge amount of difference in how fast and accurately you can get answers. If the data is structured for analysis - if it is research quality - then it makes getting answers dramatically faster.

A common analogy says that data is the new oil. Using this analogy pulling the data from all of the different available sources is like mining and extracting the oil. Putting it in a data lake or warehouse is like storing the crude oil for use in different products. In this analogy research is like getting the cars to go using the oil. Crude oil extracted from the ground can be used for a lot of different products, but to make it really useful for cars you need to refine the oil into gas. Creating research quality data is the way that you refine and structure data to make it conducive to doing science. It means that the data is no longer as general purpose, but it means you can use it much, much more efficiently for the purpose you care about - getting answers to your questions.

Research quality data is data that:

Is summarized the right amount
Is formatted to work with the tools you are going to use
Is easy to manipulate and use
Is valid and accurately reflects the underlying data collection
Has potential biases clearly documented.
Combines all the relevant data types you need to answer questions

Let’s use an example to make this concrete. Suppose that you want to analyze data from an electronic health record. You want to do this to identify new potential efficiencies, find new therapies, and understand variation in prescribing within your medical system. The data that you have collected is in the form of billing records. They might be stored in a large database for a health system, where each record looks something like this:

An example electronic health record. Source: http://healthdesignchallenge.com/

These data are collected incidentally during the health process and are designed for billing, not for research. Often they contain information about what treatments patients received and were billed for, but they might not include information on the health of the patient and whether they had any health complications or relapses they weren’t billed for.

These data are great, but they aren’t research grade. They aren’t summarized in any meaningful way, can’t be manipulated with visualization or machine learning tools, are unwieldy and contain a lot of information we don’t need, are subject to all sorts of strange sampling biases, and aren’t merged with any of the health outcome data you might care about.

So let’s talk about how we would turn this pile of crude data into research quality data.

Turning raw data into research quality data.

Summarizing the data the right amount

To know how to summarize the data we need to know what are the most common types of questions we want to answer and what resolution we need to answer them. A good idea is to summarize things at the finest unit of analysis you think you will need - it is always easier to aggregate than disaggregate at the analysis level. So we might summarize at the patient and visit level. This would give us a data set where everything is indexed by patient and visit. If we want to answer something at a clinic, physician, or hospital level we can always aggregate there.

We also need to choose what to quantify. We might record for each visit the date, prescriptions with standardized codes, tests, and other metrics. Depending on the application we may store the free text of the physician notes as a text string - for potential later processing into specific tokens or words. Or if we already have a system for aggregating physicians notes we could apply it at this stage.

Is formatted to work with the tools you are going to use

Research quality data is organized so the most frequent tasks can be completed quickly and without large amounts of data processing and reformatting. Each data analytic tool has different requirements on the type of data you need to input. For example, many statistical modeling tools use “tidy data” so you might store the summarized data in a single tidy data set or a set of tidy data tables linked by a common set of indicators. Some software (for example in the analysis of human genomic data) require inputs in different formats - say as a set of objects in the R programming language. Others, like software to fit a convolutional neural network to a set of images, might require a set of image files organized in a directory in a particular way along with a metadata file providing information about each set of images. Or we might need to one hot encode categories that need to be classified.

In the case of our EHR data we might store everything in a set of tidy tables that can be used to quickly correlate different measurements. If we are going to integrate imaging, lab reports, and other documents we might store those in different formats to make integration with downstream tools easier.

Is easy to manipulate and use

This seems like it is just a re-hash of formatting the data to work with the tools you care about, but there are some subtle nuances. For example, if you have a huge amount of data (petabyes of images, for example) you might not want to do research on all of those data at once. It will be inefficient and expensive. So you might use sampling to get a smaller data set for your research quality data that is easier to use and manipulate. The data will also be easier to use if they are (a) stored in an easy to access database with security systems well documented, (b) have a data dictionary that makes it clear what the data are and where they come from, or © have a clear set of tutorials on how to perform common tasks on the data.

In our EHR example you might include a data dictionary that describes the dates of the data pull, the types of data pulled, the type of processing performed, and pointers to the scripts that pulled the data.

Is valid and accurately reflects the underlying data collection

Data can be invalid for a whole host of reasons. The data could be incorrectly formatted, input with error, could change over time, could be mislabeled, and more. All of these problems can occur on the original data pull or over time. Data can also be out of date as new data becomes available.

The research quality database should include only data that has been checked, validated, cleaned and QA’d so that it reflects the real state of the world. This process is not a one time effort, but an ongoing set of code, scripts, and processes that ensure the data you use for research are as accurate as possible.

In the EHR example there would be a series of data pulls, code to perform checks, and comparisons to additional data sources to validate the values, levels, variables, and other components of the research quality database.

Has potential biases clearly documented

A research quality data set is by definition a derived data set. So there is a danger that problems with the data will be glossed over since it has been processed and easy to use. To avoid this problem, there has to be documentation on where the data came from, what happened to them during processing, and any potential problems with the data.

With our EHR example this could include issues about how patients come into the system, what procedures can be billed (or not), what data was ignored in the research quality database, what are the time periods the data were collected, and more.

Combines all the relevant data types you need to answer questions

One big difference between a research quality data set/database and a raw database or even a general purpose tidy data set, is that it merges all of the relevant data you need to answer specific questions, even if they come from distinct sources. Research quality data pulls together and makes easy to access, all the information you need to answer your questions. This could still be in the form of a relational database - but the databases organization is driven by the research question, rather than driven by other purposes.

For example, EHR data may already be stored in a relational database. But it is stored in a way that makes it easy to understand billing and patient flow in a clinic. To answer a research question you might need to combine the billing data, with patient outcome data, and prescription fulfillment data, all processed and indexed so they are either already merged or can be easily merged.

Why do this?

So why build a research quality data set? It sure seems like a lot of work (and it is!). The reason is that this work will always be done, one way or the other. If you don’t invest in making a research quality data set up front, you will do it as a thousand papercuts over time. Each time you need to answer a new question or try a different model you’ll be slowed down by the friction of identifying, creating, and checking a new cleaned up data set. On the one hand this amortizes the work over the course of many projects. But by doing it piecemeal you also dramatically increase the chance of an error in processing, reduce answer time, slow down the research process, and make the investment for any individual project much higher.

Problem Forward Data Science

If you want help planning or building a research quality data set or database, we can help at Problem Forward Data Science. Get in touch here: https://problemforward.typeform.com/to/L4h89P

I co-founded a company! Meet Problem Forward Data Science

Mon, 20 May 2019 00:00:00 +0000

I have some exciting news about something I’ve been working on for the last year or so. I started a company! It’s called Problem Forward data science. I’m pumped about this new startup for a lot of reasons.

My co-founder is one of my families closest friends, Jamie McGovern, who has more than 2 decades of experience in the consulting world and who I’ve known for 15 years.
We are creating a cool new model of “data scientist as a service” (more on that below)
We have a problem forward, not solution backward approach to data science that grew out of the Hopkins philosophy of data science.
We are headquartered in East Baltimore and are creating awesome new tech jobs in a place where they haven’t been historically.

Problem Forward, Not Solution Backward

We have always had a “problem forward, not solution backward” approach to statistics, machine learning and data here at Simply Stats. This has grown out of the Johns Hopkins Biostatistics philosophy of starting with the public health or medical problem you care about and working back to the statistical models, software, and tools you need to solve it.

This idea is so important to us, it is in the name of the company. When we work with people our first goal is to find out the problems and questions that they genuinely care about, then work backward to figure out how to solve them. We don’t come in with a particular predetermined algorithm or strategy. One of the first questions we ask people isn’t about data at all, it is:

What question do you wish you could answer about your business (ignoring if you have the data or not to answer it yet)?

My favorite example of this is Moneyball. This is one of the classic stories about how the Oakland A’s used data to gain a unique advantage. But one of the key messages about this story that often gets missed is that the data weren’t unique to the A’s! Everyone had the same data, the A’s just started with a problem that they needed to solve. They needed to find a unique way to win games that wasn’t as expensive. Then they moved forward to looking at the data and realized that on base percentage was cheaper than home runs. So the A’s used a “problem forward, not solution backward” approach to data analysis.

Using this approach we have worked with companies with a wide variety of needs. Our main capabilities are in data strategy, data cleaning and research quality database generation, modeling and machine learning, and data views through dashboards, reports, and presentations.

Data Scientist as a Service

There are a huge number of data science platform companies out there. Some of them are producing awesome tools, but as any serious data analyst will tell you we are years from automating real data science. We are only very recently seeing formal definitions of what success of a data analysis even means! So it isn’t surprising when general purpose platforms like IBM Watson struggle with specific problems - the problem isn’t specified clearly enough for a platform to solve it yet..

The reason there are so many platforms is that its easy to sell the “cool” part of the problem - say building an AI to classify images or drive a car. But often the deeper problem is (a) figuring out what you even want to or can say with a set of data set, (b) collecting a set of disorganized data, © getting buy in from groups with different motivations and data sets, (d) organizing ugly data from different sources or finding new data you might need, and (e) putting your answers in context. These problems are more like “glue” that comes between each of the platforms. We have a phrase we like to use:

To solve your data problem you need a person, not a platform

So we have set up a “platform” that lets you scale up and down the number team members you have to solve data problems, just like you would scale up and down the number of servers or tools that you use on AWS.

This means if you are an early stage startup we can help you scale data science before you can afford to hire a whole team. Even if you are a non-profit or a small academic group we can scale up or down to suit your needs. And if you are a big company we can provide utility data science for projects with tight deadlines.

Working with friends and building East Baltimore

The thing that gets me most excited about this new adventure is working with my really close friend Jamie. It’s been huge for me to learn about the ins and outs of starting and running a business with someone who has decades of experience in the consulting industry.

It’s also exciting to be able to headquarter the company right in East Baltimore and to work to upskill and develop talent here in a neighborhood I care about.

Like what you hear? Get in touch

If you are looking for data science work we’d love to hear from you! Whether you are an academic, a non-profit, a small startup, or a big business our utility model means we can work with you.

If you are interested in working with us contact us here:

https://problemforward.typeform.com/to/L4h89P

Generative and Analytical Models for Data Analysis

Mon, 29 Apr 2019 00:00:00 +0000

Describing how a data analysis is created is a topic of keen interest to me and there are a few different ways to think about it. Two different ways of thinking about data analysis are what I call the “generative” approach and the “analytical” approach. Another, more informal, way that I like to think about these approaches is as the “biological” model and the “physician” model. Reading through the literature on the process of data analysis, I’ve noticed that many seem to focus on the former rather than the latter and I think that presents an opportunity for new and interesting work.

Generative Model

The generative approach to thinking about data analysis focuses on the process by which an analysis is created. Developing an understanding of the decisions that are made to move from step one to step two to step three, etc. can help us recreate or reconstruct a data analysis. While reconstruction may not exactly be the goal of studying data analysis in this manner, having a better understanding of the process can open doors with respect to improving the process.

A key feature of the data analytic process is that it typically takes place inside the data analyst’s head, making it impossible to directly observe. Measurements can be taken by asking analysts what they were thinking at a given time, but that can be subject to a variety of measurement errors, as with any data that depend on a subject’s recall. In some situations, partial information is available, for example if the analyst writes down the thinking process through a series of reports or if a team is involved and there is a record of communication about the process. From this type of information, it is possible to gather a reasonable picture of “how things happen” and to describe the process for generating a data analysis.

This model is useful for understanding the “biological process”, i.e. the underlying mechanisms for how data analyses are created, sometimes referred to as “statistical thinking”. There is no doubt that this process has inherent interest for both teaching purposes and for understanding applied work. But there is a key ingredient that is lacking and I will talk about that more below.

Analytical Model

A second approach to thinking about data analysis ignores the underlying processes that serve to generate the data analysis and instead looks at the observable outputs of the analysis. Such outputs might be an R markdown document, a PDF report, or even a slide deck (Stephanie Hicks and I refer to this as the analytic container). The advantage of this approach is that the analytic outputs are real and can be directly observed. Of course, what an analyst puts into a report or a slide deck typically only represents a fraction of what might have been produced in the course of a full data analysis. However, it’s worth noting that the elements placed in the report are the cumulative result of all the decisions made through the course of a data analysis.

I’ve used music theory as an analogy for data analysis many times before, mostly because…it’s all I know, but also because it really works! When we listen to or examine a piece of music, we have essentially no knowledge of how that music came to be. We can no longer interview Mozart or Beethoven about how they wrote their music. And yet we are still able to do a few important things:

Analyze and Theorize. We can analyze the music that we hear (and their written representation, if available) and talk about how different pieces of music differ from each other or share similarities. We might develop a sense of what is commonly done by a given composer, or across many composers, and evaluate what outputs are more successful or less successful. It’s even possible to draw connections between different kinds of music separated by centuries. None of this requires knowledge of the underlying processes.
Give Feedback. When students are learning to compose music, an essential part of that training is the play the music in front of others. The audience can then give feedback about what worked and what didn’t. Occasionally, someone might ask “What were you thinking?” but for the most part, that isn’t necessary. If something is truly broken, it’s sometimes possible to prescribe some corrective action (e.g. “make this a C chord instead of a D chord”).

There are even two whole podcasts dedicated to analyzing music—Sticky Notes and Switched on Pop—and they generally do not interview the artists involved (this would be particularly hard for Sticky Notes). By contrast, the Song Exploder podcast takes a more “generative approach” by having the artist talk about the creative process.

I referred to this analytical model for data analysis as the “physician” approach because it mirrors, in a basic sense, the problem that a physician confronts. When a patient arrives, there is a set of symptoms and the patient’s own report/history. Based on that information, the physician has to prescribe a course of action (usually, to collect more data). There is often little detailed understanding of the biological processes underlying a disease, but they physician may have a wealth of personal experience, as well as a literature of clinical trials comparing various treatments from which to draw. In human medicine, knowledge of biological processes is critical for designing new interventions, but may not play as large a role in prescribing specific treatments.

When I see a data analysis, as a teacher, a peer reviewer, or just a colleague down the hall, it is usually my job to give feedback in a timely manner. In such situations there usually isn’t time for extensive interviews about the development process of the analysis, even though that might in fact be useful. Rather, I need to make a judgment based on the observed outputs and perhaps some brief follow-up questions. To the extent that I can provide feedback that I think will improve the quality of the analysis, it is because I have a sense of what makes for a successful analysis.

The Missing Ingredient

Stephanie Hicks and I have discussed what are the elements of a data analysis as well as what might be the principles that guide the development of an analysis. In a new paper, we describe and characterize the success of a data analysis, based on a matching of principles between the analyst and the audience. This is something I have touched on previously, both in this blog and on my podcast with Hilary Parker, but in a generally more hand-wavey fashion. Developing a more formal model, as Stephanie and I have done here, has been useful and has provided some additional insights.

For both the generative model and the analytical model of data analysis, the missing ingredient was a clear definition of what made a data analysis successful. The other side of that coin, of course, is knowing when a data analysis has failed. The analytical approach is useful because it allows us to separate the analysis from the analyst and to categorize analyses according to their observed features. But the categorization is “unordered” unless we have some notion of success. Without a definition of success, we are unable to formally criticize analyses and explain our reasoning in a logical manner.

The generative approach is useful because it reveals potential targets of intervention, especially from a teaching perspective, in order to improve data analysis (just like understanding a biological process). However, without a concrete definition of success, we don’t have a target to strive for and we do not know how to intervene in order to make genuine improvement. In other words, there is no outcome on which we can “train our model” for data analysis.

I mentioned above that there is a lot of focus on developing the generative model for data analysis, but comparatively little work developing the analytical model. Yet, both models are fundamental to improving the quality of data analyses and learning from previous work. I think this presents an important opportunity for statisticians, data scientists, and others to study how we can characterize data analyses based on observed outputs and how we can draw connections between analyses.

Tukey, Design Thinking, and Better Questions

Wed, 17 Apr 2019 00:00:00 +0000

Roughly once a year, I read John Tukey’s paper “The Future of Data Analysis”, originally published in 1962 in the Annals of Mathematical Statistics. I’ve been doing this for the past 17 years, each time hoping to really understand what it was he was talking about. Thankfully, each time I read it I seem to get something new out of it. For example, in 2017 I wrote a whole talk around some of the basic ideas.

Well, it’s that time of year again, and I’ve been doing some reading.

Probably the most famous line from this paper is

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

The underlying idea in this sentence arises in at least two ways in Tukey’s paper. First is his warning that statisticians should not be called upon to produce the “right” answers. He argues that the idea that statistics is a “monolithic, authoritarian structure designed to produce the ‘official’ results” presents a “real danger to data analysis”. Second, Tukey criticizes the idea that much of statistical practice centers around optimizing statistical methods around precise (and inadequate) criteria. One can feel free to identify a method that minimizes mean squared error, but that should not be viewed as the goal of data analysis.

But that got me thinking—what is the ultimate goal of data analysis? In 64 pages of writing, I’ve found it difficult to identify a sentence or two where Tukey describes the ultimate goal, why it is we’re bothering to analyze all this data. It occurred to me in this year’s reading of the paper, that maybe the reason Tukey’s writing about data analysis is often so confusing to me is because his goal is actually quite different from that of the rest of us.

Exploring the Data

It would seem that the message here is that the goal of data analysis is to explore the data. In other words, data analysis is exploratory data analysis. Maybe this shouldn’t be so surprising given that Tukey wrote the book on exploratory data analysis. In this paper, at least, he essentially dismisses other goals as overly optimistic or not really meaningful.

For the most part I agree with that sentiment, in the sense that looking for “the answer” in a single set of data is going to result in disappointment. At best, you will accumulate evidence that will point you in a new and promising direction. Then you can iterate, perhaps by collecting new data, or by asking different questions. At worst, you will conclude that you’ve “figured it out” and then be shocked when someone else, looking at another dataset, concludes something completely different. In light of this, discussions about p-values and statistical significance are very much beside the point.

The following is from the very opening of Tukey’s book *Exploratory Data Analysis:

It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it

(Note that the all caps are originally his!) Given this, it’s not too surprising that Tukey seems to equate exploratory data analysis with essentially all of data analysis.

Better Questions

There’s one story that, for me, totally captures the spirit of exploratory data analysis. Legend has it that Tukey once asked a student what were the benefits of the median polish technique, a technique he invented to analyze two-way tabular data. The student dutifully answered that the benefit of the technique is that it provided summaries of the rows and columns via the row- and column-medians. In other words, like any good statistical technique, it summarized the data by reducing it in some way. Tukey fired back, saying that this was incorrect—the benefit was that the technique created more data. That “more data” was the residuals that are leftover in the table itself after running the median polish. It is the residuals that really let you learn about the data, discover whether there is anything unusual, whether your question is well-formulated, and how you might move on to the next step. So in the end, you got row medians, column medians, and residuals, i.e. more data.

If a good exploratory technique gives you more data, then maybe good exploratory data analysis gives you more questions, or better questions. More refined, more focused, and with a sharper point. The benefit of developing a sharper question is that it has a greater potential to provide discriminating information. With a vague question, the best you can hope for is a vague answer that may not lead to any useful decisions. Exploratory data analysis (or maybe just data analysis) gives you the tools that let the data guide you towards a better question.

Interview with Abhi Datta

Mon, 01 Apr 2019 00:00:00 +0000

Editor’s note: This is the next in our series of interviews with early career statisticians and data scientists. Today we are talking to Abhi Datta about his work in large scale spatial analysis and his interest in soccer! Follow him on Twitter at @datta_science. If you have recommendations of an (early career) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter!

SS: Do you consider yourself a statistician, biostatistician, data scientist, or something else?

AD: That is a difficult question for me, as I enjoy working on theory, methods and data analysis and have co-authored diverse papers ranging from theoretical expositions to being primarily centered around a complex data analysis. My research interests also span a wide range of areas. A lot of my work on spatial statistics is driven by applications in environmental health and air pollution. Another significant area of my research is developing Bayesian models for epidemiological applications using survey data.

I would say what I enjoy most is developing statistical methodology motivated by a complex application where current methods fall short, applying the method for analysis of the motivating data, and trying to see if it is possible to establish some guarantees about the method through a combination of theoretical studies and empirical experiments that will help to generalize applicability of the method for other datasets. Of course, not all projects involve all the steps, but that is my ideal workflow. Not sure what that classifies me as.

SS: How did you get into statistics? What was your path to ending up at Hopkins?

AD: I was born and grew up in Kolkata, India. I had the option of going for engineering, medical or statistics undergrad. I chose statistics persuaded by my appreciation for mathematics and the reputation of the statistics program at Indian Statistical Institute (ISI), Kolkata. I completed my undergrad (BStat) and Masters (MStat) in Statistics from ISI and I’m thankful I made that choice as those 5 years at ISI played a pivotal role in my life. Besides getting rigorous training in the foundations of statistics, most importantly, I met my wife Dr. Debashree Ray at ISI.

After my Masters, I had a brief stint in the finance industry, working for 2 years at Morgan Stanley (in Mumbai and then in New York City) before I joined the PhD program at the Division of Biostatistics at University of Minnesota (UMN) in 2012 where Debashree was pursuing her PhD in Biostatistics. I had initially planned to work in Statistical Genetics as I had done a research project in that area in my Master’s. However, I explored other research areas in my first year and ended up working on spatial statistics under the supervision of my advisor Dr. Sudipto Banerjee, and on high-dimensional data with my co-advisorDr. Hui Zou from the Department of Statistics in Minnesota. I graduated from Minnesota in 2016 and joined Hopkins Biostat as an Assistant Professor in the Fall of 2016.

SS: You work on large scale spatio-temporal modeling - how do you speed up computations for the bootstrap when the data are very large?

AD: A main computational roadblock in spatio-temporal statistics is working with very big covariance matrices that strain memory and computing resources typically available in personal computers. Previously, I have developed nearest neighbor Gaussian Processes (NNGP) – a Bayesian hierarchical model for inference in massive geospatial datasets. One issue with hierarchical Bayesian models is their reliance on long sequential MCMC runs. Bootstrap, unlike MCMC, can be implemented in an embarrassingly parallel fashion. However, for geospatial data, all observations are correlated across space prohibiting direct resampling for bootstrap.

In a recent work with my student Arkajyoti Saha, we proposed a semi-parametric bootstrap for inference on large spatial covariance matrices. We use sparse Cholesky factors of spatial covariance matrices to approximately decorrelate the data before resampling for bootstrap. Arkajyoti has implemented this in an R-package BRISC: Bootstrap for rapid inference on spatial covariances. BRISC is extremely fast and at the time of publication, to my knowledge, it was the only R-package that offered inference on all the spatial covariance parameters without using MCMC. The package can also be used simply for super-fast estimation and prediction in geo-statistics.

SS: You have a cool paper on mapping local and global trait variation in plant distributions, how did you get involved in that collaboration? Does your modeling have implications for people studying the impacts of climate change?

AD: In my final year of PhD at UMN, I was awarded the Inter-Disciplinary Doctoral Fellowship – a fantastic initiative by the graduate school at UMN providing research and travel funding, and office space to work with an inter-disciplinary team of researchers on a collaborative project. In my IDF, mentored by Dr. Arindam Banerjee and Dr. Peter Reich, I worked with a group of climate modelers, ecologists and computer scientists from several institutions on a project whose eventual goal is to improve carbon projections from climate models.

The paper you mention was aimed at improving the global characterization of plant traits (measurements). This is important as plant trait values are critical inputs to climate model. Even the largest plant trait database TRY offers poor geographical coverage with little or no data across many large geographical regions. We used the fast NNGP approach I had been developing in my PhD to spatially gap-fill the plant trait data to create a global map of important plant traits with proper uncertainty quantification. The collaboration was a great learning experience for me on how to conduct a complex data analysis, and how to communicate with scientists.

Currently, we are looking at ways to incorporate the uncertainty quantified trait values as inputs to Earth System Models (ESMs) – the land component of climate models. We hope that replacing single trait values with entire trait distributions as inputs to these models will help to better propagate the uncertainty and improve the final model projections.

SS: What project has you most excited at the moment?

AD: There are two. I have been working with Dr. Scott Zeger on a project lead by Dr. Agbessi Amouzou in the Department of International Health at Hopkins aiming to estimate the cause-specific fractions (CSMF) of child mortality in Mozambique using family questionnaire data (verbal autopsy). Verbal autopsies are often used as a surrogate to full autopsy in many countries and there exists software that use these questionnaire data to predict a cause for every death. However, these software are usually trained on some standard training data and yield inaccurate predictions in local context. This problem is a special case of transfer learning where a model trained using data representing a standard population offers poor predictive accuracy when specific populations are of interest. We have developed a general approach for transfer learning of classifiers that uses the predictions from these verbal autopsy software and limited full autopsy data from the local population to provide improved estimates of cause-specific mortality fractions. The approach is very general and offers a parsimonious model-based solution to transfer learning and can be used in any other classification-based application.

The second project involves creating high-resolution space-time maps of particulate matter (PM2.5) in Baltimore. Currently a network of low-cost air pollution monitors is being deployed in Baltimore that promises to offer air pollution measurements at a much higher geospatial resolution than what is provided by EPA’s sparse regulatory monitoring network. I was awarded a Bloomberg American Health Initiative Spark award for working with Dr. Kirsten Koehler in the Department of Environmental Health and Engineering to combine the low-cost network data, the sparse EPA data and other land-use covariates to create uncertainty quantified maps of PM2.5 at an unprecedented spatial resolution. We have just started analyzing the first two months of data and I’m really looking forward to help create the end-product and understand how PM2.5 levels vary across the different neighborhoods in Baltimore.

SS: You have an interest in soccer and spatio temporal models have played an increasing role in soccer analytics. Have you thought about using your statistics skills to study soccer or do you try to avoid mixing professional work and being a fan?

AD: Yes, I’m an avid soccer fan. I have travelled to Brazil in 2014 and Russia in 2018 to watch live games in the world cups. It also unfortunately means that I set my alarm to earlier times on weekends than on weekdays as the European league games start pretty early in US time.

However, until recent times, I’ve been largely ignorant of applications of spatio-temporal statistics in soccer analytics. I just finished teaching a Spatial Statistics course and one of the students presented a fascinating work he has done on predicting player’s scoring abilities using spatial statistics. I certainly plan to read more literature on this and maybe one day can contribute. Till then I remain a fan.