The Berkeley Science Review

Reproducible and collaborative: Teaching the data science life

Sarah Hillenbrand — Wed, 11 Jun 2014 22:59:33 +0000

The Spring 2014 issue of the BSR made a splash with Anum Azam’s feature piece, ‘The First Rule of Data Science.’ For many PhD students in the sciences, grim job prospects in academia make the advent of data science seem like a godsend. And so far, blasting the surfeit of PhDs at Big Data like so much buckshot seems to be working out—scientists, after all, are innovators. But, as the field evolves, it will become important to think about how to train a new generation of data scientists. Like Whitney Houston, I believe the children are our future. It is up to us to teach them well and let them lead the way.

Housebreaking the puppies

Philip Stark, a professor of statistics and the department chair here at Berkeley, puts it a slightly different way: “It’s hard to teach old dogs new tricks, so let’s work with the puppies. What are new tricks for my generation are basic housebreaking for this generation.” Stark, along with Aaron Culich of the recently-launched Berkeley Research Computing program (BRC), tested the waters of data science pedagogy with their 2013 fall semester course, Reproducible & Collaborative Data Science (STAT 157). “The idea,” according to Stark, “was to have a project-oriented statistics class that was geared towards solving a real-world scientific problem.”

But the first rule of data science, as Azam wrote, is that you don’t ask how to define it. The very ethos of data science runs counter to the way most stressed-out undergraduates study. In most classes, the point is to race to the top of the curve, so most students appreciate clearly defined goals in the form of syllabi, deadlines, review sheets and rubrics. Anyone who has taught a course at Berkeley will tell you they’re all too familiar with the “Will this be on the final?” mentality—Stark included. In fact, an attempt to visit willthisbeonthefinal.com will redirect you to Stark’s homepage. “The way that our educational system is set up right now really does focus on the individual, the individual’s performance, the individual’s ability to either synthesize or regurgitate or master some corpus of material,” Stark says. Data science, however, requires a more collaborative approach.

Collaboration and reproducibility are inextricably linked. “The first step to making science reproducible is to build good habits,” Culich explains. “Your most important collaborator is your future self. It’s important to make a workflow that you can use time and time again, and even pass on to others in such a way that you don’t have to be there to walk them through it.” So how did they teach the skills required for collaboration?

Lofty goals, and no road map

In class, students’ goal was to use data provided by the Southern California Earthquake Center to create a better way of predicting earthquakes. This goal had no clear road map that could be broken down into simple steps and thrown onto a syllabus. Students knew that sufficient progress could lead to a publication, the holy grail of scientific research, but it wasn’t clear how to get there. “It was a little bit hectic in the beginning,” says Christina Ho, a student in the class. “I think a lot of us weren’t really sure what we were supposed to be doing.”

Culich would concur: “Because our goals weren’t laid out for us step-by-step, one of the first things we had to do was define them, and so we had to just jump right in and talk.” Starting with the basics: what are the units? What was recorded? How was it measured? What do we mean when we say “magnitude?” Then, students were faced with decisions: how to transform, clean, and normalize the data into something meaningful and usable. Finally, they thought about how to turn data into evidence by applying the information they had towards solving a problem. And every step of the way, they had to work reproducibly. “We were willing to go super slowly and make slow progress in our science just to make sure that our work was easy for someone else to pick up,” says Teresa Tenfelder, a student.

Working reproducibly: the wave of the future

Reproducible computational research is a concept that can be traced back to John Claerbout, an emeritus professor of geophysics at Stanford. By insisting that his grad students work reproducibly, he enabled new students to become productive in 1-2 weeks instead of the common 1-2 years: they could build on the work of the previous cohort almost immediately. Before the emphasis on reproducibility, incoming students spent far more time and effort figuring out where their predecessor had left off. This timesuck, Claerbout thought, could be reduced if people developed good habits in the way they coded—producing code that was well-documented, could run on any platform, whose function and outputs were clear given a well-defined set of inputs, and that could be modified and extended easily. In short, code should be geared towards making results reproducible.

Reproducibility lies at the heart of the twin zeitgeists of data science and open access, currently revolutionizing the way science is done. “It’s not just what you know anymore, it’s how you work,” as Stark puts it. “We have largely abandoned the scientific method, especially in ‘big science.’ We’ve traded ‘show me’ for ‘trust me.’ The whole point of science is that you shouldn’t have to trust the experts. Some people are afraid of working reproducibly—exposing their work—because others might find a mistake. But here’s a different way to think about it: If I say ‘trust me’ and I’m wrong, I’m untrustworthy. If I say ‘here’s my work’ and it’s wrong, I’m honest and human.”

In the end, the class implemented a predictor that was much simpler than the current state-of-the-art Epidemic-Type Aftershock Sequence (ETAS) model, and whose performance was almost indistinguishable from that of ETAS. And their software pipeline was automated and deposited in a github repository, so anyone else can easily pick up where they left off. That is an important contribution to the field, and one that can be learned and built upon quickly because of the students’ attention to detail.

Working collaboratively instead of beating the curve

Working reproducibly presented one set of challenges, while working collaboratively presented another. Realizing that harmonious, effective teamwork doesn’t simply happen, Kristina Kangas, a graduate student in Integrative Biology and the GSI for the course, administered the VARK questionnaire. This test draws on principles of team building, identifying students’ distinct preferences, motivations, goals, and working styles. The results were used to help students find roles that were right for them. This approach allows room for innovation from a group of individuals with diverse intelligences, differing markedly from typical courses where students must all learn the same thing.

Students became data curators, analyzers, visualizers, or presenters. As goals and plans emerged, groups evolved. Some students were more comfortable than others in the role of thinking through the big picture. Other students found themselves in the “I just want to be told what to do” camp. These students often had a skill to bring to the table, such as being a python wizard, or having a strong background in statistics. Many students stepped into the roles that they felt comfortable in. Others took on roles outside their comfort zone that provided the training they sought.

To create a reproducible data analysis pipeline, teams had to have conversations about how the pieces would be handed off from one person to another. Was a data curator getting the data the analyzer needs? Was it in a usable format? Even deciding who to talk to about handing off one’s portion could be a challenge. “Working with people is hard,” Tenfelder confesses. “The collaboration was the most difficult, but at the same time, the most valuable, aspect of the class.” Seeking opportunities, discussing plans, and building collaborative relationships are skills that are crucial to success in a data science career, and yet these skills are seldom taught in the classroom. In the end, Jody Zhang, a student, found it worthwhile, adding that “the frustration turned into an opportunity to really learn and grow a lot.”

Culich believes that what students learned about teamwork in data science closely matches what he learned from a youtube sensation known simply as Dancing Guy. The dancing guy shows us what it is to be a leader. First, a leader must be easy to follow. And second, a leader understands the value of their first follower. Without them, the leader is just a lone nut who isn’t connecting to anyone. In a rapidly evolving field like data science, identifying good ideas and cultivating good work relationships can turn out to be just as important as cultivating good work habits.

Preparing students for jobs in a changing climate

“At a minimum, students learned good work habits and how to learn new programming languages,” says Stark. “I hope the experience prepared them to work in an environment where they’re just told, ‘Here are some tools. Now go be brilliant.’” Code reviews, where students discuss a snippet of code, were a way of learning brilliance by example. “There’s a difference between learning a programming language and knowing how to program,” says Culich. Instruction tends to be geared towards the former. Seeing a problem solved in an unfamiliar programming language can empower students to copy, paste, and start modifying. Where there is no need to reinvent the wheel, it can be clunky and inefficient to do so.

Students also learned that, to be effective collaborators, they had to identify the resources available to them. Data science is all over campus, you just have to know where to look. The Geospatial Information Facility (GIF), the Statistical Computing Facility (SCF), the D-Lab, the Data Lab, the Berkeley Institute for Data Science (BIDS), and yes, Stack Overflow, were all tremendously helpful. By reaching out and navigating this rich landscape of resources, students connected with smart people all over campus and beyond. Developing a taste for proactive networking and autodidacticism is, of course, a major advantage. Culich explains, “No one takes typing classes anymore. But to say, ‘I don’t type’ wouldn’t work now. I think we’re creating habits, a culture, an ecosystem that I think will eventually be a part of the fabric of the way science is done. Data science seems chaotic and disruptive now, but best practices will emerge.”

Currently, data science is doing a bang-up job of catching what falls out of the leaky pipeline of misfit PhDs. But let’s face it. The practice of retrofitting the highly specialized skillset of your average PhD is a klugey hack. We can, and should, invest in training new data scientists, but data science courses are still in their trial-and-error phase. Student-led courses can serve as laboratories for redefining the problems at hand. And Stark & Culich’s course serves as a valuable jumping-off point. At least this much is clear: we can equip students for this climate by teaching them to think for themselves, ask for what they need, and code for the long run.

When it comes to making science reproducible and transparent, as well as ditching the “will this be on the final?” attitude, further restructuring the classroom experience may help squash some of the unwanted side effects of the educational rat race’s competitive atmosphere. As far as Big Data is concerned, working collaboratively and reproducibly is non-negotiable. Without these practices in place, we cannot turn data into evidence, making all the programming prowess in the world useless.

The post Reproducible and collaborative: Teaching the data science life appeared first on The Berkeley Science Review.

Bacteria in your frosted (snow) flakes!

David Litt — Mon, 09 Jun 2014 17:11:57 +0000

If you’re from the Northeast or the Midwest (like I am) then you probably know that children enjoy playing in the snow. Snowball fights, snow angels, snow forts, and catching snowflakes on their tongues, children revel in the soft, downy, cold crystals. As scientists, our idea of fun has changed—following a loss of innocence, there are adults who are more interested in how those flakes of ice form than in playing in them. In 2008, a study published in Science showed snow samples collected in areas around the world (including Antarctica) have DNA containing cells in the center of some of the flakes, and many of these cells have ice nucleating proteins¹. Why are there bacteria in snowflakes? What are these ice-nucleating proteins? But most importantly, what will overprotective mothers who learn that their children are being exposed to bacteria in the snow do with this new knowledge (just kidding)?

Perplexed by these questions, I dug through the literature and found a wonderful review article which discussed the “Bioprecipitation Cycle”². It turns out that plant biologists were the first to look into this phenomenon. Normally, pure water won’t turn into ice until about -40 °C (for those of you on the English system, that happens to be -40 °F). Even though water “freezes” at 0°C (32 °F), it can’t freeze unless there is some small particle (dust, protein, etc.) that will force water particles into the correct formation to form a crystal—these alien particles are called nucleators, because they help cause formation of (or “nucleate”) the ice crystal. With the presence of nucleators, ice can form at temperatures as balmy as -1°C for some ice nucleating proteins, and warm as -5 °C to -11 °C for other nucleators, such as pollen or inorganic molecules.

Because these bacteria have proteins that are more effective than any other molecule at making water turn to ice at cool temperatures, they catalyze formation of frost. This frost breaks open cells on the surface of the leaf, and can then feast on the released proteins and sugars. This adaptation has made several species of plant bacteria, notably Pseudomonas syringae, able to banquet on plants. If causing frost and ice formation to get at the juicy insides of plants was all the bacteria could do, that would be cool enough.
So where does snowfall and rain come in?

It turns out that not only are bacteria on plant leaves and in the soil, but they float through the air—on a windy day bacteria are whisked up into the upper atmosphere. In fact, during the warm daylight hours (10am-2pm to be precise)², there is a net upward movement of bacteria in the atmosphere. At these altitudes bacteria are exposed to extreme weather conditions and are removed from their terrestrial food sources. Further studies have shown that there is a net downward movement of bacteria during precipitation such as rainfall. Therefore, it is in the bacteria’s best interest to somehow cause rain. The few bacteria with ice nucleating proteins are uniquely situated to seed precipitation. In the cold atmosphere they can get into a cloud and start to freeze the droplets, causing snowflakes. If this process occurs in the summer, the snowflakes will melt into rain, or during winter, the snowflakes will gracefully dance into some small child’s upturned and gaping mouth. There is plenty of evidence for this theory, as there is a reported large downward flux of bacteria with ice nucleating proteins during rainfall, and rainfall is more plentiful over irrigated and farmed areas (as more bacteria are thrown into the air due to farming activities).

Not all snowflakes are nucleated by bacteria, and there are very few species of bacteria capable of this amazing feat. But it is a mind-boggling idea to think about how versatile one protein can be—ripping open plants for food, and contributing to the precipitation cycle in an important way by causing snow and rain. It leaves hope and wonder that there are many more fantastic proteins and species waiting to be discovered.

For Further Reference:
Listen to the Radiolab Podcast on NPR.

1. Christner, B. C., Morris, C. E., Foreman, C. M., Cai, R. & Sands, D. C. Ubiquity
of biological ice nucleators in snowfall. Science 319, 1214 (2008).

2. Morris, C. E., Georgakopoulos, D. G. & Sands, D. C. Ice nucleation active bacteria
and their potential role in precipitation. J. Phys. IV 121, 87–103 (2004).

The post Bacteria in your frosted (snow) flakes! appeared first on The Berkeley Science Review.

5 Things I wish I knew about Berkeley at least 2 years ago (instead of just 1)

Kristina Kangas — Fri, 06 Jun 2014 16:50:46 +0000

UC Berkeley is a huge melting pot of interesting people, groups, communities, assemblies, departments, services…the list could continue ad nauseum. We’ve seen the impressive interplay of collaborations between these departments put together by Natalia Bilenko, with data scraped from the PubMed database. This article strives to share some of the communities that I have only learned about this past year (after being an undergraduate at Berkeley and now an upcoming fourth year graduate student). With such valuable resources available to us, we can assuredly make this collaborative network even more impressive in the years to come.

1. Center for Studies in Higher Education (CSHE)

For those of you interested in staying onboard SS Academia, you may be intrigued by the empirical research about the culture of higher education. Parsing apart the “hearsay” from the “here: evidence”, CSHE offers a wide range of current projects, ranging from the The Future of Scholarly Communication to Student Experience in the Research University. Given that this research informs policy decisions, I think it’s important to understand it and be able to reference it in all those committees you’ll be sitting on as a faculty member.

2. D-Lab

Perhaps you’ve heard about BIDS (Berkeley Institute for Data Science) from the BSR article by Mikel Delgado. Did you know that they offered classes and consulting? As a truly positive force behind the sc(i)en(c)es, I’m sure we can expect great collaborations to stem from this campus gem.

3. Office of Scholarly Communication (OSC)

The OSC impressed me with their lists of campus resources, answering questions you might have about open access and publishing in general (e.g. peer-review, metrics, and much, much more). Looks like they’re keeping it real in their evaluation and interpretation of academic practices across fields.

4. PeerLibrary (Open Access @ Cal)

Have you deposited your publication(s) in PeerLibrary yet? (Yes, it is only one word: PeerLibrary). You can like them on Facebook, follow them on Twitter, watch, star, and fork their repository on GitHub, and even watch their promotional video on Vimeo. All the early adopters are doing it.

5. Science Libraries @ UCB

Whaaat? You forgot everything about the library from orientation? You need more information? Well, if these tutorials and guides aren’t enough, maybe you would be surprised to know there is even help with data management and curation from Science Libraries @UCB, including links for general overview and even international resources.

The post 5 Things I wish I knew about Berkeley at least 2 years ago (instead of just 1) appeared first on The Berkeley Science Review.

Notes on Replication from an Un-Tenured Social Psychologist

Psych Your Mind — Wed, 04 Jun 2014 21:07:49 +0000

This week’s edition of Psych Wednesdays was written by Michael Kraus. It was originally published on Psych Your Mind on May 27, 2014.

Last week the special issue on replication at the Journal of Social Psychology arrived to an explosion of debate (read the entire issue here and read original author Simone Schnall’s commentary on her experience with the project and Chris Fraley’s subsequent examination of ceiling effects). The debate has been happening everywhere–on blogs, on twitter, on Facebook, and in the halls of your psychology department (hopefully).

Make no mistake, this is great news for the field of social psychology: This is the first time, since I joined the field, that one of our own journals has devoted attention to examining the footing on which so much of our science is grounded (If you haven’t had a chance, please congratulate the curators of this effort [@lakens and @BrianNosek] on twitter or elsewhere). The whole procedure involved in the replication efforts has been made publicly available. Read it here and be encouraged by the fairness and transparency of this scientific enterprise.

There are some significant short-term challenges we are facing: By moving ahead with high-quality attempts at direct replication some of our original research will not replicate (we don’t know how much unless we conduct these replications). This is going to be personally painful for the original researchers who conducted the non-replicated work (though see this blog post for reasons why single replications are starting points, not conclusions). Ultimately, the mere act of conducting direct replications will improve our science, and we as researchers need to keep this in mind when facing non-replications of our research, or even when evaluating others’ non-replicated research. We must not allow single non-replications to damage the reputations of our colleagues. Think of our science as cumulative and the definitive answers as somewhere down the road.

There is one disappointing part of this whole event that I’d like to point out. In the past, the research integrity of social psychologists has been called into question by acts of fraud, by science journalists, and by psychologists in other fields. This is the first (or at least the most visible) time that card-carrying social psychologists are holding up our own research to scrutiny. In response to this effort by members of our own field–who have undertaken this work at some cost to their own reputations–we have resorted to engaging in debate tactics that would probably be best characterized as childish (one example explained here). I can understand defensiveness when it is people outside our discipline who comment on our methods (the discussion about research integrity is loudest among non-social psychologists at the University of Illinois, so I understand the defensive reaction), but sniping within our field is us missing a great opportunity to put social psychology at the forefront of scientific integrity.

We should all reflect for a moment on how wonderful it is that the examination of our field is FINALLY IN THE HANDS OF SOCIAL PSYCHOLOGISTS. I think we need to welcome this change and we need to promote (not obstruct) direct replications that are supervised by experts of our own guild (Danny Kahneman suggested as much here). If we can’t support direct replication attempts that are rigorously vetted prior to data collection, open and transparent, and conducted by our own guild, then I’m not sure we’ll ever be a replicable science. That would be a tragedy.

I’d love to read your comments here or on twitter (@mwkraus).

The post Notes on Replication from an Un-Tenured Social Psychologist appeared first on The Berkeley Science Review.

What are errorbars, anyway?

Chris Holdgraf — Mon, 02 Jun 2014 18:05:48 +0000

**note – this is a follow up post to an article I wrote a few weeks back on the importance of uncertainty. A lot of you loved the idea of quantifying uncertainty, but had a lot of questions about the various ways that we can do so. This post hopes to answer some of those questions**

A few weeks back I posted a short diatribe on the merits and pitfalls of including your uncertainty, or error, in any argument you make. Some of you were quick to sing your praise of our friendly standard deviants, while others were more hesitant to jump on the confidence bandwagon.

However, one common thread amongst the responses was a general uncertainty about uncertainty. That is – what exactly we mean when we say “error bars”. It turns out that error bars are quite common, though quite varied in what they represent.
This post is a follow up which aims to answer two distinct questions: what exactly are error bars, and which ones should you use. So, without further ado:

What the heck are error bars anyway?

Well, technically this just means “bars that you include with your data that convey the uncertainty in whatever you’re trying to show”. However, there are several standard definitions, three of which I will cover here.
First, we’ll start with the same data as before.

Ok, so this is the raw data we’ve collected. As we can see, the values seem to be spread out around a central location in each case. The question that we’d like to figure out is: are these two means different. If they are, then we’re all going to switch to banana-themed theses.

Upon first glance, you might want to turn this into a bar plot:

However, as noted before, this leaves out a crucial factor: our uncertainty in these numbers. Remember how the original set of datapoints was spread around its mean. Here, we have lost all of that information.

So, let’s add some error bars!

The standard deviation

The simplest thing that we can do to quantify variability is calculate the “standard deviation”. Basically, this tells us how much the values in each group tend to deviate from their mean. Here is its equation:

As with most equations, this has a pretty intuitive breakdown:

And here’s what these bars look like when we plot them with our data:

OK, not so bad, but is standard deviation really what we want? We’ve just seen that this tells us about the variability of each point around the mean. However, we don’t really care about comparing one point to another, we actually want to compare one *mean* to another. Which brings us to…

Standard error

Closely related to the standard deviation, the standard error gets more specifically at the kinds of questions you’re usually asking with data. We want to compare means, so rather than reporting variability in the data points, let’s report the variability we’d expect in the means of our groups. This is known as the standard error.

Now, here is where things can get a little convoluted, but the basic idea is this: we’ve collected one data set for each group, which gave us one mean in each group. If we wanted to calculate the variability in the means, then we’d have to repeat this process a bunch of times, calculating the group means each time.

However, we don’t want to do this, so what can we do?

One option is to make an assumption. Specifically, we might assume that if we were to repeat this experiment many many times, then it would roughly follow a normal distribution. Note – this is a big assumption, but it may be reasonable if we expect the Central Limit Theorem to hold in this case.

If we assume that the means are distributed according to a normal distribution, then the standard error (aka, the variability of group means) is defined as this:

Basically, this just says “take the general variability of the points around their group means (the standard deviation), and scale this number by the number of points that we’ve collected”.

This one also makes intuitive sense. If we increase the number of samples that we take each time, then the mean will be more stable from one experiment to another. Don’t believe me? Here are the results of repeating this experiment a thousand times under two conditions: one where we take a small number of points (n) in each group, and one where we take a large number of points.

See how the means are clustered more tightly around their central number when we have a large n? This represents a low standard error. AKA, on each experiment, we are more likely to get a mean that’s consistent across multiple experiments, so it is more reliable.

This sounds like a much better choice for plotting along with our data, because it directly answers the question “how certain are we that the means we’ve recorded are the “true” values?”

Let’s see what this looks like:

Wahoo! We’ve made our error bars even tinier. That’s no coincidence. Look at the equation for the standard error. If we increase N, we will always make the standard error smaller. As such, the standard error will always be smaller than the standard deviation.

OK, there’s one more problem that we actually introduced earlier. As I said before, we made an *assumption* that means would be roughly normally distributed across many experiments. But do we *really* know that this is the case? Is there a better way that we could give our uncertainty in group means, without assuming that things are normally distributed? Fortunately, there is…

Confidence Intervals (with bootstrapping)

Confidence intervals have been theorized for quite some time, but they’ve only become practical in the past twenty years or so as a common tool in data analysis. I’m going to talk about one way to calculate confidence intervals, a method known as “bootstrapping”. Basically, this uses the following logic:

I’m interested in finding the variability of our sample means across many experiments, but I don’t want to make too many assumptions about how the means would be distributed across many experiments. What can I do? Bootstrapping says “well, if I had the “full” data set, aka every possible datapoint that I could collect, then I could just “simulate” doing many experiments by taking a random sample from that “full dataset”.

However, I don’t have the full dataset, but I do have the sample that I’ve collected. As such, I’m going to say that the closest thing I’ve got to the true distribution of all the data is the sample that I’ve already got. Thus, I can simulate a bunch of experiments by taking samples from my own data *with replacement*. I’ll calculate the mean of each sample, and see how variable the means are across all of these simulations.

OK, that sounds really complicated, but it’s quite simple to do on our own. Let’s try it.

We need to:

Take a bunch of samples of the same size as our original dataset. “With replacement” just means that we can sample the same datapoint more than one time.
For each sample, we calculate the mean.
Then we look at all of the means to figure out how variable they are

Doing this requires a bit of computation, so I’m not going to go into the details here. However, at the end of the day what you get is quite similar to the standard error. Why is this? Because in this case, we know that our data are normally distributed (we created them that way). However, in real life we can’t be as sure of this, and confidence intervals will tend to be more different from standard errors than they are here.

The way to interpret confidence intervals is that if we were to repeat the above process many times (including collecting a sample, then generating a bunch of “bootstrap” samples from the big sample, then taking the percentiles of these sample means), then 95% of the time, our interval would contain the “true” mean of the data.

So what should I use?

At the end of the day, there is never any 1-stop method that you should always use when showing error bars. And so the most important thing above all is that you’re explicit about what kind of error bars you show. The biggest confusions come when people show standard error, but people think it’s standard deviation, etc.

That said, in general you want to show the standard error or 95% confidence intervals rather than the standard deviation. This is because these are closer to the question you’re really asking: how reliable is the mean of my sample?

As for choosing between these two, I’ve got a personal preference for confidence intervals as it seems like they’re the most flexible and require less assumptions than the standard error. I’m sure that statisticians will argue this one until the cows come home, but again, being clear is often more important than being perfectly correct.

So that’s it for this short round of stats-tutorials. There are many other ways that we can quantify uncertainty, but these are some of the most common that you’ll see in the wild. If you’ve got a different way of doing this, we’d love to hear from you. Until then, may the p-values be ever in your favor.

The post What are errorbars, anyway? appeared first on The Berkeley Science Review.

So you want to be a Professor?

Adam Hill — Fri, 30 May 2014 16:52:42 +0000

Every day, another essay lands on the virtual doorstep, detailing the challenges facing academia and decrying the “failed system.” Some essays offer solutions, others prognosticate doom. After all of the essays and after the majority of a decade (or more) spent in the reality of academia, do you still want to be a professor? I’m here to give you a primer on how to do it—or at least correct some common misconceptions about the process.

How do you know? Having recently started a tenure-track job myself, and having been exposed to the world of the hiring committee, some fellow scientists have come to me for advice before they head to the academic job market.

Can it be done? Professor has joined the list of jobs, along with astronaut, race car driver, and movie star, that fall into a lonely category: “no one” has a hope of landing such a job, and yet the world still has astronauts, race car drivers, movie stars, and yes, even professors. (I exaggerate a bit for comic effect here, but the truth is that, in spite of the bad odds, tenure-track positions do still exist. Put it in the category of “tough but possible.”)

Is applying worth it? If being faculty is your dream, this might be a no-brainer. But the process of applying, even if you don’t wind up applying, is extrinsically valuable: it forces you to assemble your credentials and consider the aspects of teaching, research, and administration that matter most to you.

When do I start? There are two answers: If you see academia in your future, consider every day and every action as part of your application package. Elegant papers and studious lab work will be part of that application, but so will experience mentoring younger students, participating in committees, and networking at conferences. It’s also never too early to start building relationships with senior faculty who could someday write letters of recommendation; this means going beyond just your advisor and thesis committee. No matter your age, if you see academia in your future, focus your efforts on the skills and achievements expected of a candidate.

From an immediate standpoint, the academic application “season” typically begins in late summer and continues through the fall and early winter. From senior faculty, I’ve sometimes heard the goal: “Have your application done in April for applying in September.” I’ve typically understood this to mean that the time between “finishing” the application in April and applying in September should be spent improving it and seeking advice on research and personal statements. From a more practical standpoint, beginning in May and working consistently over the course of the summer typically works well.

What materials do I prepare? First and foremost, keep your curriculum vitae up to date; try to update it at least once per semester. Remember, there’s more to a great CV than papers. By assembling your CV, you have an opportunity to think on what aspects of your credentials seem thin. Not enough awards? Not enough papers? Not enough mentoring? These questions become a lot easier to answer when the whole list is assembled.

In addition, you’ll prepare statements describing your teaching philosophy and your research plans, as well as college and graduate school transcripts. (More about those in a bit.)

Though you won’t be writing them yourself, you’ll also want to find faculty who are familiar with you and your achievements and can recommend you. That will often mean not only writing letters, but also potentially speaking on the phone with members of the search committee.

How do I prepare? Now that you know what you need to be preparing, go forth and seek advice. I’ll offer some here, but it’s well worth your time to speak to not only faculty, but also other graduate students and post-docs who are hunting for faculty jobs themselves. Part of success comes in synthesizing the wisdom of a multitude of sometimes-contradictory sources. In the realm of more codified advice, the AAAS’s science careers site offers a great set of straightforward advice for academic job searching and job getting.

To go with these resources, I’ll offer a bit of my own advice: in writing your materials, be specific—not in the sense of scientifically over-detailed, but rather in the sense of making your documents specific to the job for which you’re applying. The teaching philosophy of someone aiming for a position at an R01 school will likely be very different from that of a candidate aiming for a position at a small liberal arts college. (And the research statement will be thoroughly different between the two.) Show that you’ve considered the details of the position you want.

What happens next? It’s entirely within the realm of possibility that nothing else happens. Maybe you don’t quite get around to sending your materials in this year, or maybe this just isn’t the year. As a tool for figuring out your future direction, preparing your application is still fantastically useful.

But “next” might mean a phone interview and, eventually, an on-site interview. If you’re lucky enough to move forwards with the process, remember: your application package gets your foot in the door, but the interview gets you the job. Once a hiring committee is satisfied that you’re qualified for the job, your challenge is to convince them that you possess the ineffable qualities that can’t be represented in a CV—and that you’re the person they want as coworker. Another hint: part of interviewing well is being able to talk conversationally about your project and your interests. I personally like to offer a standing invitation to buy lunch for anyone who will discuss a paper (of their choice) with me.

And though I’m sure it goes without saying, interviewing means presenting a professional appearance, too: Get a haircut. Wear a suit. Make sure you can walk comfortably in your dress shoes all day. Be the person they want to hire.

So that’s it? Applying is a challenging and time-consuming process; I don’t mean the brevity of this article to imply otherwise. Ultimately, it’s a process (in part) of self-discovery.

Adam Hill graduated from UC Berkeley in 2013. He is currently Assistant Professor of Chemistry at St. Lawrence University, a small liberal arts college in northern New York. He is the blog editor emeritus of the Berkeley Science Review, and co-author of the photography blog Decaseconds.

The post So you want to be a Professor? appeared first on The Berkeley Science Review.

Love songs from a spider

Levi Gadye — Wed, 28 May 2014 15:31:59 +0000

To hear more about spider love songs from Erin Brandt herself, attend Sound Off! this Thursday, May 29th, from 7-10pm at Genentech Hall in Mission Bay.

Amongst the abundance of communication happening all around us, messages don’t always get through. Communication can be tricky to understand, especially in animals - it can be hard to pin down just why one interaction failed while a similar interaction succeeded.

Behavioral ecologists in the lab of Damian Elias study communication in a variety of spiders that exhibit some of the most-complex invertebrate courtship behaviors. Male jumping spiders, found in the Habronattus genus, perform “complex vibratory songs, tightly coupled to a coordinated dance, to give information to females,” according to Erin Brandt, a graduate student in the Elias lab. If the male’s performance passes muster, the female may become receptive to the male’s advances, but her preferences when making this choice are far from obvious – or easy to observe. Check out this short video of a male jumping spider performing its courtship ritual (and be sure to turn on your sound!):

Erin’s field work takes her into the sky islands of Arizona, where she collects as many spiders as possible from a particular species, and then brings them back to the lab for behavioral observation. These sky islands provide the spiders with a hospitable environment compared to the surrounding desert, and because of the isolated nature of every island, spider populations on each sky island exhibit their own distinct courtship rituals.

The Santa Rita Mountains, seen across the Tuscon Valley from the Santa Catalina Mountains in Arizona. Erin conducts some of her field work on jumping spiders in these ‘sky islands,’ which provide a safe haven for life that cannot thrive on the desert floor. (Wikimedia Commons)

Erin is interested in how environment shapes the song and dance of each jumping spider population, a question that will continue to evolve as climate change drives the spiders further up each mountain. Interestingly, while male behavior and appearance in jumping spiders are both species-specific, females across these species look largely identical. The Elias lab must go to great lengths to ensure that mating pairs come from just one species. “If the female rejects the male, you have to make sure it’s because he’s a bad male, and not just because he’s the wrong species,” says Erin.

Erin’s spiders of interest are about half a centimeter long, and males possess tiny, colorful ornaments on their bodies that may or may not factor in to the female’s decision. “We take a two-pronged approach. First, we take lots of video, hoping to quantify movement, and we also use spectrometry to try to quantify colors and look at patterns, though it’s it’s difficult to measure a feature .5 mm long with a 3 mm probe,” she says.

A male peacock spider, displaying its namesake fan appendage. (Wikimedia Commons).

Parallel work being done in the Rosenblum Lab focuses on the brilliantly patterned fan structures found on peacock spiders, another genus of jumping spiders, which also use visual presentations to woo females. Maddie Girard, a graduate student in the Rosenblum lab, is attempting to relate male peacock spider behavior and ornamentation with female peacock spider responses. “Peacock spiders are very diverse in their distribution, habitat, morphology as well as courtship behavior, and I’m investigating the social and ecological factors that are driving the immense diversity seen in this group,” she says. Though the brain is hidden from experimental view in both the Elias and Rosenblum labs’ spiders, the genetic and behavioral diversity of each genus, across a range of habitats, can provide new insights into the evolution of spider communication.

But visual displays aren’t the only way that spiders communicate – many of the Elias lab’s arthropod subjects also ‘sing,’ albeit with their appendages, rather than with vocal cords. “For the vibratory component, we have a Doppler laser vibrometer, this fancy piece of equipment that measures tiny displacements in a substrate using a laser,” says Erin. “The vibrometer then converts those displacements into a sound file you can analyze just like any other sound.”

The Elias Lab’s Doppler laser vibrometer, mounted on a vibration-isolation table. This device measures minute vibrations created by male jumping spiders as they try to woo their female counterparts. (Levi Gadye)

The Doppler laser vibrometer has allowed the Elias lab to observe minute vibrations created not only by jumping spiders, but other invertebrates, such as scorpions. “We point the vibrometer at anything we can find because so many invertebrates use vibratory, or substrate borne communication,” Erin says. Other collaborators are using the Elias lab’s vibrometer to study hummingbirds, which knock their wings together to make a buzzing noise, and crabs.

Erin is also interested in the organs that allow females to ‘hear’ the male vibrations. Somewhat analogous to the taste receptors found on the legs of fruit flies, female jumping spiders possess ‘ears’ on their legs, ears with a sensitivity high enough to detect minute vibrations in the ground. But Erin’s work can’t address all elements of spider communication – these spiders have two eyes that have evolved to provide vision equivalent to that of a cat, and are also likely using pheromones and smell to complement vibrations and dances during mating. “The world of invertebrates is so wide open, even if everyone switched to invertebrates there’d be plenty to go around,” she says. “Invertebrates do so many crazy things, and they do them so differently across so many species, that you could keep going on for careers and careers worth of work and never figure it all out.”

To hear more about spider love songs from Erin Brandt herself, attend Sound Off! this Thursday, May 29th, from 7-10pm at Genentech Hall in Mission Bay. Sound Off!, presented by Carry the One Radio, is open to the public and features interviews with UC Berkeley’s Dr. Frederic Theunissen and Erin Brandt, and interactive exhibits including an echolocation device and laser harp.

(Featured image: Damian Elias with a jumping spider. Credit: Tamas Szuts.)

The post Love songs from a spider appeared first on The Berkeley Science Review.

The professor is in, AMA

Holly Williams — Mon, 26 May 2014 18:17:54 +0000

In celebration of Astronomy Day, UC Berkeley professor, Dr. Alex Filippenko, participated in the popular Reddit interview series, Ask Me Anything (or AMA, if you’re hip on the slang). Check it out here. As part of an ongoing campaign aiming to increase the dialog between top-notch researchers and the general public, the subreddit r/science has been asking a number of prominent research professors to participate in these discussions. This isn’t the first time Filippenko has forayed into the front page of the internet, nor is it the first time UC Berkeley professors have answered the burning questions of Redditors. Nonetheless, Filippenko’s contribution to the series serves as continued evidence for the successful marriage of big science to the everyday person.

Read on for some highlights from Filippenko’s AMA!

Dark matter and dark energy are quite mysterious.

Here’s a fun fact — atoms only make up roughly 5% of the known universe. Everything else falls into one of two categories, either dark matter (~25%) or dark energy (~70%). Unlike ordinary matter, dark matter does not interact with electromagnetic radiation and is thus extremely difficult to observe directly. The existence of dark matter is instead inferred via its gravitational effects on ordinary matter. By calculating the masses of gravitational lenses (massive cosmological objects that bend light by distorting the surrounding space-time) and comparing them to the masses of luminous matter in corresponding regions, scientists can map out the distribution of dark matter. However, just exactly what it is remains elusive. According to Filippenko, dark matter is likely a weakly interacting massive particle (WIMP), though laboratory evidence of this remains to be seen.

In the late 90s, scientists made a shocking discovery. The expansion of the universe was accelerating. As our neighboring galaxies move away, the spectra of the elements contained in their stars become “redshifted” (meaning they move to longer wavelengths). To measure a galaxy’s distance, one must use “standard candles” (such as Cepheids or supernovae). These standards have known intrinsic brightness, such that comparing their apparent brightness gives a measure of how distant the candles are. In the 1920s, Edwin Hubble compared the redshifts of galaxies to their distances and discovered that space itself was, in fact, expanding. For many years thereafter, it was assumed that the gravitational forces exerted by every galaxy onto every neighboring galaxy would eventually slow down the expansion of the Universe.

However, Nobel-prize winning work by Dr. Saul Pearlmutter and others showed that the opposite was true. To explain this phenomenon, the idea of dark energy was born. Dark energy is the energy which uniformly permeates all space and from which a repulsive force that accelerates the expansion of the Universe is derived. Presently, two popular theories offer explanations for the origins of dark energy. “Quintessence” is dynamic dark energy generated by a scalar field, In contrast, Einstein’s “Cosmological Constant” considers dark energy to be vacuum energy, the energy of empty space, (which has a negative pressure) and contributes statically to the energy density of the Universe. Filippenko’s personal preference tends towards the cosmological constant.

A lower limit on the size of the Universe can be calculated.

The Observable Universe is the tiny slice of the Universe that we are capable of seeing, given the age of the Universe, speed of light, and accelerating expansion of said Universe. Presently, the Universe is 13.7 billion years old, and the consequent edge of the Observable Universe is about 47 billion light years away. NASA has shown that, to the best of our knowledge, the Universe is flat (as opposed to spherical or hyperbolic). If we assume that the Universe is closed (ie like the surface of a sphere) and that the local curvature is very, very small, we can do a rough back-of-the-envelope calculation on the minimum size of the Universe. Per Filippenko, current estimates state that the Universe is at least 80 times the volume of the Observable Universe. However, he notes the following.

…the entire Universe may well be far, far larger than that, according to inflation models. It could easily be the case that the ratio of the diameter of the entire Universe (all that there is) to the diameter of our observable Universe is as large as the ratio of the diameter of the observable Universe to the diameter of a proton: something like 10⁴¹ for the ratio. In other words, our observable Universe is like a proton relative to the entire Universe. Whoa!Dr. Alex Filippenko

Supermassive black holes live in the centers of galaxies.

A supermassive black hole (SMBH), the largest type of black hole, weighing in at up to billions of solar masses (1 solar mass = 1.9891*10^30 kg). The smallest black holes are a few solar masses. It is well established that many galaxies, including our very own, have SMBHs nestled at their core. Our SMBH, Sagittarius A*, is roughly a 4 million solar mass black hole. According to Filippenko, a gas cloud is set to be gobbled up by Sgr A* within the coming months — a rare and exciting event, so be sure to stay tuned!

Lick Observatory needs our help.

Located on Mount Hamilton, just outside of San Jose, Lick has served as the flagship observatory for all eight UC campuses, LBNL, and Livermore for over a hundred years. Lick was the first to empirically verify Einstein’s Theory of Relativity (1922), first to bounce a laser off the Moon (1969), and in the late 1980s and 90s contributed vast amounts of data surrounding supernovae, which led to the discovery of an expanding Universe. Filippenko says the following about the observatory.

“Lick Observatory serves all UC astronomers, and it is especially important for young astronomers (undergraduate students, graduate students, and postdoctoral scholars) because they can get telescope time there, unlike at many bigger, more expensive facilities. Lick is also excellent for long-term studies that require many nights each year, such as searches for exoplanets and monitoring exploding stars (supernovae). A lot of new types of instruments are developed or improved at Lick, such as laser-guide-star adaptive optics (which can make the Lick 3-m telescope get images as clear as those from the Hubble Space Telescope at near-infrared wavelengths). Lick also does much public outreach and education.”Dr. Alex Filippenko

Unfortunately, the University of California Office of the President plans to phase out funding for Lick Observatory. If you are interested in supporting the Lick Observatory, head on over to the “Save Lick” campaign for more information.

To see Filippinko’s entire AMA, check it out here. The r/science AMA series will continue with Dr. Shaun Hotchkiss who will be discussing the “Inflation and the Large Scale Structure of the Universe” on May 28th at 1 PM.

BONUS! Here are some extra tidbits that were my personal favorites.

Q1: “How long will it take until we have ships capable of taking us to the next solar system in a reasonable amount of time?” – seismicor

A: “Oh, gosh, I don’t know. Probably not sooner than 1000 years from now, and perhaps much longer than that, even with our currently rapid growth in technology. The energy barriers are truly stupendous, especially if you want to take humans instead of computers/robots. I think it is far more likely that we will be sending computers/robots instead of humans to other worlds. Indeed, they may end up being our evolutionary descendants.”

Q2: “What is the most logical explanation of what would happen if we were to be consumed by a block hole?” – jaycrypted

A: “We would be crushed into a singularity having extremely high density — after first being torn apart (‘spaghettified’).”

Q3: “How far out should I drive to get a good view of the milky way this summer?” – 5Aces

A: “Try to go out to the Sierra Nevada range around the time of new moon, and you will be treated to a truly spectacular sight. The skies there are really dark, and you’re above the haze, so you can see a lot of faint stars. If you’re not able to drive that far, then I suggest Pt. Reyes National Seashore, about 1-1.5 hours drive. It’s the darkest place I know withing a short driving distance from the SF Bay Area.” (I can personally verify this — the star gazing in Yosemite is stunning.)

The post The professor is in, AMA appeared first on The Berkeley Science Review.

Daniel’s desiderata

Daniel Freeman — Fri, 23 May 2014 18:36:16 +0000

A collection of things I find interesting:

1. Nearly a year ago, I talked about some progress on the Twin Primes conjecture through a collaborative effort spearheaded by Terry Tao and Michael Nielsen. When I wrote the last post, the record gap was at 258,728. It has since shrunk to 246 (unconditionally), or, depending on the use of some other unproved conjectures, 6. This has been an exciting story to follow due to how rapidly it has moved forward, and how seamlessly mathematicians around the world have been able to coordinate on the problem. For more technical details, see here.

2. Scott Aaronson recently posted a collection of ten new open (interesting) problems in the field of quantum information. This list (and this post) is great because Scott posted a similar list about eight years ago, and there’s been fairly significant progress on his old open questions. Many of them are fairly technical, but it’s inspiring to have so many interesting research questions all bundled together in one place.

3. About ten months ago, I discussed some drama surrounding the company D-Wave. Since then, a couple of interesting things have happened: A handful of publications have come out benchmarking various aspects of the D-Wave machine. Secondly, a classical model of the D-Wave machine was proposed by Berkeley’s own Seung Woo Shin and friends. For those keeping track at home, a truly correct classical model of the D-Wave would fairly convincingly slam the door on any interesting quantum capabilities of such a device. But, alas, the D-Wave is a slippery beast, and Daniel Lidar’s group has evidence to suggest that Shin’s model isn’t perfect.

4. In terms of things that just make me ridiculously excited, Google released a program for simulating a quantum computer on your desktop. I haven’t had the chance to play around with this too much, but I may write a longer review of the software in the future. It’s limited to simulating 22 qubits, but that should be enough to demonstrate some simple algorithms.

5. On a completely different note, Pike and friends from Imperial College London and the Max Planck Institute have come up with a way to convert light into matter. That’s the sensationalized headline, anyway. Really, they’re thinking about slamming together photons to produce electron-positron pairs (on the order of 100,000 pairs, possibly). This is a fair ways from conjuring up a cup of Earl Grey, but it’s exciting and almost too hilariously Sci-Fi sounding to be real (except it is).

The post Daniel’s desiderata appeared first on The Berkeley Science Review.

Scientific collaborations at UC Berkeley

Natalia Bilenko — Mon, 19 May 2014 18:03:56 +0000

This is a first post in an ongoing series where the BSR team sketches out their creative process. Look out for more posts in the coming weeks!

As Alexis Fedorchak pointed out in her letter from the editor (1), several articles in the latest issue of Berkeley Science Review focus on meta-science – describing many sides of being a scientist and the process of science. The cover article by Anum Azam, “The first rule of data science” (2), explores a new scientific field rising to prominence at UC Berkeley, one that can be seen as a type of meta-science itself. Data science is inherently a collaborative endeavor, bridging many areas of research and diverse sets of skills.

Data science skillset. Design: Natalia Bilenko, modified from Drew Conway; Book: MTchemik; network: Qwertyus

I wanted to capture this spirit of collaboration when illustrating Azam’s article. The image I chose for the title page fit the bill perfectly – the visualization by data scientist Olivier H. Beauchesne shows scientific collaborations across the world (3), based on co-authorships in the Elsevier’s Scopus database (4). Numerous connections, depicted as sweeping arcs, evoke a web that unites the world in the scientific process. When the decision was made to feature Azam’s article on the cover, I wanted to extend the theme of scientific collaboration and take it closer to home. The article describes many connections across the UC Berkeley campus, and I thought that BSR readers might want to see UC Berkeley collaborations (I certainly did!). This is how I found myself solving a data science problem to illustrate a data science article.

Finding a dataset

The first step was to find an appropriate dataset. Though co-authorships lingered on my mind, I did not have full access to the database used by Beauchesne. I went through a few alternatives: cross-linked websites on the berkeley.edu network, cross-listed classes, faculty appointments. While interesting, none of those sources of information captured collaboration explicitly. Finally, I made a fortuitous discovery – PubMed database (5) allows searches by affiliation. Just as I was dreading writing a tedious script to parse the HTML returned by a PubMed search, it turned out that I could export those search results in XML format, a much easier structure to understand and pull information from. We were in business.

I retrieved all articles published in the past 20 years (1994-2014) that listed “University of California, Berkeley” (and several variants) in the affiliations. PubMed’s listings dropped rapidly for articles before 1994, likely due to indexing problems, and twenty years seemed like a good range. I then loaded the data in my favorite exploratory tool – IPython Notebook (6), created by UC Berkeley’s own Fernando Perez and his team of pythonistas. I parsed the data using a Python library called BeautifulSoup (7). It’s a library for working with web data that I picked up from the excellent visualization blog FlowingData (8). I used it to get a list of affiliations for each article, and only kept affiliations that included UC Berkeley. That list included 3826 articles.

Cleaning the data

It was time to switch to manual cleaning. The affiliations were quite inconsistent: non-normative department names, various ways of listing affiliations for multiple authors, even misspellings of “Department.” Next time I publish a paper, I might fill out that little field more carefully. I kept articles with multiple Berkeley affiliations, including a few major independent collaborators, such as Lawrence Berkeley National Lab and Howard Hughes Medical Institute. I then loaded the affiliations for remaining 611 articles back into IPython Notebook and created a co-occurrence matrix characterizing co-authorships between programs. With the matrix in hand, it was time to make the first graphic draft.

Co-occurrence matrix characterizing co-authorships between UC Berkeley programs.

To display the data, I used D3.js (9), a powerful JavaScript library for interactive data visualization. D3 was created by Mike Bostock, graphic editor for the New York Times, while he was a graduate student at Stanford. It’s immensely flexible and has a very supportive community with many tutorials and resources. The idea of making a geographical map appealed to me, but the correspondence between departments and buildings at Berkeley is quite convoluted. I went with a minimal circular layout, borrowing the concept from Mike Bostock’s visualization of Uber rides by San Francisco neighborhood (10).

The first draft ended up looking colorful and overwhelming. After further data cleaning and forgoing color, the graph looked more manageable. However, it was still impossible to know what it represented. To get a glimpse at the meaning behind the lines, I included the department names in the graph, resulting in the least attractive draft of all.

Early drafts of the collaboration graphic.

Designing the graphic

Satisfied that the visualization was sensible, I focused on aesthetics. The number of programs and departments was too large to encode visually. I assigned each to one of six disciplines (physical sciences, engineering, biological sciences, social sciences, math and computer sciences, and health and medicine). This may have been the hardest part of the process – many programs don’t fit into a discipline neatly – but it was a design compromise that had to be made. I encoded each discipline with a color, and the penultimate draft was born.

Penultimate draft of the collaboration graphic – the program order is random.

In the final graph, the programs are grouped by discipline. The number of interdisciplinary connections across Berkeley is truly striking, and the colorful collaboration network made for a beautiful magazine cover. The abstract visualization in print is enriched by the interactive version on the website – hover over each program’s sector, and its name is displayed. To see the interactive graphic in action, see here.

There are of course some caveats to this visualization, much like any project based on real world data. For example, PubMed articles heavily sway towards biomedical research and mathematical sciences are underrepresented. Further, affiliation lists represent multi-appointed faculty, as well as work of authors from multiple departments. But those considerations aside, it provides a fascinating view of the strength of collaboration among scientists at UC Berkeley. Throughout my time in graduate school, I’ve felt that this feature is at the core of UC Berkeley’s academics. It is inspiring to see that hunch confirmed visually.

References

1. http://sciencereview.berkeley.edu/article/from-the-editor-spring-2014/

2. http://sciencereview.berkeley.edu/article/first-rule-data-science/

3. http://olihb.com/2011/01/23/map-of-scientific-collaboration-between-researchers/

4. http://www.elsevier.com/online-tools/scopus

5. http://www.ncbi.nlm.nih.gov

6. http://ipython.org/notebook.html

7. http://www.crummy.com/software/BeautifulSoup/bs4/doc/

8. http://flowingdata.com/

9. http://d3js.org/

10. http://bost.ocks.org/mike/uberdata/

The post Scientific collaborations at UC Berkeley appeared first on The Berkeley Science Review.