Mic FarrisMic Farris

The Rise and Fall of Measure E

Mic Farris — Sat, 16 Jan 2021 04:34:22 +0000

The following essay is adapted from the forthcoming book TUESDAY NIGHT FIGHTS, which details the political and self-governance history of Thousand Oaks.

Twenty-five years ago, in 1996, Thousand Oaks voters were included as an official part of preserving the city’s general plan, as ordinances were enacted requiring voter approval of important general plan changes before becoming effective. Sponsored by the two political protagonists of the time, Linda Parks authored the Parks Initiative, a precursor to the successful Ventura County open space protection campaigns, while Andy Fox championed the city-sponsored Measure E.

The Parks Initiative, qualifying for the ballot as a citizens’ initiative and approved in June 1996, focused on the parks and open space lands within the city. Under the initiative, lands categorized as ‘Parks, Golf Courses, Open Space’ “shall remain so designated… unless redesignated to another General Plan land use category by vote of the people”,[1] and “[w]henever the City Council adopts an amendment requiring approval by a vote of the people…, the City Council’s action shall have no effect until after such a vote is held and a majority of the voters vote in favor of it.” [2]

The language of the Parks Initiative provides clarity as to when the voters must be included for changes to the general plan: if land is designated in a protected category and an amendment is proposed to designate that land otherwise, the amendment does not take effect until approved by the voters.

By defining the approvals that require voter concurrence plainly and specifically, the Parks Initiative established very effective protections. The clarity of language ensures that landowners and the public know the process for making any such changes. While not a prohibition, such a requirement lowers the likelihood of any such proposals being made, making for an effective measure for protecting open space.

Just weeks after enacting Parks’ open space initiative, the Council placed its own measure on the November 1996 ballot, which was approved by voters that fall. The ordinance now known as Measure E also proved effective as a tool in curbing development pressures within Thousand Oaks, but not for the reasons Measure E advocates claim. In contrast with the Parks Initiative, the measure went through evolving tactics for implementation, based upon how interpretable the language is and the circumstances that gave councilmembers opportunities to modify their interpretations.

Measure E requires that “[a]ny amendment which cumulatively provides a net increase in the maximum number of residential dwelling units which could be permitted under the proposed land use designation” or “[a]ny amendment which cumulatively provides a net increase in the land designated ‘commercial’” requires voter approval.[3] The challenge: what does “cumulatively provides a net increase” mean?

When Measure E was introduced to the Council, Mayor Fox, the measure’s primary advocate, provided a way forward in a memo for such “minor changes in the General Plan…if these changes might mean that the residential and/or commercial uses might be increased. If the Council chose to do this, they would be required to do one of two things: (1) concurrently reduce densities and commercial space elsewhere in the City or (2) submit the proposed General Plan changes which would increase the City’s ultimate population or commercial areas to the voters.”[4]

To grant flexibility under Measure E without seeking voter ratification, the City Council could, for example, increase acreage designated as commercial in one part of the city, as long as an equal or greater amount of acreage were decreased elsewhere, thus resulting in no net increase of commercial acreage. If this were done, then according to Fox, the general plan amendment would not need voter approval to become effective.

Based on Fox’s 1996 memo, these offsetting shifts in residential density or commercial acreage were anticipated to occur “concurrently” – at the same time. In the succeeding years after Measure E’s passage, no issues arose regarding the measure, as only amendments decreasing residential or commercial lands were approved by the Council.

The first true test of Measure E came in 2001, when former city councilmember and county supervisor Ed Jones sent a letter to the city representing Jerry Brodecky, owner of Thousand Oaks Toyota on Thousand Oaks Boulevard. Jones proposed that a portion of Brodecky’s property, about 1.2 acres, be redesignated from “medium density residential” to “commercial” to accommodate the dealership’s expansion. Noting that this could trigger Measure E, Jones proposed that “an alternative to a vote of the people be invoked in which it is demonstrated that an equivalent amount of land within the city has been rezoned by removing it from commercial since Measure E became law…”[5] Jones went on to recommend that this could be accomplished with land that “has been or is in the process of being removed from commercial and use that acreage to demonstrate that there will be no net increase…”[6]

Regarding the request, City Attorney Mark Sellers advised the Council that “[i]f land has been removed from the ‘commercial’ designated [sic] since 1996, and no equivalent offsetting increases have occurred, there is a net reduction which can be reallocated.”[7] A further memo from the City Attorney’s office advised that “[t]he Council controls how it will reallocate those past reductions in acreage of commercial uses.” [8] The Council initiated the general plan amendment for Brodecky Toyota in 2003,[9] but also established an Ad Hoc General Plan Review Committee to evaluate the “applicability of Measure E to banking of commercial acreage and residential density”.[10] The Ad Hoc General Plan Review Committee comprised two members of the City Council, Mayor Pro Tem Bob Wilson and Councilmember Claudia Bill-de la Peña, and two members of the Planning Commission, Commissioner Tom Glancy and myself.

In the committee’s report back to the Council, one of the interpretations included processing “a simultaneous increase/decrease in commercial acreage or residential density” as a “single General Plan amendment, [which] could involve several properties” and that “[t]he scope of ‘net’ change could be clearly assessed by comparing/balancing the offsetting loss/gain in acreage or density among the properties.” The other interpretation involved creating a “bank” for any “reduced commercial acreage or the number of reduced residential units resulting from [a General Plan] amendment…approved since Measure E was adopted…” and that “[a] future request… to increase commercial acreage or residential units elsewhere within the City’s Planning Area could draw from the ‘bank,’ reallocating the reserved acreage or units to another site or sites… This option would permit greater flexibility… in that the acreage or units could be reallocated as needed, without going to the voters for approval.”[11]

Councilmember Bill-de la Peña and I supported the interpretation of evaluating each general plan amendment on its own, consistent with how Measure E was originally proposed assuming “concurrent” increases and decreases; Mayor Pro Tem Wilson and Commissioner Glancy backed the banking option, which aligned with the preferred interpretation from the City Attorney’s office, based on previous memos. Given the language of Measure E and the split among the committee, I recommended that we propose options for interpretation to the Council, as opposed to presenting a split recommendation; we all agreed that these were the options for interpreting the measure, even if we didn’t agree on which interpretation was preferred. The Council directed the City Attorney’s office to provide a deeper legal analysis, which was reported back to the Council the following year.

City Attorney Mark Sellers stepped down from his post after 21 years of city service in 2004,[12] and Amy Albano began her tenure as his permanent replacement in 2005.[13] By mid-year, City Attorney Albano penned a formal eight-page legal opinion, guiding the city’s interpretation of Measure E. In effect, the banking interpretation would stand, allowing previous reductions in residential density or commercial acreage to be used to balance future increases.

In her analysis, Albano said that the Council should not consider reductions as being stored in a “bank” but rather a baseline to judge an amendment against.[14] She did argue, however, that some density reductions, even though approved after Measure E’s passage, shouldn’t be used to reduce the baseline, as these amendments were merely “ratifying current conditions” in the city and didn’t represent actual decisions by the Council to reduce density that could be reallocated later, since “to do so does not further the purpose of Measure E.” [15] Specifically, two “housekeeping amendments” that “in total would reduce the commercial baseline by 4 acres and the residential dwelling unit baseline by 4,102 dwelling units”[16] were to be excluded as reductions eligible for reallocation.

The Council accepted Albano’s analysis, excluding these amendments merely “ratifying current conditions”, and as a result, the Council’s 2005 policy established a baseline of reductions for future reallocation of 1 acre for commercial use and residential density allowing for 368 units. From this understanding, the Brodecky Toyota proposal was modified to reduce their commercial acreage request from 1.2 acres to 1 acre so that a popular vote was not needed.[17]

This implementation became the governing interpretation of Measure E for the next decade, establishing a political environment where only requests for minor development increases were considered acceptable. Any sudden and sizable increases in residential development without voter approval would run counter to the stated intent of Measure E per the measure’s ballot argument that voters “should have the power to vote yes or no when significant changes to the General Plan are proposed.”[18] Thousand Oaks had become a mature city where the annual development of residential units had slowed significantly; for example, only 137 residential development allotments were approved in 2005. A decade later, creative interpretations of Measure E were revisited.

In 2015, former City Attorney Sellers, who helped draft Measure E, was now a member of the Thousand Oaks Boulevard Property Business Improvement District (BID), which represents a group that owns much of the property on Thousand Oaks Boulevard, including the City of Thousand Oaks. Sellers believed the measure should be reinterpreted to allow more development without a citywide vote.[19]

Regarding Albano’s memo guiding the Council’s 2005 implementation of Measure E, Sellers said in an interview with the Thousand Oaks Acorn that “(Albano) made an interpretation that (she said) was ‘in the spirit of Measure E.’ I and a number of other people feel that was a wrong interpretation.”[20]

“What we have to do is get the city to take a new look at it. The BID would obviously love to have the new interpretation,” Sellers said. Under a new interpretation, Sellers said, “I believe the cap can be higher. In my 21 years as city attorney, I never saw a project that wasn’t underdeveloped (with) less homes than was contemplated by the zoning or the General Plan. Everything in this community has been underdeveloped. It’s a goal to use that flexibility.”[21]

Two years later, pursuing this newly conceived flexibility, the city conducted an analysis to identify areas where general plan changes could be made to create more units for reallocation and future development without seeking voter approval. In the study, city staff identified 5,400 units that were “theoretically accessible for reallocation” as they were residential areas that “were built at a density that is within an entirely lower residential land use category.” [22] As an example, a development may have been designated on the general plan as medium density residential, allowing for 4.6 to 15 units per net acres; however, if the property were ultimately developed at 3 units per net acre, the general plan designation could be reduced to low density residential, which allows between 2 to 4.5 units per net acre, and the difference in residential densities can be used for reallocation. The problem: such density reductions seem to be exactly the type the 2005 Council interpretation excluded from reallocation as “ratifying current conditions” – existing development at a lower residential land use category.

Although a Thousand Oaks Acorn editorial urged the Council to “[r]eject the amendment and stick with what the majority of residents feel they were promised: the right to vote ahead of the demands of developers”,[23] the Council moved ahead and amended the general plan to increase the available residential density balance by 1,088 units.[24]

Continuing the evolving interpretations of Measure E since its inception, the 2018 Council approvals raise further questions about whether the voters will be involved in the city’s general plan decision-making as intended; based on the differing political climates in 2005 and 2018, amendments of the same type – those “ratifying current conditions” – now seem to be treated differently. Under the 2005 interpretation, the 2018 amendments would not be available for further allocation; however, if 1,088 units are indeed available for reallocation, this could open up the amendments excluded in 2005 to be available as well, leading to a potential increase of an additional 4,102 units.[25]

This could easily be interpreted as a “significant change” where the voters “should have the power to vote yes or no”[26] on such a proposal. The general plan reductions approved in 2018 may not yet technically run afoul of the voter approval requirements; the real test will come when the Council entertains increases that attempt to tap into this arguably questionable reservoir of residential density.

The successes of voter approval measures are only as good as the text of the measures themselves. If a measure’s language is clear and precise such as with the Parks Initiative, the voters’ will is protected; however, if the language is vague, the measure could potentially provide no protection at all, becoming dependent on the interpretations and decisions of a narrow council majority.

In the end, the effectiveness of Measure E as a growth control instrument has merely paralleled the desire of the measure’s advocates to control growth. If a Council majority chose to argue against a particularly intense development, opponents could point to Measure E as a politically convenient limitation. However, if the Council wanted to find ways to allow for growth, creative ways to navigate through Measure E’s language are explored, interpreting away the right to approve significant changes the voters thought they were guaranteeing by passing Measure E.

When restrictions are placed on the Council on what they can and cannot do, they can come in two forms: restrictions that are so tight that no one tests the measure, and restrictions that are so loose that they can always be interpreted around when convenient. When it comes to including the voters in city decisions, the Parks Initiative qualifies as the former, and Measure E serves as the latter.

As we consider significant changes to the city’s general plan, we should keep the rightful place of Thousand Oaks voters in approving these changes top of mind. Honoring the voters’ intent is critical to healthy democratic governance.

===================

[1] Thousand Oaks Municipal Code, Section 9.2.204(b)

[2] Thousand Oaks Municipal Code, Section 9.2.204(d). Exceptions exist in subsection (c) if the redesignation is deemed necessary to avoid an unconstitutional taking of a private landowner’s property.

[3] Thousand Oaks Municipal Code, Sections 9.2.203(b)(2) and (3), as listed. However, research for this book indicates that the ordinance approved by the Council codifying Measure E contains errors. The current ordinance listed in the Municipal Code appears to be an older version of Measure E and not the version presented to the voters for approval.

[4] Memo to City Council from Andrew P. Fox, Mayor, “Subject: Growth Control,” April 22, 1996.

[5] Memo to City Council from Mark G. Sellers, City Attorney, “Subject: Request of Mr. Brodecky on Thousand Oaks Toyota Expansion,” June 5, 2001. Attached letter from Ed Jones.

[6] Ibid.

[7] Memo to City Council from Mark G. Sellers, City Attorney, “Subject: Request of Mr. Brodecky on Thousand Oaks Toyota Expansion,” June 5, 2001.

[8] Memo to Chris Ronneberg, Associate Planner, from Nancy Kierstyn Schriener, Assistant City Attorney, “Subject: LU 2002-226/Brodecky – Applicability of Measure E,” August 20, 2003.

[9] Minutes of the Thousand Oaks City Council, September 2, 2003.

[10] Minutes of the Thousand Oaks City Council, September 16, 2003.

[11] Memo to City Council from Ad Hoc General Plan Review Committee, “Subject: 1. Applicability of Measure E Limitations on General Plan Amendments that increase Commercial Acreage or Residential Density. 2. Special Use Permits for Certain Commercial Uses in all Commercial Zones,” June 22, 2004.

[12] Minutes of the Thousand Oaks City Council, January 27, 2004.

[13] Minutes of the Thousand Oaks City Council, December 14, 2004.

[14] Memo to City Council from Amy Albano, City Attorney, “Subject: Applicability of Measure E Limitations on General Plan Amendments That Increase Commercial Acreage or Residential Density,” June 14, 2005.

[15] Memo to City Council from Amy Albano, City Attorney, “Subject: Measure E Opinion,” June 6, 2005.

[16] Ibid.

[17] Minutes of the Thousand Oaks City Council, July 26, 2005.

[18] Argument in Favor of Measure “E”, Sample Ballot and Voter Information Pamphlet, County of Ventura, City of Thousand Oaks, General Election, November 5, 1996.

[19] Bitong, Anna, “Measure E: under the microscope,” Thousand Oaks Acorn, February 5, 2015.

[20] Ibid.

[21] Ibid.

[22] Memo to Andrew P. Powers, City Manager, from Mark A. Towne, Community Development Director, “Subject: Measure E Residential Baseline,” October 24, 2017.

[23] “Give residents the vote they voted for,” Thousand Oaks Acorn, April 19, 2018.

[24] Minutes of the Thousand Oaks City Council, April 24, 2018.

[25] Memo to Andrew P. Powers, City Manager, from Mark A. Towne, Community Development Director, “Subject: Measure E Residential Baseline,” October 24, 2017.

[26] Argument in Favor of Measure “E”, Sample Ballot and Voter Information Pamphlet, County of Ventura, City of Thousand Oaks, General Election, November 5, 1996.

Can Lessons from Data Science Help Journalism?

Can Lessons from Data Science Help Journalism?

Mic Farris — Wed, 27 Jun 2018 17:22:31 +0000

New York Times newsroom 1942. Source: Library of Congress http://hdl.loc.gov/loc.pnp/cph.3c12969

You might think journalism and data science don’t really go together, but on that, I differ. Below are some thoughts on the topic and lessons we can draw from data science on how to make journalism better and more effective in these times.

To begin, my background isn’t journalism, but it is science. I believe that the goals of true journalism are the same as science — get to the truth about what is happening and inform others about that truth.

In decision theory, a topic that I enjoy and in which I have deep practical experience, we learn that there are two primary factors that lead to the decisions we make. One is the most likely explanation for the facts we observe, and the other is our own cost-benefit analysis about what to do as a result.

The first element — finding the most likely explanation for the facts — requires us to be objective about the data we have in front of us. It also requires that we are open to alternative explanations (usually referred to as hypotheses).

The data lead us to the truth, but only when we are asking appropriate questions. Some facts or data may look disparate and unintelligible when viewed collectively, but things can make perfect sense if they are put in the right context. This is where epiphanies can happen — when the most likely explanation for the data finally becomes apparent.

However, some hypotheses might fit the data perfectly yet aren’t at all robust to new information through repeated and independent examination. In data science, we call this a solution that overfits the data. In journalism, this might be called a conspiracy theory — an overly complicated explanation that likely wouldn’t hold up when exposed to new facts.

The truth comes out when the explanation we find is consistent through repeated and independent examination. This holds in science, in journalism, in criminal investigations, or in any other endeavor whose goal is actually to seek the truth.

The second element — evaluating costs and benefits — is different for each decision maker. We can conflate these two elements in our decision making — letting our desire for one outcome cloud our interpretation of the facts in a way to justify the decision.

Much like how a rainbow color scale can fool the eye into seeing sharp gradients in scientific data, journalism has the potential to color the story at the expense of the facts themselves. This can become propaganda (if used intentionally) or can serve the journalistic form itself at the expense of the truth.

Much has also been made about press “objectivity”. From a recent thread (https://twitter.com/KlasfeldReports/status/1011334068275961856) by Adam Klasfled (@KlasfeldReports), referencing work by Walter Lippmann, “neutrality” — a device to convince the reader of one’s accuracy or fairness — is not the same as “objectivity”.

We should strive — whether in journalism or in science — to seek the most likely explanation for the facts and separate out any approaches for assessing the right decision to make as a result. Facts and values should be separate.

In a lesson from decision theory, two decision makers can agree on the facts, and even on the likelihoods of certain explanations, yet *disagree* on the decision to be made. Each decision maker has different tolerances for error or differing ways to assess costs and benefits — in other words, different values.

One example: Sales and Engineering teams within a company can agree on a product and what’s needed to improve it. The Sales team will want to ship the product sooner than the Engineering team, since Sales wants to increase sales (so delay is bad) and Engineering wants to improve the product (so delay is good).

People will still have disagreements because they apply different values to the decision making situation. However, we should *all* strive to make those evaluations on the same set of facts.

That said, that *doesn’t* mean that if there are two possible decisions, there must be given equal treatment. While reporters, investigators, and scientists (and basically all humans) have bias, the lack of equal treatment is not evidence of bias in and of itself.

Some explanations for the facts are not likely, and thus should not really be given much reporting weight. If a plausible explanation for the facts is 10x (or 1000x) more likely than a conspiracy theory, treating each equally unduly injects the values of the conspiracy theorist.

Conspiracy theorists have reasons for choosing the conspiracy over the far more likely plausible explanation — self-rationalization, camaraderie with other true believers, or doubling down as a self-defense of previous choices.

However, journalism should not unwittingly conflate these value judgments in their reporting.

Again, one example is that the government may be actively lying to the public and the press about facts. The free press has an obligation to find the truth and confront the government for how it operates.

Spin around the edges has been practiced by both Democratic and Republican administrations, moving back and forth between liberal and conservative, but within the confines of normality. In science, this might be described as working within the linear regime.

Previous rules of thumb — such as “if the President says it, it’s news” or “we present what he says, the viewers can decide whether it’s true” — can work as effective operating principles in the linear regime.

However, what is happening now has broken past the linear regime. The President and his staff are lying — verifiably, directly and without remorse — on a daily basis. With this, we have entered the nonlinear regime, so our approach to coverage must evolve to remain objective to the truth.

In science, we may have had an experimental setup that was working just fine, maybe with some hiccups and tweaks along the way. However, if suddenly the data start to look very different, we should question what is happening with our setup and look for other explanations.

This doesn’t mean that we give up on our scientific objectivity. It does mean, though, that we can’t interpret the data the way we used to, since conditions on the ground have changed. We have to apply our same principles, but to a new problem. The same goes for journalism.

If we focus on the goals of journalism and the new facts on the ground, we will get to a better objective explanation of the facts and serve journalism and its audience better in these changing times.

The Rise and Fall of Measure E

Why Lies Spread Faster than the Truth

Mic Farris — Thu, 24 May 2018 19:18:46 +0000

With the increasing speed of information coming at us, how do we know what’s true and what’s not, or even worse – what’s fake?

Figuring out what’s true and false is tough, and then understanding what to do about it can be even tougher. But we should recognize one aspect between lies and the truth.

Lies spread faster. Here’s why.

In the information age, we’re understanding our world more and more in terms of information and data. We’re also recognizing that we are (and always have been) making decisions based on this information – even to the point of automating many of these decisions.

We are explainers, building internal models in our head for describing what we see in the world, and our decisions are based upon what we come up with as that explanation.

Some explanations are good at describing the data, and others are not. However, the most likely explanation – the truth underlying it all – does well at explaining what we’ve seen and anticipating what’s to come.

One example of a bad explanation is the conspiracy theory; they are easy to create, and there seem to be so many of them. In data science, we call this phenomenon overfitting, where an overly complicated explanation for the data is proposed that (1) seems to fit the data we have perfectly and (2) is highly likely to fall apart with any new information.

On the other hand, a truthful explanation needs to meet two criteria: (1) faithfully explain what we have observed and (2) be consistent any future observations. In a way, truthful explanations are consistent throughout, as well as have the power of prediction.

False explanations spread faster than the true ones, because they are easier to come up with and aren’t burdened by being consistent with new information. The false explanations will be created and disseminated quickly, because they only need to meet the first criterion.

When we search for the truth, we strive to meet both criteria, and that takes time. Plus, the more certain we want to be about the truth, the longer it takes.

When we don’t care about certainty of the truth, this incentivizes weak or false explanations which can spread faster. The intentionally false explanation – the lie – comes from malevolent intent to force a decision before the right explanation can be found.

The goal of the lie is of course not the truth, but a quicker than expected decision (usually by someone else) that might not otherwise be made if the truth were actually known.

There are times, such as in business and entrepreneurship, when quicker decisions are preferable and have positive intents. In these cases, the point of the decision is to act quickly, gaining feedback needed in order to make better subsequent decisions. “Perfect is the enemy of the good” is a mantra used to characterize this approach.

However, in areas such as science, criminal proceedings, or investigative journalism, where we are trying to find out what the truth really is, there is a bigger penalty for getting it wrong, so it’s important to gather the best information before the decision is rendered.

In these situations, beware of those who seek to hide the truth. Those who seek to obfuscate – lie – are not in search for the truth. They are trying to force an outcome that would be different than what would occur if the truth were uncovered.

Unfortunately, we are in a time where the lies have turned into deliberate attacks on the search for truth itself.

We’re seeing it with science with attacks on findings relating to climate change and theories of evolution.

We’re seeing it with lawful investigations into foreign influence and potential criminality.

We’re seeing it with attacks on the free press, investigative journalism, and the First Amendment itself.

These institutions seek truth and need a manner of protection in order to allow them to get to that truth.

Why? Because, as we’ve discussed, getting to the truth – finding the best explanation for the facts – takes more time than finding and promulgating less valid (or likely false) explanations. Thus, we find ourselves engaged with defending against these false explanations right at the time when we’re searching for the best one.

There’s no perfect solution, since getting ahead of the lies isn’t possible. However, we can build up our systems to combat against the lies better. Here are several things we can do:

Let the data be collected

Whether in science or journalism or investigations, we shouldn’t be afraid of collecting more independent information. The truth comes out through repeated and independent examination, and we should strive to learn what the facts are.

Differences about what to do in light of the truth may still exist, but let our differences result from an acknowledged difference in values, not our collective lack of understanding or in differing sets of facts.

Be open about how the data was collected

Here lies the real challenge. In science, this is easier to accomplish – we need to make sure everyone analyzing the results knows where the information came from. There is no real value in hiding information about the sources providing the data.

However, in investigatory or journalistic work using confidential or unnamed sources, this becomes more difficult. There is a real penalty for the sources being as transparent as possible.

The world proves itself to be a dangerous place, and there are interests out there who don’t want information to be known. Silencing the source is in their interest, so transparency, while desired in an altruistic environment in search for the truth, could put people in danger. The costs of not being transparent about information can sometimes be outweighed by the costs of harm that could come to those sources.

The costs don’t come from our desire to be transparent. The costs come from bad actors (in autocratic countries, this is the government, and in free societies, these are liars and criminals) who wish to prevent truth from being exposed to hide their bad acts (and potentially give the impression that they are really good).

Striking that balance is key, but hopefully we will trend toward transparency and truth.

Protect the truthseekers

The scientists, investigators, and journalists who care about the truth need our protection. We can’t leave them open to attacks from those with other nefarious agendas.

Sometimes they get it wrong – those in search for the truth would readily admit that – but seeking the truth will uncover those wrongs, with understandable explanations for what led to the original errors.

Society benefits by having truthseekers, so society needs to protect them.

Allow freedom to question the explanations

We don’t question “the data” – we can question the hypothesis for why the data look that way.

For example, in investigations, DNA of the defendant at the crime scene can be powerful evidence of guilt. One explanation is that the defendant committed the crime and their DNA was left at the scene. Might there, however, be other explanations?

Of course. There could be prosecutorial misconduct or planting of evidence or a problem with chain of custody. There are many possible explanations for the data – our job is to seek the most likely explanation for the facts.

It’s not the data that are bad; it’s the explanation for what the root cause is.

We need the ability to question the explanations we are hearing. In fact, we should strongly encourage such questioning. A vigorous discussion of the facts and how likely our explanations are to be the right ones is important in getting to the truth.

We must protect freedom of thought and expression because it’s necessary for seeking the truth.

Develop new technologies that help to close the gap between when falsehoods start and the truth is revealed

Again, by its nature, the truth takes time to be revealed, so the lies will always come first. That said, we can best enable the truth’s unveiling through technology. We need approaches that focus on connecting available information into the most likely explanation. We need a way to shrink the time gap between when the lies start being disseminated and when the truth can catch up.

Fundamentally, if we care about the truth, we should allow the search for truth to occur.

We must support the search for truth, the approaches of finding and evaluating that truth, and the institutions, such as science, investigations, and journalism, that are in that business.

3 Things We Can Do About Fake News

Mic Farris — Fri, 18 May 2018 03:42:15 +0000

FAKE NEWS!

It’s amazing that we’ve now had our collective awareness heightened to the problem of fake news. I get frustrated at times with the sheer nonsense that seems to swim in the public consciousness, but in search of what I can do about it, I figured I’d share something that happened to me recently.

I have many people in my Facebook feed that are professional, fun, and have real character, but there are some that carry a bit of an aggressive and intolerant flavor. They’re Facebook friends for other reasons, and, while I could easily unfriend these folks, I resist the urge so that I remain open to what they have to say or what they share. People have a variety of perspectives, and I feel it’s important to understand them, even if the purpose is to counter them with better information. That said, it can be a bit trying at times, and here’s one of those times.

This image was spread widely via Facebook (29,218 shares at the time of this writing), finding its way into my own feed. The clear implication was that it is unfair for Ivanka Trump to be criticized for what she wears, when Michelle Obama, who is lauded for her fashion sense, wore something ugly in an official setting.

Some of the less incendiary comments from the thread include:

but it is ok for Moochell to dress like a pinata. she was a disgusting first lady

Is there anything about the Trump family the press will not complain about. Frankly I am tired of their [expletive deleted] complaints. The press is awful

…Michelle’s dress is horrid!! ”””

Images like these seem to spread so easily via Facebook Shares or Twitter retweets, potentially reinforcing any biases held by the people that view them. The assumption that most of us hold is that the images we see are genuine, but in this case, it’s worth questioning what our own eyes see.

When I first saw this image, it didn’t seem right to me. Did Michelle Obama really wear this dress? It didn’t seem consistent with what I’ve understood. I tried to evaluate the information with a critical eye, questioning whether it makes sense. Sure enough – with a little investigating, I found that this image of the Obamas was altered somewhere along the way to superimpose the bullseye dress on the First Lady.

I could have stopped there, satisfied with my own discovery of the truth, but I decided to inform others, sending a comment to the Facebook thread focusing on the facts that the Michelle Obama image was faked. My premise: in order to have a better informed view of the world, the people in this thread needed to be presented with facts that (1) were true and (2) countered their current view.

Here was my post:

=====

At risk of receiving a lot of intemperate responses, this post was shared in my FB feed so given the epidemic of spreading fake news, it’s worth a reply to get some facts in the mix.

The image of Michelle Obama is fake – it was photoshopped from the attached image from 2009, where the Obamas were on their way to award a posthumous Medal of Honor to a soldier killed in Afghanistan, U.S. Army Sergeant First Class Jared C. Monti. (https://www.gettyimages.com/…/obama-awards-medal-of…)

Discourse is important and it’s important to share our views and opinions, but they need to be based on the facts. None of us benefit otherwise.

Facts matter. We should exercise our freedom of expression, but let us all agree to seek the truth and honor the facts, so that our views can be better informed.

=====

If you are already predisposed to see the Obamas negatively, the doctored photo is information that, upon a quick glance, confirms your understanding, therefore you might determine that it is “the truth”.

However, this violates what I believe to be a more scientific definition of truth — consistency under repeated and independent examination, and this phenomenon, which is referred to as confirmation bias, becomes a violation of independent examination.

The values people hold (“I can’t stand those Obamas…”) get in the way of the information presented to them (“the image you’re viewing, which is reinforcing your negative view of the Obamas, was faked”). For us to seek the truth accurately, we need to be more open to the information presented to us, while simultaneously being critical of whether the information itself is consistent with other independent information.

This is truly the challenge of our time, that of arming us all with the ability to review information/data critically and make beneficial and well-informed decisions. We are being inundated with more and more data, and much of our society’s decision making is being automated. Much of what we’re experiencing with hand-coded bots, trolls, and disinformation campaigns are human-driven, but they really are decision-driven – automated technologies will be able to create the same kinds of chaos, only at scale, unless we understand and approach differently how we make better informed decisions in this ocean of information.

Given these challenges, here are three things we can do now (and learn to be better at):

1) Evaluate the data – not just the source

To find the actual truth – the one which serves as the best explanation for what we observe – we can’t trust the source of information without question. We can’t just assume that the New York Times is accurate because they are a trusted source for us. For those that trust the Washington Post, there are others that trust Fox News. If we only focus on the sources, we keep ourselves in our own information tribes and don’t get at the truth.

When we blindly believe what a trusted source tells us, we open ourselves open to confirmation bias, or worse, manipulation and coercion. All of these effects lead to bad decision making, some with more severe impacts than others.

To counter this, we need to analyze the information, reporting, and data critically, and seek some manner of independent confirmation to see that all the information we receive is consistent with the explanation at hand.

2) Engage – Conversation and communication is needed

Admittedly, it’s easier to just turn off the noise machine and speak only with those who already share our views. But, if we are to be stronger as a nation, or even a global community, we need to share our perspectives, encourage a vigorous back-and-forth regarding our views, and jointly agree to hold that discourse in the presence of facts and in search of the truth. The alternative is to disengage, leading us to destruction of “the other side” and their points of view for our own survival. This is a natural tendency throughout all of human history, resulting in chaos, war, and worse, so tough work of engagement and striving for the truth is critical.

Again, as we start to rely upon automated decision making algorithms and artificial intelligence, these information processing biases will not go away – they will only become part and parcel of the automated information flows, so it’s an important phenomenon of which to be aware.

The sharing of knowledge, information, and perspectives is what is needed to get to a more accurate version of the truth and the best explanation for what happens in our world.

3) Focus on the facts and choose to seek the truth

Facts matter, but values can differ. Two people can, in fact, see the same information and agree on the root causes for what they see, and yet make two different decisions regarding what to do about it.

Even though different groups may hold contrasting values regarding what our decisions should be, such as Democrats and Republicans in Congress, both sides should agree to seek a common set of facts upon which to based our collective decision making. It’s OK to disagree on the decisions, but we shouldn’t disagree on the facts.

If we focus on our tribes, we won’t seek truth; we’ll only seek “our” side winning. If we seek the truth, ultimately, in the long run, we all win.

The Advent of Analytics Engineering

Mic Farris — Fri, 01 Sep 2017 05:36:01 +0000

Data Science has become an exploding field in recent years, and depending on whether you are focusing on machine learning, artificial intelligence, or citizen data science, the discipline of data science is creating very high expectations.

There is indeed much promise for data science, where predictive models and decision engines can target skin cancer in patient imagery, presciently recommend a new product that piques your interest, or power your self-driving car to evade a potential accident.

However, promise requires much effort for it to be realized. It takes a lot of work and brand new engineering disciplines that are not yet mature or even employed on a wide scale. As there is greater recognition of the value of data science, and the generation of data is increasing at exponential rates, this engineering effort is starting and will grow beyond its adolescence soon.

This is why we are at the advent of a new engineering discipline that can truly realize the promise of data science – a discipline that I call “analytics engineering”.

Consider the parallels.

Computer science became an established academic discipline in the 1950s and 1960s, and serves as a foundation for today’s technology industry, articulating the theory and application of algorithms, computation, logic, and information representation in building real computing devices. The applications of computer science have included code breaking in World War II, the creation of ENIAC and the IBM mainframe.

However, something happened in the 1970s to democratize the development of software – the announcement of the Altair. At the time, Intel sold a microprocessor for $10,000, but the Altair was only $400. At this price, the microcomputer became accessible to individuals – geeks who wanted to build their own computer. Club started meeting in Silicon Valley, such as the Homebrew Computer Club and the Altair Users Group, to show what could be done with these computers and how they could be programmed.

Hackers took hold of an industry and an explosion of innovation ensued. Steve Jobs and Steve Wozniak had formed Apple, Bill Gates and Paul Allen had launched Microsoft, and the personal computer was born.

Eventually, as industries were created and matured, a strong engineering discipline came to computer programming – a discipline we now refer to as software engineering.

Data Science includes the analysis of data and applying solid approaches to gain meaning or insight from the data. There are several fundamentals to the field of data science, which I’ve elaborated on before.

That said, the maturity of data science as a discipline is following a similar trajectory as that of software. Top universities such as Columbia, Cornell, and University of California-Berkeley now offer programs and degrees in Data Science, establishing the academic discipline.

With prototyping languages such as R and Python, which are downloadable free, anyone can literally start programming, working with data, and applying data science principles. The barrier to entry for becoming a data scientist is now nearly zero.

However, just because someone can do something doesn’t mean they can do it well. Becoming a true practitioner is important, and learning the disciplines of a craft through experience and hard work is a must. Additionally, firms that leverage data science capabilities cannot afford to deploy an 24/7 operational capability based on a model developed for free on someone’s laptop. More engineering specialty is required, which is where the industry is heading, just like software and other engineering disciplines.

Throughout the history of innovation, this maturity curve has followed a common path, being part of a great surge in capability and creativity, supported by solid engineering practice. With mechanical engineering, the Industrial Revolution was launched. Electrical engineering led to advancements such as electricity, radio, and television. Of course, software engineering has dawned the age of computing and the internet. What will analytics engineering bring? Possibly what’s needed to support the age of artificial intelligence.

Analytics Engineering encompasses a key set of specialties that are not yet in common practice. There is a promise for data science that all one needs to do is “just push a button and the models get developed”. Others say that there are many different models we could try, so if we try a thousand different models on the data, we’ll evaluate all of them and “pick the best one”.

These approaches are symptoms of technology hype, so we should take them with a grain of salt.

For example, even after many years of developing computer applications, software even today isn’t written with just a “push of a button”. Engineering practices are still needed (and need to be followed) for quality software to be shipped. Sure – hackers can prototype something quickly and demonstrate truly innovative capabilities. However, for this to scale, be reliable, and ultimately operational without frequent failures, engineering disciplines need to be employed.

In this new age, true analytics engineering disciplines are what is needed, tailored to the needs of analytics and decision modeling.

Data Science isn’t magic, and never will be. Yet, more focused analytics engineering disciplines can be developed to become part of decision model development and improvement moving forward. The promise of data science, machine learning, and artificial intelligence will depend on this trend, which makes this an exciting time for the industry.

Why is this important for data science? Imagine a ROC curve where the false positive rate is very low, say 1% at an acceptably high true positive rate.

Are we satisfied? Consider the case where a decision model, say to identify high risk customers in a financial institution, is run on a database of 1 million customers. A false positive rate of 1% would still yield a database of 10,000 customers that would need to be reviewed, and purely in error, since these are falsely flagged as high risk customers. When you are working at scale, with millions if not billions of tests being run through our decision models, the performance of these models needs to demonstrate incredibly low false positive rates to be worth using.

As Analytics Engineering matures, here are some of the developments that we can expect:

– New metrics will be developed to compare model performance in more accurate ways, superseding effective yet crude metrics such as Area Under the Curve (or AUC).

– New analysis techniques will be leveraged to focus on insights gained from the tails of statistical distributions, which are the true drivers of false positive rates in decision models.

– Tools and technologies will be created and matured to manage models, control versions, and track audit changes in model development and deployment.

– Standards, similar to CMMI or Agile in the software engineering world, will be developed and gain traction to provide for more explicit best practices around the creation, management, and engineering of decision models.

Companies such as Netfilx, Tesla, Apple, Amazon, Google, Facebook, and others are already developing these disciplines in-house, as the success of their respective business models demand this advancement. However, other businesses will need to leverage these capabilities soon to keep pace.

It’s an exciting time to recognize and help define what this new engineering discipline will become. For data science, it’s currently like the Wild West of old – wide expanses, plenty of room to “stake your claim”, and a rush to “get in on” the field that is hot. That said, we aren’t all cowboys and the West is being now tamed.

Welcome to the advent of Analytics Engineering.

The Fundamentals of Data Science

Mic Farris — Fri, 10 Oct 2014 04:24:49 +0000

Two of the biggest buzzwords in our industry are “big data” and “data science”. Big Data seems to have a lot of interest right now, but Data Science is fast becoming a very hot topic.

I think there’s room to really define the science of data science – what are those fundamentals that are needed to make data science truly a science we can build upon?

Below are my thoughts for an outline for such a set of fundamentals:

Fundamentals of Data Science

Introduction

The easiest thing for people within the big data / analytics / data science disciplines is to say “I do data science”. However, when it comes to data science fundamentals, we need to ask the following critical questions: What really is “data”, what are we trying to do with data, and how do we apply scientific principles to achieve our goals with data?

– What is Data?
– The Goal of Data Science
– The Scientific Method

Probability and Statistics

The world is a probabilistic one, so we work with data that is probabilistic – meaning that, given a certain set of preconditions, data will appear to you in a specific way only part of the time. To apply data science properly, one must become familiar and comfortable with probability and statistics.

– The Two Characteristics of Data
– Examples of Statistical Data
– Introduction to Probability
– Probability Distributions
– Connection with Statistical Distributions
– Statistical Properties (Mean, Mode, Median, Moments, Standard Deviation, etc.)
– Common Probability Distributions (Discrete, Binomial, Normal)
– Other Probability Distributions (Chi-Square, Poisson)
– Joint and Conditional Probabilities
– Bayes’ Rules
– Bayesian Inference

Decision Theory

This section is one of the key fundamentals of data science. Whether applied in scientific, engineering, or business fields, we are trying to make decisions using data. Data itself isn’t useful unless it’s telling us something, which we’re making a decision about what it is telling us. How do we come up with those decisions? What are the factors that go into this decision making process? What is the best method for making decisions with data? This section tells us…

– Hypothesis Testing
– Binary Hypothesis Test
– Likelihood Ratio and Log Likelihood Ratio
– Bayes Risk
– Neyman-Pearson Criterion
– Receiver Operating Characteristic (ROC) Curve
– M-ary Hypothesis Test
– Optimal Decision Making

Estimation Theory

Sometimes we make characterizations of data – averages, parameter estimates, etc. Estimation from data is essentially an extension of decision making, a natural next section from Decision Theory.

– Estimation as Extension of M-ary Hypothesis Test
– Unbiased Estimation
– Minimum Mean Square Error (MMSE)
– Maximum Likelihood Estimation (MLE)
– Maximum A Posteriori Estimation (MAP)
– Kalman Filter

Coordinate Systems

To bring various data elements together into a common decision making framework, we need to know how to align the data. Knowledge of coordinate systems and how they are used becomes important to lay a solid foundation for bringing disparate data together.

– Introduction to Coordinate Systems
– Euclidian Spaces
– Orthogonal Coordinate Systems
– Properties of Orthogonal Coordinate Systems (angle, dot product, coordinate transformations,
etc.)
– Cartesian Coordinate System
– Polar Coordinate System
– Cylindrical Coordinate System
– Spherical Coordinate System
– Transformations Between Coordinate Systems

Linear Transformations

Once we understand coordinate systems, we can learn why to transform the data to get at the underlying information. This section describe how we can transform our data into other useful data products through various types of transformations, including the popular Fourier transform.

– Introduction to Linear Transformations
– Properties of Linear Transformations
– Matrix Multiplication
– Fourier Transform
– Properties of Fourier Transforms (time-frequency relationship, shift invariance, spectral
properties, Parseval’s Theorem, Convolution Theorem, etc.)
– Discrete and Continuous Fourier Transforms
– Uncertainty Principle and Aliasing
– Wavelet and Other Transforms

Effects of Computation on Data

An often overlooked aspect of data science is the impact the algorithms we apply have on the information we are seeking to find. Merely applying algorithms and computations to create analytics and other data products has an impact on the effectiveness data-driven decision making ability. This section take us on a journey of advanced aspects of data science.

– Mathematical Representation of Computation
– Reversible Computations (Bijective Mapping)
– Irreversible Computations
– Impulse Response Functions
– Transformation of Probability Distributions (due to addition, subtraction, multiplication,
division, arbitrary computations, etc.)
– Impacts on Decision Making

Prototype Coding / Programming

One of the key elements to data science is the willingness of practitioners to “get their hands dirty” with data. This means being able to write programs that access, process, and visualize data in important languages in science and industry. This section takes us on a tour of these important elements.

– Introduction to Programming
– Data Types, Variables, and Functions
– Data Structures (Arrays, etc.)
– Loops, Comparisons, If-Then-Else
– Functions
– Scripting Languages vs. Compilable Langugages
– SQL
– SAS
– R
– Python
– C++

Graph Theory

Graphs are ways to illustrate connections between different data elements, and they are important in today’s interconnected world.

– Introduction to Graph Theory
– Undirected Graphs
– Directed Graphs
– Various Graph Data Structures
– Route and Network Problems

Algorithms

Key to data science is understanding the use of algorithms to compute important data-derived metrics. Popular data manipulation algorithms are included in this section.

– Introduction to Algorithms
– Recursive Algorithms
– Serial, Parallel, and Distributed Algorithms
– Exhaustive Search
– Divide-and-Conquer (Binary Search)
– Gradient Search
– Sorting Algorithms
– Linear Programming
– Greedy Algorithms
– Heuristic Algorithms
– Randomized Algorithms
– Shortest Path Algorithms for Graphs

Machine Learning

No data science fundamentals course would be complete without exposure to machine learning. However, it’s important to know that these techniques build upon the fundamentals described in previous sections. This section gives practitioners an understanding of useful and popular machine learning techniques and why they are applied.

– Introduction to Machine Learning
– Linear Classifiers (Logistic Regression, Naive Bayes Classifier, Support Vector Machines)
– Decision Trees (Random Forests)
– Bayesian Networks
– Hidden Markov Models
– Expectation-Maximization
– Artificial Neural Networks and Deep Learning
– Vector Quantization
– K-Means Clustering

Question: Do you have any thoughts on the fundamentals of data science? You can leave a comment below.

A Data Science Lesson from Richard Feynman

Mic Farris — Mon, 10 Feb 2014 00:12:46 +0000

Richard Feynman

Richard Feynman is one of the greatest scientific minds, and what I love about him, aside from his brilliance, is his perspective on why we perform science. I’ve been reading the compilation of short works of Feynman titled The Pleasure of Finding Things Out, and I recently came across a section that really hit home with me.

In the world of data science, much is made about the algorithms used to work with data, such as random forests or k-mean clustering. However, I believe there is a missing component – one that deals the fundamentals underlying data science, and that is the real science of data science.

The following paragraphs are taken from The Pleasure of Finding Things Out, which I would encourage you all to read. Feynman’s way of cutting through the scientific and mathematical gobbledygook to get to the essence of what all that stuff represents is remarkable, which in my mind just demonstrates his brilliance since he’s so able to communicate what he knows to other people. I’ve written on the importance of effective communication, especially in science – the most effective scientific communicators were Albert Einstein and Stephen Hawking; I would definitely put Richard Feynman in that class.

One way, that’s kind of a fun analogy in trying to get some idea of what we’re doing in trying to understand nature, is to imagine that the gods are playing some great game like chess, let’s say, and you don’t know the rules of the game, but you’re allowed to look at the board, at least from time to time, in a little corner, perhaps, and from these observations you try to figure out what the rules of the game are, what the rules of the pieces moving are. You might discover after a bit, for example, that when there’s only one bishop around on the board that the bishop maintains its color. Later on you might discover the law for the bishop as it moves on the diagonal which would explain the law that you understood before – that it maintained its color – and that would be analogous to discovering one law and then later finding a deeper understanding of it. Then things can happen, everything’s going good, you’ve got all the laws, it looks very good, and then all of a sudden some strange phenomenon occurs in some corner, so you being to investigate that – it’s castling, some thing you didn’t expect. We’re always, by the way in fundamental physics, always trying to investigate those things in which we don’t understand the conclusions. After we’ve checked them enough, we’re okay.

The thing that doesn’t fit is the thing that’s the most interesting, the part that doesn’t go according to what you expected. Also, we could have revolutions in physics: after you’ve noticed that the bishops maintain their color and they go along the diagonal and so on for such a long time and everybody knows that that’s true, then you suddenly discover one day in some chess game that the bishop doesn’t maintain its color, it changes its color. Only later do you discover a new possibility, that a bishop is captured and that a pawn went all the way down to the queen’s end to produce a new bishop – that can happen but you didn’t know it, and so it’s very analogous to the way our laws are: They sometimes look positive, they keep on working and all of a sudden some little gimmick shows that they’re wrong and then we have to investigate the conditions under which this bishop change of color happened and so forth, and gradually learn the new rule that explains it more deeply. Unlike the chess game, though, in [which] the rules become more complicated as you go along, in physics, when you discover new things, it look more simple. It appears on the whole to be more complicated because we learn about a greater experience – that is, we learn about more particles and new things – and so the laws look complicated again. But if you realize all the time what’s kind of wonderful – that is, if we expand our experience into wilder and wilder regions of experience – every once in a while we have these integrations when everything’s pulled together into a unification, in which it turns out to be simpler than it looked before.

If you are interested in the ultimate character of the physical world, or the complete world, and at the present time our only way to understand that is through a mathematical type of reasoning, then I don’t think a person can fully appreciate, or in fact can appreciate much of, these particular aspects of the world, the great depth of character of the universally of the laws, the relationships of things, without an understanding of mathematics. I don’t know any other way to do it, we don’t know any other way to describe it accurately… or to see the interrelationships without it. So I don’t think a person who hasn’t developed some mathematical sense is capable of fully appreciating this aspect of the world – don’t misunderstand me, there are many, many aspects of the world that mathematics is unnecessary for, such as love, which are very delightful and wonderful to appreciate and to feel awed and mysterious about; and I don’t mean to say that the only thing in the world is physics, but you were talking about physics and if that’s what you’re talking about, then to not know mathematics is a server limitation in understanding the world.

The connection here to data science is the search for understanding. Research and engineering teams use data science to explain things about the data, so that we can use that information later – maybe to make predictions, maybe for better explanations, maybe to make better products. However, the key part is the understanding and without that, data science is merely a collection of tools and techniques used to fit observations. Unless we seek to understand – trying the find “the why” – then we won’t really know whether our data science models, tools, or techniques are actually working.

If you are interested, these passages are from a television interview Feynman conducted as part of a BBC documentary Richard Feynman: No Ordinary Genius.

Question: Do you have any thoughts on the fundamental science of data science or on Richard Feynman? You can leave a comment below.

10 Things To Know When Hiring Data Scientists

Mic Farris — Wed, 04 Sep 2013 17:01:25 +0000

I’ve been performing data science before there was a field called “data science“, so I’ve had the opportunity to work with and hire a lot of great people. But if you’re trying to hire a data scientist, how do you know what to look for, and what should you consider in the interview process?

I’ve been doing what is now called “data science” since the early 1990s and have helped to hire numerous scientists and engineers over the years. The teams I’ve had the opportunity to work with are some of the best in the world, tackling some of the most challenging problems facing our country. These folks are also some of the smartest people I’ve ever had the opportunity to work with.

That said, not everyone is a good fit, and the discipline of data science requires important key elements. Hiring someone into your team is incredibly important to your business, especially if you’re a small startup or building a critical internal data science team; mistakes can be expensive in both time and money. This can be even more intimidating if you don’t have the background or experience in hiring scientists, especially someone responsible for this new discipline of working with data.

Through conducting many interviews with potential candidates, not everyone has met our standards. Also, we’ve made some mistakes in hiring that eventually got corrected over time, but might have been avoided if we followed our own lessons earlier. Hopefully I can give you a few helpful things to look for when trying to fill out your data science team.

A note of caution: I may say some things here that seem like heresy. I’ve viewed some job descriptions that make data science seem like it’s all about coding. Be careful! Data science is not all about coding; it’s about understanding what data represent and how to convert it into reliable information. Coding is necessary to make this happen, but there are true fundamentals – the real science of data science – that aren’t being taught well right now (more on that in another post…)

Given that, here are 10 key things you should consider when filling out your data science teams:

Look for strong math backgrounds. Data science requires a background in mathematics – this is really not negotiable. We made some mistakes in the past finding candidates that had strong software backgrounds, but didn’t really have the necessary math fundamentals. What happened? The software team didn’t have a strong appreciation for the data-crunching algorithms being developed, so there was a divide within our team; it became harder for the scientists and the software engineers to work together to achieve common goals. Knowledge of statistics, linear algebra, calculus, geometry, and trigonometry are baseline requirements. I’ve even heard stories of a company (not ours, thank goodness…) that had a programmer implement algorithms incorrectly, and didn’t really appreciate what was being done. An algorithm that used the sum of squared values was implemented by squaring the sum of the values, because it ran faster (by the way, these two algorithms are not the same!) This is a simple example, but if your implementation team doesn’t have a strong math background, they might not know the difference, causing you real problems down the road.

Seek the willingness to program, not necessarily specific languages. Here’s the heretical statement – don’t pay as much mind to the actual programming languages someone has on their resume. They need to have experience programming and show that they can get their hands dirty with coding. However, data science is about learning and discovering; you need to be flexible. So, look beyond the recitation of R, Python, Pig/Hive, C/C++, Perl, MATLAB, IDL, SQL, SAS, Java, Unix shell, Ruby, Scala… Any good scientist who is willing and able to program can pick up a new language. It’s much harder to get someone who knows the ins-and-outs of a particular language to be a good scientist. Keyword searches of résumés targeting specific languages may exclude some strong candidates, while also including others that are weak yet know what to put on their résumé. Manual review can lead to the same result, since if you’re looking for Hadoop experience, you may hire that and onboard a less-than-stellar data scientist in the process.

Make sure candidates have a probabilistic/statistical view of the world. In data science, nothing is black and white. Data has two driving elements – the behavioral (characterized by what we know – our model for how the data is observed) and the statistical (everything else that we currently don’t understand). The job of the quality data science team is to characterize the behavioral and deal with the statistical. As your team gets to know the data better, they will find even more subtle drivers and nuggets of information, turning what used to be statistical into a more accurate behavioral model (this is the cool part of data science – amazing predictive ability!). Candidates must have an appreciation for a probabilistic view of the world, meaning that when a certain condition occurs, you expect the data to appear a given way with some probability (or only some of the time). A background in statistics is an absolute must in data science, so look for that on the résumé and test for it in your interviewing. With that said…

Look for people who are detail-oriented and want to get to the root cause. Statistics come from the lack of information about what drives the data we observe, which you can get at when you have more data. Sometimes there is a real root cause to what we see, and good data scientists try to figure out why. Technical staff members that aren’t detail-oriented tend to make more mistakes than others who are, leading to inaccurate results and incorrect conclusions. I’ve seen really smart people find some very confusing results in their data analysis and be stumped as to what it was. When we looked into it further, it was merely a bug in their algorithm (not necessarily in their implementation) which led to some subtle errors. A probabilistic view of the world is important, but having a taste for getting to the bottom of things is equally as valuable.

Find people who can communicate effectively. An often overlooked quality for data science candidates is top communication skills. Even if someone is working alone on their data analysis, they have to communicate with someone, whether that is his boss or her colleague; no one works alone. I’ve written several articles about the importance of communicating (such as What We Can Learn From Stephen Hawking, Why Scientists Are Lousy Communicators, and tips on Job Interview Presentations), and it becomes especially important for those in the sciences. Math and science geeks think presenting is merely for marketers and sales people… Not so! If you want others to believe the results of your data science teams, your team has an obligation to communicate effectively.

Include your current scientific staff in interviews. We know that hiring is the job for managers. However, including your current staff in the interview process can yield real benefits. It can ensure that new candidates will be good fits for the organization and can even improve the company. In his 1998 letter to shareholders, Jeff Bezos, CEO of Amazon, detailed three questions that were asked of his hiring teams when evaluating candidates. Here’s what Bezos wrote about these questions:

Will you admire this person? If you think about the people you’ve admired in your life, they are probably people you’ve been able to learn from or take an example from. For myself, I’ve always tried hard to work only with people I admire, and I encourage folks here to be just as demanding. Life is definitely too short to do otherwise.
Will this person raise the average level of effectiveness of the group they’re entering? We want to fight entropy. The bar has to continuously go up. I ask people to visualize the company 5 years from now. At that point, each of us should look around and say, “The standards are so high now — boy, I’m glad I got in when I did!”
Along what dimension might this person be a superstar? Many people have unique skills, interests, and perspectives that enrich the work environment for all of us. It’s often something that’s not even related to their jobs. One person here is a National Spelling Bee champion (1978, I believe). I suspect it doesn’t help her in her everyday work, but it does make working here more fun if you can occasionally snag her in the hall with a quick challenge: “onomatopeoeia!”

Bring members of your current team in with the understanding that you’re looking for people who will make their team better, and the help from your current staff will be valuable in assessing talent.

Don’t get so hung up on brainteasers – whether they can or can’t answer them. I know that some companies like to put candidates on the spot and get them to solve brainteasers during their interview. Personally, I find this to be a waste of time and an inaccurate way to tell whether someone will work well as a data scientist on your team. Some people need a little time to work through a problem, but if they have that time, they nail it. Others get to the right answer by trying out many things, learn from their mistakes, and hone in on what works. Brainteasers would make these candidates look like they can’t do the job, so they’d get weeded out. Plus, if someone happened to solve a brainteaser quickly, it may mean that they’ve been exposed to that particular before, which is why they know it so easily (for example, here’s one: For any prime number p > 5, show me why p²-1 is divisible by 24…). You aren’t hiring someone who can solve the problem – you are hiring someone who can find the solution to the problem. They may solve it themselves (which can be especially important when the problem has never been solved before), but if it has been solved, why would you want someone who is predisposed to solving it over again? Instead…

Ask open-ended questions that provoke how people approach problems. There is a great book, Are You Smart Enough To Work at Google?, which details how Google evaluates candidates for their teams. There is even an insightful question they have asked: You are shrunk to the height of a nickel and thrown into a blender. Your mass is reduced so that your density is the same as usual. The blades start moving in 60 seconds. What do you do? (How would you answer this?…) For their interview process, Google posts on their website how they approach it and what they look for. They generally look at four elements:

Leadership. We’ll want to know how you’ve flexed different muscles in different situations in order to mobilize a team. This might be by asserting a leadership role at work or with an organization, or by helping a team succeed when you weren’t officially appointed as the leader.
Role-Related Knowledge. We’re looking for people who have a variety of strengths and passions, not just isolated skill sets. We also want to make sure that you have the experience and the background that will set you up for success in your role. For engineering candidates in particular, we’ll be looking to check out your coding skills and technical areas of expertise.
How You Think. We’re less concerned about grades and transcripts and more interested in how you think. We’re likely to ask you some role-related questions that provide insight into how you solve problems. Show us how you would tackle the problem presented–don’t get hung up on nailing the “right” answer.
Googlyness. We want to get a feel for what makes you, well, you. We also want to make sure this is a place you’ll thrive, so we’ll be looking for signs around your comfort with ambiguity, your bias to action and your collaborative nature.

Use a group interview process when possible. Having an interview process that is back-to-back-to-back-to-back one-on-one interviews leads to repeat questions, making it tiring for the candidate. Additionally, when the interviewing team gets together to discuss the candidate (if they do at all), each member has a different perspective on the candidate because different questions may have been asked and different answers might have been given for the same repeat question. When you have multiple people (four to six) hearing the same thing as part of a group interview, you can get a better feel for the person coming on board. The information is the same, but different people pick up on different things, so it gives the team a more well-rounded perspective on the candidate. Something to keep in mind: These types of interviews can be intimidating for someone being interviewed, so it’s important to establish an environment of trust from the start. Make them comfortable so that you can get the best out of them.

Look for people that can tell you what they’ve learned, not just telling you what they did. Machine learning algorithms are great at exploiting separations in data. But, why are we looking for separations in data? To make better decisions with that data. The tools of data science are important to know, but if we don’t look for the “why” in the data science we are performing, then we are just using tools for the sake of using tools. Just because someone is an expert in hammers and nails doesn’t make him a carpenter. Extracting information out of data is all about context – what question are you asking of the data, and what drives what you see? This is about understanding, forming hypotheses, drawing conclusions. If a data scientist starts down the path of “we used this algorithm and the metrics came out like this…” without giving you some context or understanding of what it means, then you and your team could run into problems down the road (overfitting, building models that aren’t robust, etc.). It’s the difference between hearing about what you did on your summer vacation and what you learned on your summer vacation. In looking for what makes a good data scientist, DJ Patil talks about storytelling – the ability to use data to tell a story and to be able to communicate it effectively. Data scientists need to understand what they are trying to communicate and let their data science help them tell that story. No one really wants to hear what you did on your summer vacation, but they may want to know what you’ve learned and how you learned it.

There is so much data being created on a daily basis, now is a wonderful opportunity for companies that can leverage key data science disciplines. Knowing what to look for when hiring can let you take a solid step forward in building a successful data science team.

Question: Have you ever hired people to fill out your data science team? Interested in sharing your experiences in “what works” and “what doesn’t”? You can leave a comment below.

The Best Way To Learn New Things

Mic Farris — Fri, 30 Aug 2013 22:01:46 +0000

Science and business seem like two very different disciplines, but is the best approach to learning any different in these two fields? These areas of life seem so unique, and the people in them can be quite varying (one with the nerdy pocket protector and the other dressed in the well-tailored suit). However, both science and business require learning, and the best approach to learning in each is really the same.

The best approach to learning is generally through failure. For example, Thomas Edison failed an astounding number of times before he invented a working lightbulb, and there are likely thousand of stories about how successes came as a result of many tries and many failures.

In many ways, this is really an application of the scientific method. I’ve written a number of posts about Stephen Wolfram (such as using Wolfram|Alpha to look at your own social network, his views on big data, computing a theory of everything, and how he created his company). In the effort to learn even more about how the world works, Wolfram has pushed scientific discovery to the next level, which he’s done with his book A New Kind of Science (NKS for short).

NKS, a discipline focused on how simple computational programs can create amazingly complex things, is really the scientific method applied by asking many, many questions and testing each question. The nature of the problems being solved requires that more questions be asked and more tests be conducted.

Science progresses by asking questions and performing tests, and some answers can’t be found unless we ask the question. NKS urges us to ask many questions and perform many tests, and in some cases, ask all possible questions and test each one. Magical and amazing things can be learned as a result.

In business, Eric Ries has written up his approach to building strong companies in his book, The Lean Startup. Ries’ approach is really the scientific method applied to business. Ries (and his mentor Steve Blank) notes that a new business venture is really a temporary organization in search of a repeatable and scalable business model. You can plan your new business all you want, but that plan usually falls apart as soon as it interacts with customers.

Ries describes how you have to pose a hypothesis (for example, customers are more focused on cost) and then perform a test to see if that is really true. It becomes so important to let real data tell you whether you’re on the right track or not. This may require performing many different types of tests and failing often. Keeping track of these failures will guide you toward the best way to succeed.

A similar take comes from Seth Godin, master of marketing (author of such books as Permission Marketing and Tribes). In this video interview, he discusses his book Poke The Box, and points out the importance of trying and failing in order to learn how to succeed – here’s an excerpt from the interview:

I’ve launched books that’ve failed. I did a book called “E-mails Addresses of the Rich and Famous” – Roger Ebert got really mad at me. I’ve made videotapes that didn’t work; I’ve made books that didn’t work.

My lesson was: If I fail more than you do, I win, because built into that lesson is this notion that you get to keep playing. If you get to keep playing, you get to keep failing, and sooner or later, you’re going to make it succeed.

The people who lose are either the ones who don’t fail at all and get stuck, or ones who fail so big they don’t get to play again…

If you’re talking to a pacemaker assemblyman or an airline pilot, they don’t try stuff; they don’t say, “I wonder what happens if I do this,” and we’re really glad they don’t do that, because the cost of failing is greater than the cost of discovering what works and what doesn’t.

But almost no one I know builds pacemakers and I don’t know airline pilots. Most of us now live in a world where the kind of failure I’m talking about isn’t fatal at all. If you post a blog post and it doesn’t resonate with people, post another one tomorrow. If you tweet something and no one retweets it, tweet again in an hour. If you’re obsessed with doing what everyone else is doing, because of someone saying “you failed,” then you’re in really big trouble.

Here’s a quick overview of what we need to do – in science and in business – to help learn more and succeed:

Set a goal. Decide what you want to do or what you want to learn. In science, this might be to find an algorithm that performs a certain task or to develop a model that describes how something in the world works. In business, this might be to come up with a product that serves a specific customer need. This is the same as asking a question, such as “Will this model predict what happens next?” or “Will this product serve my customer’s need?” In either case, you need to know what you’re trying to do first or what question you are asking.
Form a hypothesis. A hypothesis is a statement that tries to explain behavior; it’s your belief about why something happens. At this point, it’s only a guess, although an educated one, based upon your previous knowledge. An example hypothesis might be “I believe customers will buy my product because they really want to protect the environment.” How do we know if this is really true? We’ll test it out.
Predict the outcome. You need to test our hypothesis, which means understanding how feedback would come to you under two situations – (1) if your hypothesis is true and (2) if your hypothesis is false. This is a critically important step, and fundamental to being a good scientist or a solid businessperson. I could probably go into a whole other post about how critical this is (and how some really smart people aren’t as careful as they should be with this step…). Get really clear in your mind about these two things – what would the feedback look like if you are right and what would the feedback be if you are not right.
Try it out. Create an experiment that will collect the feedback. Ideally your test will give different answers if your hypothesis was correct or incorrect, and this way you’d be able to tell whether or not you’ve confirmed what you know.
Compare your results to your expectations. Once you have your feedback – data collected from your scientific or business experiment – you need to analyze it to see if it confirms your hypothesis or not. Is the feedback most consistent with your hypothesis or not? If the data is unclear, try a different experiment. If the feedback tells you that your hypothesis is wrong, great! You’ve learned something you didn’t know before, and you can ask another question and carry out another test to learn even more.

Testing your hypothesis and having be incorrect might be considered by some as a failure. However, without performing the test, you would never know that your insights was misplaced. This “failure” is really one more step toward success, and as Seth Godin says, if you keep failing, “sooner or later, you’re going to make it succeed.”

This is really how we learn – ask a question, form a hypothesis, test it out, and look at the results – and fail often to get to the successes.

Question: Besides science and business, are there other areas where you think the scientific method can be helpful? You can leave a comment below.

Beating Cancer and Favoritism with Data

Mic Farris — Tue, 27 Aug 2013 14:45:10 +0000

I read a couple of items in this month’s Fortune magazine that I thought it was worth passing along.

The first was a small article by Brian Dumaine about the work being done at Applied Proteomics to identify cancer before it develops. At Applied Proteomics, they use mass spectroscopy to capture and catalog 360,000 different pieces of protein found in blood plasma, and then let supercomputers crunch on the data to identify anomalies associated with cancer. The company has raised $57 million in venture capital and is backed by Microsoft co-founder Paul Allen. You can read the first bit of the article here.

The second is from the Word Check callout, showing how access to information is making the word a better place:

wasa: Pronounced [wah-SUH]

(noun) Arabic slang: A display of partiality toward a favored person or group without regard for their qualifications. A system that drives much of life in the Middle East — from getting into a good school to landing a good job.

But on the Internet, there is no wasa.

– Adapted from Startup Rising: The Entrepreneurial Revolution Remaking the Middle East by Christopher M. Schroeder

Mic FarrisMic Farris

The Rise and Fall of Measure E

Related Posts:

Can Lessons from Data Science Help Journalism?

Related Posts:

Why Lies Spread Faster than the Truth

Related Posts:

3 Things We Can Do About Fake News

Related Posts:

The Advent of Analytics Engineering

Related Posts:

The Fundamentals of Data Science

Related Posts:

A Data Science Lesson from Richard Feynman

Related Posts:

10 Things To Know When Hiring Data Scientists

Related Posts:

The Best Way To Learn New Things

Related Posts:

Beating Cancer and Favoritism with Data

Related Posts: