If you’re running purely predictive models, and the relationships among the variables aren’t the focus, it’s much easier. Go ahead and run a stepwise regression model. Let the data give you the best prediction.

But if the point is to answer a research question that describes relationships, you’re going to have to get your hands dirty.

It’s easy to say “use theory” or “test your research question” but that ignores a lot of practical issues. Like the fact that you may have 10 different variables that all measure the same theoretical construct, and it’s not clear which one to use.

Or that you could, theoretically, make the case for all 40 demographic control variables. But when you put them all in together, all of their coefficients become nonsignificant.

So how do you do it? Like I said, it’s hard to give you step-by-step instructions because I’d need to look at the results from the each step to tell you what to do next. But here are some guidelines to keep in mind.

**Remember that regression coefficients are marginal results.**

That means that the ** coefficient for each predictor** is the

So it matters what else is in the model. Coefficients can change quite a bit, depending on what else is in the model.

If two or more predictors overlap in how they explain an outcome, that overlap won’t be reflected in either regression coefficient. It’s in the overall model ** F statistic and the R-squared**, but not the coefficients.

**Start with univariate descriptives and graphs.**

Always, always, always start with descriptive statistics.

It will help you find any errors that you missed during cleaning (like the 99s you forgot to declare as missing values).

But more importantly, you have to know what you’re working with.

The first thing to do is univariate descriptives, or better yet, graphs. You’re not just looking for bell curves. You’re looking for interesting breaks in the middle of the distribution. Values with a huge number of points. Surprising values that are generally much higher or with less variation than you expected.

Once you put these variables in the model, they may behave funny. If you know what they look like going in, you’ll have a much better understanding why.

**Next, run bivariate descriptives, again including graphs.**

You also need to understand how each potential predictor relates, on its own, to the outcome and to every other predictor.

Because the regression coefficients are marginal results (see #1), knowing the bivariate relationships among variables will give you insight into why certain variable lose significance in the bigger model.

I personally find that in addition to correlations or ** crosstabs**, scatterplots of the relationship are extremely informative. This is where you can see if

**Think about predictors in sets.**

In many of the models I’ve been working with recently, the predictors were in theoretically distinct sets. By building the models within those sets first, we were able to see how related variables worked together and then what happened once we put them together.

For example, think about a model that predicts binge drinking in college students. Potential sets of variables include:

- demographics (age, year in school, socio-economic status)
- history of Mental Health (diagnoses of mental illness, family history of alcoholism)
- current psychological health (stress, depression)
- social issues (feelings of isolation, connection to family, number of friends)

Often, the variables within a set are correlated, but not so much across sets. If you put everything in at once, it’s hard to find any relationships. It’s a big, overwhelming mess.

By building each set separately first, you can build theoretically meaningful models with a solid understanding of how the pieces fit together.

**Model building and interpreting results go hand-in-hand.**

Every model you run tells you a story. Stop and listen to it.

Look at the coefficients. Look at R-squared. Did it change? How much do coefficients change from a model with control variables to one without?

When you pause to do this, you can make better decisions on the model to run next.

**Any variable involved in an interaction must be in the model by itself.**

As you’re deciding what to leave in and what to boot from the model, it’s easy to get rid of everything that’s not significant.

And it’s usually a good idea to eliminate non-significant interactions first (the exception is if the interaction was central to the research question, and it’s important to show that it was not significant).

But if the interaction is significant, you can’t take out the terms for the component variables (the ones that make up the interaction). The ** interpretation of the interaction** is only possible if the component term is in the model.

**The research question is central.**

Especially when you have a very large data set, it’s very easy to step off the yellow brick road and into the poppies. There are so many interesting relationships you can find (and they’re so shiny!). Months later, you’ve testing every possible predictor, categorized every which way. But you’re not making any real progress.

Keep the focus on your destination–** the research question**. Write it out and tape it to the wall if it helps.

All of these guidelines apply to any type of model–linear regression, ANOVA, logistic regression, mixed models. Keep them in mind the next time you’re doing statistical analysis.

This article first appeared at http://www.theanalysisfactor.com/7-guidelines-model-building/

]]>* *

The second personality is Professor David Hand. He is a senior research investigator and emeritus Professor of Mathematics at Imperial College, London. He is a Fellow of the British Academy and a recipient of the Guy Medal of the Royal Statistical Society, and has served twice as President of the Royal Statistical Society. His latest contribution is a book by the title ‘The Improbability Principle’. Here is what he says: *“Statistics is: the fun of finding patterns in data; the pleasure of making discoveries; the import of deep philosophical questions; the power to shed light on important decisions, and the ability to guide decisions….. in business, science, government, medicine, industry…” *

Indeed the use of statistics has changed over time and its range application now spans disciplines that were hitherto not encompassed in its uses. In Dr. Smith’s words, the discipline of statistics has moved from being grounded firmly in the world of measurement and scientific analysis into the world of exploration, comprehension and decision-making.

Statistical analysis should be used as a tool to generate insights from data. The data might come from any field of study and can be structure of unstructured, and the insights generated can aid comprehension and/or assessment of a particular relationship of interest, decision-making in business and industry and among other uses depending on the nature of the problem at hand. Broadly speaking, a statistical analysis can either be descriptive or inferential. In descriptive statistics, the purpose is to represent or summarise the data in the simplest form possible to enable, for example, the observation of patterns and the reduction of the data in to a few summary quantities that are easier to assimilate. In inferential statistics, further considerations such as sampling and random errors are made. The aim is to enable the user to draw conclusions about the population of units from which the data used in the statistical analysis is drawn from.

Because of the growing relevance of statistical analysis in many academic disciplines there is demand for increased proficiency in analysis, both for individuals who are trained to be statisticians and those who are not. Don’t be left behind, learn something!

]]>

**NOTE TO READERS: The following has been excerpted from Principles and Practice of Structural Equation Modeling (Third Edition) by Rex B. Kline. My contribution was limited to some small editing solely for this post. – Kevin Gray**

There is ample evidence that many of us do not know the correct interpretation of outcomes of statistical tests, or p values. For example, at the end of a standard statistics course, most students know how to calculate statistical tests, but they do not typically understand what the results mean (Haller & Krauss, 2002). About 80% of psychology professors endorse at least one incorrect interpretation of statistical tests (Oakes, 1986). It is easy to find similar misinterpretations in books and articles (Cohen, 1994), so it seems that psychology students get their false beliefs from teachers and also from what students read. However, the situation is no better in other behavioral science disciplines (e.g., Hubbard & Armstrong, 2006).

Most misunderstandings about statistical tests involve overinterpretation, or the tendency to see too much meaning in statistical significance. Specifically, we tend to believe that statistical tests tell us what we want to know, but this is wishful thinking. Elsewhere I described statistical tests as a kind of collective Rorschach inkblot test for the behavioral sciences in that what we see in them has more to do with fantasy than with what is really there (Kline, 2004). Such wishful thinking is so pervasive that one could argue that much of our practice of hypothesis testing based on statistical tests is myth.

In order to better understand misinterpretations of p values, let us first deal with their correct meaning. Here it helps to adopt a frequentist perspective where probability is seen as the likelihood of an outcome over repeatable events under constant conditions except for chance (sampling error). From this view, a probability does not apply directly to a single, discrete event. Instead, probability is based on the expected relative frequency over a large number of trials, or in the long run. Also, there is no probability associated with whether or not a particular guess is correct in a frequentist perspective. The following mental exercises illustrate this point:

1. A die is thrown, and the outcome is a 2. What is the probability that this particular result is due to chance? The correct answer is not p = 1/6, or .17. This is because the probability .17 applies only in the long run to repeated throws of the die. In this case, we expect that .17 of the outcomes will be a 2. The probability that any particular outcome of the roll of a die is the result of chance is actually p = 1.00.

2. One person thinks of a number from 1 to 10. A second person guesses that number by saying, 6. What is the probability that the second person guessed right? The correct answer is not p = 1/10, or .10. This is because the particular guess of 6 is either correct or incorrect, so no probability (other than 0 for “wrong” or 1.00 for “right”) is associated with it. The probability .10 applies only in the long run after many repetitions of this game. That is, the second person should be correct about 10% of the time over all trials.

Let us now review the correct interpretation of statistical significance. You should know that the abbreviation p actually stands for the conditional relative-frequency probability, the likelihood of a sample result or one even more extreme (a range of results) assuming that the null hypothesis is true, the sampling method is random sampling, and all other assumptions for the corresponding test statistic, such as the normality requirement of the t-test, are tenable. Two correct interpretations for the specific case p < .05 are given next. Other correct definitions are probably just variations of the ones that follow:

1. Assuming that H0 is true (i.e., every result happens by chance) and the study is repeated many times by drawing random samples from the same population, less than 5% of these results will be even more inconsistent with H0 than the particular result observed in the researcher’s sample.

2. Less than 5% of test statistics from random samples are further away from the mean of the sampling distribution under H0 than the one for the observed result. That is, the odds are less than 1 to 19 of getting a result from a random sample even more extreme than the observed one.

Described next are what I refer to as the “Big Five” false beliefs about p values. Three of the beliefs concern misinterpretation of p, but two concern misinterpretations of their complements, or 1 – p. Approximate base rates for some of these beliefs, reported by Oakes (1986) and Haller and Krauss (2002) in samples of psychology students and professors, are reported beginning in the next paragraph. What I believe is the biggest of the Big Five is the **odds-against-chance fallacy**, or the false belief that p indicates the probability that a result happened by chance (e.g., if p < .05, then the likelihood that the result is due to chance is < 5%).

Remember that p is estimated for a range of results, not for any particular result. Also, p is calculated assuming that H0 is true, so the probability that chance explains any individual result is already taken to be 1.0. Thus, it is illogical to view p as somehow measuring the probability of chance. I am not aware of an estimate of the base rate of the odds-against-chance fallacy, but I think that it is nearly universal in the behavioral sciences. It would be terrific if some statistical technique could estimate the probability that a particular result is due to chance, but there is no such thing.

The **local type I error fallacy** for the case p < .05 is expressed as follows: I just rejected H0 at the .05 level. Therefore, the likelihood that this particular (local) decision is wrong (a Type I error) is < 5% (70% approximate base rate among psychology students and professors). This belief is false because any particular decision to reject H0 is either correct or incorrect, so no probability (other than 0 or 1.00; i.e., right or wrong) is associated with it. It is only with sufficient replication that we could determine whether or not the decision to reject H0 in a particular study was correct.

The **inverse probability fallacy** goes like this: Given p < .05; therefore, the likelihood that the null hypothesis is true is < 5% (30% approximate base rate). This error stems from forgetting that p values are probabilities of data under H0, not the other way around. It would be nice to know the probability that either the null hypothesis or alternative hypothesis were true, but there is no statistical technique that can do so based on a single result.

Two of the Big Five concern 1 – p. One is the **replicability fallacy**, which for the case of p < .05 says that the probability of finding the same result in a replication sample exceeds .95 (40% approximate base rate). If this fallacy were true, knowing the probability of replication would be useful. Unfortunately, a p value is just the probability of the data in a particular sample under a specific null hypothesis. In general, replication is a matter of experimental design and whether some effect actually exists in the population. It is thus an empirical question and one that cannot be directly addressed by statistical tests in a particular study.

The last of the Big Five, the **validity fallacy**, refers to the false belief that the probability that H1 is true is greater than .95, given p < .05 (50% approximate base rate). The complement of p, or 1 – p, is also a probability, but it is just the probability of getting a result even less extreme under H0 than the one actually found. Again, p refers to the probability of the data, not to that of any particular hypothesis, H0 or H1. See Kline (2004, chap. 3) or Kline (2009, chap. 5) for descriptions of additional false beliefs about statistical significance.

It is pertinent to consider one last myth about statistical tests, and it is the view that the .05 and .01 levels of statistical significance, or α, are somehow universal or objective “golden rules” that apply across all studies and research areas. It is true that these levels of α are the conventional standards used today. They are generally attributed to R.A. Fisher, but he did not advocate that these values be applied across all studies (e.g., Fisher, 1956). There are ways in decision theory to empirically determine the optimal level of α given estimate of the costs of various types of decision errors (Type I vs. Type II error), but these methods are almost never used in the behavioral sciences. Instead, most of us automatically use α = .05 or α = .01 without acknowledging that these particular levels are arbitrary.

Even worse, some of us may embrace the **sanctification fallacy**, which refers to dichotomous thinking about p values that are actually continuous. If α = .05, for example, then a result where p = .049 versus one where p = .051 is practically identical in terms of statistical outcomes. However, we usually make a big deal about the first (it’s significant!) but ignore the second. (Or worse, we interpret it as a “trend” as though it was really “trying” to be significant, but fell just short.) This type of black-and-white thinking is out of proportion to continuous changes in p values.

There are other areas in SEM where we commit the sanctification fallacy. This thought from the astronomer Carl Sagan (1996) is apropos: “When we are self-indulgent and uncritical, when we confuse hopes and facts, we slide into pseudoscience and superstition” (p. 27). Let there be no superstition concerning statistical significance going forward from this point.

]]>When the sample size is small (not large enough as defined above) or when one of the parameters of the normal distribution i.e. the variance is unknown (the other parameter is the mean), as is always the case, the **t-distribution** is normally useful in testing hypothesis, thus the popular Student’s t-test.

Student’s **t-test**** **is a method of testing hypotheses about the mean of a small sample drawn from a normally distributed population when the population standard deviation is unknown. The t-distribution is a family of curves in which the number of degrees of freedom – the number of independent observations in the sample minus one – specifies a particular curve. As the sample size increases, the t-distribution approaches the symmetric bell shape of the standard normal distribution.

The first kind of null hypothesis that the t-test can be applied states that there is no effective difference between the observed sample mean and the hypothesized or stated population mean; that any measured difference is due only to chance. Normally a **t****-test** may be either two-sided/two-tailed, hypothesising that the means are not equivalent, or one-sided, stating that observed mean is larger or smaller than the hypothesized mean. The test-statistic is calculated as the difference between the sample and hypothesised means divided by the sample standard deviation.

The second statement of the null hypothesis where the application of the *t*-distribution or t-test is used posits that two independent random samples have the same mean, or that the difference between the two means is zero. In this scenario the means of the T test-statistic is calculated as the difference between the two estimated means divided by the standard deviation of the difference in means. The later is calculated as the square root of the sum of the two sample variances each divided by the sample size of each group.

The third use is in testing the hypothesis about regression coefficients. The null hypothesis in this case is that the regression coefficient is equal to zero. In other words, the interest in it testing whether the effect (quantified by the regression coefficient) of the explanatory variable on the independent variable can as well be zero meaning not influence. The t-statistic is calculated by dividing the estimated coefficient by its standard error. The resulting ratio tells us how many standard-error units the coefficient is away from zero.

Once the test-statistics are calculated in each of the scenarios presented above, they are compared to critical value determined by the t-distribution. If the observed t-statistics is larger than the critical value the null hypothesis is rejected. The critical value depends on the significance level of the test – the probability of erroneously rejecting the null hypothesis. In most cases a p-value can be calculated from the test-statistic and the appropriate t-distribution and then used as the basis for rejecting the null hypothesis.

]]>

Depending on the problem at hand there are several classes of ANOVA. For instance, if there is only one variable or factor that defines groups e.g. age which can be categorised in to several levels for instance <1, 1-5 and >=5 years, then we will talk of one-way ANOVA if the means of a particular quantity is to be compared across the three levels of age. If there are two factors that group observations then the comparison of the means in those groups will be done by two-way ANOVA. Other variants of ANOVA include, repeated measures ANOVA, nested designs and Latin square design.

This article describes the steps in performing one-way ANOVA. Some definition of terms is appropriate in describing how to calculate the test-statistic and consequently test the null hypothesis:

**Degrees of freedom (DF):** This is the number of values in the final calculation of a statistic (e.g. a mean) that are free to vary. Imagine a set of four numbers, whose mean is 10. There are lots of sets of four numbers with a mean of 10 but for any of these sets you can freely pick the first three numbers. The fourth and last number is out of your hands as soon as you pick the first three; because you have to ensure that the last number is picked such that the mean of the four is 10. In this case it will be said that the set has 3 degrees of freedom; the number elements in the set that are allowed to vary freely.

The DF for the variable (e.g. Age group) in ANOVA is calculated by taking the number of group levels (called *k*) minus 1 (i.e. *k* – 1). The DF for *Error* is found by taking the total sample size, *N*, minus k (i.e. *N* – *k*). This is because we lose one degree of freedom every time we estimate each of the *k* group means. The DF for *Total* is found by *N* – 1; we lose on degree of freedom when we estimate the grand mean, the mean of all samples independent of the group they are in.

**Sum of Squares (SS):** The between-group SS, or SSB, is a measure of the variation in the data between the groups. It is the sum of squared deviations from the overall mean of each of the *k* group means. The error SS, or SSE, or within-group sum of squares (SSW) is the sum of squared deviations of each observation from its group mean. The total SS, or SST, is the sum of squared deviation of each observation from the grand/overall mean. These values are additive such that SST = SSB + SSW.

**F-statistic:** This is the test statistic used for ANOVA. It is calculated Mean Square (MS) for the factor/variable (MSR = SSB/DF for the variable) divided by the MS of the error (MSE=SSW/DF for error). The F-statistic is a ratio of the variability *between* groups compared to the variability *within* the groups. If this ratio is larger than 1 there is more variability between groups than within groups.

** p-value**: The meaning of p-value has been discussed elsewhere. In the context of ANOVA, it is the probability of obtaining an F-statistic greater than that observed if the null hypothesis was true. The null hypothesis is that all the group population means are equal versus the alternative that at least one is not equal. This probability is obtained by comparing the calculated F-statistic to the theoretical F-distribution. If the p-value is less than the conventional critical value of 0.05 then there is sufficient evidence to reject the null hypothesis. In the case the null hypothesis is rejected, further pairwise tests have to be conducted to determine which particular group means are significantly different. These are called post-hoc tests.

For the validity of the results, some assumptions have to be checked to hold before the **ANOVA** technique is applied. These are:

- Each level of the factor is applied to a sample.
- The population from which the sample was obtained must be normally distributed.
- The samples must be independent.
- The variances of the population must be equal.

There are alternative methods or modifications of the base case ANOVA, which are applicable when some of these assumptions are violated but those are not discussed in this article.

]]>People talk about ‘data being the new oil’, a natural resource that companies need to exploit and refine. But is this really true or are we in the realm of hype? Mohamed Zaki explains that, while many companies are already benefiting from big data, it also presents some tough challenges.

“To get value out of big data, organisations need to be able to capture, store, analyse, visualise and interpret it. None of which is straightforward”

Government agencies have announced major plans to accelerate big data research and, in 2013, according to a Gartner survey, 64% of companies said they were investing – or intending to invest – in big data technology. But Gartner also pointed out that while companies seem to be persuaded of the merits of big data, many are struggling to get value from it. The problem may be that they tend to focus on the technological aspects of data collection rather than thinking about *how* it can create value.

But big data is already creating value for some very large companies and some very small ones. Established companies in a number of sectors are using big data to improve their current business practices and services and, at the other end of the spectrum, start-ups are using it to create a whole raft of innovative products and business models.

At the Cambridge Service Alliance, in Cambridge’s Institute for Manufacturing, we work with a number of leading companies from a range of sectors and see first-hand both the opportunities and challenges associated with big data.

Take a company which makes, sells and leases its products and also provides maintenance and repair services for them. Its products contain sensors that collect vast amounts of data, allowing the company to monitor them remotely and diagnose any problems

If this data is combined with existing operational data, advanced engineering analytics and forward-looking business intelligence, the company can offer a ‘condition-based monitoring service’, able to analyse and predict equipment faults. For the customer, unexpected downtime becomes a thing of the past, repair costs are reduced and the intervals between services increased. Intelligent analytics can even show them how to use the equipment at optimum efficiency. Original equipment manufacturers (OEMs) and dealers see this as a way of growing their parts and repairs business and increasing the sales of spare parts. It also strengthens relationships with existing customers and attracts new ones looking for a service maintenance contract.

In a completely different sector, an education revolution is under way. Big data is underpinning a new way of learning otherwise known as ‘competency-based education’, which is currently being developed in the USA. A group of universities and colleges is using data to personalise the delivery of their courses so that each student progresses at a pace that suits them, whenever and wherever they like.

In the old model, thousands of students arrive on campus at the start of the academic year and, regardless of their individual levels of attainment, work their way through their course until the point of graduation. In the new data-driven model, universities will be able to monitor and measure a student’s performance, see how long it takes them to complete particular assignments and with what degree of success. Their curriculum is tailored to take account of their preferences, their achievements and any difficulties they may have. For the students, this means a much more flexible way of working which suits their needs and gives them the opportunity to graduate more quickly. For the institutions, it means delivering better quality education and hence achieving better student outcomes, and being able to deploy their staff more efficiently and more in line with their skills and interests.

To get value out of big data, however, organisations need to be able to capture, store, analyse, visualise and interpret it. None of which is straightforward.

One of the main barriers seems to be the lack of a ‘data culture’, where data is wholly embedded in organisational thinking and practices. But companies also face a very long list of challenges to do with data management and processing.

Condition-monitoring services, for example, rely on data transmission, often using satellite systems or digital telephone systems: sometimes there simply is no coverage. Most organisations have vast amounts of data stored in different systems in a variety of formats: bringing these together in one place is difficult.

The whole issue of data ownership is problematic in a service contract environment, where the customer considers it to be their data, generated by their usage, while the service provider may consider it to be theirs as it is processed by their system.

In complex data landscapes, security – managing access to the data and creating robust audit trails – can also be a major challenge as, sometimes, is complying with the legislation around data protection. Many organisations also suffer from a lack of techniques such as data and text-mining models, which include statistical modelling, forecasting, predictive modelling and agent-based models (or optimisation simulations).

Where established organisations may find it hard to move away from their entrenched ways of doing things, start-ups have the luxury of being able to invent new business models at will. At the Cambridge Service Alliance we have also been looking at these new ways of doing things in order to understand what business models that rely on data really look like. The results should help companies of all sizes – not just start-ups – understand how big data may be able to transform their businesses. We have identified six distinct types of business model:

**Free data collector and aggregator**: companies such as Gnip collect data from vast numbers of different, mostly free, sources then filter it, enrich it and supply it to customers in the format they want.**Analytics-as-a-service**: these are companies providing analytics, usually on data provided by their customers. Sendify, for example, provides businesses with real-time caller intelligence, so when a call comes in they see a whole lot of additional information relating to the caller, which helps them maximise the sales opportunity.**Data generation and analysis**: these could be companies that generate their own data through crowdsourcing, or through smartphones or other sensors. They may also provide analytics. Examples include GoSquared, Mixpanel and Spinnakr, which collect data by using a tracking code on their customers’ websites, analyse the data and provide reports on it using a web interface.**Free data knowledge discovery**: the model here is to take freely available data and analyse it. Gild, for example, helps companies recruit developers by automatically evaluating the code they publish and scoring it.**Data-aggregation-as-a-service**: these companies aggregate data from multiple internal sources for their customers, then present it back to them through a range of user-friendly, often highly visual interfaces. In the education sector,*Always Prepped*helps teachers monitor their students’ performance by aggregating data from multiple education programmes and websites.**Multi-source data mash-up and analysis**: these companies aggregate data provided by their customers with other external, mostly free data sources, and perform analytics on this data to enrich or benchmark customer data. Welovroi, is a web-based digital marketing, monitoring and analysing tool that enables companies to track a large number of different metrics. It also integrates external data and allows benchmarking of the success of marketing campaigns.

So what does this tell us? That agile and innovative start-ups are creating entirely new business models based on big data and being hugely successful at it. These models can also inspire larger companies (SMEs as much as multinationals) to think about new ways in which they can capture value from their data.

But these more established companies face significant barriers to doing so and may have to deconstruct their current business models if they are to succeed. In the world of fleet and engines this could be by moving to a condition-based monitoring service or, in the education sector, delivering teaching in a completely new way. If companies can’t innovate when the opportunity arises, they may lose competitive advantage and be left struggling to ‘catch up’ with their competitors.

*Dr Mohamed Zaki is a Research Associate at the Cambridge Service Alliance, Institute for Manufacturing. Big data is one of the Cambridge Service Alliance’s core research themes.*

*This article first appeared in the IfM Review.*

– See more at: http://www.cam.ac.uk/research/discussion/is-big-data-still-big-news#sthash.v0ZW8VJv.dpuf

]]>**Probability and statistics** are closely related and each depends on the other in a number of different ways. These have been traditionally studied together and justifiably so.

For example, consider a statistical experiment that studies how effective a drug is against a particular pathogen. After the experiment has been performed and the results tabulated, what then?

Surely, there should be something useful and tangible that comes out of the experiment. This is usually in the form of probability. Assuming the sample size was large enough and represented the entire population of applicability, the statistics should be able to predict what the probability is of the drug being effective against a pathogen if a person takes it. Thus the experimenter should be able to tell a patient – “If you take this drug, the probability that you will be cured is x%”. This shows the interrelation between probability and statistics.

A lot of statistical analysis and experimental results depend on probability distributions that are either inherently assumed or found through the experiment. For example, in many social science experiments and indeed many experiments in general, we assume a normal distribution for the sample and population. The normal distribution is nothing but a probability distribution.

Thus the relationship between probability and statistics cuts both ways – statistical analysis makes use of probability and probability calculation makes use of statistical analysis.

In general, we are interested to know, what is the chance of an event occurring. For example, what are the chances that it will rain today? This answer is quite complex and involves a lot of calculations, experimentations and observations. After all the analysis, the answer can still be only a probability because the event is so complex that despite the best tools available to us, it is next to impossible to predict it with certainty. Thus one can take data from satellites, data from measuring instruments, etc. to arrive at a probability of whether it will rain today.

However, the same problem can also be approached in a different manner. Someone might look at past data and surrounding conditions. If it didn’t rain for many days, the temperatures have been consistently higher but humidity has been consistently lower, he might conclude that the probability of a rain today is low.

These two methods can go hand in hand, and usually do. Most predictions are based not just on the bare facts but also past trends. This is why in sports; analysts look at past records to see how well a team played against the other in the past, in addition to looking at the individual players and their records. A lot of predictions, therefore, involve statistics.Probability and statistics are therefore intertwined and lots of analysis and predictions that we see daily involve both of them.

This article was originally published at: https://explorable.com/probability-and-statistics

]]>

- Specify the null hypothesis. This is the default position that is assumed to be true unless there is sufficient evidence against it. The null hypothesis is either that a parameter is greater than or equal to zero or that a parameter is less than or equal to zero.
- Specify the significance level. This is an arbitrary value compared to the probability value (p-value) so as to either reject or not reject the null hypothesis. It is the probability of making the wrong decision when the null hypothesis is true. Typical values are 0.05 and 0.01.
- Compute the probability value.
- Compare the probability value with the significance level. If the probability value is lower then you reject the null hypothesis. If your probability value is higher, most scientists will consider your findings inconclusive. Note that failure to reject the null hypothesis does not constitute support for the null hypothesis; it means you do not have sufficient evidence from the data to reject it. Also, the lower the probability value the more confidence you can have that the null hypothesis is false thus interpreting results on a continuous scale rather than binary (reject / do not reject) as some researchers prefer.

There are some criticisms of statistical hypothesis testing that you should keep in mind when undertaking the procedure. Significance tests tell you the probability of getting a result as extreme as that observed, given the null hypothesis is true. But you want to know is how likely it is that the null hypothesis is true given the observed results. Dichotomizing the decision has also been criticised in some research contexts where it is reject or do not reject decision is not appropriate.

Confusion also arises between statistical and substantive significance. The significance of a result depends on the size of the effect observed or estimated and whether it can be replicated. Statistical significance does not necessarily mean that the effect observed has contextual significance e.g. biological significance, but these two concepts are often confused. The use of statistical tests of significance almost always requires making distributional assumptions, which may be violated, about the test statistic in question, and on the use of strictly random sampling.

As a guide, to try and address some of the above issues, it is always advisable to: report parameter estimates with an indication of their variance such as their standard errors of confidence intervals, not to use the term significance loosely when referring to statistical significance, report the methods and results in a such a way that they can be usefully interpreted even without the significance tests.

]]>

**Big data** refers to data sets so large or complex that traditional data processing applications are inadequate. It is characterised by big volume, high velocity (it accrues fast e.g. transactions data) and variety (can be structured or unstructured e.g. videos, emails); what is sometime referred to as the three Vs of big data. The challenges mainly include analysis, capture, storage and visualization.

**Data analysis** refers to the extensive use of statistics, with or without the aid of computerized tools, to gain insights/knowledge from the data.

**Data analytics** is a discipline rather than a tool. It uses data analysis and other data science tools to recommend actions or aid decision-making; it is thus concerned with the whole process of analysis to insights generated to decisions being made from the insights.

**Big data analytics** is the process of examining big data to uncover hidden patterns and other useful information that can be used to make better decisions in the application context, which is mostly a business environment.

**Why is big data analysis and analytics essential? **

First, it provides business intelligence through standard and unplanned business reports which might answered questions such as how consumers behave the way they do and what individual consumer factors are associated with particular product choices or purchase preferences. Big data analytics can also be proactive through approaches like optimization, predictive modelling and forecasting thus aiding decision making for the future. Big data analytics also ensures efficiency in managing the huge volume and variety of data that businesses have to deal with. By using big data analytics you can extract only the relevant information from terabytes of potential data in efficient and speedy manner.

**What are some of the tools are available for big data analytics?**

Due to the sheer volume, variety and velocity associated with big data traditional analysis tools based on relational databases are limited in their capacity for big data analysis and analytics. Institutions that need to analyse big data are adopting customised technologies that are developed or being developed to offer a platform for big data analytics.

Some of the technologies in place include:

Apache Hadoop: an open source data processing platform and a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. The related tools to hadoop are: YARN, MapReduce, Spark, Hive and Pigas.

Cloudera enterprise: its aim is to help users become information-driven by leveraging the best of the open source community with the enterprise capabilities they need to succeed with Apache Hadoop in their organization. It is designed specifically for mission-critical environments and includes CDH, the world’s most popular open source Hadoop-based platform, as well as advanced system management and data management tools.

Hortonworks Data Platform (HDP): is an enterprise-grade data management platform that enables a centralized architecture for running batch, interactive and real-time analytics and data processing applications simultaneously across a common shared dataset. It is also built on Apache Hadoop, powered by YARN.

]]>There are a number of ways through which data analysts deal with missing data in longitudinal studies, and indeed other types of study designs:

**Complete case analysis: **In this option of dealing with missing data subjects/cases without complete information are dropped in the analysis sample. This approach result in loss of information because partly complete information of some subjects is dropped, and may lead to introduction of bias in the estimates of the model coefficients if the data is not missing completely at random.

**Last-Observations-Carried-Forward (LOCF): ** This method can only be applied under a longitudinal study. The missing values, for each individual/case, are replaced by the last observation of a variable. This manner of dealing with missing values has been discouraged in literature recently. The means and precision measures such as the variance can be biased leading to wrong inferences. We advise against using this approach in dealing with missing values.

**Mean imputation: **Under mean imputation the missing values in a variable are replaced by its mean value of the non-missing observations of that variable. It preserves the mean (the mean in the data wont be biased) but does not preserve the relationship between variables; it might reduce/increase the correlation between the variables being studied. This approach does not account for the uncertainty in the imputed values by including an additional variance from imputation, hence less preferred over data imputation techniques such as multiple imputation that account for the uncertainty.

**Hot-deck imputation:** In this method of dealing with missing data, each missing value is replaced with an observed response from a similar unit in the same sample dataset. There are several ways of implementing the Hot-deck imputation method. For example, randomly picking the observed response from the set of cases that are similar to the case for which the imputation is needed, or finding the mean of the variable among the set of similar cases. This article provides a detailed review of the various Hot-deck imputations techniques. The performance of this imputation technique, in terms of the preservation of relationships between variables, differs according to the specific technique chosen.

**Estimation maximisation (EM):** This iterative procedure of dealing with missing values uses other variables to impute an expected value (estimation step), then checks whether that is the value that is most likely (maximization step). The EM algorithm preserves the relationship with other variables a feature that is important in regression analysis. However, they understate standard error and should be used when the extent of missing values is not big, for instance when the proportion of missing values is not more than 5%.

**Multiple imputation: **This approach has three stages. First, multiple copies of the dataset, with the missing values replaced by imputed values, are generated. The imputed values are sampled from their predictive distribution based on the observed data. Next, standard statistical methods are used to fit the model of interest to each of the imputed dataset. Lastly, the estimated of parameters from each imputed dataset are pooled to provide a single estimate for each parameter of interest. The standard errors of these pooled estimates are calculated using rules that take account of the variability between the imputed datasets. Valid inferences are obtained because results are averaged over the distribution of the missing data given the observed data. Nonetheless, there are pitfalls in multiple imputations that analysts should be aware of when they contemplate using this approach.

]]>