On some environments, disk space usage can be pretty predictable. In this post, we will see how to do a linear regression to estimate when free space will reach zero, and how to assess the quality of such regression, all using R - the statistical software environment.

The first thing we need is the data. By running a simple
`(date --utc; df -k; echo) >> /var/dflog.txt`

everyday at 00:00 by cron, we will have more than enough, as that will store the
date along with total, free and used space for all mounted devices.

On the other hand, that is not really easy to parse in R, unless we learn more about the language. In order to keep this post short, we invite the reader to use his favorite scripting language (or python) to process that into a file with the day in the first column and the occupied space in the second, and a row for each day:

YYYY-MM-DD free space YYYY-MM-DD free space (...)

This format can be read and parsed in R with a single command.

This is the data file we will use as source for the results
provided in this article. Feel free to download it and repeat the process.
All number in the file are in MB units, and we assume an HD of 500GB. We will
call the date the free space reaches 0 as the **df0**.

After running **R** in the shell prompt, we get the usual license and basic help
information.

The first step is to import the data:

> duinfo <- read.table('duinfo.dat', colClasses=c("Date","numeric"), col.names=c("day","usd")) > attach(duinfo) > totalspace <- 500000

The variable *duinfo* is now a list with two columns: *day* and *usd*. The
`attach`

command allows us to use the column names directly. The
*totalspace* variable is there just for clarity in the code.

We can check the data graphically by issuing:

> plot(usd ~ day, xaxt='n') > axis.Date(1, day, format='%F')

That gives us an idea on how predictable the usage of our hard drive is.

From our example, we get:

We can now create and take a look at our linear model object:

> model <- lm(usd ~ day) > model

Call: lm(formula = usd ~ day) Coefficients: (Intercept) day -6424661.2 466.7

The second coefficient in the example tells us that we are consuming about 559 MB of disk space per day.

We can also plot the linear model over our data:

> abline(model)

The example plot, with the line:

R provides us with a very generic command that generates statistical information
about objects: **summary**. Let's use it on our linear model objects:

> summary(model)

Call: lm(formula = usd ~ day) Residuals: Min 1Q Median 3Q Max -3612.1 -1412.8 300.7 1278.9 3301.0 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.425e+06 3.904e+04 -164.6 <2e-16 *** day 4.667e+02 2.686e+00 173.7 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1697 on 161 degrees of freedom Multiple R-squared: 0.9947, Adjusted R-squared: 0.9947 F-statistic: 3.019e+04 on 1 and 161 DF, p-value: < 2.2e-16

To check the quality of a linear regression, we focus on the **residuals**, as
they represent the error of our model. We calculate them by subtracting the
expected value (from the model) from the sampled value, for every sample.

Let's see what each piece of information above means: the first is the five-number summary of the residuals. That tells us the maximum and minimum error, and that 75% of the errors are between -1.4 GB and 1.3 GB. We then get the results of a Student's t-test of the model coefficients against the data. The last column tells us roughly how probable seeing the given residuals is, assuming that the disk space does not depend on the date - it's the p-value. We usually accept an hypothesis when the p-value is less than 5%; in this example, we have a large margin for both coefficients. The last three lines of the summary give us more measures of fit: the r-squared values - the closest to 1, the better; and the general p-value from the f-statistics, less than 5% again.

In order to show how bad a linear model can be, the summary bellow was generated by using 50GB as the disk space and adding a random value between -1GB and 1GB each day:

Call: lm(formula = drand$usd ~ drand$day) Residuals: Min 1Q Median 3Q Max -1012.97 -442.62 -96.19 532.27 1025.01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17977.185 33351.017 0.539 0.591 drand$day 2.228 2.323 0.959 0.340 Residual standard error: 589.7 on 84 degrees of freedom Multiple R-squared: 0.01083, Adjusted R-squared: -0.0009487 F-statistic: 0.9194 on 1 and 84 DF, p-value: 0.3404

It's easy to notice that, even though the five-number summary is narrower, the p-values are greater than 5%, and the r-squared values are very far from 1. That happened because the residuals are not normally distributed.

Now that we are (hopefully) convinced that our linear model fits our data well, we can use it to predict hard-disk shortage.

Until now, we represented disk space as a function of time, creating a model that allows us to predict the used disk space given the date. But what we really want now is to predict the date our disk will be full. In order to do that, we have to invert the model. Fortunately, all statistical properties (t-tests, f-statistics) hold in the inverted model.

> model2 <- lm(day ~ usd)

We now use the **predict** function to extrapolate the model.

> predict(model2, data.frame(usd = totalspace)) 1 14837.44

But... when is that? Well, that is the numeric representation of a day in R: the number of days since 1970-01-01. To get the human-readable day, we use:

> as.Date(predict(model2, data.frame(usd = totalspace)), origin="1970-01-01") 1 "2010-08-16"

There we are: df0 will be at the above date **if** the
current pattern holds until then.

The linear model can give us the predicted hard disk space usage at any future
date, as long as collected data pattern **is linear**. If the data we collected
has a break point - some disk cleanup or software installation - the model will
not give good results. We will usually see that in the analysis, but we should
also always look at the graph.

This article is focused on teaching R basics - data input and plotting. We skip most of the formalities of science here, and linear regression is certainly not a proper df0 prediction method in the general case.

On the other hand, in the next part of this article we will see a more robust method for df0 prediction. We will also sacrifice our ability to see the used space vs time to get a statistical distribution for the date of exhaustion, which is a lot more useful in general.

- http://www.cyclismo.org/tutorial/R/index.html: R tutorial
- http://www.r-tutor.com/: An R introduction to statistics
- http://cran.r-project.org/doc/contrib/Lemon-kickstart/index.html: Kickstarting R
- http://data.princeton.edu/R/linearModels.html: "Linear models" page of Introduction to R.
- http://www.r-bloggers.com/: daily news and tutorials about R, very good to learn the language and see what people are doing with it.

On the second article, we saw how to use a Monte Carlo simulation generate sample of disk space delta for future dates and calculate the distribution probability of zeroing free space in the future.

In this article, we will see how we can plot the evolution of predicted distribution for the occupied disk space. Instead of answering que question "how likely is that my disk space will zero before date X?," we will answer "how much disk space will I need by date X, and with what probability?"

This file has the dataset we will use as example. It's the same we used in the second part. The graph below shows it:

We now import this data into R:

duinfo <- read.table('duinfospike.dat', colClasses=c("Date","numeric"), col.names=c("day","usd")) attach(duinfo) totalspace <- 450000 today <- tail(day, 1)

We then build our simulations for the next 4 months:

# Number of Monte Carlo samples numsimulations <- 10000 # Number of days to simulate numdays <- 240 # Simulate: simulate <- function(data, ndays) { delta <- diff(data) dssimtmp0 <- replicate(numsimulations, tail(data, 1)) dssimtmp <- dssimtmp0 f <- function(i) dssimtmp <<- dssimtmp + replicate(numsimulations, sample(delta, 1, replace=TRUE)) cbind(dssimtmp0, mapply(f, seq(1, ndays))) } dssim <- simulate(usd, numdays) # Future days: fday <- seq(today, today+numdays, by='day')

What king of data have we built in our simulations? Each simulation is built by sampling from the delta samples and adding to the current disk space for each day in the simulated period. We can say that each individual simulation is a possible scenario for the next 4 months. The graph bellow shows the first 5 simulations:

plot(fday, dssim[1,], ylim=c(min(dssim[1:5,]), max(dssim[1:5,])), ylab='usd', xlab='day', xaxt='n', type='l') axis.Date(1, day, at=seq(min(fday), max(fday), 'week'), format='%F') lines(fday, dssim[2,]) lines(fday, dssim[3,]) lines(fday, dssim[4,]) lines(fday, dssim[5,]) abline(h=totalspace, col='gray')

From this graph we can clearly see that the range of possible values for the used disk space grows with time. All simulations start with the same value - the used disk space for today - and grow apart as we sample from the delta pool.

We can also plot all simulations in a single graph:

plot(fday, dssim[1,], ylim=c(min(dssim), max(dssim)), ylab='usd', xlab='', xaxt='n', type='l') axis.Date(1, day, at=seq(min(fday), max(fday), 'week'), format='%F') f <- function(i) lines(fday, dssim[i,]) mapply(f, seq(2, numdays)) abline(h=totalspace, col='gray')

This plot gives us an idea of the overall spread of the data, but it fails to show the density. There are 10000 black lines there, with many of them overlapping one another.

There is another way to look at our data: we have created, for each day, a sample of the possible used disk spaces. We can take any day of the simulation and look at the density:

dssimchosen <- list(density(dssim[,5]), density(dssim[,15]), density(dssim[,45]), density(dssim[,120])) colors <- rainbow(length(dssimchosen)) xs <- c(mapply(function(d) d$x, dssimchosen)) ys <- c(mapply(function(d) d$y, dssimchosen)) plot(dssimchosen[[1]], xlab='usd', ylab='dens', xlim=c(min(xs),max(xs)), ylim=c(min(ys),max(ys)), col=colors[1], main='') lines(dssimchosen[[2]], col=colors[2]) lines(dssimchosen[[3]], col=colors[3]) lines(dssimchosen[[4]], col=colors[4]) abline(v=totalspace, col='gray') legend('top', c('5 day', '15 days', '45 days', '120 days'), fill=colors)

By looking at this graph we can see the trend:

- The curves are getting flatter: we are getting more possible values for occupied disk space.
- The curves are moving to the right: we have more simulations with higher occupied disk space values.

So far, we have seen how we can visualize some simulations along the 4 months and how we can visualize the distribution for some specific days.

We can also plot the distribution of the values for each day in the simulated 4 months. We can't use the kernel density plot or the histogram, as they use both axes, but there are other options, most of them involving some abuse of the built-in plot functions.

We can use the *boxplot* function to create a boxplot for each day in R in a
very straightforward way:

boxplot(dssim, outline=F, names=seq(today, as.Date(today+numdays), by='day'), ylab='usd', xlab='day', xaxt='n') abline(h=totalspace, col='gray')

The boxplots glued together form a shape that shows us the distribution of our simulations at any day:

- The thick line in the middle of the graph is the median
- The darker area goes from the first quartile to the third - which means that 50% of the samples are in that range
- The lighter area has the maximum and minimum points, if they are within 1.5 IQR of the upper/lower quartile. Points out of this range are considered outliers and are not plotted.

We can use the *quantile* function to calculate the values of each
quantile per day, and plot the lines:

q <- 6 f <- function(i) quantile(dssim[,i], seq(0, 1, 1.0/q)) qvals <- mapply(f, seq(1, numdays+1)) colors <- colorsDouble(rainbow, q+1) plot(fday, qvals[1,], ylab='usd', xlab='day', xaxt='n', type='l', col=colors[1], ylim=c(min(qvals), max(qvals))) mapply(function(i) lines(fday, qvals[i,], col=colors[i]), seq(2, q+1)) axis.Date(1, day, at=seq(min(fday), max(fday), 'week'), format='%F') abline(h=totalspace, col='gray')

The advantage of this type of graph over the boxplot is that it is parameterized
by *q*. This variable tells us the number of parts that we should divide our
sample in. The lines above show us the division. If *q* is odd, the middle
line is exactly the median. If *q* is 4, the lines will draw a shape similar
to that of the boxplot, the only difference being the top and bottom line, that
will include outliers - the boxplot filters outliers by using the IQR as
explained above.

In the code above, we have used the *colorsDouble* function to generate a
sequence of colors that folds in the middle:

colorsDouble <- function(colorfunc, numcolors) { colors0 <- rev(colorfunc((1+numcolors)/2)) c(colors0, rev(if (numcolors %% 2 == 0) colors0 else head(colors0, -1))) }

We can also abuse the *barplots* function to create an area graph. We have to
eliminate the bar borders, zero the distance between them and plot a white bar
from the axis to the first quartile, if appropriate:

q <- 7 f <- function(i) { qa <- quantile(dssim[,i], seq(0, 1, 1.0/q)) c(qa[1], diff(qa)) } qvals <- mapply(f, seq(1, numdays+1)) colors <- c('white', colorsDouble(rainbow, q)) barplot(qvals, ylab='usd', xlab='day', col=colors, border=NA, space=0, names.arg=seq(min(fday), max(fday), 'day'), ylim=c(min(dssim), max(dssim))) abline(h=totalspace, col='gray')

In this case, using an odd *q* makes more sense, as we want to use the same
colors for the symmetric intervals. With an even *q*, there would either be a
larger middle interval with two quantiles or a broken symmetry. The code above
builds a larger middle interval when given an even *q*.

If we increase *q* and use *heat.colors* in a quantile area plot, we get
something similar to a heat map:

q <- 25 f <- function(i) { qa <- quantile(dssim[,i], seq(0, 1, 1.0/q)) c(qa[1], mapply(function(j) qa[j] - qa[j-1], seq(2, q+1))) } qvals <- mapply(f, seq(1, numdays+1)) colors <- c('white', colorsDouble(heat.colors, q)) barplot(qvals, ylab='usd', xlab='day', col=colors, border=NA, space=0, names.arg=seq(min(fday), max(fday), 'day'), ylim=c(min(dssim), max(dssim))) abline(h=totalspace, col='gray')

We can also plot our data in the same graph as our simulations, by extending the
axis of the *barplot* and using the *points* function:

quantheatplot <- function(x, sim, ylim) { q <- 25 simstart <- length(x) - length(sim[1,]) f <- function(i) { if (i < simstart) replicate(q+1, 0) else { qa <- quantile(sim[,i-simstart], seq(0, 1, 1.0/q)) c(qa[1], diff(qa)) } } qvals <- mapply(f, seq(1, length(x))) colors <- c('white', colorsDouble(heat.colors, q)) barplot(qvals, ylab='usd', xlab='day', col=colors, border=NA, space=0, names.arg=x, ylim=ylim) abline(h=totalspace, col='gray') }

quantheatplot(c(day, seq(min(fday), max(fday), 'day')), dssim, ylim=c(min(c(usd, dssim)), max(dssim))) points(usd) abline(h=totalspace, col='gray')

Cross-validation is a technique that we can use to validate the use of Monte Carlo on our data.

We first split our data in two sets: the training set and the validation set. We than use only the first in our simulations, and plot the second over. We can then see graphically if the data fits our simulation.

Let's use the first two months as the training set, and the other three months as the validation set:

# Number of days to use in the training set numdaysTrain <- 60 numdaysVal <- length(day) - numdaysTrain dssim2 <- simulate(usd[seq(1, numdaysTrain)], numdaysVal-1)

allvals <- c(usd, dssim2) quantheatplot(day, dssim2, c(min(allvals), max(allvals))) points(usd)

Looks like using only the first two months already gives us a fair simulation. What if we used only a single month, when no disk cleanup was performed?

# Number of days to use in the training set numdaysTrain <- 30 numdaysVal <- length(day) - numdaysTrain dssim3 <- simulate(usd[seq(1, numdaysTrain)], numdaysVal-1)

allvals <- c(usd, dssim3) quantheatplot(day, dssim3, c(min(allvals), max(allvals))) points(usd)

If we do regular disk cleanups, we must have at least one of them in our training set to get realistic results. Our training set is not representative without it.

This also tests our cross-validation code. A common mistake is using the whole data set as the training set and as the validation set. That is not cross-validation.

We can use Monte Carlo simulations not only to generate a distribution probability of an event as we did in the previous article, but also to predict a possible range of future values. In this article, disk space occupation is not the most interesting example, as we are usually more interested in knowing when our used disk space will reach a certain value than in knowing the most probable values in time. But imagine that the data represents the number of miles traveled in a road trip or race. You can then not only see when you will arrive at your destination, but also the region where you will probably be at any day.

There are plenty of other uses for this kind of prediction. Collect the data, look at it and think if it would be useful to predict future ranges, and if it makes sense with the data you have. Predictions based on the evidence can be even used to support a decision or a point of view, just keep mind that you can only use the past if you honestly don't think anything different is going to happen.

]]>On the first article, we saw a quick-and-dirty method to predict disk space exhaustion when the usage pattern is rigorously linear. We did that by importing our data into R and making a linear regression.

In this article we will see the problems with that method, and deploy a more robust solution. Besides robustness, we will also see how we can generate a probability distribution for the date of disk space exhaustion instead of calculating a single day.

The linear regression used in the first article has a serious lack of robustness. That means that it is very sensitive to even single departures from the linear pattern. For instance, if we periodically delete some big files in the hard disk, we end up breaking the sample in parts that cannot be analysed together. If we plot the line given by the linear model, we can see clearly that it does not fit our overall data very well:

We can see in the graph that the linear model gives us a line that our free disk space is increasing instead of decreasing! If we use this model, we will reach the conclusion that we will never reach df0.

If we keep analysing used disk space, there is not much we can do besides discarding the data gathered before the last cleanup. There is no way to easily ignore only the cleanup.

In fact, we can only use the linear regression method when our disk consumption pattern is linear for the analysed period - and that rarely is the case when there is human intervention. We should always look at the graph to see if the model makes sense.

Instead of using the daily used disk space as input, we will use the
daily **difference** (or delta) of used disk space. By itself, this reduces a
big disk cleanup to a single outlier instead of breaking our sample. We could
then just filter out the outliers, calculate the average daily increment in used
disk space and divide the current free space by it. That would give us the
average number of days left until disk exhaustion. Well, that would also give us
some new problems to solve.

The first problem is that filtering out the outliers is neither straightforward nor recommended. Afterall, we are throwing out data that might be meaningful: it could be a regular monthly process that we should take into account to generate a better prediction.

Besides, by averaging disk consumption and dividing free disk space by it, we would still not have the probability distribution for the date, only a single value.

Instead of calculating the number of days left from the data, we will use a technique called Monte Carlo simulation to generate the distribution of days left. The idea is simple: we sample the data we have - daily used disk space - until the sum is above the free disk space; the number of samples taken is the number of days left. By doing that repeatedly, we get the set of "possible days left" with a distribution that corresponds to the data we have collected. Let's how we can do that in R.

First, let's load the data file that we will use (same one used in the introduction) along with a variable that holds the size of the disk (500GB; all units are in MB):

duinfo <- read.table('duinfospike.dat', colClasses=c("Date","numeric"), col.names=c("day","usd")) attach(duinfo) totalspace <- 500000 today <- tail(day, 1)

We now get the delta of the disk usage. Let's take a look at it:

dudelta <- diff(usd)

plot(dudelta, xaxt='n', xlab='')

The summary function gives us the five-number summary, while the boxplot shows us how the data is distributed graphically:

summary(dudelta) Min. 1st Qu. Median Mean 3rd Qu. Max. -29580.00 5.25 301.00 123.40 713.00 4136.00

boxplot(dudelta)

The kernel density plot gives us about the same, but in another visual format:

plot(density(dudelta))

We can see the cleanups right there, as the lower points.

The next step is the creation of the sample of the number of days left until exhaustion. In order to do that, we create an R function that sums values taken randomly from our delta sample until our free space zeroes, and returns the number of samples taken:

f <- function(spaceleft) { days <- 0 while(spaceleft > 0) { days <- days + 1 spaceleft <- spaceleft - sample(dudelta, 1, replace=TRUE) } days }

By repeatedly running this function and gathering the results, we generate a set of number-of-days-until-exhaustion that is robust and corresponds to the data we have observed. This robustness means that we don't even need to remove outliers, as they will not disproportionally bias out results:

freespace <- totalspace - tail(usd, 1) daysleft <- replicate(5000, f(freespace))

plot(daysleft)

What we want now is the
empirical cumulative distribution.
This function gives us the probability that we will reach df0 **before** the
given date.

df0day <- sort(daysleft + today) df0ecdfunc <- ecdf(df0day) df0prob <- df0ecdfunc(df0day)

plot(df0day, df0prob, xaxt='n', type='l') axis.Date(1, df0day, at=seq(min(df0day), max(df0day), 'year'), format='%F')

With the cumulative probability estimate, we can see when we have to start worrying about the disk by looking at the first day that the probability of df0 is above 0:

df0day[1] [1] "2010-06-13" df0ecdfunc(df0day[1]) [1] 2e-04

Well, we can also be a bit more bold and wait until the chances of reaching df0 rise above 5%:

df0day[which(df0prob > 0.05)[1]] [1] "2010-08-16"

Mix and match and see what a good convention for your case is.

This and the previous article showed how to use statistics in R to predict when free hard-disk space will zero.

The first article was main purpose was to serve as an introduction to R. There are many reasons that make linear regression an unsuitable technique for df0 prediction - the underlying process of disk consumption is certainly not linear. But, if the graph shows you that the line fits, there is no reason to ignore it.

Monte Carlo simulation, on the other hand, is a powerful and general technique.
It assumes little about the data (non-parameterized), and it can give you
probability distributions. If you want to forecast something, you can always
start recording data and use Monte Carlo in some way to make predictions
**based on the evidence**. Personally, I think we don't do this nearly as often
as we could. Well, Joel is even using it to make schedules.

- http://www.joelonsoftware.com/items/2007/10/26.html: Joel's use of Monte Carlo to make schedules.
- http://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29: Wikipedia's page on bootstrapping, which is clearer than the one on Monte Carlo simulations.
- http://www.r-bloggers.com/: daily news and tutorials about R, very good to learn the language and see what people are doing with it.

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

The following program is supposed to generate two random 8-bit integer and print them on stdout:

#include <stdio.h> #include <fcntl.h> /* Returns -1 if error, other number if ok. */ int get_random_chars(char *r1, char*r2) { int f = open("/dev/urandom", O_RDONLY); if (f < 0) return -1; if (read(f, r1, sizeof(*r1)) < 0) return -1; if (read(f, r2, sizeof(*r2)) < 0) return -1; close(f); return *r1 & *r2; } int main(void) { char r1; char r2; int ret; ret = get_random_chars(&r1, &r2); if (ret < 0) fprintf(stderr, "error"); else printf("%d %d\n", r1, r2); return ret < 0; }

On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

- \(n\): the number of tests made;
- \(k\): the number of times the bug was observed in the \(n\) tests;
- \(p\): the unknown (and, most of the time, unknowable) probability of seeing the bug.

As a Bernoulli trial, the number of errors \(k\) of running the program \(n\) times follows a binomial distribution \(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of \(p\) with trials from different experiments, while in truth each experiment has its own \(p\). We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

- Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
- \(n\): the total number of trials we have performed.
- \(k\): the total number of red balls seen.
- \(p\): the total number of red balls in the box divided by the total number of green balls in the box.

Some things become clearer when we think about this analogy:

- If we open the box and count the balls, we can know \(p\), in contrast with our original problem.
- Without opening the box, we can estimate \(p\) by repeating the trial. As \(n\) increases, our estimate for \(p\) improves. Mathematically: \[p = \lim_{n\to\infty}\frac{k}{n}\]
- Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

Before we try fixing anything, we have to know more about the bug, starting by the probability \(p\) of reproducing it. We can estimate this probability by dividing the number of times we see the bug \(k\) by the number of times we tested for it \(n\). Let's try that with our sample bug:

$ ./hasbug 67 -68 $ ./hasbug 79 -101 $ ./hasbug error

We know from the source code that \(p=25%\), but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we tested more, but how much more, and exactly what would be better?

Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that \(p = 1/4\). What are the possible results when we test three times?

Red balls | Green balls | \(p\) estimate |
---|---|---|

0 | 3 | 0% |

1 | 2 | 33% |

2 | 1 | 66% |

3 | 0 | 100% |

The less we test, the smaller our precision is. Roughly, \(p\) precision will be at most \(1/n\) - in this case, 33%. That's the step of values we can find for \(p\), and the minimal value for it.

Testing more improves the precision of our estimate.

Let's now approach the problem from another angle: if \(p = 1/4\), what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:

The table above has all the possible results for getting 4 balls out of the box. That's \(4^4=256\) rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:

k | rows | % |
---|---|---|

0 | 81 | 31.64% |

1 | 108 | 42.19% |

2 | 54 | 21.09% |

3 | 12 | 4.69% |

4 | 1 | 0.39% |

That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.

What if \(p = 1/3\) - one red ball and two green balls? We would get the following table:

k | rows | % |
---|---|---|

0 | 16 | 19.75% |

1 | 32 | 39.51% |

2 | 24 | 29.63% |

3 | 8 | 9.88% |

4 | 1 | 1.23% |

What about \(p = 1/2\)?

k | rows | % |
---|---|---|

0 | 1 | 6.25% |

1 | 4 | 25.00% |

2 | 6 | 37.50% |

3 | 4 | 25.00% |

4 | 1 | 6.25% |

So, let's assume that you've seen the bug once in 4 trials. What is the value of \(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\). Which one is it?

The graph bellow shows the discrete likelihood for all \(p\) percentual values for getting 1 red and 3 green balls:

The fact is that, *given the data*, the estimate for \(p\)
follows a beta distribution
\(Beta(k+1, n-k+1) = Beta(2, 4)\)
(1)
The graph below shows the probability distribution density of \(p\):

The R script used to generate the first plot is here, the one used for the second plot is here.

What happens when we test more? We obviously increase our precision, as it is at most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we only test twice. But there is also another effect: the distribution for \(p\) gets taller and narrower around the observed ratio \(k/n\):

So, which value will we use for \(p\)?

- The smaller the value of \(p\), the more we have to test to reach a given confidence in the bug solution.
- We must, then, choose the probability of error that we want to tolerate, and
take the
*smallest*value of \(p\) that we can. A usual value for the probability of error is 5% (2.5% on each side). - That means that we take the value of \(p\) that leaves 2.5% of the area of the density curve out on the left side. Let's call this value \(p_{min}\).
- That way, if the observed \(k/n\) remains somewhat constant, \(p_{min}\) will raise, converging to the "real" \(p\) value.
- As \(p_{min}\) raises, the amount of testing we have to do after fixing the bug decreases.

By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.

In order to calculate \(p_{min}\) with the mentioned properties, we have to solve the following equation:

\[\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2} \]

\(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.

That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:

- http://statpages.org/confint.html: \(\alpha\) here is 5%.
- http://www.danielsoper.com/statcalc3/calc.aspx?id=85: results for \(\alpha\) 1%, 5% and 10%.
- https://www.google.com.br/search?q=binomial+confidence+interval+calculator: google search.

So, you have tested a lot and calculated \(p_{min}\). The next step is fixing the bug.

After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?

Let's say that \(t\) is the number of times we test the bug after it is fixed.
Then, if our fix is not effective and the bug still presents itself with
a probability greater than the \(p_{min}\) that we calculated, the probability
of *not* seeing the bug after \(t\) tests is:

\[\alpha = (1-p_{min})^t \]

Here, \(\alpha\) is also the probability of making a
type I error,
while \(1 - \alpha\) is the *statistical significance* of our tests.

We now have two options:

- arbitrarily determining a standard statistical significance and testing enough times to assert it.
- test as much as we can and report the achieved statistical significance.

Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.

The standard statistical significance in the industry is 5%, we recommend either that or less.

Formally, this is very similar to a statistical hypothesis testing.

This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed \(k/n\) ration and the calculated \(p_{min}\) evolved as shown in the following graph:

After those 20 tests, our \(p_{min}\) is about 12%.

Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:

In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.

Now, what happens if we test more before fixing the bug?

Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows \(k/n\) and \(p_{min}\):

After those 5000 tests, our \(p_{min}\) is about 23% - much closer to the real \(p\).

The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:

We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher \(p_{min}\), and that allowed us to test less after fixing the bug.

We have seen that we decrease \(t\) as we increase \(n\), as that can potentially increases our lower estimate for \(p\). Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.

But, how much should we test before fixing the bug? Which value is an ideal value for \(n\)?

To define an optimal value for \(n\), we will minimize the sum \(n+t\). This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.

The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using the data we generated for our bug:

We can see clearly that there are some low values of \(n\) and \(t\) that give us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which gives us \(t+n = 39\).

While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in \(n\) by providing us with a better \(p\) or in \(t\) by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.

When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."

We have also provided a real example (even though the bug itself is artificial) where the framework is applied.

As usual, the source code of this page (R scripts, etc) can be found and downloaded in https://github.com/lpenz/lpenz.org.

]]>