I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?
So here you have a map of different visual explanations for black-box models. Choose one in (on average) less than three simple steps.
These are available in the DALEX package. Feel free to propose other visual explainers that should be added to this map (and the package).
Every inch of sky’s got a star
Every inch of skin’s got a scar
(Everything Now, Arcade Fire)
I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.
That’s precisely what I’ve done today. I combined two interesting datasets:
Apart from using dplyr
to manipulate data and highcharter
to do the visualization, I used tabulizer
package to extract the table of probabilities of computerisation from the pdf
: it makes this task extremely easy.
This is the resulting plot:
If you want to examine it in depth, here you have a full size version.
These are some of my insights (its corresponding figures are obtained directly from the dataset):
The code of this experiment is available here.
We’ve released the newest version of NIMBLE on CRAN and on our website.
Version 0.6-11 has important new features, notably support for Bayesian nonparametric mixture modeling, and more are on the way in the next few months.
New features include:
Please see the NEWS file in the installed package for more details
In this post, I will illustrate the use of prediction intervals for the comparison of measurement methods. In the example, a new spectral method for measuring whole blood hemoglobin is compared with a reference method. But first, let's start with discussing the large difference between a confidence interval and a prediction interval.
Very often a confidence interval is misinterpreted as a prediction interval, leading to unrealistic “precise” predictions. As you will see, prediction intervals (PI) resemble confidence intervals (CI), but the width of the PI is by definition larger than the width of the CI.
Let’s assume that we measure the whole blood hemoglobin concentration in a random sample of 100 persons. We obtain the estimated mean (Est_mean), limits of the confidence interval (CI_Lower and CI_Upper) and limits of the prediction interval (PI_Lower and PI_Upper): (The R-code to do this is in the next paragraph)
Est_mean | CI_Lower | CI_Upper | PI_Lower | PI_Upper |
---|---|---|---|---|
140 | 138 | 143 | 113 | 167 |
A Confidence interval (CI) is an interval of good estimates of the unknown true population parameter. About a 95% confidence interval for the mean, we can state that if we would repeat our sampling process infinitely, 95% of the constructed confidence intervals would contain the true population mean. In other words, there is a 95% chance of selecting a sample such that the 95% confidence interval calculated from that sample contains the true population mean.
Interpretation of the 95% confidence interval in our example:
A Prediction interval (PI) is an estimate of an interval in which a future observation will fall, with a certain confidence level, given the observations that were already observed. About a 95% prediction interval we can state that if we would repeat our sampling process infinitely, 95% of the constructed prediction intervals would contain the new observation.
Interpretation of the 95% prediction interval in the above example:
Remark: Very often we will read the interpretation “The whole blood hemogblobin concentration of a new sample will be between 113g/L and 167g/L with a probability of 95%.” (for example on wikipedia). This interpretation is correct in the theoretical situation where the parameters (true mean and standard deviation) are known.
First, let's simulate some data. The sample size in the plot above was (n=100). Now, to see the effect of the sample size on the width of the confidence interval and the prediction interval, let's take a “sample” of 400 hemoglobin measurements using the same parameters:
set.seed(123)
hemoglobin<-rnorm(400, mean = 139, sd = 14.75)
df<-data.frame(hemoglobin)
Although we don't need a linear regression yet, I'd like to use the lm() function, which makes it very easy to construct a confidence interval (CI) and a prediction interval (PI). We can estimate the mean by fitting a “regression model” with an intercept only (no slope). The default confidence level is 95%.
CI<-predict(lm(df$hemoglobin~ 1), interval="confidence")
CI[1,]
## fit lwr upr
## 139.2474 137.8425 140.6524
The CI object has a length of 400. But since there is no slope in our “model”, each row is exactly the same.
PI<-predict(lm(df$hemoglobin~ 1), interval="predict")
## Warning in predict.lm(lm(df$hemoglobin ~ 1), interval = "predict"): predictions on current data refer to _future_ responses
PI[1,]
## fit lwr upr
## 139.2474 111.1134 167.3815
We get a “warning” that “predictions on current data refer to future responses”. That's exactly what we want, so no worries there. As you see, the column names of the objects CI and PI are the same. Now, let's visualize the confidence and the prediction interval.
The code below is not very elegant but I like the result (tips are welcome :-))
library(ggplot2)
limits_CI <- aes(x=1.3 , ymin=CI[1,2], ymax =CI[1,3])
limits_PI <- aes(x=0.7 , ymin=PI[1,2], ymax =PI[1,3])
PI_CI<-ggplot(df, aes(x=1, y=hemoglobin)) +
geom_jitter(width=0.1, pch=21, fill="grey", alpha=0.5) +
geom_errorbar (limits_PI, width=0.1, col="#1A425C") +
geom_point (aes(x=0.7, y=PI[1,1]), col="#1A425C", size=2) +
geom_errorbar (limits_CI, width=0.1, col="#8AB63F") +
geom_point (aes(x=1.3, y=CI[1,1]), col="#8AB63F", size=2) +
scale_x_continuous(limits=c(0,2))+
scale_y_continuous(limits=c(95,190))+
theme_bw()+ylab("Hemoglobin concentration (g/L)") +
xlab(NULL)+
geom_text(aes(x=0.6, y=160, label="Prediction\ninterval",
hjust="right", cex=2), col="#1A425C")+
geom_text(aes(x=1.4, y=140, label="Confidence\ninterval",
hjust="left", cex=2), col="#8AB63F")+
theme(legend.position="none",
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
PI_CI
The width of the confidence interval is very small, now that we have this large sample size (n=400). This is not surprising, as the estimated mean is the only source of uncertainty. In contrast, the width of the prediction interval is still substantial. The prediction interval has two sources of uncertainty: the estimated mean (just like the confidence interval) and the random variance of new observations.
A prediction interval can be useful in the case where a new method should replace a standard (or reference) method.
If we can predict well enough what the measurement by the reference method would be, (given the new method) than the two methods give similar information and the new method can be used.
For example in (Tian, 2017) a new spectral method (Near-Infra-Red) to measure hemoglobin is compared with a Golden Standard. In contrast with the Golden Standard method, the new spectral method does not require reagents. Moreover, the new method is faster. We will investigate whether we can predict well enough, based on the measured concentration of the new method, what the measurement by the Golden Standard would be. (note: the measured concentrations presented below are fictive)
Hb<- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb.txt",
header = TRUE)
kable(head(Hb))
New | Reference |
---|---|
84.96576 | 87.24013 |
99.91483 | 103.38143 |
111.79984 | 116.71593 |
116.95961 | 116.72065 |
118.11140 | 113.51541 |
118.21411 | 121.70586 |
plot(Hb$New, Hb$Reference,
xlab="Hemoglobin concentration (g/L) - new method",
ylab="Hemoglobin concentration (g/L) - reference method")
Now, let's fit a linear regression model predicting the hemoglobin concentrations measured by the reference method, based on the concentrations measured with the new method.
fit.lm <- lm(Hb$Reference ~ Hb$New)
plot(Hb$New, Hb$Reference,
xlab="Hemoglobin concentration (g/L) - new method",
ylab="Hemoglobin concentration (g/L) - reference method")
cat ("Adding the regression line:")
Adding the regression line:
abline (a=fit.lm$coefficients[1], b=fit.lm$coefficients[2] )
cat ("Adding the identity line:")
Adding the identity line:
abline (a=0, b=1, lty=2)
If both measurement methods would exactly correspond, the intercept would be zero and the slope would be one (=“identity line”, dotted line). Now, let's calculated the confidence interval for this linear regression.
CI_ex <- predict(fit.lm, interval="confidence")
colnames(CI_ex)<- c("fit_CI", "lwr_CI", "upr_CI")
And the prediction interval:
PI_ex <- predict(fit.lm, interval="prediction")
## Warning in predict.lm(fit.lm, interval = "prediction"): predictions on current data refer to _future_ responses
colnames(PI_ex)<- c("fit_PI", "lwr_PI", "upr_PI")
We can combine those results in one data frame and plot both the confidence interval and the prediction interval.
Hb_results<-cbind(Hb, CI_ex, PI_ex)
kable(head(round(Hb_results),1))
New | Reference | fit_CI | lwr_CI | upr_CI | fit_PI | lwr_PI | upr_PI |
---|---|---|---|---|---|---|---|
85 | 87 | 91 | 87 | 94 | 91 | 82 | 99 |
Visualizing the CI (in green) and the PI (in blue):
plot(Hb$New, Hb$Reference,
xlab="Hemoglobin concentration (g/L) - new method",
ylab="Hemoglobin concentration (g/L) - reference method")
Hb_results_s <- Hb_results[order(Hb_results$New),]
lines (x=Hb_results_s$New, y=Hb_results_s$fit_CI)
lines (x=Hb_results_s$New, y=Hb_results_s$lwr_CI,
col="#8AB63F", lwd=1.2)
lines (x=Hb_results_s$New, y=Hb_results_s$upr_CI,
col="#8AB63F", lwd=1.2)
lines (x=Hb_results_s$New, y=Hb_results_s$lwr_PI,
col="#1A425C", lwd=1.2)
lines (x=Hb_results_s$New, y=Hb_results_s$upr_PI,
col="#1A425C", lwd=1.2)
abline (a=0, b=1, lty=2)
In (Bland, Altman 2003) it is proposed to calculate the average width of this prediction interval, and see whether this is acceptable. Another approach is to compare the calculated PI with an “acceptance interval”. If the PI is inside the acceptance interval for the measurement range of interest (see Francq, 2016).
In the above example, both methods do have the same measurement scale (g/L), but the linear regression with prediction interval is particularly useful when the two methods of measurement have different units.
However, the method has some disadvantages:
In contrast to Ordinary Least Square (OLS) regression, Bivariate Least Square (BLS) regression takes into account the measurement errors of both methods (the New method and the Reference method). Interestingly, prediction intervals calculated with BLS are not affected when the axes are switched (del Rio, 2001).
In 2017, a new R-package BivRegBLS was released. It offers several methods to assess the agreement in method comparison studies, including Bivariate Least Square (BLS) regression.
If the data are unreplicated but the variance of the measurement error of the methods is known, The BLS()
and XY.plot()
functions can be used to fit a bivariate Least Square regression line and corresponding confidence and prediction intervals.
library (BivRegBLS)
Hb.BLS = BLS (data = Hb, xcol = c("New"),
ycol = c("Reference"), var.y=10, var.x=8, conf.level=0.95)
XY.plot (Hb.BLS,
yname = "Hemoglobin concentration (g/L) - reference method",
xname = "Hemoglobin concentration (g/L) - new method",
graph.int = c("CI","PI"))
Now we would like to decide whether the new method can replace the reference method. We allow the methods to differ up to a given threshold, which is not clinically relevant. Based on this threshold an “acceptance interval” is created. Suppose that differences up to 10 g/L (=threshold) are not clinically relevant, then the acceptance interval can be defined as Y=X±??, with ?? equal to 10. If the PI is inside the acceptance interval for the measurement range of interest then the two measurement methods can be considered to be interchangeable (see Francq, 2016).
The accept.int
argument of the XY.plot()
function allows for a visualization of this acceptance interval
XY.plot (Hb.BLS,
yname = "Hemoglobin concentration (g/L) - reference method",
xname = "Hemoglobin concentration (g/L) - new method",
graph.int = c("CI","PI"),
accept.int=10)
For the measurement region 120g/L to 150 g/L, we can conclude that the difference between both methods is acceptable. If the measurement regions below 120g/l and above 150g/L are important, the new method cannot replace the reference method.
In method comparison studies, it is advised to create replicates (2 or more measurements of the same sample with the same method). An example of such a dataset:
Hb_rep <- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb_rep.txt",
header = TRUE)
kable(head(round(Hb_rep),1))
New_rep1 | New_rep2 | Ref_rep1 | Ref_rep2 |
---|---|---|---|
88 | 95 | 90 | 84 |
When replicates are available, the variance of the measurement errors are calculated for both the new and the reference method and used to estimate the prediction interval. Again, the BLS() function and the XY.plot() function are used to estimate and plot the BLS regression line, the corresponding CI and PI.
Hb_rep.BLS = BLS (data = Hb_rep,
xcol = c("New_rep1", "New_rep2"),
ycol = c("Ref_rep1", "Ref_rep2"),
qx = 1, qy = 1,
conf.level=0.95, pred.level=0.95)
XY.plot (Hb_rep.BLS,
yname = "Hemoglobin concentration (g/L) - reference method",
xname = "Hemoglobin concentration (g/L) - new method",
graph.int = c("CI","PI"),
accept.int=10)
It is clear that the prediction interval is not inside the acceptance interval here. The new method cannot replace the reference method. A possible solution is to average the repeats. The BivRegBLS package can create prediction intervals for the mean of (2 or more) future values, too! More information in this presentation (presented at useR!2017).
In the plot above, averages of the two replicates are calculated and plotted. I'd like to see the individual measurements:
plot(x=c(Hb_rep$New_rep1, Hb_rep$New_rep2),
y=c(Hb_rep$Ref_rep1, Hb_rep$Ref_rep2),
xlab="Hemoglobin concentration (g/L) - new method",
ylab="Hemoglobin concentration (g/L) - reference method")
lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]),
y=as.numeric(Hb_rep.BLS$Pred.BLS[,2]),
lwd=2)
lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]),
y=as.numeric(Hb_rep.BLS$Pred.BLS[,3]),
col="#8AB63F", lwd=2)
lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]),
y=as.numeric(Hb_rep.BLS$Pred.BLS[,4]),
col="#8AB63F", lwd=2)
lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]),
y=as.numeric(Hb_rep.BLS$Pred.BLS[,5]),
col="#1A425C", lwd=2)
lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]),
y=as.numeric(Hb_rep.BLS$Pred.BLS[,6]),
col="#1A425C", lwd=2)
abline (a=0, b=1, lty=2)
predict()
and lm()
functions of R: Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.
Related Post
Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it's difficult to answer the question, "why that label, and not one of the others?".
One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it's impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package lime by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).
You can read about how the lime package works in the introductory vignette Understanding Lime, but this limerick by Mara Averick sums also things up nicely:
There once was a package called lime,
Whose models were simply sublime,
It gave explanations for their variations,
One observation at a time.
"One observation at a time" is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.
The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment "negative", "neutral" or "positive". In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.
To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.
GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)
R Tip: be wary of “...
“.
The following code example contains an easy error in using the R function unique()
.
vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # [1] "a" "b" "c"
Notice none of the novel values from vec2
are present in the result. Our mistake was: we (improperly) tried to use unique()
with multiple value arguments, as one would use union()
. Also notice no error or warning was signaled. We used unique()
incorrectly and nothing pointed this out to us. What compounded our error was R
‘s “...
” function signature feature.
In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.
Data science coding is often done in a rush (deadlines, wanting to see results, and so on). Instead of moving fast, let’s take the time to think a bit about programming theory using a very small concrete issue. This lets us show how one can work a bit safer (saving time in the end), without sacrificing notational power or legibility.
A confounding issue is: unique()
failed to alert me of my mistake because, unique()
‘s function signature (like so many R
functions) includes a “...
” argument. I might have been careless or in a rush, but it seems like unique
was also in a rush and did not care to supply argument inspection.
In R
a function that includes a “...
” in its signature will accept arbitrary arguments and values in addition to the ones explicitly declared and documented. There are three primary uses of “...
” in R
: accepting unknown arguments that are to be passed to later functions, building variadic functions, and forcing later arguments to be bound by name (my favorite use). Unfortunately, “...
” is also often used to avoid documenting arguments and turns off a lot of very useful error checking.
An example of the “accepting unknown arguments” use is lapply()
. lapply()
passes what it finds in “...
” to whatever function it is working with. For example:
lapply(c("a", "b"), paste, "!", sep = "") # [[1]] # [1] "a!" # # [[2]] # [1] "b!"
Notice the arguments “"!", sep = ""” were passed on to paste()
. Since lapply()
can not know what function the user will use ahead of time it uses the “...
” abstraction to forward arguments. Personally I never use this form and tend to write the somewhat more explicit and verbose style shown below.
lapply(c("a", "b"), function(vi) { paste(vi, "!", sep = "") })
I feel this form is more readable as the arguments are seen where they are actually used. (Note: this, is a notional example- in practice we would use “paste0(c("a", "b"), "!")
” to directly produce the result as a vector.)
An example of using “...
” to supply a variadic interface is paste()
itself.
paste("a", "b", "c") # [1] "a b c"
Other important examples include list()
and c()
. In fact I like list()
and c()
as they only take a “...
” and no other arguments. Being variadic is so important to list()
and c()
is that is essentially all they do. One can often separate out the variadic intent with lists or vectors as in:
paste(c("a", "b", "c"), collapse = " ") # [1] "a b c"
Even I don’t write code such as the above (that is too long even for me), unless the values are coming from somewhere else (such as a variable). However with wrapr
‘s reduce/expand operator we can completely separate the collection of variadic arguments and their application. The notation looks like the following:
library("wrapr") values <- c("a", "b", "c") values %.|% paste # [1] "a b c"
Essentially reduce/expand calls variadic functions with items taken from a vector or list as individual arguments (allowing one to program easily over variadic functions). %.|%
is intended to place values in the “...
” slot of a function (the variadic term). It is designed for a more perfect world where when a function declares “...
” in its signature it is then the only user facing part of the signature. This is hardly ever actually the case in R
as common functions such as paste()
and sum()
have additional optional named arguments (which we are here leaving at their default values), whereas c()
and list()
are pure (take only “...
“).
With a few non-standard (name capturing) and variadic value constructors one does not in fact need other functions to be name capturing or variadic. With such tools one can have these conveniences everywhere. For example we can convert our incorrect use of unique()
into correct code using c()
.
unique(c(vec1, vec2)) # [1] "a" "b" "c" "d"
In the above code roles are kept separate: c()
is collecting values and unique()
is applying a calculation. We don’t need a powerful “super unique” or “super union” function, unique()
is good enough if we remember to use c()
.
In the spirit of our earlier article on function argument names we have defined a convenience function wrapr::uniques()
that enforces the use of value carrying arguments. With wrapr::uniques()
if one attempts the mistake I have been discussing one immediately gets a signaling error (instead of propagating incorrect calculations forward).
library("wrapr") uniques(c(vec1, vec2)) # [1] "a" "b" "c" "d" uniques(vec1, vec2) # Error: wrapr::uniques unexpected arguments: vec2
With uniques()
we either get the right answer, or we get immediately stopped at the mistaken step. This is a good way to work.
Throughout the last years I noticed the following happening with a number of people. One of those people was actually yours truely a few years back. Person is aware of S3 methods in R through regular use of print
, plot
and summary
functions and decides to give it a go in own work. Creates a function that assigns a class to its output and then implements a bunch of methods to work on the class. Strangely, some of these methods appear to be working as expected, while others throw an error. After a confusing and painful debugging session, person throws hands in the air and continues working without S3 methods. Which was working fine in the first place. This is a real pity, because all the person is overlooking is a very small step in the S3 chain: the generic function.
So we have a function doing all kinds of complicated stuff. It outputs a list with several elements. We assign a S3 class to it before returning, so we can subsequently implement a number of methods^{1}. Lets just make something up here.
my_func <- function(x) {
ret <- list(dataset = x,
d = 42,
y = rnorm(10),
z = c('a', 'b', 'a', 'c'))
class(ret) <- "myS3"
ret
}
out <- my_func(mtcars)
Perfect, now lets implement a print
method. Rather than outputting the whole list, we just want to know the most vital information when printing.
print.myS3 <- function(x) {
cat("Original dataset has", nrow(x$dataset), "rows and",
ncol(x$dataset), "columns\n",
"d is", x$d)
}
out
## Original dataset has 32 rows and 11 columns
## d is 42
Ha, that is working!. Now we do a mean
method, that gives us the mean of the y
variable.
mean.myS3 <- function(x) {
mean(x$y)
}
mean(out)
## [1] 0.2631094
Works too! And finally we do a count_letters
method. It takes z
from the output and counts how often each letter occurs.
count_letters.myS3 <- function(x) {
table(out$z)
}
count_letters(out)
## Error in count_letters(out): could not find function "count_letters"
What do you mean “could not find function”? It is right there! Maybe we made a typo. Mmmm, no it doesn’t seem so. Maybe, mmmm, lets look into this…. Half an hour, a bit of swearing and feelings of stupidity later. Pfff, lets not bother about S3, we were happy with just using functions in the first place.
Now why are print
and mean
working just fine, but count_letters
isn’t? Lets look under the hood of print
and mean
.
print
## function (x, ...)
## UseMethod("print")
##
##
mean
## function (x, ...)
## UseMethod("mean")
##
##
They look exactly the same! They call the UseMethod
function on their own function name. Looking into the help file of UseMethod
, it all of a sudden starts to make sense.
“When a function calling UseMethod(“fun”) is applied to an object with class attribute c(“first”, “second”), the system searches for a function called fun.first and, if it finds it, applies it to the object. If no such function is found a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used, if it exists, or an error results.”
So by calling print
and mean
on the myS3
object we were not calling the method itself. Rather, we call the general functions print
and mean
(the generics) and they call the function UseMethod
. This function then does the method dispatch: lookup the method belonging to the S3 object the function was called on. We were just lucky the print
and mean
generics were already in place and called our methods. However, the count_letters
function indeed doesn’t exist (as the error message tells us). Only the count_letters
method exist, for objects of class myS3
. We just learned that methods cannot get called directly, but are invoked by generics. All we need to do, thus, is build a generic for count_letters
and we are all set.
count_letters <- function(x) {
UseMethod("count_letters")
}
count_letters(out)
##
## a b c
## 2 1 1
It is actually ill-advised to assign a S3 class directly to an output. Rather use a constructor, see 16.3.1 of Advanced R for the how and why.
Have you ever wondered how you create or use Plotly (plotly.js) on Datazar? No? Well here it is anyway.
Datazar’s live chart editor allows you to create custom D3js charts with a JavaScript and CSS editor. Using the DatazarJS library, you can call your datasets without having to worry about file paths or URLs; simply do Datazar.dataset('someDataset.csv',()=>{}).
Is your data being automatically updated via the Datazar REST API? The charts update themselves so you don’t have to worry about taking into account new data.
Plotly provides a CDN link so you can use their library without saving your own copy. That’s wonderful; let’s use that and copy it into the popup.
Let’s use one of the examples Plotly was kind enough to create on their site: https://plot.ly/javascript/histograms/
Copy and paste the JS code to the Datazar JS editor and you’re done.
Note this example actually generates the data on the fly but it’s the same principle as grabbing your data using the async Dataset function.
The CSS editor is just making sure the container page keeps the same color as the Plotly chart.
To use your newly minted chart on a Paper, Datazar’s new interactive research paper, create a Paper and use the option bar on the right to include it. The chart will be on the “Visualization” tab.
The cool thing about using the Plotly chart along with the interactive Paper is that it will keep its interactivity so you can zoom in and do all the fun things Plotly allows on their charts even AFTER you publish your Paper. This means your readers can play with your charts and get a better understanding; after all, that’s the whole point.
Checkout more features here: https://www.datazar.com/features
Creating Custom Plotly Charts on Datazar was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.