The post AI in Regulated Industries: The No-Nonsense Guide appeared first on Ayasdi.
]]>While it still needs to grow into its lofty expectations (particularly in the enterprise) there is a growing body of evidence that the small wins are stacking up. One area where AI is particularly challenged is in regulated industries. Regulated industries have the same complexity as other industries but have the added dimension of requiring human explainable models. This is not to suggest they are simple models, they can be complex, but they need extreme transparency and explainability (something we call justification) for the regulator to sign off on their use.
Ayasdi is a pioneer in justifiable AI and our underlying technology supports it on an atomic level (the machine did this because of that…for any action). It is why the team at Basis Technology asked us to contribute to their handbook on integrating AI in highly regulated industries. The report covers AI from a macro perspective, the regulators perspective and the practitioner’s perspective and is eminently consumable at just 57 pages. So hop over, fill out the form and thanks us later….
The post AI in Regulated Industries: The No-Nonsense Guide appeared first on Ayasdi.
]]>The post The Double Edged Sword of Deep Learning + The Importance of Breakthrough Science appeared first on Ayasdi.
]]>First off, deep learning is a boon to the field. Powered by algorithmic progress by academics and systems innovations from the likes of Nvidia, storage from the cloud providers and broad, open-source libraries of algorithms the performance of these frameworks has grown alongside the interest in the field.
But, like the graph, things have plateaued for deep learning. The fact of the matter is that DL is not a fit for many tasks, yet like the proverbial hammer, for many, everything looks like nail when you are wedded to that approach.
Deep Learning struggles with many of the data challenges we find in enterprises today. Perhaps most importantly, is the fact that the majority of data in the enterprise is unlabeled. This makes it harder to find massive training sets to train performant deep learning models.
What Demis points out is that while DL is a useful tool, it is but one avenue, one pathway – there are many others. He notes that “Deep learning is an amazing technology and hugely useful in itself, but in my opinion it’s definitely not enough to solve AI, [not] by a long shot. I would regard it as one component, maybe with another dozen or half-a-dozen breakthroughs we’re going to need like that. There’s a lot more innovations that’s required.”
We could not agree more. We think that in order to be successful with AI, you need a range of capabilities – from the unsupervised to the semi-supervised and ultimately to the supervised.
In fact, Gunnar Carlsson’s work in the deep learning space is proving this out – that to generate material improvements in deep learning, it is very useful to apply techniques like topological data analysis as a starting point.
The net of it is that the alpha and omega of AI is not deep learning. It is merely a letter – you need other concepts to truly change the world.
Speaking of changing the world, Hassabis also talks about the promise of AI to solve some of the world’s great challenges. He’s right. AI needs to be judged against its ability to solve the real problems that face mankind – from healthcare to climate change.
“This is what I’m really excited about and I think what we’re going to see over the next 10 years is some really huge, what I would call Nobel Prize-winning breakthroughs in some of these areas.”
He goes onto note that DeepMind is looking protein folding and quantum chemistry among other areas. We agree wholeheartedly.
Our work to date has produced over 285 peer reviewed publications in places like Nature, Science and Cell. Our unique brand of AI (topological data analysis), has produced breakthroughs with diabetes, asthma, cancer, disease states and traumatic brain injury.This is where AI interacts with the real world – with powerful effects. Like Hassabis, we are incredibly optimistic on what the future holds with regard to AI. If you want to learn more about why – just ask us and we will find some time.
The post The Double Edged Sword of Deep Learning + The Importance of Breakthrough Science appeared first on Ayasdi.
]]>The post Geometric Methods for Constructing Efficient Neural Net Architectures appeared first on Ayasdi.
]]>In this post, we will discuss how to take much more general data types and adapt the architecture to them. In fact, we will demonstrate cases where we can construct an architecture for a single data set – a custom neural network. In order to do this, we must first determine a way to think about the convolutional construction and understand how it can generalize.
We recall that for images, a convolutional neural net is composed of layers of neurons, in this case thought of as pixels, each of which is laid out as a collection of regular square grids. While the grid size can vary, typically the number of points in the grid becomes smaller for deeper layers. Often, however, the convolutional neural net has consecutive layers in which the grids are of identical size. In this case, the connections are local in the sense that neurons are connected to nearby neurons. For example, one might take the 3×3 grid patch with the original neuron in the center, and have the connection pattern below repeated around the whole grid.
In a convolutional net for time series, the layers of neurons would be arranged in linear segments instead. Again, the interlayer connections between neurons are specified by the requirement that a neuron in a given layer is connected with neurons in the next higher layer that lie in a small segment around the placement of the given neuron. In this case the picture is like this.
In the case of images, the features defining a data set of images consists of pixels, each one assigned a gray scale numerical value. The identification with pixels with points in the plane equips the pixels (i.e. the set of features) with a geometric structure. In particular, one can consider the pixels as having a notion of distance between them, and that pixels only are connected to pixels in the following layer that lie close to them in the grid.
This means that we have removed a large number of connections from the fully connected structure based on a geometric principle, which we will refer to as locality.
The times series situation is similar – in this case time points in one layer connect to time points in the next layer if and only if they are within a fixed distance of each other.
Why does this kind of pruning of connections make sense?
Let’s consider the case of a data set of time series, and suppose that each time series consists of 10^{4 }consecutive time points and measurements at those time points for example stock ticker data. Suppose further that we are trying to train a predictor for estimating the value of the measurement at time t_{1+1} from all the measurements at all values of time. That would mean that we would be attempting to search for features that include all possible combinations of the 10^{4} variables. Even if we only included pairs of time measurements, this would already enlarge the search space to 10^{8} combinations. If we are searching through all combinations, then the search space is of size 2^{10,000}, obviously a very large number.
By pruning all connections between time points that are further than one unit apart, we are restricting the way in which the neural net can modify the formulas to those combinations of features that consist only of sets of three consecutive time points, for example. This means that we are searching through a space consisting only of 104 combinations, one for each center of an interval of three time points.
This is an important efficiency.
Furthermore, it can easily be justified, since we believe that it is much more likely that features that are nearby (in time) will be relevant to the prediction task at hand than time points that are far removed in time.
A similar argument works for images, where the geometrically local structure encodes the idea that the features we are looking for to distinguish images are local, i.e. are concentrated in a small part of the image. Pixels that are far removed from each other are unlikely to produce combined features that are meaningful. We will refer to this condition imposed on connections as locality, a notion that makes sense in both of the two situations in question.
Naturally, there is great utility in extending this idea to new situations, beyond just time series and images. This means that we want our feature space to be equipped with a notion of distance, which can be encoded in the idea of an abstract metric on a set X. A metric on X is a function that assigns a non-negative real value d(x, y) to every pair of points (x, y), with x and y members of X, subject to three properties.
These conditions abstract three important properties of the ordinary distance function on the line or in the plane or space. In particular, in the case of time series and images, the data sets are equipped with an a priori metric since they are subsets of the line and the plane, respectively. To see another example, consider the circle, where we can assign a notion of distance that assigns, to two different angles, the arc length of the shortest arc from the one to the other, as points on the circle. This quantity can easily be seen to be a metric on the circle. Let’s now recall the features 𝑓_{θ} that were introduced in our previous post, one for each angular value on the circle. If we choose a discrete set on the circle, it can be equipped with a metric by restricting the distance on the circle. The points could be chosen to be regularly spaced, as in the diagram below.
We now have a collection of 16 features, equipped with a distance function, just as we had distance functions on the one-dimensional lattice in the time series case and the two-dimensional lattice in the case of images.
The key difference is that these features were not raw data from an image data set, but were derived features that were found to be of importance via the work in our earlier blog post. Nevertheless, it is possible to build a feed forward neural net with neurons in each layer in correspondence with the 16 points on the circle, and where connections from one such layer to the following one are local, i.e. a neuron in a layer is connected to a neuron in the following layer if and only if they are close, say adjacent to each other, in the circular geometry.
Using networks built in this way, but in combination with the grids coming from images, we are exploring how this kind of architecture (again, choices of a directed graph in the mathematical sense) can speed up the learning process in image convolutional neural nets. There are many such geometries that can be used to build neural net structures adapted to various kinds of data.
We have described how to use geometry to specify connections when one has two adjacent layers that are identical. However, and important construction in the image (and time series) examples is that of a pooling process, which has a grid in one layer and a smaller grid in the following layer. In the time series case, it looks like this.
Each neuron in the right layer is connected to two neurons in the left layer. A similar thing is done with images, having the effect of forming a new image at a lower resolution. A variant of this pooling procedure is available for the circular geometries as well. Here is a picture describing it.
The previous example shows how one can incorporate knowledge about features spaces into the construction of neural nets, and that knowledge can be a priori knowledge coming directly from the raw data (as in the case of images or time series) or it can be knowledge derived from research about the features, as in the case of the circular geometry on derived features from images.
But what about a situation in which one does not have research leading to a simple model of geometry on a space of features? There is still something one can do. The important idea is to use the methods described in our earlier blog post. Suppose that we have a data matrix D, whose rows are the data points, and whose columns are the features. Using such a matrix, one can put a metric on the set of rows in a number of ways (higher dimensional versions of the Euclidean metric, Hamming metrics, etc.) , and doing this turns out to be extremely useful.
It is the basis for Ayasdi’s software, where topological models of data sets are constructed using a metric on the data set. What was observed in our previous post, though, is that it is equally possible to put a metric on the columns of D, i.e. the features. By performing this simple transformation, we have a metric on the feature space, as we did in the cases described above.
One could use this geometry directly, but often there are many more features than the number of neurons one would prefer to have. What one can do is to create an Ayasdi topological model of the set of columns, using the distance function. In this model, each node is a collection of features, and I can assign to it a new feature that is the average value of all of the features that correspond to the node. We now have a graph structure so that each node corresponds to a feature.
We can use this to create a feed forward network, with each layer having neurons in one to one correspondence with the set of nodes in the topological model, and where we assign a edge from the neuron in a particular layer corresponding to a vertex of the topological model to a neuron in the following layer corresponding to a vertex if and only if and form an edge in the topological model. This creates a feed forward network with identical layers. There is also a way to construct pooling layers in this context, based on varying the resolution in the topological model.
The activation and weight information in this net gives rise to derived features for the data set, as well as solving a classification or estimation problem on the data set. One important point in using the topological model for the feature space is that it forces the neural network to deal with the results of groups of raw features, and therefore will not be able to overfit on a single raw feature. It should mitigate the overfitting that often occurs in this technology.
What we described above is a methodology that facilitates the construction of feed forward deep learning architectures adapted to classes of data sets, such as images, time series, etc.
Having said that, this methodology also facilitates the construction of a bespoke architectures for one-off data sets that fall outside the classes outlined above. As such, we have a unified approach that covers the vast majority of potential data sets.
The post Geometric Methods for Constructing Efficient Neural Net Architectures appeared first on Ayasdi.
]]>The post Mathematical Acceleration: Incorporating Prior Information to Make Neural Nets Learn 3.5X Faster appeared first on Ayasdi.
]]>By eliminating the need to relearn these features one can increase the performance (speed and accuracy) of a neural networks.
In this post, we describe how these ideas are used to speed up the learning process in image datasets while simultaneously improving accuracy. This work is a continuation of our collaboration with Rickard Brüel Gabrielsson
As we recall from our first and second installments we can, discover how convolutional neural nets learn and how to use that information to improve performance while allowing for the generalization of those techniques between data sets of the same type.
These findings drew from a previous paper where we analyzed a high density set of 3×3 image patches in a large database of images. That paper identified the patches visible in Figure 1 below and allowed them to be described algebraically. This permits the creation of features (for more detail, see the technical note at the end). What was found in the paper was that at a given density threshold (e.g. the top 10% densest patches) those patches were organized around a circle. The organization is summarized by the following image.
While the initial finding was via the study of a data set of patches, we have confirmed the finding through the study of a space of weight vectors in a neural network trained on data sets of images of hand-drawn digits. Each point on the circle is specified by an angle θ, and there is a corresponding patch P_{θ}.
What this allows us to do is to create a feature that is very simple to compute using trigonometric functions, that for each angle θ that measures the similarity of the given patch to P_{θ}. See the technical note at the end for a precise definition. By adjoining these features to two distinct data sets, and treating them as separate data channels, we have obtained improved training results for the two distinct data sets, one MNIST and the other SVHN. We’ll call this approach the boosted method. We are using a simple two layer convolutional neural net in both cases. The results are as follows.
MNIST
Validation Accuracy |
# Batch iterations Boosted |
# Batch iterations standard |
.8 |
187 |
293 |
.9 |
378 |
731 |
.95 |
1046 |
2052 |
.96 |
1499 |
2974 |
.97 |
2398 |
4528 |
.98 |
5516 |
8802 |
.985 |
9584 |
16722 |
SVHN
Validation Accuracy |
# Batch iterations Boosted |
# Batch iterations standard |
.25 |
303 |
1148 |
.5 |
745 |
2464 |
.75 |
1655 |
5866 |
.8 |
2534 |
8727 |
.83 |
4782 |
13067 |
.84 |
6312 |
15624 |
.85 |
8426 |
21009 |
The application to MNIST gives a rough speedup of a little under 2X. For the more complicated SVHN, we obtain improvement by a rough factor of 3.5X until hitting the .8 threshold, and lowering to 2.5-3X thereafter. An examination of the graphs of validation accuracy vs. number of batch iterations suggests, plotted to 30,000 iterations, that at the higher accuracy ranges, the standard method may never attain the results of the boosted method.
These findings suggest that one should always use these features for image data sets in order to speed up learning. The results in our earlier post also suggest that they contribute to improved generalization capability from one data set to another.
These results are only a beginning. There are more complex models of spaces of high density patches in natural images, which can generate richer sets of features, and which one would expect to enable improvement, particularly when used in higher layers.
In our next post, we will describe a methodology for constructing neural net architectures especially adapted to data sets of fixed structures, or adapted to individual data sets based on information concerning their sets of features. By using a set of features equipped with a geometry or distance function, we will show how that information can be used to inform the development of architectures for various applications of deep neural nets.
Technical Note
As stated above, each point on the circle is specified by an angle θ, and there is a corresponding patch P_{θ}, which is in turn a discretization of a real valued function ƒ_{θ }on the unit square of the form
(x,y) —–> xcos (θ ) + ysin (θ )
By forming the scalar product of a patch (which is a 9-vector of values, one for each pixel) with the 9-vector of values obtained by evaluating ƒ_{θ }on the 9 grid points, we obtain a number, which is the feature assigned to the angle θ. By selecting down to a finite discrete set of angles, equally spaced, we get a finite set of features we can work with.
The post Mathematical Acceleration: Incorporating Prior Information to Make Neural Nets Learn 3.5X Faster appeared first on Ayasdi.
]]>The post Feature Rich – How Features Feature in Our 7.10 Release appeared first on Ayasdi.
]]>The 7.10 release is one of Ayasdi’s most functionality-packed releases ever. We won’t cover everything in this post, but will touch on the highlights, of which there are many:
We covered our growing portfolio of transformation services in a recent post and so will focus on what’s new in this release.
Data is being collected at faster and faster rates, everywhere. The data we collect, however, is not very valuable, from a predictive perspective, in its raw form. The main way for companies to harness the data and make it valuable (i.e. have predictive power) is through data transformation. These transformations include merges, concatenations, summing and other transformations.
Ayasdi continues to grow its portfolio of transformations and with Release 7.10, Ayasdi has added Pivot, Binarize, Union, and Math/String Operator transformations to its already extensive list.
Let’s start with the Math/String Operator transformations. Customers now have the ability to perform mathematical operations (+ – * /) between a set of columns (for creating ratios, sums, differences, etc.) or they can concatenate string or numeric values between multiple columns using the new Math/String Operators.
Next is Union transformations. With this new feature, a number of data sources can now be united into one. For example, transactions from 2015 can be merged with those from 2016 to produce a data source that spans over both years.
We also added the ability to binarize any column in a dataset. For example, for an age column, any value above 20 can be labeled with a 1 and any value 20 or below can be given a 0 value in a new binary column.
Finally, we’ve added pivot transformations. Pivoting can convert, for example, a customer-transaction source into a customer-product-number of transactions source. This is similar to Microsoft Excel Pivot functionality, except the Ayasdi system handles much larger data pivots.
Expect more transformations going forward as Ayasdi continues to deliver against our customer requirements in this area.
Part of the transformation puzzle involves retrieving statistical facts about the data. The raw form of the data itself might not be meaningful but understanding the distribution of the values is often quite valuable.
Release 7.10 provides new statistical functions, including Row Group Stats, Histograms, and Distributions. These and many other statistical functions can be accessed using the Ayasdi Python SDK Source.get_stats function.
Some of the highlights include Most Common Value (mode) within a group or the entire source (i.e. what song did a customer listen to the most?), the number of unique values, and the actual distinct values themselves.
Additionally, the distribution of values (i.e. percentiles) and histogram data is now easily obtained. A histogram of the number of times a customer listened to a particular song Genre might provide an interesting customer summary visual for that customer, for example.
Ultimately, these features are not that impressive from a statistical perspective (frankly they are not at all) but exposing the functionality will allow a user to create robust dashboards, something that was previously more difficult (though not impossible) to do.
Features are an integral part of any machine intelligence exercise. Our technology, Topological Data Analysis, feasts on features. The more the better. That is why we spend the time we do on transformations – they allow us to increase the feature space. Where some technologies struggle with this dimensionality, we thrive.
Having said that, more features can create challenges. The greater the number, the more interaction there likely is and the more difficult the process of manual selection becomes. For an accomplished data scientist, there are methods, approaches, and intuition that can be applied. These can be time-consuming.
Release 7.10 has introduced two elements that help with this challenge. The first is Automated Feature Selection and the second is an enhanced internal search algorithm that provides improved Feature Selection and scoring results.
Users now have three options. They can either directly select features for analysis as they have traditionally done, choose the automatically created feature subset (i.e. column set), or use the scoring results from Automatic Feature Selection to choose alternative feature sets
Automated Feature Selection, which is currently available for supervised feature selection, identifies features (columns such as age, location or transaction type) in a data source with the most predictive value with respect to a given outcome.
Ayasdi’s Feature Selection works to identify features that have high relevance to the outcome while reducing redundancy within the selected set of features. The Feature Selection returns an ordered list of features. Ayasdi’s new Automated Feature Selection now proposes feature sets based on this list and scores them for their overall ability to build good topological models that localize the outcome.
The new Automated Feature Selection functionality substantially reduces the amount of time necessary to identify the subset of features that are relevant to the outcome and facilitate the rapid discovery of high performing models.
These features (pun intended) are massive time savers – and come with a well thought-out set of guard rails to ensure the output is meaningful. We often talk about co-optimizing for efficiency and effectiveness – these features exemplify that pursuit.
Once the causal factors for decisions are selected and a classification model is created, the user typically begins to predict future outcomes. Often, classification systems become a “black box” of sorts providing a predicted classification group but not providing any insight into why that prediction was made.
Until this release, users had the ability to create one_vs_rest classification models within Ayasdi MIP. While this is a great strategy for creating performant predictive models, it suffers from two major issues, when it comes to justifiability
Ayasdi addresses this challenge through its Multiclass Model framework. The past several releases have brought major enhancements to this framework and this release is no different.
The new Ayasdi Decision Tree MultiClass Predictor provides a single set of rules to justify and explain why certain transactions/claims/genre choices belong to a particular classification group in the Topological Network (using dt.get_rules(), dt.rules, and dt.dot). This is important because users can now understand, at a granular level, the causal factors behind the groupings and is particularly useful when a network has multiple disjoint groups where the user wants to predict in which group a data point belongs.
We have maintained for a long time that rules-based systems are ultimately flawed. However, we realize that they are exceptionally prevalent, especially in the finserve and healthcare verticals, and have justifiable use in situations where very fast, millisecond scale decisions need to be made.
A minor addition we’ve made is that results are now returned in the form of prediction probabilities and decisions. For example, a model can predict if Customer A will be a churning customer (the Decision) with a 90% probability (how confident the model is of this Decision).
This is useful in a number of constructs such as determining churn, identifying upgrade candidates and identifying failed program states just to name a few.
Often, the user would like to perform prediction using just a subset of the data. With this release, Ayasdi has made this even easier. Now, the user can construct a predictive model using a subset of the data or using particular datapoints. The new functionality is enabled through the Ayasdi Python SDK SourceSubset and DataPointList parameters.
Each DataPointList represents the values for the row corresponding to the original column set used for training. This meets a common requirement for Logistic Regression, Random Forest and Decision Tree models to be able to predict against only one or a few incoming data points (instead of a full dataset).
New or test data no longer needs to be imported into the system as a source. What this abstraction provides us is that a stream of data, as it is generated in real-time from an external source system, can now be used as an input. This addition is a milestone in our constant progress towards real-time prediction capabilities.
Ayasdi is in the business of building intelligent applications. All of those applications are built on our machine intelligence platform and thus have a common analytical foundation (whether it uses TDA or not).
Often, however, multiple applications would like to access a single, platform-powered component, such as a classification model.
Previously this would require multiple simultaneous connections. With this release, we have introduced API Connection changes that allow a user to call the Python SDK in order to make connections to multiple installations, as multiple users, without having to logout and login.
This functionality is key for a number of AI applications, such as Anti-Money Laundering where multi-user (multiple authorized users at the same time), multi-tenancy (multiple users can spawn concurrent jobs) is a requirement.
Release 7.10 also offers an option for the on-premise or private cloud (adding Azure to our existing AWS support). In keeping with all our previous releases, we continue to support high availability in our installations.
New for Release 7.10, the installation process no longer requires administrator or super-user privileges.
Whew. To think that only covers the highlights. We are constantly looking for ways to make this information more accessible and more digestible. There is now a new Getting Started page for the Python SDK, which was introduced to help users set up their environment before launching into the Ayasdi Python SDK Tutorials. This page provides links to all the sample data and code needed for the Ayasdi Python SDK Tutorials, describes how to upload the data to the Ayasdi Machine Intelligence Platform, and explains how to start a fresh Ayasdi Notebook.
Additionally, new documentation with Jupyter Notebook samples is provided with sample code to create topological networks and produce groups and auto-groups.
Please refer to the Ayasdi Machine Intelligence Platform 7.10 Release Notes and Ayasdi Python SDK Documentation for more details on these features and more.
The post Feature Rich – How Features Feature in Our 7.10 Release appeared first on Ayasdi.
]]>The post Going Deeper: Understanding How Convolutional Neural Networks Learn Using TDA appeared first on Ayasdi.
]]>The significance of this work can be summarized as follows:
In this post we show how one might leverage this kind of understanding for practical purposes. I will build on the work that Rickard Gabrielsson and I have done while discussing three other findings that we have discovered. Those are:
We need to recall some ideas from the previous post. One of the ideas introduced was the use of persistent homology as a tool for measuring the shape of data. In our example, we used persistent homology to measure the size and strength or “well-definedness” of a circular shape.
We’ll first recall the notions of persistent homology. Persistent homology assigns to any data set and dimension a “barcode”, which is a collection of intervals. In dimension = 0, the barcode output reflects the decomposition of the data set into clusters or components.
In clustering, one can choose a threshold and connect any two points by an edge if their distance is less than this threshold, and compute connected components of the resulting graph. Of course, as the threshold grows, more points will be connected, and we will obtain fewer clusters. The barcode is a way of tracking this behavior. The picture below gives an indication of how this works.
On the left, we have a data set that breaks naturally into three clusters, roughly equidistant. The barcode reflects that in the presence of three bars, with only one longer than a certain threshold value, depending on the distance between the clusters. The bars represent the initial clusters, with two of them cutting off at the threshold value where the clusters merge. On the right, we have a similar situation, except that the clusters are not equidistant. In this case, we begin with three bars because at a fine level of resolution, there are three clusters. At a threshold value roughly equal to the one on the left, we see that the two adjacent clusters merge into a single one, and we are looking at two clusters for a while. This is reflected in the barcode by the fact that the first bar is relatively short, while the other two are longer. Finally, as the threshold gets large enough to merge all three, we see that two of the bars are shorter than that threshold, with only one longer.
For higher dimensions, persistent homology measures the presence of geometric features beyond cluster decomposition. In the case of dimension = 1, the barcode measures the presence of loops in a data set.
On the left, the barcode consists of a long bar and some much shorter ones. The long bar reflects the presence of a circle whereas the shorter ones occur due to noise. On the right, we again have the short bars corresponding to noise and two longer bars, of different lengths. These bars reflect the presence of the two loops,and the different length of the bars correspond to the size of the loops. Length of a bar can also reflect what one might refer to as the “well-definedness” of the loop.
Let’s look at these images to understand that better.
On the left we have a very well-defined loop, and its barcode. On the right, some noise has been added to the loop, making in more diffuse and less well-defined. The longest bar on the right is shorter than that on the left. The length of the longest bar can thus reflect the well-definedness of the loop.
In the earlier post, we saw that the “loopy” shapes obtained from Ayasdi’s software was in fact confirmed by the presence of a single long bar within the barcode. We now wanted to understand how the loopy shape evolved as the training progressed.
We achieved this by examining the correlation between the length of the longest bar in the barcode (which can be computed at any stage of training) and the accuracy at that point of training. We performed these computations for two data sets, MNIST and a second data set of house numbers, referred to as SVHN.
The results look like this:
On the left is MNIST, on the right is SVHN. The x-axis records the number of iterations in the learning process. The y axis records the accuracy or length of longest bar in the barcode respectively, after mean centering and normalization.
As one can see, the length of the longest bar is well correlated with the accuracy of the classifier for the digits. This finding enhances the precision of our observations from the earlier post. There, we had only observed that, after training, we saw a single long bar in the barcode whereas we are now observing the length of the bar (and therefore the well-definedness of the circle) increasing as the training progresses.
The second finding concerned the process of generalization from one dataset to another. Specifically, we trained a CNN based on MNIST, and examined its accuracy when applied to SVHN. We performed the training using three different methods.
When we did this, we found that in the three separate cases, the accuracy of prediction on SVHN was 10% in case (1), 12% in case (2), and 22% in case (3). Of course, all these numbers are low accuracy numbers, but the results demonstrate that fixing the first convolutional layer to consist entirely of points from the primary circle model significantly improves the generalization from MNIST to SVHN. There are more complex models than the primary circle which one could include and expect to find further improvement in generalization.
The third finding involved an examination of the variability of the two datasets. Qualitatively we can determine that SVHN has much more variability than does MNIST. We would, in turn, expect that SVHN provides a richer data set and more precise data set of weight vectors. Indeed, persistence interval for SVHN is significantly longer than that for MINST (1.27 vs. 1.10). This further confirms from above, that there is a strong correlation between the “well-definedness” of the circle model generated and the quality of the neural network.
The reason topological analysis is useful in this type of analytical challenge is that it provides a way of compressing complicated data sets into understandable and potentially actionable form. Here, as in many other data analytic problems, it is crucial to obtain an understanding of the “frequently occurring motifs” within the data. The above observations suggest that topological analysis can be used to obtain control and understanding of the learning and generalization capabilities of CNN’s. There are many further ideas along these lines, which we will discuss in future posts.
The post Going Deeper: Understanding How Convolutional Neural Networks Learn Using TDA appeared first on Ayasdi.
]]>The post Transforming Before Your Eyes – New Transformation Services appeared first on Ayasdi.
]]>Ayasdi has been expanding its Transformation Services in order to both facilitate and expedite its data ingestion process. In Release 7.9, Ayasdi added Null Imputation, Z-Scoring, Joins, and In-Source Transformations to its rapidly growing list of supported Transformation Services:
Already available Transformation Services prior to Release 7.9:
Ayasdi Transformation Services are available through the Ayasdi Python SDK and can be built into Envision applications. Full documentation for the Ayasdi SDK can be found at https://platform.ayasdi.com/sdkdocs/
The following reviews the new Ayasdi Transformation Services recently released with the Ayasdi Python SDK Version 7.9.
Null Imputation Transformations
The Ayasdi Python SDK now supports Null Imputation Transformations, which enable the conversion of null data values into a statistical calculation or a value provided by the user. With NullImputationTransformationStep, null values can be replaced with the average mean, minimum, max, median value of the original column. This is helpful since often a column will contain null values that would be better for analysis with a real value and the required transformation can be done automatically. For example, the null values might be set to the value 0, the mean, or the median values. A determination of whether nulls should be imputed or not depends upon the context. If you are not sure how to proceed, drop us a note on support@ayasdi.com and we will be sure to respond.
Z-Scoring Virtual Transformations
The Ayasdi Python SDK now supports standard scaling of a data column, or Z-scoring. Z-scoring is a standard practice of transforming numerical columns prior to applying any machine learning method. This transformation is especially useful when columns have different scales and the user would like to scale them for effective comparisons. The new StandardScalingTransformationStep method provides two standard scaling options: standard deviation (relative weight) and mean.
Ability to Join Multiple Datasets
The Ayasdi Python SDK now supports the ability to join multiple datasets, which is a key feature engineering ability. Ayasdi’s Machine Intelligence Platform Release 7.9 has added infrastructure to support merging of data from different sources for combined analysis. The new JoinTransformStep merges a primary source with any number of secondary sources.
In-Place Source Transformations
Prior to release 7.9, the Ayasdi Machine Intelligence Platform always created a new source for transformations and, therefore, a new associated Topological Network, which would not have any previously applied Comparisons, Row Groups, Column Sets, or Colorings. Release 7.9 now supports calling the ApplyTransform function without specifying a new_source_name, supporting the addition of newly transformed columns to the existing source. If the new_source_name is set to None or left blank, the new transformed columns will be added to the original source.
Looking Forward
Transformations are critical elements. Not only do they add background experience to the data, they facilitate its understanding. Keep checking back as we will continue to roll-out new Transformation Services in future releases.
The post Transforming Before Your Eyes – New Transformation Services appeared first on Ayasdi.
]]>The post Innovators are Everywhere: How a Community Hospital Set the Standard for AI appeared first on Ayasdi.
]]>They are, however, innovative.
A 2018 winner of Best Hospitals by Healthgrades, Flagler has has also earned the Gold Seal of Approval from the Joint Commission for primary stroke care centers, national accreditation for its total hip and total knee replacement programs, accreditation from the American College of Surgeons’ Commission on Cancer, Center of Excellence designation for its bariatric surgery center, and ANCC magnet status for nursing excellence.
Despite this excellence, Flagler doesn’t jump off the page as a place where artificial intelligence would take root. They don’t have a data science group. They don’t have any other AI initiatives. They don’t have a dedicated innovation team.
What they do have, however, is a keen appreciation the concepts inherent in value based care. For several years now, Flagler had quietly pursued an value based care agenda, supported enthusiastically by their board. The hospital had held organization-wide workshops on the subject. Indeed, one of the pillars of their Mind, Body + Spirit program was the economic health of their patients.
Value Based Care has many dimensions, however, one of the key ones is managing clinical variation. Waste, a key component of clinical variation accounts for roughly 30% of healthcare dollars. That is more than $750 billion per year for the US healthcare system and tens of millions of dollars for a community hospital. The problem space of value based care, is so dimensional, however, that is continues to persist, even grow, in the face of countless initiatives, governmental incentives and penalties.
All care is a defined by events. The sheer volume of events associated with any procedure (a lab, a drug, a puncture, rehab, office visits) is immense and with the advent of EMRs we capture everything.
Next is the sequence of those events. What is the optimal order of events and how do we manage the transition of those events where most errors occur.
Then there is the timing of the sequence of the events. When should the order of the events occur? Long wait times for diagnostic tests and slow follow up of test results can lead to delayed diagnosis or even disease progression.
Once you have solved that immensely complex problem – you need to solve the challenge of institutional inertia and resistance to change. Machines are partners for doctors, not replacements – but they need to be integrated into the workflow in a way that delivers against that goal.
Finally, we have to manage adherence. Systematically created, timely and detailed reporting to physicians, nurses and pharmacists makes it possible to understand one’s performance compared to peers and change practice to help remove unwanted variation in the delivery of care.
This multi-dimensional problem demands AI. The inherent challenge for Flagler was that they did not have any dedicated data science resources. What they did have was a strong data foundation and outstanding SQL capabilities. Further, they understood their AllScripts implementation deeply.
Flagler had read about Ayasdi’s Clinical Variation Management application in an informatics journal and wanted to test it against their care process model for pneumonia. Their goals were simple:
This is evidence-based medicine at its best – in service of their patients and comprehensive in its perspective.
Flagler chose pneumonia because of its wide variation in practice created an area where the hospital felt it was falling short. Flagler’s data set included 1,573 patients who were discharged with pneumonia as the principal diagnosis. The data went back to 2014.
Flagler pulled data from five different systems using SQL (EMR, Surgical, Analytics, EDW, Financials). It took 2,000 lines of SQL and three rounds of semantic and syntactic validations to get it right (this was the bulk of the 8.5 weeks taken for the first carepath – the second took only two weeks). They loaded it into Ayasdi’s CVM application using the Fast Hospital Interoperability Resource (FHIR). The Ayasdi instance ran on AWS.
Ayasdi’s CVM application created nine potential pneumonia care pathways, each with distinctive elements.
From here, Flagler identified the continuous and categorical variables that distinguished the care pathway groups (again, comprised of patients) and in particular, the “goldilocks” group. This “goldilocks” cohort had remarkable characteristics – if the events, sequence and timing of the suggested care pathway were followed patients enjoyed superior care:
Given Flagler’s admission rate for pneumonia, this represented a potential savings of more than $400K while delivering better care – the foundation of value based care.
The next step for Flagler was to review the findings with the Physician IT Group (called the PIT Crew) and to make the necessary changes to AllScripts. Physician buy-in is critical, as Adherance is a core module of Ayasdi’s Clinical Variation Management system and the ability to measure performance going forward creates data-driven conversations.
There are two interesting anecdotes from this process that bear repeating. The first is that once doctors became aware of the work that was being done, requests for membership in the PIT Crew skyrocketed and attendance at the bi-weekly meetings doubled. Doctors want access to data.
The second is that one of the more accomplished physicians remarked that the care process model for pneumonia was far lighter than what he would have used, but upon looking at the outcomes, readily agreed that it delivered the same or better care in almost every case – and that what he was doing was essentially unnecessary, or wasteful. Presented with the evidence, he committed himself to rethinking his approach.
That was it. In the span of just 8.5 weeks, Flagler had harnessed the power of AI to reshape their business without hiring a single data scientist.
More importantly, Flagler has already completed its sepsis care pathway revision and has, based on the ease of use of Ayasdi’s Clinical Variation Management application, increased its expected care pathways production over the next year by 50% from eight to twelve covering everything from heart-surgery to childbirth.
This will have a material effect on the bottom line for Flagler – represented tens of millions in potential savings over a three year period while delivering better outcomes for their patient populations.
More importantly, it serves as a template for other small and community hospitals with the ambition and drive to adopt AI, but without dedicated resources. Indeed, the size and centrality of these community hospitals lend themselves to operationalizing AI faster. There is less politics, better communication and fewer competing initiatives.
The press often bemoans AI’s failures in healthcare, but it is the small wins that matter and those are piling up at places like Flagler. This is how AI is impacting healthcare and patient outcomes. What is different is that Flagler just proved that AI can succeed everywhere – not just the giants.
If you have the ambition to deploy AI in service of value-based care (or have a risk sharing contract that demands you get better at managing variation) reach out to us and we can talk about our approach.
The post Innovators are Everywhere: How a Community Hospital Set the Standard for AI appeared first on Ayasdi.
]]>The post Reverse Engineering Value Based Care: Payers Adopting Clinical Variation Management Software appeared first on Ayasdi.
]]>Indeed, most of our clients in the space are Providers and they use our Clinical Variation Management application to develop the optimal way to deliver care – from pneumonia to knee replacement to coronary artery bypass graft (CABG) surgery. The results are profound and the application has won numerous awards for its ability to see the subtle connections that doctors simply cannot. This is not about “eureka” moments (although they occur), this is about incremental, but continuous improvement and, just as importantly adherence. The success of Clinical Variation Management is also a function of the fact that it can deliver care paths against fine grained patient groups. This matters because there is rarely one thing contributing to a health event and the ability to find and study similar patients results in better care paths.
While these attributes contribute to Provider adoption, they also serve as the foundation for Payer adoption. This is where the similarities end, however, for the Payer has different data, different motivations and different incentives. Let’s explore them quickly.
Providers, even highly integrated ones, are focused on the patient first and foremost. This is reflected in the different operating models – non-profit, teaching, faith-based and even for profit hospitals. Payers are ultimately responsible for their members, but they are, in healthcare, profit making enterprises. They seek to balance the well-being of their membership with the cost of that care. Delivering against that balance leads Payers to view an individual members health more holistically, whereas a Provider views it it in the moment. While there is plenty of ambition in this area, it is how the U.S. healthcare system works.
Payers have access to longitudinal data. Providers have access to temporal slices of that data. Providers have a deeper picture but Payers have a wider one. As Physician from UCLA told us “everything that happens in my hospital started somewhere else.” The Payer has a better picture of where and when.
Finally, the Payer’s incentive to deliver an optimal outcome for its members is generally not unaligned with the Provider – although it can be unaligned with the individual Physician. This is one reason why Payers are adopting Clinical Variation Management.
Let me explain.
One of the top payers in the United States used our Clinical Variation Management application to determine which doctors were the best at performing certain procedures. Then they directed their members to those doctors. In effect, they “engineered” value based care – but from the other direction.
They started with the procedures, medical administration (all pharmacy), diagnostics and authorization data for close to 8,000 patients who underwent coronary artery bypass graft surgery (CABG). The patients contained slightly more commercial members than medicare members.
Not included in this work but available for subsequent study was detailed EMR data, detailed inpatient/outpatient data, socioeconomic data, patient experience, referrals, benchmarks and business segment/geographic data.
Using this data in conjunction with Ayasdi’s Clinical Variation Management application, the Payer discerned the evidenced-based, optimal care process model for CABG based on several key outcome variables. Those included overall cost, length of stay, readmissions, complications and any ER post procedure.
With that optimal care process model identified, the Payer then identified providers with strong compliance to the care process model and validated the association between compliance and good outcomes (LOS, cost, adverse events). Those become “preferred” Providers for CABG. Indeed, most Payers already have this designation. This technology makes it far more efficient and effective.
This naturally leads to “learning opportunities” for those Providers with poor outcomes vs. the standard. It becomes a data-driven conversation at that point to drive changes and to identify specific opportunities for moving underperforming providers toward these superior outcomes.
In this particular case, those learning opportunities came in the form of specific medications including beta blockers, calcium channel blockers, ACE inhibitor and ARBs. Groups with higher post-operative prescription fill rates for these medications were associated with lower cost and lower LOS. Additionally, cardiac rehabilitation utilization post CABG was associated with lower costs and lower LOS.
Finally, within Providers, certain physicians will perform better against the standard (indeed, their performance is what creates the standard) and the Payer can guide their members toward these doctors – knowing that in doing so they are creating the best potential outcome for their members, their shareholders and the healthcare system as a whole.
While some pioneering payers are already moving in this direction, the fact of the matter is that this has a far larger impact on profitability than optimizing marketing. When coupled with fraud detection, there isn’t a larger opportunity for the bottom line. More importantly, Payers have the data, talent and clinical knowledge to operationalize this at scale.
Further, Payers can also use this same technology to assess the readiness or suitability for value-based care contracts by evaluating the performance of the Provider as well as their variance to the standard of care. As more and more contracts head in this direction, this becomes a highly efficient way to assess Providers.
Finally, Payers can, and are, leveraging CVM in their population health work. As noted elsewhere, the ability to get at fine grained patient groups with targeted care plans is immensely difficult. With Ayasdi’s Population Risk Stratification application working in conjunction with CVM, this becomes a far more manageable task.
The financial impact can be profound. Consider the following scenario. If a CABG procedure runs $60k on average and you take $7.5K out of that across the 400K procedures annually, it produces savings of $3B for the healthcare industry. One would not expect to be able to move the needle with all Providers, but a third is reasonable. This results in $1B in savings. Again, on a single procedure. This cannot be expected to work for all procedures, however, we have yet to encounter a procedure where we don’t see double digit opportunities for improvement.
Stay tuned to learn more about this work and work with other innovative Payers.
The post Reverse Engineering Value Based Care: Payers Adopting Clinical Variation Management Software appeared first on Ayasdi.
]]>The post Using Topological Data Analysis to Understand the Behavior of Convolutional Neural Networks appeared first on Ayasdi.
]]>Neural networks have demonstrated a great deal of success in the study of various kinds of data, including images, text, time series, and many others. One issue that restricts their applicability, however, is the fact that it is not understood in any kind of detail how they work. A related problem is that there is often a certain kind of overfitting to particular data sets, which results in the possibility of adversarial behavior. For these reasons, it is very desirable to develop methods for developing some understanding of the internal states of the neural networks. Because of the very large number of nodes (or neurons) in the networks, this becomes a problem in data analysis, specifically for unsupervised data analysis.
In this post, we will discuss how topological data analysis can be used to obtain insight into the working of convolutional neural networks (CNN’s). Our examples are exclusively from networks trained on image data sets, but we are confident that topological modeling can just as easily explain the workings of many other convolutional nets. This post describes joint work between Rickard Gabrielsson and myself.
First, a couple of words about neural networks. A neural network is specified by a collection of nodes and directed edges. Some of the nodes are specified as input nodes, others as output nodes, and the rest as internal nodes. The input nodes are the features in a data set. For example, in the case of images, the input nodes would be the pixels in a particular image format. In the case of text analysis, they might be individual words.
Suppose that we are given a data set and a classification problem for it, such as the MNIST data set of images of hand drawn digits, where one is attempting to classify each image as one of the digits 0 through 9. Each node of the network corresponds to a variable which will take different values (called activations) depending on values assigned to the input nodes. So, each data point produces values for every internal and output node in the neural network. The values at each node of the network are determined by a system of numerical weights assigned to each of the edges. The value at a node (in Figure 1, node Z) is determined by a function of the nodes which connect to it by an edge (in Figure 1, nodes A,B,C,D).
The activation at the rightmost node (Figure 1, node Z) is computed as a function of the activations of the four nodes A,B,C, and D, based on the weights assigned to the four edges. One possible such function would be
(wAxA+wBxB+ wCxC+wDxD)
where wA, wB, wC, and wD are the weights associated to the edges AZ, BZ, CZ, and DZ, xA, xB, xC,and xD, are the activations at the nodes A,B,C, and D, respectively, and is a fixed function, which typically has its range between 0 and 1, and is typically monotonic. A choice of the weights determines a complex formula for each of the nodes (including the output nodes) in terms of the values at the input nodes. Given a particular output function of the inputs, an optimization procedure is then used to select all the weights in such a way as to best fit the given output function. To each node in the right hand layer one then associates its weight matrix, i.e. the matrix of the weights of all the incoming edges.
There is a particular class of neural networks that are well adapted to databases of images, called convolutional neural networks. In this case, the input nodes are arranged in a square grid corresponding to the pixel array for the format of the images that comprise the data. The nodes are composed in a collection of layers, so that all edges whose initial node is in the i-th layer have their terminal node in the (i+1)-st layer. A layer is called convolutional if it is made up of a collection of square grids identical to the input layer, and it is understood that the weights at the nodes in each such square grid (a) involve only nodes in the previous layer that are very near to the corresponding node and (b) obey a certain homogeneity condition, so that for each square grid in layer i, the weights attached to a given node are identical to those for another node in the same grid, but translated to its surrounding neurons. Sometimes intermediate layers called pooling layers are introduced between convolutional layers, and in this case the higher convolutional layers are smaller square grids. Here is a schematic picture that summarizes the situation.
In order to discern the underlying behavior of a CNN we need to understand the weight matrices. Consider the dataset where each datapoint is the weight matrix associated with a neuron in the hidden layer. We collect data from all the grids in a fixed layer, and do this over many different trainings of the same network on the same training data. Finally, we perform topological data analysis on the set of weight matrices.
By performing TDA on the weight matrices we can, for the first time, understand the behavior of the CNN, proving, independently, that the CNN faithfully represents the underlying distributions occurring in natural images.
Here is how it is done.
To start with we need to find useful structure from a topological perspective. To achieve that, we consider only the points of sufficiently high density (computed using simple proxy for Gaussian density as in the 2008 paper, called codensity). We begin by looking at, the 1st convolutional layer in a two-layer convolutional neural network. It produces the topological model shown in Figure 3.
Note that the model is circular. The barcodes shown on the right are persistent homology barcodes, which are topological shape signatures that show that the data set in fact has this shape, and that it is not an artifact of the model constructed using Mapper. The explanation for the shape is also shown in the image by labeling parts of the model with the mean of the corresponding set of weight matrices. What is very interesting about this model is that it is entirely consistent with what is found in a study of the statistics of 3×3 patches in grayscale natural images, as well as what is found in the so-called primary visual cortex, a component of the visual pathway that connects directly with the retina^{1}.
Put more simply, the topological model describes the CNN in such a way that we can independently confirm that it matches how humans see the world, and matches with density analysis of natural images.
This analysis in Figure 3 was performed on the MNIST data set. A related analysis, performed on the CIFAR 10 data set, gives the following diagram and persistence barcode.
This comes from the first convolutional layer. This model is consistent with the “three circle model” found in the 2008 paper, which incorporates lines in middle of regions as well as the edges found in Figure 4. Neurons that are tuned to these line patches also exist in the mammalian primary visual cortex. This provides a quantitative perspective of the qualitative aspects that we have come to associate with vision. There are, however, opportunities to go deeper.
Now that we have proven, using TDA, that CNN’s can mimic the distribution of data sets occurring in natural images, we can turn our attention to the study of what happens over the course of the learning process. Figure 5 below is obtained by computing topological models in the first and second layer of a convolutional neural network on the CIFAR 10 data set used above, and then displaying models for both the first and second layers at various numbers of learning iterations.
We use the coloring of the model to obtain information about what is happening. The coloring reflects the number of data points in a node, so we can consider the red portion as the actual model, where the rest contains weight matrices that appear less often.
The top row reflects the first layer, and one observes that it quickly discovers the circular model mentioned above, after 400 and 500 iterations of the optimization algorithm. What then starts to happen, though, is that the circle devolves instead into a more complex picture, which includes the patches corresponding to horizontal and vertical patches, but now also including a more complex pattern in the center of the model after 1,000 iterations. In the second layer, on the other hand, we see that during the first rounds of the iterations, there is only a weak pattern, but that after 2,000 iterations there appears to be a well- defined circular model. Our hypothesis is that the second layer has “taken over” from the first layer, which has moved to capture more complex patches. This is an area for future potential research.
This demonstrates the capability of using topological data analysis to monitor and provide insight into the learning process of a neural network – something that has been highly elusive until now.
This method also works on deeper networks (i.e. networks including more layers). Deeper networks are organized in a way that resembles the organization of the visual pathway in humans or primates. It is understood that the pathway has a number of components, including the retina, the so-called primary visual cortex or V1, and various higher components. It is thought that the primary visual cortex acts as an edge and line detector, and that the higher components detect more complex shapes, perhaps seen at larger scales. The picture below is the result of a study of the layers in an already trained network, VGG 16, that has been trained on the ImageNet data set referenced by J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Here we display the 2nd through the 13th convolutional layers, given as topological models.
Notice that the 2nd and 3rd layers are clearly very similar to the circular models found in the model trained on the MNIST data set. At the fourth layer, we have a circular model, but that also includes some lines in the middle of a background. These reflect the “secondary circles” found in the 2008 paper. At the higher levels, though, very interesting patterns are developed that include line crossings and “bulls eyes”, which are not seen in the analysis in the 2008 paper, nor in the primary visual cortex.
What these topological models tell us is that the convolutional neural network is not only mimicking the distribution of real world data sets, but is also able to mimic the development of the mammalian visual cortex.
While CNN’s are notoriously difficult to understand, topological data analysis provides a way to understand, at a macro scale, how computations within a neural network are being performed. While this work is applied to image data sets, the use of topological data analysis to explain the computations of other neural networks also applies.
By providing a compression of a large set of states into a much smaller and more comprehensible model, topological data analysis, can be used to understand the behavior and function of a broad range neural networks.
On a practical basis, this approach may facilitate our understanding of the behavior of (and subsequent debugging of) any number of vision problems from drone flight to self-driving automobiles to matters relating to GDPR. The details of this study will appear in due course.
References
The post Using Topological Data Analysis to Understand the Behavior of Convolutional Neural Networks appeared first on Ayasdi.
]]>