The post A Pretty Good Month appeared first on Ayasdi.
]]>First, the news associated with our successful deployment at Flagler Hospital has reverberated throughout the industry, resulting in press coverage ranging from Healthcare Informatics to SearchHealthIT.
For those who are interested in hearing the story firsthand, Flagler’s CMIO did an entire webinar on the initiative. While the pneumonia results ($1,350 savings, 2.5 day LOS reduction, 7X readmission reduction) are the headlines, the fact that they are achieving these results without a single data scientist underscores the applicability of the application.
Second, we took the application on the road to the Accenture HealthTech Challenge where an esteemed panel of over thirty judges from providers to payers to lifesciences chose our solution as a global finalist – putting us through to San Francisco where we will compete against nine other companies from EMEA and APAC. Given the competition started with over 1,200 entries – we are humbled and flattered. Helping us to get to the finals was the fact that Clinical Variation Management is a $2T market opportunity and we have the best product ever made to address it. Interested in the deck we used? We posted it here.
Finally, our CVM application just took home Fierce Healthcare’s Innovation Award for the Clinical Variation Management application. The award was for the data analytics/business intelligence category and speaks to the power and usability of the application. In a nice bonus, we also won the Fiercest Cost Savings Award – speaking to our ability to carve billions out of the healthcare system while simultaneously improving patient outcomes.
To top it all off, we were also named one of the top five AI Platform Innovators by IDC. This award, selected by uber-analyst Dave Schubmehl referenced our unique technology and our application orientation.
This recognition isn’t the effort of marketing – it is the effort of engineers, product managers, data scientists and domain experts. They are the ones crafting the product into something that wins in the market and wins with those in the know. While we do it because we are passionate – a little recognition never hurt….
The post A Pretty Good Month appeared first on Ayasdi.
]]>The post Relationships, Geometry, and Artificial Intelligence appeared first on Ayasdi.
]]>A simple example would be images, where one considers graphs which are shaped like a square grid, with the pixel values being the attribute attached to a particular node in the grid. The authors go on to describe the framework they have developed for computing with graph structured data, in particular describing how neural networks for this kind of computation can be built.
Embedded in this work is the following question:
Where do the graphs come from?
The question recognizes the fact that most data is “sensory” in character, i.e. it is obtained by measuring numerical quantities and recording their values as coordinates of vectors in a high dimensional space, and consequently do not come equipped with any explicit underlying graph structure.
It follows that in order to achieve human like intelligence in studying such data, it is necessary to construct graph or relational structures from sensory data, and also that performing such constructions is a critical ingredient in human intelligence.
The goal of this post is to describe a method for performing exactly this task, and to describe how neural networks can be constructed directly from the output of such methods. We can summarize the approach as follows.
The interested reader should consult the Topological Data Analysis Based Approaches to Deep Learning and Topological Approaches to Deep Learning where the constructions are discussed in detail.
We strongly agree with the point of view expressed in the DeepMind/Google paper that their work rejects the false choice between “hand-engineering” and “end-to end” approaches to learning, and instead argues for the recognition that the approaches are complementary and should be treated as such. What follows is that we should be developing a wide variety of tools and methods that enable hand-engineering of neural networks. The methods described in the DeepMind/Google paper are a step in that direction, as are papers on TDA-based Approaches to Deep Learning and Topological Approaches to Deep Learning.
We also observe that the benefits of such an approach go beyond improving the accuracy and speed of neural net computations. Once the point of view of the DeepMind/Google paper is adopted, one will be able to create algorithms which realize the idea of machine-human interaction. In particular, they will be more transparent, so that one can understand their internal functioning, and also provide more complex “human-readable” outputs, with which humans can interact. This paper on the Exposition and Interpretation of the Topology of Neural Networks constitutes a step in that direction.
Finally, we observe that in order to fully realize this vision, it is important not only to make the computation scheme human readable, but also the input data. Without a better understanding of what is in the data, the improvements described by Battaglia and his colleagues will only get us part of the way to fully actualizing artificial intelligence. Topological methods provide tools for understanding the qualitative and relational aspects of data sets, and should be used in the understanding of the algorithms which analyze them.
Graphs from geometry
In the case of images, the graph structure on the set of attributes (pixels) is that of a square grid, with some diagonal connections included.
Our first observation is that the grid is actually contained in a continuous geometric object, namely the plane. The importance of the plane is that it is equipped with a notion of distance, i.e. a metric in the mathematical sense (see Reisel’s work here). The set of vertices are chosen so as to be well distributed through a square in the plane, so that every point in the square is close to one of the vertices. In addition, the connections in the graph can be inferred from the distance. This is so because if we let r denote the distance from a vertex to its nearest neighbor (this is independent of what vertex we choose because of the regularity of the layout), then two vertices are connected if their distance is < √2r.
So, the graph is determined by:
(a) a choice of subset of the plane and
(b) a notion of distance in the plane
The value of this observation is that we often have sets of attributes or features that are equipped with metrics. Some interesting examples were studied in this paper, where a study of the statistical behavior of patches in natural images was performed. One of the findings was that there is an important feature that measures the “angular preference” of a patch in the direction of a particular angle θ. As we vary θ, we get a set of features which is geometrically laid out on a circle, which can be discretized and turned into the graph on the left below. A more detailed analysis showed a two-parameter set of features that are laid out in the form of a geometric object called a Klein bottle, shown with associated graph structure on the right below.
Analogous geometries can be found for 3D imaging and video imaging. The work in the Exposition paper has suggested larger spaces containing the Klein bottle which will apply to larger scale features and more abstract objects.
Purely data driven graph constructions
The constructions in the previous section rely on a prior knowledge concerning the features or a detailed data analytic study of particular sensing technologies. To have a method that is more generally applicable, without complicated data analysis, it is necessary to have methods that can quickly produce a useful graph structure on the set of features. Such a method exists, and is described in both the TDA paper as well as the Topology paper.
Suppose that we have a data matrix, where the rows are the samples or observations, and the columns are features or attributes. Then the columns are vectors, and there are various notions of distance one might equip them with, including high-dimensional Euclidean distance (perhaps applied to mean centered and normalized columns), correlation, angle, or Hamming distance if the values are Boolean. There are typically many options. By choosing a distance threshold, we can build a graph whose nodes are the individual features. If the threshold is small, the graph will be sparser than if we choose a larger threshold. So, at this point we have a method for constructing a graph from sensory data. Further, the graph describes a similarity relation between the features, in the sense that if two features are connected by an edge, we know that their distance (regarded as a dissimilarity measure) is less than some fixed threshold, and therefore that they are similar to each other.
This construction certainly makes sense and generalizes the construction that is done for images. However, it produces graphs which are large in the sense that they have the same number of nodes as we have features. They may also be dense in the sense that some nodes may have a very large number of connections, which is not ideal from the point of view of computation. We will address this problem via a form of compression of the graph structure.
To understand this construction, we will discuss another aspect of the deep learning pipeline for images, namely the pooling layers. We describe it as follows.
The “pooling” refers to pooling values of subsets of pixels, specifically smaller squares. The picture below describes the situation, where we have covered the pixel grid by nine smaller subgrids, each one two by two.
The idea of pooling is to create a new graph, this time three by three, with one node for each of the subgrids. The interesting point is that the creation of the new graph can be described in terms of the intersections of the subgrids only. Specifically, the nodes corresponding to two subgrids are connected by an edge if and only if the subgrids have at least one pixel in common. The value of this observation is that this criterion generalizes to the general situation of a geometry on the space of features, provided that we can produce a covering analogous to the covering of this square by subgrids. We refer to the collection of subgrids as a covering of the set of pixels, where by a covering of a set X we mean a family of subsets U of X so that every element of X is contained in some member of the U covering.
Topological data analysis provides a systematic way of producing such coverings, using a method referred to as Mapper, which is described in the TDA and Topology papers.
The output of this construction will be a family of graphs , ⌈_{1}…,⌈r with the following properties.
The construction yields a systematic way of producing graph representations of the sensory data available from the data matrix. It turns out that one can go directly from this representation to neural network architectures.
In fact, one could even imagine incrementally producing coverings using Mapper on the activation-values of the layers following the input layer. Occasionally recomputing the coverings and therefore the graph structure would allow the network to adapt its graph structure during training.
Constructing neural networks
The circular constructions given above can give some very simple geometric constructions of neural network architectures adapted to them. For example, in the case of the circular networks described above, a picture of such a construction would look as follows.
Similar constructions are available for the Klein bottle geometries and others.
The construction of neural networks from the purely data driven constructions above is also quite simple. Since each node of each graph corresponds to a set of features from the data matrix, we can declare that a node ν in ⌈r is connected by a directed edge to a node w in ⌈r+1 if and only if the collection corresponding to ν and the collection corresponding to w contain at least one feature in common. This will describe a connection pattern for a simple feed-forward network with the nodes of ⌈r as the set of neurons in the r-th layer. A picture of a simple version of this construction is given below.
Summary
We have given an approach to assigning graph theoretic information to sensory or continuous data. It depends on the assignment of geometric information in the form of a distance function on the set of features of a data set. There are many constructions based on the mathematical discipline of topology (the study of shape) that should be useful for this task. The methodology should yield the following:
The post Relationships, Geometry, and Artificial Intelligence appeared first on Ayasdi.
]]>The post The Accenture HealthTech Innovation Challenge and the $812B Opportunity appeared first on Ayasdi.
]]>While we were humbled to have made the cut for the Boston round where we competed against an extraordinary group of companies. We were even more humbled to be selected as a finalist for San Francisco, where we will go up against our fellow finalists from Boston, plus six teams from Tokyo, Sydney and Dublin rounds.
One of the reasons we made it to the final is that we are attacking a massive problem in healthcare – clinical variation management. When we say massive, we really mean massive – $812B annually in the US alone. Add another $1T for the rest of the world (using 5% of GDP for healthcare vs the 18% in the US). That’s closing in on $2 trillion per year for labs, tests, diagnostics, medications and other care that didn’t improve the patient outcomes – and in many cases diminished it.
Managing Clinical Variation is notoriously complex. How else could a $812B a year problem persist for the better part of three decades?
That’s right, we have been working the clinical variation problem for almost 30 years – ever since the AMA first started to aggressively advocate for evidenced-based care guidelines. The effectiveness of evidenced-based care is not in question, it has been proven out in hundreds of peer-reviewed studies. Better outcomes, lower costs. The pillars of the value-based care movement.
So if we know how to do it, why aren’t we doing it?
The answer lies in scale. The cost, time and effort to produce a care process model manually is extremely high (which is why the refresh cycle is around 4 to 7 years!). Furthermore, the acceptance of those care process models (what we call adherence) is quite low, diminishing further the utility of the effort.
Let’s examine both of these separately:
The Complexity of the Problem
Any healthcare episode can be broken down into component parts. How granular you go is a function of what you are looking for. For the purposes of this post – let’s keep it at a high level:
When you combine the three it creates a picture of extraordinary complexity where the events, the sequence of those events and the timing of the sequence of events all conspire to confuse even the most competent clinician.
The result is providers (and payers as seen here) generally develop something rather general – something high level, often using national literature as a guide.
The Challenge of Acceptance and its Impact on Adherence
Care path adherence is an a particularly thorny problem. Doctors own the patient relationship. The care path doesn’t own the patient relationship, nor does the machine. Doctors know what is best for their individual patients. They understand the exceptions, the co-morbidities, the family history, the patient’s preferences.
As a result, they often deviate from the care process model. Often it is warranted and at times is innovative. Often it is not warranted and is a result of habit, or in the rare case financial gain.
Only when a care path can fulfill all of the dismissal reasons (plus some others that I have likely missed) will it get accepted.
The Opportunity: Building Acceptable Care Pathways at Scale
To create acceptable care process models at scale requires a few key elements:
The product we demonstrated in Boston delivers against much of what is outlined here – compressing man-years into hours and uncovering good variation (innovation) alongside bad variation. More importantly, the product has every bell and whistle all packed into a single application interface that was designed to be used by doctors, not data scientists (although they love it too).
Achieving the elements outlined above can deliver exceptional results. At Flagler in St. Augustine, Florida they applied our software to pneumonia resulting a care path that saved them $1,350 per patient ($850K per year), reduced LOS by 2.5 days and reduced readmissions by 7X. The result was a 22X ROI for the hospital. With a care pathway per month for the next 18 months, they expect to save over $20M while simultaneously improving the quality of care they deliver to their patients.
It is an amazing story underscores the size of the opportunity. If a 335-bed hospital can save $20M what can NY Presberterian do?
So wish us luck as we prep for the SF round. They say it is an honor to be nominated, but we would like to win – and take advantage of the platform it provides to make healthcare better, across the board.
The post The Accenture HealthTech Innovation Challenge and the $812B Opportunity appeared first on Ayasdi.
]]>The post Why Prediction Needs Unsupervised Learning appeared first on Ayasdi.
]]>Even if they could deliver on a crystal ball, such a capability would obviously have enormous consequences for all aspects of human existence. In truth, even small steps in this direction have major implications for society at large, as well as for specific industries and other entities.
Chief among the methods for prediction is regression (linear and logistic), often highly tuned to the specific environment. There are dozens of others, however, including deep learning, support vector machines and decision trees.
Many of these methods have as their output the estimate of a particular quantity, such as a stock price, the outcome of an athletic contest or an election, either in binary terms (who wins and who loses) or in numerical terms (point spread). Sometimes the output even comes with an estimate of the uncertainty, for example in political polling where margin of error is often quoted. The existing methods clearly do have value, but this post concerns itself with two claims:
Let start with the first premise, that prediction may not be the right technique in certain circumstances.
The output of a prediction methodology is often used as input to an optimization algorithm. For example, predictions of stock prices are often used in portfolio optimization, prediction of outcomes of athletic contests are used in optimizing a betting portfolio, and results of election polling used to determine the choices made by political parties in supporting various candidates. There are many powerful methods for optimization, and applying the predictive information obtained by these methods is often a straightforward process. However, in order to use them, it is required that one starts with a well-defined objective function to which one can apply optimization methods in order to maximize or minimize it.
Herein lies the challenge.
In many situations it is unclear what defines a good objective function. Further, our attempts to create a proxy for an ill-defined objective function makes matters worse (for example, sometimes we choose a function that is easy to define and compute with, but that doesn’t make it good, it just makes it easy). Let’s consider a few examples:
In all these situations, the notion of prediction becomes significantly more difficult when the role of the objective function is unclear or debatable.
Because we are not in a position to define an objective function, we cannot rely exclusively on optimization techniques to solve the problems that face us. This means we need to take a step back and understand our data better so that we might guide ourselves to better outcomes.
This is where unsupervised learning comes in.
Unsupervised learning allows us to explore of various objective functions, find methods that permit the adaptation of an objective function to individual cases, and search for potentially catastrophic outcomes.
These tasks cannot be resolved by methods whose outputs are a single number. They instead require an output with which one can interact in addressing them, and therefore is equipped with more information and functionality than simple binary or numeric outputs. It is also vitally important that the methods we use let the data do the talking, because otherwise our natural inclinations are to verify what we already believe to be true, or to favor an objective function we have decided on, or to decide that we already know what all the relevant factors in an analysis are.
This is, in many ways, the underappreciated beauty in unsupervised approaches.
OK, what we have established so far is that:
So what are the capabilities required of such unsupervised methods?
When we can deliver against these challenges, the performance of the prediction improves materially.
For example, we are working with a major manufacturer on a problem of particular value to them. They make very complicated widgets and failure rates are paramount. Every .01 percent change has tens of millions of dollars in bottom line impact. Their prediction capabilities had tapped out and no longer yielded valuable information. By employing unsupervised approaches, the company determined a key piece of information, namely that there were many different kinds of failure and those failures were not distributed evenly across their factories. This allowed them to take immediate steps to improve their failure rates.
So what are some techniques to create this one/two punch of unsupervised learning and supervised prediction?
There are many notions of unsupervised analysis, including clustering, principal component analysis, multidimensional scaling, and the topological data analysis (TDA) graph models.
The TDA models have by far the richest functionality and are, unsurprisingly, what we use in our work. They include all the capabilities described above. TDA begins with a similarity measure on a data set X, and then constructs a graph for X which acts as a similarity map or similarity model for it. Each node in the graph corresponds to a sub-collection of X. Pairs of points which lie in the same node or in adjacent nodes are more similar to each other than pairs which lie in nodes far removed from each other in the graph structure. The graphical model can of course be visualized, but it has a great deal of other functionality.
To summarize, it is fundamentally important to move from the idea of pure numerical or binary prediction to methods that permit one to have a better understanding of ones underlying data in order to design better objective functions and capture small and weak phenomena, that may eventually become very significant. TDA provides a methodology that is very well suited to these requirements.
The post Why Prediction Needs Unsupervised Learning appeared first on Ayasdi.
]]>The post The Winds of Change at ACAMS 2018 appeared first on Ayasdi.
]]>Leading the group is our VP and Global Head of Financial Services, Doug Stevenson. Doug is a certified anti-money laundering specialist (CAMS) and certified financial crimes investigator (FCI), Doug joined Ayasdi from Pitney Bowes where he served as the General Manager of their Global Financial Crimes and Compliance business unit.
Doug has assembled an amazing team of fellow financial crimes specialists.
Those include David Brooks who formerly ran the N.A. Financial Crimes Practice for Capgemini. His rich experience includes leadership roles at NICE Actimize, Fortent, and Mantas.
The team also includes Chief Architect, Sridhar Govindarajan who, like Doug, hails from the successful Pitney Bowes team. Prior to Pitney Bowes, Sridhar worked in AML and architecture roles at Citi, Deutsche Bank, and Credit Suisse.
The data science lead for financial crimes is also a recent addition. Lei “Ray” Mi comes to Ayasdi from HSBC where he was the Vice President for AML Transaction Monitoring and Global Risk Analytics. In that role, Ray pioneered the use of machine learning and AI for HSBC’s AML function and was a key participant in the creation of the bank’s center of excellence in AI and Machine Learning.
When you have this much talent, you generate a lot of great thinking. That was on full display when we got together as a team to talk about this year’s event. Here are the takeaways on the ACAMS Vegas show.
1. Artificial Intelligence is Going Mainstream
Though many vendors are touting the use of Artificial Intelligence (AI) in their AML offerings, production deployment remains the litmus test of acceptance. Compared to previous years, we found a far greater appetite for and far more financial institutions (FIs) actively experimenting with Artificial Intelligence in their AML programs. Though the number of FIs actual moving from AI experimentation to production is still modest, we expect this number to grow considerably in the coming quarters.
This growth is a function of two drivers.
First, FIs are becoming more sophisticated and are actively sharing information and know how within the community – which is to be expected in a community who doesn’t really “compete” with each other as they do in other lines of business. This manifests itself as a more informed AML buyer. In the past, discussions would revolve mainly around high-level subjects, such as the definition of AI and the differences between supervised and unsupervised learning. However, this year, AML experts wanted to dive deeper and learn more about specific ways in which AI was solving real problems in AML programs. This only gives us more confidence that AI adoption in AML will begin to boom in the coming years.
Second, the vendor community is becoming more adept at building deployable solutions – applications vs. use cases. This allows these POCs to move from the innovation area to the business far faster than before.
Smarter buyer + better product = more deployments of intelligence in financial crimes.
2. Segmentation as Lens in Behavior
Over the last two years, Ayasdi led the dialogue in the AML community from tackling money laundering with a traditional (and increasingly ineffective) rules-based approach vs. a more effective (and efficient) customer behavior-based approach.
This message is gaining widespread adoption amongst both FIs and vendors – with Verafin’s panel correctly stating, “setting a rule doesn’t indicate a behavior.” Foundational to any strong AML program is a customer’s behavior – not a static risk rating or rules-based segmentation.
Much of the work we are doing in this space revolves around collaborating with the FI to determine a) the right data b) the right features c) the resulting customer segments that accurately reflect behavior.
As we do this work, it is clear to the FIs (and the regulator) that accurate segmentation is critical to a successful AML program. It should also be noted, that while accurate segmentation is imperative if you can’t explain it, it borders on useless. This is one of the knocks on AI in AML and another area where Ayasdi has been a pioneer.
3. Change in Behavior is Key
After speaking with one AML/BSA officer at ACAMS, he looked at us and said “Wow! That is the view I’ve always wanted to have!” What we had shown him was how Ayasdi’s AML offering could take a customer behavioral segmentation and add the variable of time to see how customer’s behaviors changed over a certain period.
This is very important and will define the conversation in the coming years.
As noted, an accurate customer segmentation provides a lens into customer behavior. How that behavior changes over time is even more valuable – and not just to the AML team. If a customer moves from one group to another – there is information encoded into that move. Did the customer become riskier to the FI? Did they change how they used certain products and with what frequency? These are powerful insights that have eluded the industry – primarily because of the complexity involved in the challenge.
The implications are profound for the $80M a year KYC refresh component. Change in behavior can make this more effective and efficient – eliminating diligence on those customer’s whose behavior is the same while initiating diligence on those whose behavior has changed in a material fashion.
These are exciting times for the financial crimes community. The technology has never been better, the talent level never higher.
We see progress everywhere – with our clients, our partners like Navigant, the talent joining our team. We look forward to our ACAMS webinar in January where we can expand on the subject of change in behavior.
The post The Winds of Change at ACAMS 2018 appeared first on Ayasdi.
]]>The post AI in Regulated Industries: The No-Nonsense Guide appeared first on Ayasdi.
]]>While it still needs to grow into its lofty expectations (particularly in the enterprise) there is a growing body of evidence that the small wins are stacking up. One area where AI is particularly challenged is in regulated industries. Regulated industries have the same complexity as other industries but have the added dimension of requiring human explainable models. This is not to suggest they are simple models, they can be complex, but they need extreme transparency and explainability (something we call justification) for the regulator to sign off on their use.
Ayasdi is a pioneer in justifiable AI and our underlying technology supports it on an atomic level (the machine did this because of that…for any action). It is why the team at Basis Technology asked us to contribute to their handbook on integrating AI in highly regulated industries. The report covers AI from a macro perspective, the regulators perspective and the practitioner’s perspective and is eminently consumable at just 57 pages. So hop over, fill out the form and thanks us later….
The post AI in Regulated Industries: The No-Nonsense Guide appeared first on Ayasdi.
]]>The post The Double Edged Sword of Deep Learning + The Importance of Breakthrough Science appeared first on Ayasdi.
]]>First off, deep learning is a boon to the field. Powered by algorithmic progress by academics and systems innovations from the likes of Nvidia, storage from the cloud providers and broad, open-source libraries of algorithms the performance of these frameworks has grown alongside the interest in the field.
But, like the graph, things have plateaued for deep learning. The fact of the matter is that DL is not a fit for many tasks, yet like the proverbial hammer, for many, everything looks like nail when you are wedded to that approach.
Deep Learning struggles with many of the data challenges we find in enterprises today. Perhaps most importantly, is the fact that the majority of data in the enterprise is unlabeled. This makes it harder to find massive training sets to train performant deep learning models.
What Demis points out is that while DL is a useful tool, it is but one avenue, one pathway – there are many others. He notes that “Deep learning is an amazing technology and hugely useful in itself, but in my opinion it’s definitely not enough to solve AI, [not] by a long shot. I would regard it as one component, maybe with another dozen or half-a-dozen breakthroughs we’re going to need like that. There’s a lot more innovations that’s required.”
We could not agree more. We think that in order to be successful with AI, you need a range of capabilities – from the unsupervised to the semi-supervised and ultimately to the supervised.
In fact, Gunnar Carlsson’s work in the deep learning space is proving this out – that to generate material improvements in deep learning, it is very useful to apply techniques like topological data analysis as a starting point.
The net of it is that the alpha and omega of AI is not deep learning. It is merely a letter – you need other concepts to truly change the world.
Speaking of changing the world, Hassabis also talks about the promise of AI to solve some of the world’s great challenges. He’s right. AI needs to be judged against its ability to solve the real problems that face mankind – from healthcare to climate change.
“This is what I’m really excited about and I think what we’re going to see over the next 10 years is some really huge, what I would call Nobel Prize-winning breakthroughs in some of these areas.”
He goes onto note that DeepMind is looking protein folding and quantum chemistry among other areas. We agree wholeheartedly.
Our work to date has produced over 285 peer reviewed publications in places like Nature, Science and Cell. Our unique brand of AI (topological data analysis), has produced breakthroughs with diabetes, asthma, cancer, disease states and traumatic brain injury.This is where AI interacts with the real world – with powerful effects. Like Hassabis, we are incredibly optimistic on what the future holds with regard to AI. If you want to learn more about why – just ask us and we will find some time.
The post The Double Edged Sword of Deep Learning + The Importance of Breakthrough Science appeared first on Ayasdi.
]]>The post Geometric Methods for Constructing Efficient Neural Net Architectures appeared first on Ayasdi.
]]>In this post, we will discuss how to take much more general data types and adapt the architecture to them. In fact, we will demonstrate cases where we can construct an architecture for a single data set – a custom neural network. In order to do this, we must first determine a way to think about the convolutional construction and understand how it can generalize.
We recall that for images, a convolutional neural net is composed of layers of neurons, in this case thought of as pixels, each of which is laid out as a collection of regular square grids. While the grid size can vary, typically the number of points in the grid becomes smaller for deeper layers. Often, however, the convolutional neural net has consecutive layers in which the grids are of identical size. In this case, the connections are local in the sense that neurons are connected to nearby neurons. For example, one might take the 3×3 grid patch with the original neuron in the center, and have the connection pattern below repeated around the whole grid.
In a convolutional net for time series, the layers of neurons would be arranged in linear segments instead. Again, the interlayer connections between neurons are specified by the requirement that a neuron in a given layer is connected with neurons in the next higher layer that lie in a small segment around the placement of the given neuron. In this case the picture is like this.
In the case of images, the features defining a data set of images consists of pixels, each one assigned a gray scale numerical value. The identification with pixels with points in the plane equips the pixels (i.e. the set of features) with a geometric structure. In particular, one can consider the pixels as having a notion of distance between them, and that pixels only are connected to pixels in the following layer that lie close to them in the grid.
This means that we have removed a large number of connections from the fully connected structure based on a geometric principle, which we will refer to as locality.
The times series situation is similar – in this case time points in one layer connect to time points in the next layer if and only if they are within a fixed distance of each other.
Why does this kind of pruning of connections make sense?
Let’s consider the case of a data set of time series, and suppose that each time series consists of 10^{4 }consecutive time points and measurements at those time points for example stock ticker data. Suppose further that we are trying to train a predictor for estimating the value of the measurement at time t_{1+1} from all the measurements at all values of time. That would mean that we would be attempting to search for features that include all possible combinations of the 10^{4} variables. Even if we only included pairs of time measurements, this would already enlarge the search space to 10^{8} combinations. If we are searching through all combinations, then the search space is of size 2^{10,000}, obviously a very large number.
By pruning all connections between time points that are further than one unit apart, we are restricting the way in which the neural net can modify the formulas to those combinations of features that consist only of sets of three consecutive time points, for example. This means that we are searching through a space consisting only of 104 combinations, one for each center of an interval of three time points.
This is an important efficiency.
Furthermore, it can easily be justified, since we believe that it is much more likely that features that are nearby (in time) will be relevant to the prediction task at hand than time points that are far removed in time.
A similar argument works for images, where the geometrically local structure encodes the idea that the features we are looking for to distinguish images are local, i.e. are concentrated in a small part of the image. Pixels that are far removed from each other are unlikely to produce combined features that are meaningful. We will refer to this condition imposed on connections as locality, a notion that makes sense in both of the two situations in question.
Naturally, there is great utility in extending this idea to new situations, beyond just time series and images. This means that we want our feature space to be equipped with a notion of distance, which can be encoded in the idea of an abstract metric on a set X. A metric on X is a function that assigns a non-negative real value d(x, y) to every pair of points (x, y), with x and y members of X, subject to three properties.
These conditions abstract three important properties of the ordinary distance function on the line or in the plane or space. In particular, in the case of time series and images, the data sets are equipped with an a priori metric since they are subsets of the line and the plane, respectively. To see another example, consider the circle, where we can assign a notion of distance that assigns, to two different angles, the arc length of the shortest arc from the one to the other, as points on the circle. This quantity can easily be seen to be a metric on the circle. Let’s now recall the features 𝑓_{θ} that were introduced in our previous post, one for each angular value on the circle. If we choose a discrete set on the circle, it can be equipped with a metric by restricting the distance on the circle. The points could be chosen to be regularly spaced, as in the diagram below.
We now have a collection of 16 features, equipped with a distance function, just as we had distance functions on the one-dimensional lattice in the time series case and the two-dimensional lattice in the case of images.
The key difference is that these features were not raw data from an image data set, but were derived features that were found to be of importance via the work in our earlier blog post. Nevertheless, it is possible to build a feed forward neural net with neurons in each layer in correspondence with the 16 points on the circle, and where connections from one such layer to the following one are local, i.e. a neuron in a layer is connected to a neuron in the following layer if and only if they are close, say adjacent to each other, in the circular geometry.
Using networks built in this way, but in combination with the grids coming from images, we are exploring how this kind of architecture (again, choices of a directed graph in the mathematical sense) can speed up the learning process in image convolutional neural nets. There are many such geometries that can be used to build neural net structures adapted to various kinds of data.
We have described how to use geometry to specify connections when one has two adjacent layers that are identical. However, and important construction in the image (and time series) examples is that of a pooling process, which has a grid in one layer and a smaller grid in the following layer. In the time series case, it looks like this.
Each neuron in the right layer is connected to two neurons in the left layer. A similar thing is done with images, having the effect of forming a new image at a lower resolution. A variant of this pooling procedure is available for the circular geometries as well. Here is a picture describing it.
The previous example shows how one can incorporate knowledge about features spaces into the construction of neural nets, and that knowledge can be a priori knowledge coming directly from the raw data (as in the case of images or time series) or it can be knowledge derived from research about the features, as in the case of the circular geometry on derived features from images.
But what about a situation in which one does not have research leading to a simple model of geometry on a space of features? There is still something one can do. The important idea is to use the methods described in our earlier blog post. Suppose that we have a data matrix D, whose rows are the data points, and whose columns are the features. Using such a matrix, one can put a metric on the set of rows in a number of ways (higher dimensional versions of the Euclidean metric, Hamming metrics, etc.) , and doing this turns out to be extremely useful.
It is the basis for Ayasdi’s software, where topological models of data sets are constructed using a metric on the data set. What was observed in our previous post, though, is that it is equally possible to put a metric on the columns of D, i.e. the features. By performing this simple transformation, we have a metric on the feature space, as we did in the cases described above.
One could use this geometry directly, but often there are many more features than the number of neurons one would prefer to have. What one can do is to create an Ayasdi topological model of the set of columns, using the distance function. In this model, each node is a collection of features, and I can assign to it a new feature that is the average value of all of the features that correspond to the node. We now have a graph structure so that each node corresponds to a feature.
We can use this to create a feed forward network, with each layer having neurons in one to one correspondence with the set of nodes in the topological model, and where we assign a edge from the neuron in a particular layer corresponding to a vertex of the topological model to a neuron in the following layer corresponding to a vertex if and only if and form an edge in the topological model. This creates a feed forward network with identical layers. There is also a way to construct pooling layers in this context, based on varying the resolution in the topological model.
The activation and weight information in this net gives rise to derived features for the data set, as well as solving a classification or estimation problem on the data set. One important point in using the topological model for the feature space is that it forces the neural network to deal with the results of groups of raw features, and therefore will not be able to overfit on a single raw feature. It should mitigate the overfitting that often occurs in this technology.
What we described above is a methodology that facilitates the construction of feed forward deep learning architectures adapted to classes of data sets, such as images, time series, etc.
Having said that, this methodology also facilitates the construction of a bespoke architectures for one-off data sets that fall outside the classes outlined above. As such, we have a unified approach that covers the vast majority of potential data sets.
The post Geometric Methods for Constructing Efficient Neural Net Architectures appeared first on Ayasdi.
]]>The post Mathematical Acceleration: Incorporating Prior Information to Make Neural Nets Learn 3.5X Faster appeared first on Ayasdi.
]]>By eliminating the need to relearn these features one can increase the performance (speed and accuracy) of a neural networks.
In this post, we describe how these ideas are used to speed up the learning process in image datasets while simultaneously improving accuracy. This work is a continuation of our collaboration with Rickard Brüel Gabrielsson
As we recall from our first and second installments we can, discover how convolutional neural nets learn and how to use that information to improve performance while allowing for the generalization of those techniques between data sets of the same type.
These findings drew from a previous paper where we analyzed a high density set of 3×3 image patches in a large database of images. That paper identified the patches visible in Figure 1 below and allowed them to be described algebraically. This permits the creation of features (for more detail, see the technical note at the end). What was found in the paper was that at a given density threshold (e.g. the top 10% densest patches) those patches were organized around a circle. The organization is summarized by the following image.
While the initial finding was via the study of a data set of patches, we have confirmed the finding through the study of a space of weight vectors in a neural network trained on data sets of images of hand-drawn digits. Each point on the circle is specified by an angle θ, and there is a corresponding patch P_{θ}.
What this allows us to do is to create a feature that is very simple to compute using trigonometric functions, that for each angle θ that measures the similarity of the given patch to P_{θ}. See the technical note at the end for a precise definition. By adjoining these features to two distinct data sets, and treating them as separate data channels, we have obtained improved training results for the two distinct data sets, one MNIST and the other SVHN. We’ll call this approach the boosted method. We are using a simple two layer convolutional neural net in both cases. The results are as follows.
MNIST
Validation Accuracy |
# Batch iterations Boosted |
# Batch iterations standard |
.8 |
187 |
293 |
.9 |
378 |
731 |
.95 |
1046 |
2052 |
.96 |
1499 |
2974 |
.97 |
2398 |
4528 |
.98 |
5516 |
8802 |
.985 |
9584 |
16722 |
SVHN
Validation Accuracy |
# Batch iterations Boosted |
# Batch iterations standard |
.25 |
303 |
1148 |
.5 |
745 |
2464 |
.75 |
1655 |
5866 |
.8 |
2534 |
8727 |
.83 |
4782 |
13067 |
.84 |
6312 |
15624 |
.85 |
8426 |
21009 |
The application to MNIST gives a rough speedup of a little under 2X. For the more complicated SVHN, we obtain improvement by a rough factor of 3.5X until hitting the .8 threshold, and lowering to 2.5-3X thereafter. An examination of the graphs of validation accuracy vs. number of batch iterations suggests, plotted to 30,000 iterations, that at the higher accuracy ranges, the standard method may never attain the results of the boosted method.
These findings suggest that one should always use these features for image data sets in order to speed up learning. The results in our earlier post also suggest that they contribute to improved generalization capability from one data set to another.
These results are only a beginning. There are more complex models of spaces of high density patches in natural images, which can generate richer sets of features, and which one would expect to enable improvement, particularly when used in higher layers.
In our next post, we will describe a methodology for constructing neural net architectures especially adapted to data sets of fixed structures, or adapted to individual data sets based on information concerning their sets of features. By using a set of features equipped with a geometry or distance function, we will show how that information can be used to inform the development of architectures for various applications of deep neural nets.
Technical Note
As stated above, each point on the circle is specified by an angle θ, and there is a corresponding patch P_{θ}, which is in turn a discretization of a real valued function ƒ_{θ }on the unit square of the form
(x,y) —–> xcos (θ ) + ysin (θ )
By forming the scalar product of a patch (which is a 9-vector of values, one for each pixel) with the 9-vector of values obtained by evaluating ƒ_{θ }on the 9 grid points, we obtain a number, which is the feature assigned to the angle θ. By selecting down to a finite discrete set of angles, equally spaced, we get a finite set of features we can work with.
The post Mathematical Acceleration: Incorporating Prior Information to Make Neural Nets Learn 3.5X Faster appeared first on Ayasdi.
]]>The post Feature Rich – How Features Feature in Our 7.10 Release appeared first on Ayasdi.
]]>The 7.10 release is one of Ayasdi’s most functionality-packed releases ever. We won’t cover everything in this post, but will touch on the highlights, of which there are many:
We covered our growing portfolio of transformation services in a recent post and so will focus on what’s new in this release.
Data is being collected at faster and faster rates, everywhere. The data we collect, however, is not very valuable, from a predictive perspective, in its raw form. The main way for companies to harness the data and make it valuable (i.e. have predictive power) is through data transformation. These transformations include merges, concatenations, summing and other transformations.
Ayasdi continues to grow its portfolio of transformations and with Release 7.10, Ayasdi has added Pivot, Binarize, Union, and Math/String Operator transformations to its already extensive list.
Let’s start with the Math/String Operator transformations. Customers now have the ability to perform mathematical operations (+ – * /) between a set of columns (for creating ratios, sums, differences, etc.) or they can concatenate string or numeric values between multiple columns using the new Math/String Operators.
Next is Union transformations. With this new feature, a number of data sources can now be united into one. For example, transactions from 2015 can be merged with those from 2016 to produce a data source that spans over both years.
We also added the ability to binarize any column in a dataset. For example, for an age column, any value above 20 can be labeled with a 1 and any value 20 or below can be given a 0 value in a new binary column.
Finally, we’ve added pivot transformations. Pivoting can convert, for example, a customer-transaction source into a customer-product-number of transactions source. This is similar to Microsoft Excel Pivot functionality, except the Ayasdi system handles much larger data pivots.
Expect more transformations going forward as Ayasdi continues to deliver against our customer requirements in this area.
Part of the transformation puzzle involves retrieving statistical facts about the data. The raw form of the data itself might not be meaningful but understanding the distribution of the values is often quite valuable.
Release 7.10 provides new statistical functions, including Row Group Stats, Histograms, and Distributions. These and many other statistical functions can be accessed using the Ayasdi Python SDK Source.get_stats function.
Some of the highlights include Most Common Value (mode) within a group or the entire source (i.e. what song did a customer listen to the most?), the number of unique values, and the actual distinct values themselves.
Additionally, the distribution of values (i.e. percentiles) and histogram data is now easily obtained. A histogram of the number of times a customer listened to a particular song Genre might provide an interesting customer summary visual for that customer, for example.
Ultimately, these features are not that impressive from a statistical perspective (frankly they are not at all) but exposing the functionality will allow a user to create robust dashboards, something that was previously more difficult (though not impossible) to do.
Features are an integral part of any machine intelligence exercise. Our technology, Topological Data Analysis, feasts on features. The more the better. That is why we spend the time we do on transformations – they allow us to increase the feature space. Where some technologies struggle with this dimensionality, we thrive.
Having said that, more features can create challenges. The greater the number, the more interaction there likely is and the more difficult the process of manual selection becomes. For an accomplished data scientist, there are methods, approaches, and intuition that can be applied. These can be time-consuming.
Release 7.10 has introduced two elements that help with this challenge. The first is Automated Feature Selection and the second is an enhanced internal search algorithm that provides improved Feature Selection and scoring results.
Users now have three options. They can either directly select features for analysis as they have traditionally done, choose the automatically created feature subset (i.e. column set), or use the scoring results from Automatic Feature Selection to choose alternative feature sets
Automated Feature Selection, which is currently available for supervised feature selection, identifies features (columns such as age, location or transaction type) in a data source with the most predictive value with respect to a given outcome.
Ayasdi’s Feature Selection works to identify features that have high relevance to the outcome while reducing redundancy within the selected set of features. The Feature Selection returns an ordered list of features. Ayasdi’s new Automated Feature Selection now proposes feature sets based on this list and scores them for their overall ability to build good topological models that localize the outcome.
The new Automated Feature Selection functionality substantially reduces the amount of time necessary to identify the subset of features that are relevant to the outcome and facilitate the rapid discovery of high performing models.
These features (pun intended) are massive time savers – and come with a well thought-out set of guard rails to ensure the output is meaningful. We often talk about co-optimizing for efficiency and effectiveness – these features exemplify that pursuit.
Once the causal factors for decisions are selected and a classification model is created, the user typically begins to predict future outcomes. Often, classification systems become a “black box” of sorts providing a predicted classification group but not providing any insight into why that prediction was made.
Until this release, users had the ability to create one_vs_rest classification models within Ayasdi MIP. While this is a great strategy for creating performant predictive models, it suffers from two major issues, when it comes to justifiability
Ayasdi addresses this challenge through its Multiclass Model framework. The past several releases have brought major enhancements to this framework and this release is no different.
The new Ayasdi Decision Tree MultiClass Predictor provides a single set of rules to justify and explain why certain transactions/claims/genre choices belong to a particular classification group in the Topological Network (using dt.get_rules(), dt.rules, and dt.dot). This is important because users can now understand, at a granular level, the causal factors behind the groupings and is particularly useful when a network has multiple disjoint groups where the user wants to predict in which group a data point belongs.
We have maintained for a long time that rules-based systems are ultimately flawed. However, we realize that they are exceptionally prevalent, especially in the finserve and healthcare verticals, and have justifiable use in situations where very fast, millisecond scale decisions need to be made.
A minor addition we’ve made is that results are now returned in the form of prediction probabilities and decisions. For example, a model can predict if Customer A will be a churning customer (the Decision) with a 90% probability (how confident the model is of this Decision).
This is useful in a number of constructs such as determining churn, identifying upgrade candidates and identifying failed program states just to name a few.
Often, the user would like to perform prediction using just a subset of the data. With this release, Ayasdi has made this even easier. Now, the user can construct a predictive model using a subset of the data or using particular datapoints. The new functionality is enabled through the Ayasdi Python SDK SourceSubset and DataPointList parameters.
Each DataPointList represents the values for the row corresponding to the original column set used for training. This meets a common requirement for Logistic Regression, Random Forest and Decision Tree models to be able to predict against only one or a few incoming data points (instead of a full dataset).
New or test data no longer needs to be imported into the system as a source. What this abstraction provides us is that a stream of data, as it is generated in real-time from an external source system, can now be used as an input. This addition is a milestone in our constant progress towards real-time prediction capabilities.
Ayasdi is in the business of building intelligent applications. All of those applications are built on our machine intelligence platform and thus have a common analytical foundation (whether it uses TDA or not).
Often, however, multiple applications would like to access a single, platform-powered component, such as a classification model.
Previously this would require multiple simultaneous connections. With this release, we have introduced API Connection changes that allow a user to call the Python SDK in order to make connections to multiple installations, as multiple users, without having to logout and login.
This functionality is key for a number of AI applications, such as Anti-Money Laundering where multi-user (multiple authorized users at the same time), multi-tenancy (multiple users can spawn concurrent jobs) is a requirement.
Release 7.10 also offers an option for the on-premise or private cloud (adding Azure to our existing AWS support). In keeping with all our previous releases, we continue to support high availability in our installations.
New for Release 7.10, the installation process no longer requires administrator or super-user privileges.
Whew. To think that only covers the highlights. We are constantly looking for ways to make this information more accessible and more digestible. There is now a new Getting Started page for the Python SDK, which was introduced to help users set up their environment before launching into the Ayasdi Python SDK Tutorials. This page provides links to all the sample data and code needed for the Ayasdi Python SDK Tutorials, describes how to upload the data to the Ayasdi Machine Intelligence Platform, and explains how to start a fresh Ayasdi Notebook.
Additionally, new documentation with Jupyter Notebook samples is provided with sample code to create topological networks and produce groups and auto-groups.
Please refer to the Ayasdi Machine Intelligence Platform 7.10 Release Notes and Ayasdi Python SDK Documentation for more details on these features and more.
The post Feature Rich – How Features Feature in Our 7.10 Release appeared first on Ayasdi.
]]>