For those that have not heard the BMW Oracle Racing team won the America's Cup sailing an incredible new boat. What even those that have been following the news on the race do not know is that Oracle Data Mining helped the performance team tune the boat.

I participated helping with that problem and it was a very hard one:

Imagine standing under an avalanche of data - 2500 variables, 10 times per second and a sailing team demanding answers to design and sailing variations immediately. This was the challenge facing the BMW ORACLE Racing Performance Analysis Team every sailing day as they refined and improved their giant 90 foot wide, 115 foot long trimaran sporting the largest hard-sail wing ever made. Using ORACLE DATA MINING accessing an ORACLE DATABASE and presenting results real time using ORACLE APPLICATION EXPRESS the performance team managed to provide the information required to optimise the giant multihull to the point that it not only beat the reigning America's Cup champions Alinghi in their giant Catamaran but resoundingly crushed them in a power display of high speed sailing. After two races - and two massive winning margins - the America's Cup was heading back to America - a triumph for the team, ORACLE and American technology.

— Ian Burns, Performance Director, BMW ORACLE Racing Team

Here is a video showcasing the boat and give some interesting information about it:

For more information check out the BMW Oracle Racing site here

]]>The Oracle BIWA Summit 2008 is approaching (December 2-3) . It will be held at Oracle World HQ, Redwood Shores, California. This is the second event of its kind. Last year's event was a great success and lots of fun (see details here ). This year's keynotes include Jeanne Harris (co-author of "Competing on Analytics") and Usama Fayyad (legendary data miner). Here are some information and links about the event:

]]>Sign up now to attend the

Oracle BIWA Summit 2008Dec. 2-3. Attend this unique two-day IOUG Business Intelligence, Warehousing and Analytics (BIWA) SIG (www.oraclebiwa.org) event to gain the knowledge and information critical for success in your work. Attend 65 technical talks and 10 hands-on sessions, hear keynotes from Jeanne Harris, the co-author of the best-sellerCompeting on Analytics,and other industry leaders, learn the latest trends in data warehousing, business intelligence and analytics best practices, learn how to overcome common challenges and network with your peers.Learn how to improve data warehouse query performance by a factor of 10x with Oracle Exadata and hear firsthand from Oracle Senior Executives and other experts about the revolutionary new HP Oracle Exadata Storage Server and HP Oracle Database Machine and how they fit into Oracle’s data warehousing strategy.

KeynotesTechnical Talks and Hands-on Sessionsalso see BIWA Summit agendaTravel and Logisticsalso see BIWA Summit- invitation
SponsorsREGISTER for BIWA now!- Last year's BIWA Summit 2007 wassold out,so sign up now to reserve your space.

BIWA Summit 2008 -December 2-3, 2008

For a long time I have thought that we needed data mining books written for developers. Most data mining books are written for business or data analysts. Given that, it was a pleasant surprise to read Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. The book provides a good discussion on data mining concepts anchored with interesting examples. It also provides guidance on when to use the techniques. I still think that there is room for improvement on how data mining and analytics should be presented to developers. However, the book is a great step forward in enabling developers to use analytics.

Toby's book has many Web 2.0 examples illustrated with data obtained from different web sites (e.g., blogs, Facebook, Zebo, etc). The examples are coded in Python or rely on freely available software for the data gathering and analytics. This got me thinking. If one wanted to illustrate how to create scalable and robust applications exploiting collective intelligence, would it not be interesting to replicate some of those examples utilizing the analytical and XML processing capabilities in the Oracle database? So I decided to do exactly that. This is the first post in a series showing how we can do the problems discussed in the book using technology in the Oracle database. Although most that is described in the book can be done using the Oracle database, this series will only showcase some of the examples in the book. To implement all of them would be like writing a new book.

Before we can start mining we need to collect data and store it in a database. There are many Web 2.0 websites with interesting data for mining. XML is the de facto data format returned by these websites. This post covers the basic steps on how to get this data by showing how to build an archive of entries from a list of RSS feeds. In later posts I will describe the mining.

First the good news, the Oracle RDBMS has a great set of features supporting XML and HTTP processing. Now the bad news, before writing this post I knew nothing about those features. So, after some digging around the web, I found lots of good information in a couple of articles. I have to say that I am really impressed by what can be done with the functionality in the database.

Below I describe some preparatory steps needed, how to query multiple RSS feeds at once, and finally how to automate the whole process to build a RSS feed archive.

Lucas Jellema has a nice series of articles describing how to build a RSS feed reader and archive (1, 2). I used them as a departing point. The comments in the first article pointed me in the right direction to move from a PL/SQL implementation, as described in the article, to the SQL version given below.

Here is a list of things that one need to do before running the code in Lucas' articles or the code below:

- Configure fine-grained access using ACL (access control lists) - this is needed for 11g
- Take care of proxy server - optional if you have a proxy server
- Configure the database character set

Oracle provides access to external network services through several PL/SQL APIs (UTL_TCP, UTL_SMTP, UTL_MAIL, UTL_HTTP and UTL_INADDR). Before Oracle 11g, access to these external services was based on whether a user was granted execute permissions on a specific package. Oracle 11g introduced fine-grained access to network services using access control lists (ACL). ACL allows to control which users access which network resources regardless of package grants.

Basically we need to do the following to be able to take advantage of the features in these packages:

- Create an access control list (ACL)
- Assign the ACL to the network

The assign_acl call above is very generous. It allows the DMUSER user to connect to any server (*) through port 80. For more details take a look at this nice article by Tim Hall.

If your machine sits behind a firewall you need to account for proxies by executing the following command in your session:

For more details see the documentation for the UTL_HTTP package (link).

When retrieving XML from RSS feeds it is a good idea to use a database with a Unicode character set, for example AL32UTF8. This prevents getting an error when trying to persist the XML content to a table.

First we create a table rss_feeds to hold the RSS feeds we want to track.

Next we insert a couple of RSS feeds in the table assigning a category to each feed, such as: news, technology, etc.

Now we can use the SQL statement below to query the RSS feeds and map the XML documents retrieved to a relational schema:

The above query has a number of interesting features. It uses the HTTPURITYPE function to extract the XML for each RSS feed in the rss_feeds table. The XMLTABLE statements then process the XML document returned from each feed to extract the desired pieces of information and convert the document into a relational source. The first XMLTABLE statement extracts the tile and the item elements from each XML document returned by p. Because each XML document contains multiple items, a second XMLTABLE statement is used to return the pieces of each item element extracted by the first XMLTABLE. This XMLTABLE statement also returns the category as XMLTYPE. This is needed because some sites have multiple category elements for a given item. The category elements are then extracted using the XMLQUERY function so that we have all these elements concatenated as a single string separated by commas. For alternative ways of doing the last bit take a look at this post by Mark Volkmann.

One problem with the above query is that if one of the RSS feed sites is down or is unreachable the whole query will fail. Given the nature of the web this is not an unlikely event if we are trying to collect feeds from many sites. To overcome this problem, we can implement a PL/SQL function that wraps the HTTPURITYPE function and validates the URL before invoking the HTTPURITYPE function. If the site for a URL does not respond, the function returns NULL. The code in the sections below follows this approach.

**Building the RSS Feed Archive**

To build the RSS feed archive we want to periodically run the above query and persist to a table only the entries that have not been stored already.

First let's create a table for holding the RSS feed entries:

Next we define a function to validate a RSS feed's URL and return the XML for the feed.

We can now use the following MERGE statement to merge only newer entries into the archive:

It is very impressive that a single SQL statement can retrieve entries from multiple RSS feeds, parse the XML, and merge the newer entries into a table for archival.

Finally let's setup a DBMS_SCHEDULER job to run this merge statement every fifteen minutes. First we create a PL/SQL procedure wrapping the above MERGE statement:

Besides the MERGE statement, the above procedure also includes some code for handling proxy servers as explained above.

Next we create the job that will execute this procedure every fifteen minutes:

For more details on DBMS_SCHEDULER see the Oracle documentation (link).

To check on status of the job we can use the following query:

To manually stop and/or drop the job we can use the following calls:

After a few days running the archive has accumulate quite a few entries:

In later posts I will show how to mine the archive.

]]>In line with this trend, Oracle has recently released a new application: Oracle Sales Prospector. This application is targeted at sales representatives. It provides which accounts they should target with specific products or services. It also indicates which deals are most likely to close within specific time frames, and provides accompanying corporate profiles as well as likely customer references.

This is a very cool application and I have wanted to talk about it for some time now. My group has worked closely with the Social CRM group in developing the data mining driving the product, and now that the it has launched I can discuss some of the data mining details.

Oracle Sales Prospector is as analytical-driven as an application can be. Almost every aspect of what you seen in its very nice user interface is the product of data mining models. Oracle Data Mining (ODM) provides the algorithms and the real-time scoring driving the application.

Oracle Sales Prospector main screen

The product functionality is presented to the user in a easy to interact screen (see main screen above). Sales opportunities are presented in a bubble chart at the left of the main screen. Each bubble is an opportunity. An opportunity is a (customer, product) pair. The size of the bubble reflects the expected (estimated) revenue for the opportunity, the vertical placement how likely it is to close that opportunity, and the horizontal placement the expected number of months to close it.

Opportunity bubble chart

If the user provides no filtering criteria, the application scores all customers and products associated with the sales representative and display the top ones (e.g., ten) ranked by expected revenue in the bubble chart. This is done in real-time and scales very nicely to large numbers of products and customers. One of the benefits of having data mining tightly integrated with a robust and scalable database server like Oracle's.

Filtering interface (upper left side of the application main screen)

The user can filter opportunities by product or customer. It is also possible to constraint the opportunities to those with a projected revenue size above a given value and or probability of sale above a given threshold. Once the user specifies the filtering criteria, the application scores all customers and products in real-time and returns those that match the criteria. The real-time aspect of it ensures that the latest pieces of information (e.g., current sales data) are taken into account when assessing opportunities.

The opportunity bubble chart provides a lot of functionality and leverages different types of modeling techniques. It uses, in different ways, regression, clustering, and association techniques to provide scalable, accurate, and transparent recommendations (customer-product pairs) as well as estimates for time to close and expected revenue estimates. Robustness and scalability are important principles to keep in mind when design analytical applications. It is better to sacrifice some accuracy and select more robust and scalable techniques that will behave well with a variety of data distributions and data quality conditions.

Reference pane (lower right side of the main screen)

For each opportunity (bubble) the application also returns a list of likely references. These are customers that are like the one in the opportunity but that have already purchased the product in the opportunity. Customer segmentation using clustering is used for this part of the application. Similarity between customers is computed taking into account both purchasing behavior and company information (e.g., size and industry).

For more information on Oracle Sales Prospector check out its web site here and the Social CRM blog here.

]]>The UTL_NLA package has many different procedures for solving systems of linear equations. Most of them are designed to take advantage of special form matrices (e.g., triangular, banded, etc.). For the example below, I picked the procedure LAPACK_GELS. This procedure solves general systems of linear equations in a robust fashion. It uses QR or LQ decompositions as needed.

Here is the output for the above code:

]]>This is the second post in a series on how to do Linear Algebra in the Oracle Database using the *UTL_NLA* package. The first illustrated how to solve a system of linear equations. As promised in part 1, I am posting the code for a couple of procedures for reading and writing data from tables to arrays and vice-versa. Links to the data and code used in the examples are at the end of the post.

In data sets with many attributes groups of attributes often move together. This is usually due to the fact that more than one attribute may be measuring the same underlying force controlling the behavior of the system. In most systems there are only a few such driving forces. In these cases, it should be possible to eliminate the redundant information by creating a new set of attributes that extract the essential characteristics of the system. In effect we would like to replace a group of attributes with a single new one. PCA is a method that does exactly this. It creates a new set of attributes, the principal components, where each one of them is a linear combination of the original attributes. It is basically a special transformation of the data. It transforms numerical data to a new coordinate system where there is no redundant information recorded in the attributes.

Figure 1: A two-dimensional data distribution.

It is easier to understand what PCA does by looking at a couple of pictures (see Technical Note 1 below for a more technical explanation). Let's assume that we have a table (data set) with two columns (X1 and X2) and that all the rows (observations) in this table falls within the gray ellipsis in Figure 1. The double-edged arrows in the picture indicate the range for the X1 and X2 attributes. A couple of things to notice in the picture:

- The data dispersion for X1 and X2 is about the same
- The distribution of the data falls along a diagonal pattern

The first point indicates that information from both attributes is required to distinguish the observations. The second point indicates that X1 and X2 are correlated. That is, when X1 is large X2 is also usually large, and when X1 is small X2 is also usually small.

What PCA does is to compute a new set of axes that can describe the data more efficiently. The idea is to rotate the axes so that the new axes (also called the principal components, i.e., PCs for short) are such that the variance of the data on each axis goes down from axis to axis. The first new axis is called the first principal component (PC1) and it is in the direction of the greatest variance in the data. Each new axis is constructed orthogonal to the previous ones and along the direction with the largest remaining variance. As a result the axes are labeled in order of decreasing variance. That is, if we label PC1 the axis for the first principal component then the variance for PC1 is larger than that for PC2 and so on.

Figure 2: PC1 and PC2 axes and their ranges.

Figure 2 shows the new axes PC1 and PC2 created by PCA for the example in Figure 1. As indicated by the sizes of the double-edged arrows, there is more variability (greater variance) on the PC1 axis than on the PC2 axis. Also the PC1 variance is bigger than that in the original attributes (X1 and X2). This is not surprising as PC1 is constructed to lie on the direction of the greatest variance in the data. The fact that PC2 has less variability means that it is less important for differentiating between observations. If there were more PCs, the data projection on these axes would have less and less variability. That is, they would capture less and less important aspects of the data. This is what allows compressing the data. By ignoring higher-order PCs we can great a new version of the data with fewer attributes than the original data.

Figure 3: Coordinates for a point on the original axes and on the PC axes.

Figure 3 shows the coordinates for a point in the original coordinate system (attributes X1 and X2) and in the PC coordinate system (PC1 and PC2). We can see that, for the PC coordinate system, the majority of points can be distinguished from each other by just looking at the value of PC1. This is not true for the original coordinate system. For example, there are many fewer points with coordinate b1 than with coordinate a1.

The following are the steps commonly taken when working with PCA:

- Center and/or scale the data. If an attribute spans different orders of magnitudes then first LOG transform the attribute
- Compute PCs
- Look at the variance explained (inspection of the Scree plot)
- Select the number of PCs to keep

The selected reduced set of PCs can then be used for different types of activities depending on the task at hand. Some common applications are: data compression, faster searches, data visualization, and finding groups and outliers. These steps will become clearer in the examples below.

The PCA functionality is implemented in the package *DMT_PROJECTION* by the following procedures and functions (see code at the end of the post):

*pca**pca_eval**mk_xfm_query**mk_attr_map**mk_pca_query*

The *pca* procedure has two signatures. The first works with arrays and it is the core procedure. The second signature wraps the first one and works with tables. The wrapper procedure uses a set of utilities to convert from table to array and vice-versa (see Technical Note 3). The wrapper procedure also estimates the number of PCs to keep and produces a query string that can be used to score new data. Technical Note 2 below describes the wrapper procedure in more details. All the examples below use the wrapper procedure.

The *pca_eval* procedure estimates the ideal number of PCs to keep (see discussion in Example 3 below). It also computes the information for a Scree plot and a Log-Likelihood plot. The latter is used to estimate the number of PCs based on the method described in [ZHU06].

The function *mk_xfm_query* generates a query string that transforms a source data set to make it more suitable to PCA. It supports three approaches: Centralize only (subtract the mean), scale only (divide by the standard deviation), and z-score (centralize and scale). The scripts for the examples below show how this function is used.

The *mk_attr_map* procedure creates 2D attribute maps for a set of attributes. The names of the columns with the two coordinates for the map are provided in the argument list. The result table contains the attribute maps for all the attributes listed in the argument list. The first coordinate labels the rows and the second coordinate labels the columns of this table. Example 2 showcases the use of this procedure for data visualization.

The function *mk_pca_query* generates a query string that can be used to compute the PC representation of data in a table, view, or query. To use this query it is only necessary to append a table name or subquery to the end of the returned string. This procedure is used by the pca wrapper procedure.

The data for this example come from François Labelle's website on dimensionality reduction. The data contain the distance of each planet to the Sun, the planet's equatorial diameter, and the planet's density. The goal in this example is to find interesting features in the data set by visualizing it in 2D using PCA.

The following script transforms the data and finds the principal components using the *pca* wrapper procedure (see comments in the *DMT_PROJECTION* package code for a description of the arguments):

Because the data span a couple orders of magnitudes I applied the LOG transformation before centering the data using the *mk_xfm_query* function.

Table 1: planet_scr table.

Figure 4: Scree plot for the planetary data.

The *planet_scr* table (see Table 1) contains data for the Scree and Log-Likelihood plots. The Scree plot (Figure 4) charts all the eigenvalues in their decreasing order (from lower to higher PCs). I was curious so I checked out why it is called a Scree plot. The word scree comes from the Old Norse term for landslide. The term is used in this context because the plot resembles the side of a mountain with debris from the mountain lying at its base. The Log-Likelihood plot is covered in Example 3 below.

Table 1 shows that the first principal component (PC1) alone accounts for about 69% of the explained variance (*PCT_VAR* column). Adding PC2 brings it up to 98.8% (*CUM_PCT_VAR* column). As a result we can describe the data quite well by only considering the first two PCs.

Figure 5: Projection of planet data onto the first two PCs.

Figure 5 shows the distribution of the planets based on their PC1 (horizontal) and PC2 (vertical) coordinates. These numbers (scores) are persisted in the planet_u table. The red arrows show the projection of the original attributes in the PC coordinate system. The arrows point in the direction of increasing values of the original attribute. The direction of the arrows can be obtained from the V matrix returned by the *pca* procedure in the *planet_v* table. Each row in this table gives the PC coordinates for one of the original attributes.

The picture reveals some interesting features about the data. Even someone not trained in Astronomy would easily notice that Pluto is quite different from the other planets. This should come as no surprise given the recent debate about Pluto being a planet or not. It is also quite evident that there are two groups of planets. The first group contains smaller dense planets. The second group contains larger less dense planets. The second group is also farther away from the sun than the first group.

The previous example highlighted how principal component projections can be used to uncover outliers and groups in data sets. However, it also showed that is not very easy to relate these projections to the original attributes (the red arrows help but can also be confusing). An alternative approach is to create attribute maps in the lower dimensional projection space (2D or 3D). For a given data set, attribute maps can be compared to each other and allow to visually relate the distribution of different attributes. Again it is easier to explain the idea with an example.

The data for this example also come from François Labelle's website on dimensionality reduction. The data contain the population, average income, and area for twenty countries.

Figure 6: Country data projection onto the first two PCs.

Figure 5 shows the projection of the *LOG* of the data in the first two PCs (PC1 is the horizontal axis). From the picture we see that countries with large populations are in the upper left corner. Countries with large areas and large average incomes are closer to the lower right corner. It is harder to get insight on the countries in the middle. Also this type of analyzes becomes more and more difficult as the number of attributes increases (more red arrows in the picture).

We can construct attribute maps with the following steps:

- Start with the PC1 and PC2 coordinates used to create a map like the one in Figure 5
- Bin (discretize) the PC1 and PC2 values into a small number of buckets.
- Aggregate (e.g.,
*AVG, MAX, MIN, COUNT*) the attribute of interest over all combinations of binned PC1 and PC2 values, even those not represented in the data so that we have a full matrix for plotting. The attribute being mapped can be one of those used in the PCA step (population, average income, area) or any other numeric attribute - Organize the results in a tabular format suitable for plotting using a surface chart: rows index one of the binned PCs and columns the other

The *DMT_PROJECTION.MK_ATTR_MAP *procedure implements all these steps. This procedure relies on *DMT_DATA_MINING_TRANSFORM* package for binning the data. So you need to have the Oracle Data Mining option installed to use it. You can always comment out this procedure or replace the binning with your own if you do not have the Data Mining option.

Using *DMT_PROJECTION.MK_ATTR_MAP* procedure I computed attribute maps for Population, Average Income, and Area. Here is the code:

The code has two parts. First it does PCA on the data. Next it creates the maps. The maps are persisted in the *COUNTRY_ATTR_MAP* table. The bin coordinates for each country are returned in the *COUNTRY_BVIEW* view. The query used for creating the maps joins the original data with the PCA projections from the *country_u *table. To create a map for attributes not in the original data we just need to edit this query and replace the original table in the join with the new table containing the data we want to plot.

For this example I set the number of bins to 4 because the data set is small. The more rows in the data the more bins we can have. This usually creates smoother maps. The procedure supports either equi-width or quantile binning. If we use quantile binning the maps display an amplification effect similar to a Self-Organizing Map (SOM). Because we create more bins in areas with higher concentration of cases, the map "zooms in" these areas and spreads clumped regions out. This improves visualization of dense regions at a cost that the map now does not show distances accurately. A map created with quantile binning is not as useful for visual clustering as one created with equi-width binning or using the original PC projections as in Figure 5.

Figure 7: Population attribute map.

Figure 8: Average income attribute map.

Figure 9: Area attribute map.

Figures 6-8 display the three maps using MS Excel's surface chart. The maps place the countries on a regular grid (PC1 vs. PC2, where PC1 is the horizontal axis) by binning the values of the coordinates PC1 and PC2 into four equi-width buckets. The colors provide a high to low landscape with peaks (purple) and valleys (blue). Each point in the grid can contain more than one country. I labeled a few countries per point for reference purpose. The value of the attribute at a grid node represents the average (*AVG*) value for the countries in that grid node. Values (and colors) between grid points are automatically created by interpolation by the surface chart routine in Excel.

The maps provide a visual summary of the distribution of each attribute. For example, Figure 7 shows that China and India have very large populations. As we move away from China (the population peak) population drops. Figure 8 shows high average income countries in the middle of the map and cutting across to the upper right corner.

It is also very easy to relate the attribute values in one map to the attribute values in the other maps. For example, the maps show that, in general, countries with larger average income (purple in Figure 8) have smaller population (yellow in Figure 7), the USA being an exception. Also large countries (purple and red in Figure 9) have large populations (purple and red in Figure 6). Australia is the exception in this case.

Those interested in creating plots like Figures 6-8 with Excel's surface chart should take a look at Jon Peltier's excellent tutorial: Surface and Contour Charts in Microsoft Excel. The output produced by *DMT_PROJECTION.MK_ATTR_MAP *is already in a format that can be used for plotting with Excel or most surface charting tools. In my case I just selected from the table *p_res_tab *in SQLDeveloper and then pasted the selection directly into Excel.

An important question in PCA is how many PCs to retain. If we project a data set with PCA for visualization we would like to know that the few PCs we are visualizing accounts for a significant share of the variation in the data. Also for compression applications we would like to select the minimum number of PCs that achieve good compression. Again this will become clearer with an example.

The data set for this example can be found in [JOH02] and Richard Johnson's web site (here). The data set gives corporate data on 22 US public utilities. There are eight measurements on each utility as follows:

- X1: Fixed-charge covering ratio (income/debt)
- X2: Rate of return on capital
- X3: Cost per KW capacity in place
- X4: Annual Load Factor
- X5: Peak KWH demand growth from 1974 to 1975
- X6: Sales (KWH use per year)
- X7: Percent Nuclear
- X8: Total fuel costs (cents per KWH)

The task for this example is to form groups (clusters) of similar utilities. An application, suggested in this lecture at MIT, where clustering would be useful is a study to predict the cost impact of deregulation. To do the required analysis economists would need to build a detailed cost model for each utility. It would save a significant amount of time and effort if we could:

- Cluster similar utility firms
- Build detailed cost models for just a "typical" firm in each cluster
- Adjust the results from these models to estimate results for all the utility firms

Instead of applying a clustering algorithm directly to the data let's analyze the data visually and try to detect groups of similar utilities. First we use PCA to reduce the number of dimensions. Here is the code:

Next we need to determine if a small number of PCs (2 or 3) captures enough of the characteristics of the data that we can feel confident that our visual approach has merit. There are many heuristics that can be used to estimate the number of PCs to keep (see [ZHU06] for a review). A common approach is to find a cut-off point (a PC number) in the Scree plot where the change from PC to PC in the plot levels off. That is, the mountain ends and the debris start. Until not long ago, there were no principled low computational cost approaches to select the cut-off point on the Scree plot. A recent paper by Zhu and Ghodsi [ZHU06] changed that. The paper introduces the profile log-likelihood, a metric that can be used to estimate the number of PCs to keep. Basically one should keep all the PCs up to and including the one for which the Profile Log-Likelihood plot peaks.

The *pca *wrapper procedure implements the approach described in [ZHU06]. The estimated number of PCs is returned in the *p_num_pc* argument. The procedure also returns the information required to create a Scree plot and the Profile Log-Likelihood plot in the table specified by the *p_scr_tab_name* argument.

Figure 10: Scree plot for electric utility data.

Figure 11: Log-likelihood profile plot for electric utility data.

Figures 10 and 11 show, respectively, the Scree and the Log-Likelihood plots using the data persisted in the *utility_scr* table. The Profile Log-Likelihood plot indicates that we should keep the first three PCs (the plot peaks at PC3). This might not have been immediately obvious from the Scree plot itself. The cumulative percentage explained variance for the first three PCs is 67.5%. PC3 accounts for about 16.5% of the variance. By keeping only the first three PCs we capture a good deal of the variance in the data.

Figure 12: Electric utility data projection onto PC1 and PC2 axes.

Let's start our analysis by looking at the projection of the data on the first two PCs. This is represented in Figure 12. From the picture we can see that the values in PC1 (horizontal axis) divide the data into three sections: A left one with the Green group (San Diego), a right one with the Blue group, and a middle section with the remaining of the utilities. Along the PC2 axis the middle section separates into three subsections, colored red, yellow, and black in the picture. Finally, the Black group seems to have two subgroups separated by a line in the picture. This visual partition of the data is somewhat arbitrary but it is can be quite effective.

What happens if we add the information from PC3? Do we still have the same partitions? Can it help to further separate the yellow group?

Figure 13: Electric utility data projection onto PC1 and PC3 axes.

Figure 13 illustrates the projection of the data using the first three PCs (see Technical Note 4 for details on how to construct this plot). Each vertex of the triangle represents one of the first three PCs, the closer to the vertex the higher the value for the associated PC. The numbers have been scaled so that at a vertex the associated PC has a value of 1 and the other two PCs are zero. Also points at the line opposite to a vertex have a value of zero for the PC associated with the vertex. For example, San Diego has a value of almost 0 for PC1.

Most of the structures uncovered in Figure 12 are also present in Figure 13. The Blue and Red groups are preserved. San Diego (Green group) is still very different from the rest. One of the Black subgroups (Consolid, NewEngla, Hawaiian) was also kept intact. However, the other Black subgroup was merged with the Yellow group. We can also see that adding PC3 helped split the Yellow group into two subgroups. The partition of the Yellow utility firms takes place along a line that connects PC3 with the opposite side of the triangle.

Recapping, the task was to identify groups of similar utility firms. Using PCA we were able to visually identify 6 groups (Figure 13) with ease. Trying to uncover patterns by inspecting the original data directly would have been nearly impossible. The alternative would have been to use a clustering algorithm. The latter is a more sophisticated analysis requiring greater analytical skills from the user.

In a follow-up post I will show how to implement data interpolation using cubic splines. The implementation does not use the procedures in *UTL_NLA*, only the array types. However it is another application of Linear Algebra and it can be code against the *UTL_NLA* procedures.

Given the original data matrix *X*, PCA finds a rotation matrix *P* such that the new data *Y=XP* have a diagonal covariance matrix. The *P* matrix contains the coefficients used to create the linear combination of the original attributes that make up the new ones. Furthermore, *P* is a matrix with the eigenvectors of the covariance matrix of *X* in each column and ordered by descending eigenvalue size. We can achieve compression by truncating *P*. That is, we remove the columns of *P* that have the smallest eigenvalues. As a result *Y* is not only a rotated version of *X* but it is also smaller.

Let *X* be an *m* by *n* matrix where the mean of each column is zero. The covariance matrix of *X* is given by the *n* by *n* matrix *C=X'X*. The Singular Value Decomposition (SVD) of *X* is *X=USV'*.

The matrix *U* is *m* by *n* and has orthogonal columns, that is, *U'U = I*, where *I* is an *n* by *n* identity matrix. The matrix *S* is an *n* by *n* diagonal matrix. The matrix *V* is an *n* by *n* matrix with orthogonal columns, that is, *V'V = I*. Using this decomposition we can rewrite *C* as

From this expression we see that *V* is a matrix of eigenvectors of *C* and *S^2* is the matrix of eigenvalues of *C*, the column of both matrices are also ordered by decreasing size of the associated eigenvalue. For example, let *Vi* be the

where *Si^2* is the *i*-th element of the diagonal of *S^2.* It is also the eigenvalue associated with *Vi*. Setting *P=V* also solves the PCA problem, that is, *V* is a rotation matrix with all the desired properties. To illustrate the point further, given the covariance of *Y* as

If we replace *P* with *V* we make *Y'Y* a diagonal matrix:

So *V* is a rotation matrix that turns the covariance matrix of *Y* into a diagonal matrix.

The scores *Y* on the new coordinate system can also be obtained from the results of SVD. *Y* can be computed by multiplying each column of *U* by the respective diagonal element of *S* as follows

This is what is done in the step that scales *U* in the *pca* core procedure (see Technical Note 2 below).

The actual code for computing PCA (the core PCA procedure) is very small. This procedure is a good example of how the *UTL_NLA* package allows implementing complex calculations with a few lines of code. The code is as follows:

The above code uses Singular Value Decomposition (SVD) for computing PCA. It has three parts. First it does SVD calling a single *UTL_NLA *procedure (*UTL_NLA.LAPACK_GESVD*). Then it scales the U matrix to obtain the projections. Finally it scales and squares the S matrix to get the variances.

The *p**ca* wrapper procedure is a good example of how easy it is to write powerful code using the *UTL_NLA *package and the read and write procedures in *DMT_MATRIX* (see technical note 3 below). The full *pca* wrapper procedure is listed below:

The procedure has the following main steps:

- Read the data from a table into an array
- Perform PCA on the array
- Generate the scoring query
- Estimate the number of PCs and compute the scree and log-likelihood plots
- Persist the PCA results (eigenvalues, PCs, scores, and scree and log-likelihood plot information)

With the exception of the last one, each one of these steps is implemented with a single procedure call. The last step requires a procedure call for each persisted result. This is a total of three calls to the *array2table *procedure. In the code I use the scoring query for persisting the scores instead of using the U matrix. This way I am sure that I get projections for all the rows in the original table even for cases where the original table did not fit in memory.

*DMT_MATRIX* package contains the following procedures for reading and writing data from tables to arrays and vice-versa:

*table2array*- Reads data from a table and stores them in a matrix represented by an array of type*UTL_NLA_ARRAY_DBL*.*table2array2*- Reads data from three columns in a table (or source) and stores them in a matrix represented by an array of type*UTL_NLA_ARRAY_DBL*. The first column contains, the row index, the second column the column index, and the third the value. The row and column indexes do not have to be numeric. However, there should be only one instance of a given row-column index pair. This procedure is useful for reading data generated by the*array2table2*procedure (see below). It is also useful for reading data with more than one thousand attributes or that it is naturally stored in transactional format.*array2table*- Writes data from a matrix into a table. The table's column names and the matrix row IDs are either specified in the argument list or automatically generated.*array2table2*- Writes data from a matrix into a table with 3 or 4 columns. The first (optional) column contains an ID for the matrix. This allows storing many matrices in a single table. The ID for a matrix should be unique for each matrix stored in a table. The second column contains the row index. The third contains the column index. The fourth contains the values.

These procedures greatly simplify coding linear algebra operations using data in tables. The wrapper version of the PCA procedure uses these procedures. The code for the *DMT_MATRIX* package provides more information on the procedures and *dmt_matrix_sample2.sql *file has a few examples using them.

The approach used for Figure 13 is based on chromaticity diagrams used for color spaces. These diagrams enable three-dimensional data to be compressed into a two-dimensional diagram. The caveat is that points in the original three-dimensional space that are scaled versions of each other are mapped to the same point in the two-dimensional diagram. For example the points (0.1, 0.2, 0.2) and (0.2, 0.4, 0.4) are mapped to the same point (0.78, 0.53) in the diagram. These points represent the same overall pattern. The difference between them is a matter of "strength or scale." In the case of colors the difference between the two points is one of luminance. The second point is a brighter version of the first one. In the case of utilities (see Example 3 above), the second utility has the same operational characteristics of the first but it operates at a larger scale.

The graphics in Figure 13 was created using Excel's 2D scatter plot. The data for the graphics was generated as follows:

- Compute the pi projections (i=1, 2, 3) of the points in the cube onto the plane PC1+PC2+PC3 = 1, where (PC1, PC2, PC3) are the normalized PC projections from the previous step:

- Compute the 2D scatter plot coordinates (T1, T2), where T1 and T2 are, respectively, the horizontal and the vertical distances from the triangle's PC1 corner:

This link points to the source code for the *dmt_projection* and *dmt_matrix* packages as well as scripts for the examples in the post. There are also links to a dump file containing the data for the examples and a script to create a schema and load the data.

[JOH02] Johnson, Richard Arnold, and Wichern, Dean W. (2002). *Applied Multivariate Statistical Analysis, *Prentice Hall, 5th edition.

[ZHU06] Zhu M., and Ghodsi A. (2006), "Automatic dimensionality selection from the scree plot via the use of profile likelihood." *Comput Stat Data Anal*, 51(2), 918 - 930.

Linear algebra is a branch of mathematics with a wide range of practical applications. Many areas have tasks that can be expressed using linear algebra, and here are some examples from several fields: statistics (multiple linear regression and principle components analysis), data mining (clustering and classification), bioinformatics (analysis of microarray data), operations research (supply chain and other optimization problems), econometrics (analysis of consumer demand data), and finance (asset allocation problems). Various libraries for linear algebra are freely available for anyone to use. Oracle's UTL_NLA package exposes matrix PL/SQL data types and wrapper PL/SQL subprograms for two of the most popular and robust of these libraries, BLAS and LAPACK.

Linear algebra depends on matrix manipulation. Performing matrix manipulation in PL/SQL in the past required inventing a matrix representation based on PL/SQL's native data types and then writing matrix manipulation routines from scratch. This required substantial programming effort and the performance of the resulting implementation was limited. If developers chose to send data to external packages for processing rather than create their own routines, data transfer back and forth could be time consuming. Using the UTL_NLA package lets data stay within Oracle, removes the programming effort, and delivers a fast implementation.

BLAS and LAPACK are probably the predominant linear algebra libraries out there. These libraries are extensively used by a large number of scientific programs and specialized tools.

For developers, these libraries provide the building blocks for easily implementing a large number of advanced techniques. Take for example a toolbox like MATLAB, many of its capabilities are built on top of linear algebra primitives provided by packages like BLAS and LAPACK. Having these libraries in the database allow developers to write compact and ease to read code written using vector operations. Also, because these libraries have efficient and robust implementation of linear algebra operations, code using the *UTL_NLA* package inherent these qualities for free.

Besides scientific programming, Oracle's linear algebra support can also be used for business analysis. One example is multiple linear regression. The database ships with a multiple regression application built using the *UTL_NLA* package. This application is implemented in an object called *OLS_Regression*. Note that sample files for the OLS Regression object can be found in *$ORACLE_HOME/plsql/demo*. Take a look here for an example of how to use this functionality.

Another example of business analysis is solving a system of linear equations. In this post I give a couple of examples of how to solve a system of linear equations with *UTL_NLA*. In a follow up post I will show how to implement Principal Components Analysis (PCA) using the package.

Before using the UTL_NLA package for development there are a couple of things to keep in mind. Oracle documentation states that developers using this package are expected to have a sound grasp of linear algebra in general and of the BLAS and LAPACK libraries in particular. I believe basic knowledge of linear algebra is enough if you are just trying to implement a well-known algorithm using this package. With respect to BLAS and LAPACK it is important to familiarize yourself with some basic concepts (e.g., matrix storage representation: column- or row-wise) and also to better understand some of the arguments the procedures in the package take. Besides the Oracle documentation some other useful references are:

The Lapack Users' Guide, the BLAS and the LAPACK chapters in the in *CRC Handbook of Linear Algebra*.

The *UTL_NLA* package currently only supports matrices with up to 1,000,000 elements. For example, if we think of a table as a matrix, where table columns map to matrix columns and table rows to matrix rows, then for a table with 100 columns we can store up to 10,000 rows in a single matrix. It is useful to think in these terms because the number of rows is usually larger than the number of columns in a table. Furthermore, for many applications, we can obtain good results working on a sample of the data. This can be easily done in the database.

In order to use *UTL_NLA*, matrices have to be represented as either *UTL_NLA_ARRAY_DBL* or *UTL_NLA_ARRAY_FLT*. These types are PL/SQL *VARRAY*. However, many applications require data that resides in tables. It would be a nice addition to the package to have the ability to create a matrix from a query and to persist matrices to tables as well. This would alleviate the need for developers to code their own data read and write routines and increase usability. In the next post, along with the PCA example, I will include a couple of procedures to help with these tasks.

Systems of linear equations (see here for a less technical description) are of great importance in mathematics and its applications to areas of physical sciences, economics, engineering and many more. The goal in solving a system of linear equations is to find a set of values for the unknowns that satisfies all the equations in the system. Let me illustrate this with a couple of examples.

You are planning a 7 day trip to the East Coast. You estimate that it will cost $300 per day in Boston and $675 per day in New York City. Your total budget for the 7 days is $2850. How many days should you spend in each location?

This problem can be mapped to the following system of linear equations:

$$\left\{ \begin{array}{rcrcr} x_1 & + & x_2 & = & 7 \\ 300x_1 & + & 675x_2 & = & 2850 \end{array} \right. $$where x1 is the number of days in Boston and x2 is the number of days in New York City. x1 and x2 are the unknowns, the values we want to compute.The first equation requires that the total time be equal to 7 days. The second equation requires that the total amount spent in the trip to be equal to the available budget. In matrix form this system can be expressed as A*X = B where the matrices A, X, and B are as follows:

$$ A=\left( \begin{array}{cc} 1 & 1 \\ 300 & 675 \end{array} \right) ,\; X=\left( \begin{array}{c} x_1\\ x_2 \end{array} \right) ,\; B=\left( \begin{array}{c} 7\\ 2850 \end{array} \right) $$We can use the LAPACK_GESV procedure in UTL_NLA to solve this problem. This procedure computes the solution to a real system of linear equations A*X=B where A is an n by n matrix and X and B are n by 1 matrices. Here is a code snippet that solves this problem:

The above code returns:

The arrays A and B store, respectively, matrix A and B in column order. It only takes a single procedure to solve the problem. On exit, if the argument info = 0 (success), the procedure overwrites the array B with the n by 1 solution matrix X. For this example, the solution is x1=5 and x2=2.

*(originally in the October 1991 edition of **Mathematics Teacher **- adapted from **here**)*

Joan King is marketing director for the BurgerRama restaurant chain. BurgerRama has decided to have a cartoon-character doll made to sell at a premium price at participating BurgerRama locations. The company can choose from several different versions of the doll that sell at different prices. King’s problem is to decide which selling price will best suit the needs of BurgerRama’s customers and store managers. King has data (Table 1) from previous similar promotions to help her make a decision.

Table 1

To solve this problem we need to proceed in two steps. First we need to estimate the supply and demand equations from the above data. Then we solve a system of linear equations to find the market equilibrium price where supply equals demand.

For the supply equation we want find a linear equation relating supply (S) to price (P) that fits the data in Table 1. Let write this equation as S = c1 P + c2, where c1 and c2 are the parameters we want to estimate. By replacing S and P in the supply equation with the values in Table 1 we obtain the following system of three linear equations:

$$\left\{ \begin{array}{rcrcr} c_1 & + & c_2 & = & 35 \\ 2c_1 & + & c_2 & = & 130 \\ 4c_1 & + & c_2 & = & 320 \end{array}\right. $$In matrix form this system can be expressed as A*X = B where the matrices A, X, and B are as follows:

$$ A=\left( \begin{array}{cc} 1 & 1 \\ 2 & 1 \\ 4 & 1 \end{array} \right) ,\; X=\left( \begin{array}{c} c_1\\ c_2 \end{array} \right) ,\; B=\left( \begin{array}{c} 35 \\ 130 \\ 320 \end{array} \right) $$This type of system, with more equations (3) than unknowns (c1, c2) is called an overdetermined system. Solving this problem is the same as solving a multivariate linear regression problem.

We can use the procedure LAPACK_GELS to solve this system. The LAPACK_GELS procedure solves overdetermined and underdetermined real linear systems. The relevant code snippet is:

The above code returns:

We have just implemented multivariate linear regression using a single PL/SQL procedure call! On entering the procedure the array B contains the values for the matrix B above. On exiting the procedure, the first two elements of B are the values for c1 and c2, the coefficients for the supply equation. The third element is the residual sum of squares for the solution, that is, the sum of the squared differences between the values predicted by the solution for S and the actual values in the data. Replacing the obtained values for c1 and c2 in the supply equation we obtain: S = 95 P - 60.

Setting up a similar problem for the demand data gives a system with the same A matrix and a different B matrix (B=(530,400,140)). Solving this system of equations (see code at the end of the post) gives the following demand equation: D = -130 P + 600.

In order to find the market equilibrium price (step 2) we create a new system of linear equations combining the supply and demand equations with a new equation requiring that supply equals demand. The system looks like this:

$$\left\{ \begin{array}{rcrcrcr} S & & & - & 95P & = & -60 \\ & & D & + & 135P & = & 600 \\ S & - & D & & & = & 0 \end{array}\right. $$The first two equations are, respectively, the supply and demand equations we obtained in step 1 above. The third equation is the market equilibrium condition (S=D). The matrix representation for the system is as follows:

$$ A=\left( \begin{array}{crr} 1 & 1 & -95 \\ 0 & 1 & 130\\ 1 & -1 & 0 \end{array} \right) ,\; X=\left( \begin{array}{c} S \\ D \\ P \end{array} \right) ,\; B=\left( \begin{array}{r} -60 \\ 600 \\ 0 \end{array} \right) $$The solution of this system of linear equations is S = D = 244 units and P = $3.20.

Here is the code for all the steps in this example:

The above code has only three procedure calls. It is even possible to solve steps 1a and 1b with a single call by combining the B matrices used in both steps into a single one. This would require only two procedure calls to solve Example 2.

In the next post in this series I will show how to do Principal Components Analysis (PCA) using the *UTL_NLA* package.