The post Cluster Shrinkage appeared first on GestaltU.

]]>There are of course some true giants in the field of portfolio theory. Aside from timeless luminaries like Markowitz, Black, Sharpe, Thorpe and Litterman, we perceive thinkers like Thierry Roncalli, Attilio Meucci, and Yves Choueifetay to be modern giants. We also admire the work of David Varadi for his contributions in the field of heuristic optimization, and his propensity to introduce concepts from fields outside of pure finance and mathematics. Also, Michael Kapler has created a truly emergent phenomenon in finance with his Systematic Investor Toolkit, which has served to open up the previously esoteric field of quantitative finance to a much wider set of practitioners. I (Adam) know I’ve missed many others, for which I deeply apologize and take full responsibility. I never was very good with names.

In this article, we would like to integrate the cluster concepts we introduced in our article on Robust Risk Parity with some ideas proposed and explored by Varadi and Kapler in the last few months (see here and here). Candidly, as so often happens with the creative process, we stumbled on these ideas in the process of designing a Monte-Carlo based robustness test for our production algorithms, which we intend to explore in greater detail in a future post.

In a recent article series, Varadi and Kapler proposed and validated some novel approaches to the ‘curse of dimensionality’ in correlation/covariance matrices for high dimensional problems with limited data histories. Varadi used the following slide from R. Gutierrez-Osuna to illustrate this concept.

Figure 1. Curse of Dimensionality

Source: R. Gutierrez-Osuna

The ‘curse of dimensionality’ sounds complicated but is actually quite simple. Imagine you seek to derive volatility estimates for a universe of 10 assets based on 60 days of historical data. The volatility of each asset is held in a 1 x 10 vector, where each of the 10 elements of the vector holds the volatility for one asset class. From a data density standpoint, we have 600 observations (60 days x 10 assets) contributing to 10 estimates, so our data density is 600/10 = 60 pieces of data per estimate. From a statistical standpoint, this is a meaningful sample size.

Now let’s instead consider trying to estimate the variance covariance matrix (VCV) for this universe of 10 assets, which we require in order to estimate the volatility of a portfolio constituted from this universe. The covariance matrix is symmetrical along the diagonal, so that values in the bottom left half the matrix are repeated in the upper right half. So how might we calculate the number of independent elements in a covariance matrix with 10 assets?

For those who are interested in such things, the generalized formula for calculating the number of independent elements of a tensor of rank M with N elements is:

For a rank 2 tensor (such as a covariance matrix) the number of independent elements is:

Therefore, accounting for the diagonal, the covariance matrix generates (10 * 11) / 2 = 55 independent pairwise variance and covariance estimates from the same 600 data points. In this case, each estimate is derived from an average of 600/55 = 10.9 data points per estimate.

Now imagine projecting the same 60 days into a rank 3 tensor (like the 3 dimensional cube in the figure above), like that used to derive the third moment (skewness) of a portfolio of assets. Now we have 10 x 10 x 10 = 1000 elements. The tensor is also symmetrical along each vertex (each corner of the cube is symmetrical), so we can calculate the number of independent elements using the generalized equation above, which reduces to the following expression for rank=3:

Plugging in N=10, we easily calculate that there are (10 * 11 * 12)/6 = 220 *independent *estimates in this co-skewness tensor. Given that we have generated these estimates from the same 600 data points, we now have a data density of 600/220 = 2.7 pieces of data per estimate.

You can see how, even with just 10 assets to work with, to generate meaningful estimates for covariance, and especially higher order estimates like co-skewness and co-kurtosis (data density of 600/6500 = 0.09 observations per estimate), the amount of historical data required grows too large to be practical. For example, to achieve the same 60 data points per estimate for our covariance matrix as we have for our volatility vector would require 60*55 / 10 = 330 days of data per asset.

In finance, we are often faced with a tradeoff between informational decay (or availability for testing purposes) and estimation error. On the one hand, we need a large enough data sample to derive statistically meaingful estimates. But on the other hand, price signals from long ago may carry less meaningful information than near term prices signals.

For example, a rule of thumb in statistics is that you need at least 30 data points in a sample to test for statistical significance. For this reason, when simulating methodologies with monthly data, many researchers will use the past 30 months of data to derive their estimates for covariance, volatility, etc. While the sample may be meaningful from a density standpoint (enough data points to be meaningful), it may not be quite as meaningful from an ‘economic’ standpoint, because price movements 2.5 years ago may not materially reflect current relationships.

To overcome this common challenge, researchers have proposed several ways to reduce the dimensionality of higher order estimates. For example, the concept of ‘shrinkage’ is often applied to covariance estimates for large dimensional universes in order to ‘shrink’ the individual estimates in a covariance matrix toward the average of all estimates in the matrix. Ledoit and Wolf pioneered this domain with their whitepaper, Honey I Shrank the Sample Covariance Matrix. Varadi and Kapler explore a variety of these methods, and propose some novel and exciting new methods in their recent article series. Overall, our humble observation from a these analyses and a quick survey of the literature is that while shrinkage methods help overcome some theoretical hurdles involved with time series parameter estimation, empirical results demonstrate mixed practical improvement.

Despite the mixed results of shrinkage methods in general, we felt there might be some value in proposing a slightly different type of shrinkage method which represents a sort of ‘compromise’ between traditional shrinkage methods and estimates derived from the sample matrix with no adjustments. The compromise arises from the fact that our method introduces a layer of shrinkage that is more granular than the average of all estimates, but less granular than the sample matrix, by shrinking toward clusters.

Clustering is a method of dimensionality reduction because it segregates assets into groups with similar qualities based on information in the correlation matrix. As such, an asset universe of several dozens or even hundreds of securities can be reduced to a handful of significant moving parts. I would again direct readers to a thorough exploration of clustering methods by Varadi and Kapler here, and how clustering might be applied to robust risk parity in our previous article, here.

Figure 2 shows the major market clusters for calendar year 2013 and year-to-date 2014 derived using k-means, and where the number of relevant clusters is determined using the percentage of variance method (p>0.90) (find code here from Kapler).

Figure 2. Major market clusters in 2013-2014

In this universe there appear to have been 4 significant clusters over this period, which we might broadly categorize thusly:

- Bond cluster (IEF, TLT)
- Commodity (GLD, DBC)
- Global equity cluster (EEM,EWJ,VGK,RWX,VTI)
- U.S. Real Estate cluster (ICF)

Now that we have the clusters, we can think about each cluster as a new asset which captures a meaningful portion of the information from each of the constituents of the cluster. As such, once we choose a weighting scheme for how the assets are weighted inside each cluster, we can now form a correlation matrix from the 4 cluster ‘assets’, and this matrix will contain a meaningful portion of the information contained in the sample correlation matrix.

Figure 3. Example cluster correlation matrix

Once we have the cluster correlation matrix, the next step is to map each of the original assets to its respective cluster. Then we will ‘shrink’ each pairwise estimate in the sample correlation matrix toward the correlation estimate derived from the assets’ respective clusters. Where two assets are from the same cluster, we will shrink the sample pairwise correlation toward the *average* of all the pairwise correlations between assets of that cluster.

An example should help to cement the logic. Let’s assume the sample pairwise correlation between IEF and VTI is -0.1. Then we would shrink this pairwise correlation toward the correlation between the clusters to which IEF (bond cluster) and VTI (global equity cluster) respectively belong. From the table, we can see that the correlation between the bond and global equity clusters is 0.05, so the ‘shrunk’ pairwise correlation estimate for IEF and VTI becomes mean(-0.1, 0.05) = -0.025.

Next let’s use an example of two assets from the same cluster, say EWJ and VTI which both belong to the global equity cluster. Let’s assume the sample pairwise correlation between these assets is 0.6, and that the average of all pairwise correlations between all of the assets in the global equity cluster is 0.75. Then the ‘shrunk’ pairwise correlation estimate between EWJ and VTI becomes mean(0.6, 0.75) = 0.675.

We have coded up the logic for this method in R for use in Kapler’s Systematic Investor Toolback backtesting environment. The following tables offer a comparison of results on two universes. We ran minimum risk or equal risk contribution weighting methods with and without the application of our cluster shrinkage method, using a 250 day lookback window. All portfolios were rebalanced quarterly.

EW = Equal Weight (1/N)

MV = Minimum Variance

MD = Maximum Diversification

ERC = Equal Risk Contribution

MVA = David Varadi’s Heuristic Minimum Variance Algorithm

Results with cluster shrinkage show a .CS to the right of the weighting algorithm at the top of each performance table.

Table 1. 10 Global Asset Classes (DBC, EEM, EWJ, GLD, ICF, IEF, RWX, TLT, VGK, VTI)

Data from Bloomberg (extended with index or mutual fund data from 1995-)

Table 2. 10 U.S. sector SPYDER ETFs (XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU)

Data from Bloomberg

We can make some broad conclusions from these performance tables. At very least we have achieved golden rule number 1: first, do no harm. Most of the CS methods at least match the raw sample versions in terms of Sharpe ratio and MAR, and with comparable returns.

In fact, we might suggest that cluster shrinkage delivers meaningful improvement relative to the unadjusted versions, producing a noticeably higher Sharpe ratio for minimum variance, maximum diversification, and heuristic MVA algorithms for both universes, and for ERC as well with the sector universe. Further, we observe a material reduction in turnover as a result of the added stability of the shrinkage overlay, especially for the maximum diversification based simulations, where turnover was lower by 30-35% for both universes.

Cluster shrinkage appears to deliver a more consistent improvement for the sector universe than the asset class universe. This may be due to the fact that sector correlations are less stable than asset class correlations, and thus benefit from the added stability. If so, we should see even greater improvement on larger and noisier datasets such as individual stocks. We look forward to investigating this in the near future.

The post Cluster Shrinkage appeared first on GestaltU.

]]>The post The Evolution of Optimal Lookback Horizon appeared first on GestaltU.

]]>John von Neumann

We’ve previously written about the potential for even “simple” investment systems to be deceptively complex. Here is one example.

Many asset class rotational systems are optimized on lookback horizon (to our observation, most use 120 days), so we thought it would be interesting to investigate the evolution of optimal lookback horizon through time. This allows us to put ourselves in the position of an analyst at different points in history and try to speculate on the choices he might have made given the information at his disposal then. It’s important to conduct this sort of exercise because if, in looking back at past optima, an analyst might have chosen to use wildly different lookbacks at different points in history, it might call into question the stability of your choice of lookback today.

To perform this analysis we individually tested momentum systems with lookback horizons from 20 days (~1 month) to 260 days (~1 year) in increments of 20 days on a 10 asset class universe (ETF symbols DBC, EEM, EWJ, GLD, ICF, IEF, RWX, TLT, VGK, VTI back-extended with index data). In order to isolate lookback horizon and minimize the potential impact of varying the number of holdings, we averaged the results across independent simulations for systems holding top 2, 3, 4, and 5 assets. For example, we ran systems with top 2, 3, 4, and 5 assets using a 20 day lookback, and then averaged the results for a composite performance at the 20 day horizon.

Astute readers will note that this is *almost *the equivalent of a rank-weighted portfolio of five assets – almost because in this case the top 2 assets receive the same weight. In a rank weighting scheme, each asset carries a weight of:

The term in the numerator with rank raised to -1 signifies that the ranks are sorted such that higher ranking assets have a higher absolute rank, so the top asset out of 5 assets has a rank of 5, not 1. The weighting scheme is a way of expressing the view that assets with higher momentum have a relatively higher expected return over the next period, but where the magnitude of the momentum differential carries no information. In our experiment the ranks work out to:

Top 1 and 2: 28.6% each

3rd ranked: 21.4%

4th ranked: 14.3%

5th ranked: 7.1%

Figure 1. shows the calendar year performance of systems constructed with each lookback horizon, as well as the U.S. Total Stock Market index (Vanguard ETF symbol VTI).

Figure 1. Calendar year returns for momentum systems using different lookback horizons

Source: Bloomberg

First note the column on the right, which shows the cumulative return to each of the lookback systems over the entire period. The 100 day lookback delivered the strongest performance, while the 40 day system lagged. Interestingly, all of the systems exceeded the performance of U.S. stocks over the full 19 year period, though the S&P turned in top performance in 6 years, or over 30% of the time. Of course, it also turned in the worst performance in 8 years, or 42% of the time. And there’s the rub. You see, while US stocks turned in positive performance in more than 80% of calendar years vs. 70% positive years for the momentum systems, the worst calendar year performance by any momentum system was -11.5% for the 60 day system in 2001, while the worst calendar year for US stocks was 2008, when they lost 37%.

Momentum systems in general delivered a narrower distribution of outcomes, especially on the downside. We’d call these results pretty compelling at first glance.

Of equal interest, notice the dispersion in performance across the momentum systems from year to year. While the 100 day system was the peak performer on average over the entire history, it ranked near the bottom for the first few years, and its performance was also below average for the most recent two years. That said, keen observers will notice a potentially interesting performance ‘plateau’ that crests near the 100-120 day mark, and slowly decays as the lookback horizon moves further away in both directions.

Figure 2. illustrates how the top lookback horizons evolved through time by showing the cumulative annualized performance for each system through the end of each calendar year. Interestingly, an analyst investigating a simple momentum based asset rotation system in the late 1990s might have been forgiven for concluding that the concept was a bust, as **U.S. stocks crushed every momentum system we tested from 1995 to 2000 on a cumulative annualized basis**.

Figure 2. Cumulative annualized calendar year returns for momentum systems using different lookback horizons

Source: Bloomberg

It’s also interesting to note that the top performing lookback horizon over the entire testing period, 100 days, didn’t even creep into the top half of cumulative performance until 2002 – after the bear market. Also, the 60- and 80-day systems looked pretty grim until 2008, ranking near the bottom most years. However, in 2009 they did a better job of identifying the change in trend off the bear market low and, as a result, they leaped from the bottom quartile to the top quartile in short order, and remain there to this day. Simply stated, if our lookback horizon is too long, we are likely to be adversely effected by rapidly deteriorating bear markets similar to 2008-9 (notice the drop in the 240-260 day lookbacks in figure 2.). On the other hand, if it’s too short, we are likely to miss out on bullish reversals (which can be observed by the 20-40 day lookback’s choppy performance in figure 1.).

So what might we conclude from this analysis? Is there a ‘sweet spot’ parameter that we should zero in on for trading purposes? If so, the 100 day lookback horizon seems like a good candidate. But let’s not be too hasty. To us, what’s most clear from this analysis is that different market regimes carry different optimal lookback periods. In fact, it is likely that different baskets of securities have dominated during each historical regime, and it is actually the underlying securities which respond to momentum with different optimal lookback horizons. The volatility of the regime must also impact the optimal momentum horizon.

In an effort to avoid over-optimizing on lookback horizon, some choose to use several lookback horizons across the well established range for momentum to manifest: about 1 month through 12 months (See Faber). Choosing several lookbacks, such as 20, 60, 120,180, 250 days (~ 1, 3, 6, 9 and 12 months) is the mathematical equivalent of assigning shorter periods higher weights vs. longer periods. In a general sense, this allows a system to capture a portion of trend acceleration. Also, it’s interesting to note that the weighted average lookback of the horizons above is 126 days, which is pretty close to the optimum observed in Figures 1 and 2.

Some well known and respectable managers utilize dynamic lookback weighting. For example, one shop changes the horizon based on current risk estimates. The absorption ratio might also be used to shorten or lengthen lookback horizon in response to changes in observed measures of systemic risk.

That said, it pays to remember the over-arching goal to make our approach as simple as possible, but no simpler. In this context, simplicity relates very specifically to the number of ‘moving parts’ or degrees of freedom in the model. More degrees of freedom results in a complex model where we can have less faith in how the system will perform out of sample. As a result, we want to minimize the number of degrees of freedom while doing our best to preserve the performance character.

On the other hand, it is sometimes useful to apply fairly advanced methods to derive parameter estimates. GARCH, which stands for Generalized Autoregressive Conditional Heteroskedasticity, is a mouthful to pronounce and a bit of a bear to implement, but the literature is full of support for this model’s ability to forecast volatility estimates. Again, the goal is to be as simple as possible, **but no simpler**.

At heart, this series of posts is meant to draw attention to the *art* of system development in an effort to balance off the overwhelming focus on infinite layers of technical nuance that we observe around the blogosphere. It’s no great challenge to derive an eye-popping backtest with the right combination of indicators: just as John von Neumann about his elephant. The trick is to use just a few really good tools, with some novel tricks few others have hit upon, to deliver a balance of return and risk that looks as compelling in real-time as it does in pixels.

Gladwell said it takes 10,000 hours to be an expert. That sounds about right to us. There are no shortcuts in investing. Do your homework, or caveat emptor.

The post The Evolution of Optimal Lookback Horizon appeared first on GestaltU.

]]>The post Why Skill Never Prevails in Your NCAA March Madness Office Pool appeared first on GestaltU.

]]>As quants and sports fans we often find ourselves analyzing statistics from the sports world. And seeing as college basketball dominates the sports landscape for the next few weeks, it’s no surprise we are inspired to write about the NCAA Men’s Basketball Tournament, aka March Madness.

One of the great sports traditions is to participate in an office pool, whereby participants complete a tournament bracket. In doing so, they select a winner for each game, and ultimately the “best” bracket wins.

But there’s a problem: most March Madness bracket challenges reward only some random idiot; the “best” picker – *you, obviously – *is rarely victorious. You spend hours analyzing teams, weighing matchups and seeking out that perrenial “Cinderella” only to find that on top of your entrance fee, you’ve sunk a lot of time into a losing effort. Here’s the good news: it’s not your fault. Really, it’s not. The scoring system is to blame.

Before we tackle specifics, it makes sense for us to come to a philosophical understanding about how we would go about identifying the best picker in a bracket challenge. Here are our basic criteria:

- We should allow for the largest sample size possible.
- We should create “matchup parity.”

On maximizing the sample size, there are two considerations. First, if we are *really *concerned with identifying the most talented picker, then it stands to reason that each game ought to be scored in the same way. Increasing points as the rounds pass has the effect of rewarding pickers who are lucky enough to select teams that go far in the tournament, regardless of whether or not they picked the most correct games overall. Stated another way, using a standard scoring system, the picker with the most correct picks in the early round could easily lose to someone who did relatively worse early on, but happened to pick the eventual tournament champion correctly. So, in order to maximize sample size, every game ought to be treated in the same way, regardless of when it happens in the tournament.

Second, and more troublingly, we would ideally have every entrant pick every game in the tournament *after* the matchups were known in each round. In other words, pickers wouldn’t make their 2nd round picks until the entire 1st round was completed, and so on. In this way, every person could pick every game, regardless of whether or not the teams they selected in the previous round advanced.

Adopting the standard bracket rules is undesirable because every incorrect choice has the effect of reducing the sample size upon which we judge the best picker. These incorrect picks stay on your bracket as legacy errors, eliminating every subsequent game from the set upon which you are judged. This reduces the sample size, and in the world of statistical reliability, smaller sample sizes increase the randomness of possible outcomes.

And let’s be perfectly clear, here: random outcomes in March Madness bracket challenges will *never, ever, ever go your way. *If you want to be lauded as the “truly talented” picker that you are, the legacy errors have got to go, and thus, so do the old-time scoring systems. Every matchup should be selected, but only after every matchup is known.

Unfortunately, anyone who has ever had the misfortune of running an office pool understands the logistical impossibility that this imposes. If one used the alternate system we are proposing, then a new set of picks would need to be submitted after every round, for 5 rounds! Some people wouldn’t get their picks in on time, others would be frustrated by such a system, and everyone would hate you: the unenviable plight of the lowly pool manager.

Moving on to “matchup parity,” it comes down to this: we want the picker to be completely neutral with regards to which team is chosen to win. Ideally, if the rules are set right, half the people in your bracket would choose one team, and half would choose the other, even in the most lopsided games. How do we encourage this distribution of picks? By appropriately rewarding those who correctly predict an unlikely outcome – upsets!

As an extreme example, let’s think about the all-but-overlooked #1 seed versus #16 seed in the first round. In the entire history of the NCAA tournament, a #16 has *never* defeated a #1. Not ever. Of course this doesn’t mean it’s impossible, simply that it’s highly improbable. In order to entice half of the pool to chose something that has literally never happened before, we must create a powerful incentive to do so. To wit, we want to make the expected returns equal regardless of which team is selected. To see how this might work, imagine that the #1 seed has a 99% chance of winning, meaning the #16 seed has a 1% chance. From the perspective of expected returns, it might make sense to award 99 points to anyone correctly selecting the #16 seed in that matchup and 1 point for anyone correctly selecting the #1 seed.

To make the expected return of each team equal, we simply set the payoff for correctly choosing the favorite equal to the underdog’s chance of winning and the payoff for correctly choosing the underdog equivalent to the favorite’s odds. In the real world, the odds for each team can be backed out by a simple examination of the betting lines. It might not be perfect, but if you believe in the wisdom of crowds, the “sharp money,” or the completely accurate notion that book-makers are profit seeking enterprises with a vested interest in getting lines “right,” it’s a good enough proxy.

Therefore, if the goal is to actually reward the most talented picker in your pool the ideal system might look something like this:

- Score each game relative to the odds that the selected team’s opponent will win.
- Have each game picked only after the exact matchup is known.
- Have every game scored via the same system without regard to the tournament round.

Of course, we’re not stupid: **Nobody does this, and nobody is going to do this because it’s tedious and more importantly, it’s BORING.**

As with picking an NCAA Tournament bracket, the hope in all endeavors is that true skill bears out over time. In the investing world, time is our sample size. Any manager can look like a genius over a year or two, but it is the truly talented ones whose ability bears out over much longer and more significant periods of time. We want the odds on our side as often as possible, and we want the rules of the game to reward those with a true informational edge. Understanding the virtuous-spiral-inducing recipe of large sample size, statistical robustness and compound growth, we’re happy to win thousands of small bets over our investing lifetimes even if the “action” in the interim isn’t nearly as thrilling. Indeed, the recipe for long-term success is to be on the proper side of a small win over and over again. If you’re excited about your investments – even if it’s for the right reason, like great performance – you may want to think twice about whether or not that strategy is appropriate for you, since the investment’s evocative nature stands a good chance of undermining your success down the line.

In the world of NCAA March Madness brackets, however, we are more often excitement-seeking. And that’s quite problematic to our goal of identifying the best picker, because even we must admit that in the case of March Madness brackets, * excitement adds to our overall enjoyment even when it diminishes our chances of winning*. And there’s the rub: investors often feel the same way, seeking thrilling investments that ultimately undermine their odds of success. And while it might be alright for your office pool, it’s not going to help you achieve your financial goals.

The process whereby you identify the best picker is mutually exclusive from the process by which you maximize overall pool excitement; the process whereby you maximize your odds of financial success is mutually exclusive from the process by which you maximize the thrill of investing.

In both cases, the decision is yours.

In the case of the NCAA tournament, most people will go with the excitement angle. We understand; Sports are exciting, and the idea of winning gloriously and just *owning *your colleagues is certainly appealing. But the most likely outcome of a standard bracket challenge is that you’ll have once again contributed your hard-earned money to someone else’s bank account.

Hopefully, though, once you’ve repeated the embarassing annual ritual of awarding the championship money to the person in your office who knows the least about basketball, you’ll think twice about making similar mistakes with your investments.

The post Why Skill Never Prevails in Your NCAA March Madness Office Pool appeared first on GestaltU.

]]>The post The Black Box: Eyewitness Testimony and Investment Models appeared first on GestaltU.

]]>Multiple discovery suggests that the most valuable, achievable advances in a field are often being examined simultaneously – yet independently – by many people at the same time. It stands to reason that on these occasions, leaps in logic can often occur at the same time by independent parties. And even in the cases where an individual makes the leap, it doesn’t take long for intelligent, competing parties to use reverse engineering to “catch up.”

Degrees of freedom relates to the counterintuitive notion that the more independent variables a model has – that is, the more complicated it is in terms of the number of independent ‘moving parts’ - the less reliable a back-test generally is. This is because more independent variables create a larger number of potential model states, each of which needs to meet its own standard of statistical significance. A model that integrates a great many variables seems like it would be robust; to the contrary, it is likely to be highly fragile.

Today, we endeavor to broaden the topic of degrees of freedom by adding a layer that is all-too-often ignored: your (our) behavior. As quantitative investors and researchers, we generally don’t like to work in “squishier” areas of social science. As trusted financial advisors, however, we know that when we sit across from clients, they often need more than just an evidence-based approach to investing. Often, they need encouragement, nurturing and coaching.

People use facts as factors in decision making, but they *take action* on emotion. We engineer investment strategies that not only work *in **silico*, but that also work in practice with real clients whose behaviours and actions are *never* completely removed from their emotional state. In other words, the rules based approaches we apply in practice need to be compatible with the much less predictable black box inside your (and our) skull.

In the world of statistics, there are classifications for different types of variables. We spend a great deal of time on this blog talking about “system variables.” These are the rules which guide our research and investing. They relate to how we examine and stress test models to achieve statistically significant results and how we ultimately make investment decisions. These variables are procedural, and they are eminently controllable.

We’ve spent less time on this blog – at least recently – discussing “context variables.” These variables relate to the investor specifically, and to their cognitive and behavioral responses to a given set of circumstances. For example, each investor has a slightly different reaction to gains and losses of different magnitudes. These are sometimes called “estimator variables” but we think “context variables” is a more intuitive term.

If the difference is unclear, imagine that you are brought in to a police station for the purposes of providing eyewitness testimony. The police will have a procedure that they put you through. Will you look at a lineup of real people? Will you look at pictures? If you do look at pictures, will they be sequential? In sets of 6? While you’re doing all of this, will an officer be looking over your shoulder? What gender, race and age will the officer be relative to the witness? Or relative to the suspect?

All of the variables in this process are “system variables” because they are under the direct control of the person managing the system. It’s a choice to do an eyewitness identification using one procedure versus another.

Now imagine your specific mental state while sitting in the police station. How might your mental and emotional status change depending on the nature of the crime? What if there was a weapon involved – would you focus on the weapon or the assailant? How confident would you be in your identification if the crime happened an hour ago versus a day ago versus a week ago? In your neighbourhood vs. a neighbourhood far from your home? Are you more or less likely to select a picture reflective of your own race or sex? Are people with tattoos miscreants or creative types?

All of these are “context variables” because they relate to you and the context surrounding your individual decision making process, in this case your ability to provide accurate eyewitness testimony.

To be clear, there is a relationship between system variables and context variables; they are not completely independent of each other. For example, optimizing system variables by implementing procedures that decrease anxiety and decrease the amount of time between the crime and the identification can help stabilize otherwise volatile context variables, leading to more accurate eyewitness testimony.

Investing operates in much the same way. We constantly endeavor to explore new investment methods, integrating ideas where appropriate and putting into production system variables that show strong statistical significance. And we know that if we are successful, we will likely have a muting effect on otherwise volatile context variables. In plain English, if we design a system that delivers stability and growth, we know our clients are likely to make more rational financial decisions and generally show higher levels of commitment to their long-term investment plan.

Unfortunately, this won’t always be the case, which brings us back to performance decay. Every year, DALBAR releases their updated Quantitative Analysis of Investor Behavior (QAIB). Predictably, it shows that the average investor does significantly worse than a simple buy-and-hold investor. Much of this performance gap is attributed to behavioral deficiencies (aka context variables); a great many investors trapped by cycles of fear and greed buy high and sell low.

One issue that isn’t addressed by the QAIB is the notion that there exists a connection between system and context variables. If you are investing in the S&P 500 where 6-month price volatility since 2000 has had zen-like lows near 7% and mania-inducing highs above 58%, it seems almost natural that your responses would follow a predictable downward spiral of doing the exact wrong thing at the exact wrong time. In other words, the investment system isn’t completely blameless in the examination of emotionally flawed investing.

Volatile investment results induce volatile emotional responses, almost always to the investor’s detriment. 1987 notwithstanding, “buy and hold” was relatively easy to do from 1982-2000; it’s been an emotional roller coaster ever since.

We have a motto in our office: “We’d rather lose 50% of our clients near the peak of a runaway bull market, than 50% of our clients’ assets during the inevitable bear markets.” If most Advisors are honest with themselves, they will admit that their advice leads to precisely the opposite outcome. To wit, an Advisor advocating “Strategic Asset Allocation” – or a “buy and hold” philosophy - with a large equity component is definitionally acting in a way that is inconsistent with our philosophy. That’s because this type of portfolio can expect a 30% – 50% loss in value about once every 7 years. This Advisor will collect most of his clients near the end of a long bull market when his near-term performance has necessarily been strong. Soon after these clients will endure a major loss. This is a nearly universal cycle in wealth management.

…hence, why our motto stands out.

In our Adaptive Asset Allocation method, we’ve endeavored to deliver impressive results while focusing intensely on risk controls. Because of this, we know that there will be times when our model underperforms whatever stock index is in the headlines. We know that sometimes this underperformance will endure for extended periods of time. Further, we know we will almost certainly *lose some clients* near the end of this bull run. It’s happened before, and it will happen again.

But the difference is that we find it impossible to look our clients in the face when the stock market is down 50% and say that we succeeded by only losing 45%. That’s in our DNA. And it’s why we encourage investors to analyze the performance of any Advisor under consideration over an entire market cycle, which includes both bull and bear markets. In doing so, we also believe that we’re helping our clients “short circuit” the vicious cycle that the QAIB annually revisits.

After all, should we judge sports teams only on how they perform in the first half of the game? Or does the back half matter?

It’s true that we don’t spend as much time on this blog as our colleagues might discussing behavioral finance. Now you know why: the best way we know to limit the adverse effects of such behaviors is to provide our clients with a return profile that doesn’t compel them to make bad choices under duress.

The post The Black Box: Eyewitness Testimony and Investment Models appeared first on GestaltU.

]]>The post NFL Parity, Sample Size and Manager Selection appeared first on GestaltU.

]]>In this post, we will continue to look at issues of statistical significance. In doing so, we hope to simultaneously provide some small measure of solace to our American readers, most of whom are in the doldrums.

For our neighbors south of the border, February is perhaps the most depressing month of the year. This has little to do with the fact that large swaths of the country are frozen solid and covered from dusk until dawn with a thick layer of grey clouds, though that certainly doesn’t help. Nor does it have to do with any political or economic issue that one might find in the headlines. To the contrary, at this moment, and at this time every year, the source of their collective misery is that the NFL season is over.

Now this may be only one person’s opinion but, at least observationally, it seems like one of the reasons that the NFL is so popular is that it has a much-deserved reputation for promoting inter-season mean reversion (in other words there is a tremendous amount of competitive balancing that goes on from year to year). In fact, if you look at the four major American sports (football, baseball, basketball and hockey), football has the highest mobility of team rankings. Therefore, if you have the compounded misfortune of having to simultaneously cheer for both a terrible football and baseball team, it’s far more likely that the football team will fare better next year than the baseball team. The flip side is also true; if your football team and hockey team were both exceedingly successful last year (a situation that is quite alien to us living in Toronto – at least with regards to hockey), it’s far more likely that the football team will fail to repeat its strong performance than the hockey team.

The following graphics bear this out. They show that, despite the tendency for teams to perform about as well next season as they did last season, football has the highest mobility.

Figure 1. Season-to-Season Winning Consistency among Sports Teams

Via Visual Statistix Twitter @VisualStatisticIt is commonly assumed that qualitative forces such as league policies are the driving force behind this phenomenon. And indeed, different leagues have different rules around revenue sharing between teams, salary caps, luxury taxes and so on. But while the specifics of these policies are beyond the scope of this article, even a cursory comparison between football and baseball is sufficient to make the point.

In 2013, the NFL had 25 of 32 teams with payrolls between $100 and $125 million, with the largest payroll – $124.9 million – being paid by the Seattle Seahawks. If you need to re-read that sentence I don’t blame you. The highest spending team in the NFL last year was the Seattle Seahawks, who are clearly a mid-market team (albeit with an incredible defense). The fact that the Seahawks had the highest payroll also highlights another significant point: in the NFL, team payroll is largely disassociated with the size/population/concentration of wealth within the team’s home market. According to the Census Bureau, Seattle has the 15th largest metropolitan population in the US. This is a decidedly different situation that can be found in any other major North American sport.

Take Major League Baseball for example. The MLB has an unreasonably wide range of payrolls. In 2013, two teams had payrolls north of $216 million, with two additional teams having payrolls north of $150 million. At the other end of the range, fully 16 teams (more than half the league) had payrolls less than $100 million.

And unlike the NFL, it’s also easy to see a relatively strong connection between market size and payroll. By a substantial margin, New York and Los Angeles are the most populous metropolitan areas in the US; to wit, the Yankees and Angels had 2013 payrolls of $229,000,000 and $216,000,000 respectively. Now the question is how does the disparity in terms of payroll between teams translate into the competitiveness of the product on the field? It would stand to reason that given additional financial resources a team would be able to acquire better players, which would ultimately translate into more wins (unless of course you’re the 2013 Los Angeles Angels). Thus, it stands to reason that a relatively tighter dispersion of payrolls across a sport should lead to greater competitive balance.

However, the idea that the tighter dispersion of payrolls is what is responsible for the NFL’s competitive balance ignores, or least obfuscates, a key point. That is, is the NFL season actually long enough for any team’s win-loss record to be statistically significant? Putting it another way, is the NFL season long enough for “true talent” to prevail?

If the NFL season and its playoff structure are such that we can’t glean any meaningful statistical conclusions from it, then the idea that payroll parity promotes competitive balance is really unfounded and the inter-season mean reversion we observe is more a result of the random outcomes that can occur with too small a sample size and not from any characteristic of how the league operates.

In a recent post on the MIT Sloan Sports Analytics Conference website, “Exploring Consistency in Professional Sports: How the NFL’s Parity is Somewhat of a Hoax,” Brown University Doctoral Candidate Michael Lopez dissected several measures of parity in sports. As the title suggests, NFL parity is largely a mirage.

After several technical data transforms which make comparisons between sports more consistent, Lopez gets to the heart of the matter: the NFL suffers from a small sample size. The NFL regular season has only 16 games, whereas basketball and hockey have 82 and baseball has an incredible 162. Because of the lesser number of games, it is more likely in the NFL that the regular season record will not reflect the “true talent” of the team.

For example, Figure 2. shows a cumulative distribution function for win percentage of a theoretical team in the NFL and MLB.

Figure 2. Comparison of Potential Win Percentages Between Theoretically Average NFL and MLB Team

The chart shows the possible outcomes for a team given a 50% true talent (in other words, a team whose ability would suggest they *should* win half of their games). The standard deviations of team wins are gleaned from historical data and are 1.56 games for football and 10 games for baseball. Even with the larger standard deviation in baseball (6.4x larger), the *even larger* sample size in baseball (10.1x larger) imposes a central tendency to the possible outcomes. In plain English, the number of games played in baseball makes us significantly more confident that teams with the highest level of true talent will ultimately succeed in a given season.

With 90% fewer games, football is unable to make such guarantees. In fact, looking at the teams that actually made the playoffs since 2002, a perfectly average team will win enough games to make the playoffs almost 20% of the time. While this may not seem so out of the ordinary, remember that an average team has no business being in the playoffs at all.

But such is the way of the world when you suffer from small sample sizes; the error term dominates the outcomes and weird things happen more often than your intuition would lead you to believe.

The world of investing has a clear analog, though the situation is more complex. Consider two investment teams where one team – Alpha Manager – has genuine skill while the other team – Beta Manager – is a closet indexer with no skill. After fees Alpha Manager expects to deliver a mean return of 10% per year with 16% volatility, while Beta Manager expects to deliver 8% with 18% volatility. Both managers are diversified equity managers, so the correlation of monthly returns is 0.95.

With some simple math, and assuming a risk free rate of 1.5%, we can determine that Alpha Manager expects to deliver about 3% in traditional alpha relative to Beta Manager. This is the investment measure of ‘raw talent’.

Beta of Alpha Manager with Beta Manager (closet indexer) = (0.95 x 16% x 18%)/(18^2)=0.84

CAPM expected return of Alpha (skilled) manager = 1.5% + 0.84 * (8% – 1.5%) = 7%

Expected Alpha for Alpha Manager = 10% – 7% = 3%

The question is, how long would we need to observe the performance of these managers in order to confidently identify Alpha Manager’s skill relative to Beta Manager? Without going too far down the rabbit hole with complicated statistics, Figure 3. charts the probability that Alpha Manager will have delivered higher compound performance than Beta Manager at time horizons from 1 year through 50 years. [If you want the worksheet, email us and I *may* consider sending it out.]

Figure 3.

You can see from the chart that there is a 61% chance that Alpha Manager will outperform Beta Manager in year 1 of our observation period. Over any random 5 year period Beta Manager will outperform Alpha Manager about a quarter of the time, and over 10 years Beta will outperform Alpha almost 15% of the time. Only after 20 years can we finally reject the probability that Alpha Manager has no skill at the traditional level of statistical significance (5%). *[Note this version corrects a slight miscalculation in the original draft].*

Figure 4. demonstrates the same concept but in a different way. The red line represents the expected cumulative log returns to Alpha Manager relative to Beta Manager; note how it shows a nice steady accumulation of alpha as Alpha Manager outperforms Beta Manager each and every year. But this line is a unicorn. In reality, 90% of the time (assuming a normal distribution, which is naive) Alpha’s performance relative to Beta will fall between the green line at the high end (if Alpha Manager gets really lucky AND Beta Manager is very unlucky) and the blue line at the low end (if Alpha Manager is really unlucky AND Beta Manger is really lucky). Note how in 5% of possible scenarios Alpha Manager is still under performing Beta Manager after 17 years of observation!

Figure 4. 90% range of log cumulative relative returns between Manager A and Manager B at various horizons

These results should blow your mind. They should also prompt a material overhaul of your manager selection process. And it gets worse. That’s because the results above make very simplistic assumptions about the distribution of annual returns. Specifically, they assume that returns are independent and identically distributed which, as we’ve mentioned in previous posts, they decidedly are not. In addition, certain equity factors go in and out of style, persisting very strongly for 5 to 7 years and then vanishing for similarly long periods. Dividend stocks are this cycle’s darlings, but previous cycles saw investors fall in love with emerging markets (mid-naughts), large cap growth stocks (late 1990s), large cap ‘nifty fifty’ stocks (60s and 70s), etc.

Sometimes investment managers don’t fade with a whimper, but rather go out with a bang. Ken Heebner’s CGM Focus Fund was the top performing fund of the decade in 2007, having delivered 18% per year over the 10 years prior, a full 3% ahead of any other U.S. equity mutual fund (source: WSJ). You might be tempted to believe that Ken is possessed of a supernatural investment talent; after all, ten years is a fairly long horizon to deliver persistent alpha. And indeed, investors did flock to Ken in droves. Unfortunately, as so often happens, most investors jumped into his fund in 2007 – $2.6 billion of new assets were invested in CGM Focus in 2007 (source: WSJ).

Inevitably, Ken’s performance peaked in mid 2008 and proceeded to deal these investors a mind melting 66% drop to its eventual month-end trough in early 2009. If you don’t have a calculator handy, I’ll point out that at the fund’s 2009 trough it had wiped out over 12% in annualized returns over the now almost 11 year period, bringing its annual return down under 6%.

What’s an investor to do if she can’t make meaningful decisions on the basis of track records? Well, that’s the trillion dollar question, isn’t it. Unfortunately **the only information that is meaningful to investment allocation decisions is the process that a manager follows in order to harness one or more factors that have delivered persistent performance for many years**. The best factors have demonstrable efficacy back for many decades, and perhaps even centuries. For example, the momentum factor was recently shown to have existed for 212 years in stocks, and over 100 years for other asset classes. Now that’s something you can count on.

That’s why we spend so much time on process – because we know that in the end, that’s the only thing that an investor can truly base her decision on.

For the same reason, we are never impressed solely by the stated performance of any backtest – even our own. Rather, we are much more impressed by the ability of a model to stand up under intense statistical scrutiny: many variations of investment universes tested in multiple currencies under several regimes, along with a wide range of strong parameters with few degrees of freedom.

Often, we see firms advertising excellent medium-term results built on flimsy statistical grounds. Without understanding their process in great detail, these results are meaningless. Less commonly, we see impressive shorter-term sims, but that are clearly based on robust, statistically-significant long-term foundations. In those cases, we sit up and take note because statistically-significant, stable, long-term results are much rarer and much more important than most investors imagine.

NFL parity – and far too often, investment results – are both mirages. Small sample sizes in any given NFL season and high levels of covariance between many investment strategies make it almost impossible to distinguish talent from luck over most investors’ investment horizons. Marginal teams creep into the playoffs and go on crazy runs, and average investment managers have extended periods of above-average performances.

The next time you observe a team or a manager on what appears to be a streak, it’s important to remember that looks can be deceiving.

If you don’t believe us, just wait until next season.

The post NFL Parity, Sample Size and Manager Selection appeared first on GestaltU.

]]>The post Faber’s Ivy Portfolio: As Simple as Possible, But No Simpler appeared first on GestaltU.

]]>Albert Einstein is oft credited with suggesting that problems should be made ‘as simple as possible, but not simpler’. In fact, a poster with this very phrase and a picture of Einstein’s unmistakable visage adorned the inside of my bedroom door for much of my adolescence. However, readers might be interested to learn that this particular phrase has never been directly attributed to Einstein in any of his published works. Rather, it’s surmised that this statement is actually a distilled version of a slightly less accessible quotation from a lecture entitled, “On the Method of Theoretical Physics” delivered at Oxford in 1933. The actual quote from Einstein was, “*It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.*”

In any event, the distillation is a useful heuristic, and nowhere more so than in the field of empirical finance. To wit, it is attractive to think that an already simple approach, such as Mebane Faber’s ‘Ivy Portfolio,’ a 5-asset, 10-month moving average methodology, which requires monthly attention as originally proposed, might be equally effective with annual rebalancing. For those who aren’t already acquainted with it, Faber’s Ivy Portfolio approach was first proposed in a paper entitled “A Quantitative Approach to Tactical Asset Allocation” in 2007. It has been updated several times since including a recent update in early 2013 which extended the results through 2012. I’m not ashamed to admit that this paper was a primary catalyst for our own interest in quantitative approaches to asset allocation.

The mechanics of Faber’s approach is quite simple. First, compose a diversified portfolio from each of the major asset classes held in equal weight: bonds, U.S. stocks, international stocks, real estate, and commodities. Next compute a moving average (MA) of closing prices over the prior 10 months for each asset. Observe the portfolio at the end of each month, and where an asset closes out the month below the level of its moving average, sell the asset and hold cash, repurchasing only when it closes back above its moving average at the end of any subsequent month.

Our analysis will attempt to answer several questions:

- Is it possible to make this approach even simpler by only rebalancing the portfolio every 12 months rather than at the end of every month?
- What can we discover by backtesting a strategy which only trades the portfolio on the last trading day of the year?
- If this backtest did yield results that were comparable to the monthly approach, how statistically significant is this result?
- How might we improve our understanding of the true distribution of risk and return for an annually rebalanced strategy relative to a monthly version?

Before diving into our quantitative analysis, please recall that the cornerstone of robust system development is statistical significance. Furthermore, statistical significance is largely a function of the number of observations. It’s difficult to achieve statistical significance with only a few trades, as each trade constitutes one observation. As a result, an annually traded approach starts out with a large hurdle to overcome, which is that we are only able to generate one observation per year per instrument. For example, using just the original Ivy Portfolio’s 5 asset classes – US stocks, EAFE stocks, US real estate, US Treasuries and commodities, if we have 40 years of data we will have about 40*5 = 200 observations.

Granted, 40 years of time series data is meaningful because it covers several secular market regimes, such as the 1970s stagflation, the 2000 tech bubble, the emerging market and commodity boom of the mid ‘naughts, and the Global Financial Crisis of 2008. Even so, 200 total observations is not enough to generate meaningful statistical confidence, as we will demonstrate below.

To answer the questions we raised above, we ran several tests. Before we explain the tests however, note that we altered the original Ivy 5 concept in some subtle ways:

- We added emerging market equities (EEM), Japanese equities (EWJ), gold (GLD), international real estate (RWX) and long-duration Treasuries (TLT) to the original 5 asset universe. The broader universe generated more observations, and allowed us to test whether the parameters specified in the originally specified Ivy Portfolio approach were optimized to work on just those 5 assets , or whether the rules are more universally applicable.
- We used daily data for our tests rather than monthly data. As a result, our tests only go back to 1995 because daily data for all of the indexes was unavailable prior to that time. However, daily data allows us to test trading on each of the 252 trading days of the calendar year, which multiplies our number of observations by over 250 times.
- We used the daily equivalent of the monthly moving averages applied in the original report. For example, rather than using a 12 month moving average, we used a 252 day MA. Any performance deviations due to our use of daily vs. monthly moving average calculations are statistically immaterial to the analysis.
- We tested both the 200 day (~10 month) and the 252 day (~12 month) moving averages as filters to see if there was a material difference in results by varying the length of the moving average

We approached our analysis from three directions. First, we ran tests using a 252 day (~12 month) moving average rule with annual rebalancing, but where the annual rebalance occurs in months other than December. We also compared each of the annually rebalanced systems to results from a system that is rebalanced at the end of every month, and another system that is observed for rebalancing every day. Next we performed the same analyses, but using a 200 day (~10 month) moving average filter instead of a 252 day MA filter.

**Note that we imposed onerous all-in transaction costs of 100 bps for annual rebalancing, 150 bps for monthly rebalancing, and 200 bps for daily rebalancing.**

Figures 1. and 2. show the dispersion of performance results for these moving average systems for rebalances that occur on the last trading day of each calendar month. In other words, the results for January assume annual rebalancing on the last trading day of January in each calendar year.

The red bars in the charts show the average results for annually rebalanced models across all of the months in the calendar year. The green bar shows the results of the traditional monthly rebalanced system, and the orange bar demonstrates the performance of a system that is rebalanced daily. For those who aren’t familiar with MAR, it is simply the return divided by the maximum drawdown.

Figure 1. Performance results for 252 day moving average system, annual rebalancing at the end of each calendar month

Data source: Bloomberg

Figure 2. Performance results for 200day moving average system, annual rebalancing at the end of each calendar month

Data source: Bloomberg

First, note that the 252 day (~12 month) and 200 day (~10 month) versions deliver statistically indistinguishable results on all relevant metrics. So we can safely assert that the 12 month MA used in the original report is relatively robust. However, the validations end there.

Recall that we penalized annually rebalanced models by 1% per year, the monthly rebalanced system by 1.5% per year, and the daily observed system by 2% per year. To our thinking, the most relevant comparisons are between red and green bars, because they illustrate the average results of all the annually rebalanced systems, and the monthly system, respectively. It’s clear from the charts that results from the monthly rebalanced system are demonstrably better in every performance metric than the *average* of all annually rebalanced systems. Indeed, the monthly version is better than even the best annually rebalanced systems in most respects.

Results for annually systems rebalanced in certain months – June and July for 252 day MA systems and July and August for 200 day MA systems – show just slightly lower Sharpe ratios and higher MARs than the monthly system. Nascent quants might be tempted to conclude that you would do just as well trading an annual 252 day MA system so long as you trade in June or July, or are trading an annual 200 day MA system in July or August. But this is an illusion.

Recall that there were really just 2 bear markets over the test horizon: the 2000 bursting of the technology bubble, and the 2008 Global Financial Crisis. Further, only the 2008 Global Financial Crisis really qualifies as a true multi-asset class crash. It just so happens that in 2008 most assets, with the exception of U.S. stocks and real estate, delivered strong returns until June. Further, the crash didn’t really get going in earnest until September. The favourable ‘mid summer’ strategies rebalanced in June, July or August also avoided the whipsaws and volatile bottoming process that occurred in the first three months of 2009. Annual strategies that rebalanced in June, July or August were able to capture all of the returns in 2008, avoid almost all of the ensuing crash, avoid the January whipsaw and V bottom in March, and harness a substantial portion of the 2009 rebound. Lucky stuff, not likely to be repeated in the same way next time.

For fun, we took the next natural step for this analysis by examining the performance of annually rebalanced systems traded on each day of the calendar year. There are typically 252 trading days in a calendar year, so we examined the results for systems that trade annually on day 1, day 2, day 3…day 251, day 252. Trade day 1 will have a slightly different calendar date each year, depending on where New Years Day falls in the week, but in all we have 252 different annually rebalanced systems from which to compare results. Figures 3. and 4. show these results separately for 252 day MA and a 200 MA systems. Rather than show the results for each annual trade day (which would have made for a very wide chart), we sorted results into quantiles; this better illustrates the distribution of performance for all of the individual systems.

The numbers at the bottom of each chart represent percentile values. For example, the bar above 0.1 in any chart describes the 10th percentile observation; that is, the observation that is exceeded by 90% of all observations. Among 250 observations, this would be the 25th lowest value. The 0.5 bar is highlighted in red because it represents the median value, or the 50th percentile. 50% of all results exceed this value, and 50% are below.

Figure 3. Quantile analysis of 252 annually rebalanced 252 day MA systems vs. monthly and daily traded systems

Figure 4. Quantile analysis of 252 annually rebalanced 200 day MA systems vs. monthly and daily traded systems

It is useful to compare the median performance (red bars) among all possible annually rebalanced models against the performance of the monthly rebalanced and daily rebalanced versions (green and orange bars, respectively). Note again that in every case the monthly rebalanced system outperforms the median annually rebalanced system.

Somewhat surprisingly, the Sharpe ratios of the monthly rebalanced systems exceed the Sharpe ratios for 99% of the annually rebalanced versions. You can observe this for yourself by comparing the 0.99 bar in the charts to the green and orange bars. You’d have to be incredibly lucky to trade an annually rebalanced system and exceed the performance of the monthly model; less than 1 in 100 who try are likely to be successful.

Some readers may have been wondering whether there was anything magical about the fact that the monthly traded approach always executes on the last trading day of the month. Would results vary if we traded monthly, but on the 8th day of the month, or perhaps day 17? To satisfy your curiosity, we ran the monthly traded system with trading days from day 1 to day 20 in each month to see if this made a large difference to results. Figure 5 summarizes the output.

Figure 5. Performance results for monthly systems rebalanced at each trading day of the month, 10 month MA

Data Source: Bloomberg

Some of you may be surprised to learn that rebalancing on the last day of the month carries no advantage, and may in fact be disadvantageous. Keen systematicians may choose to divide their capital and trade each fraction on a different day of the month to further stabilize results without impacting turnover (though smaller investors may incur more trading costs).

The goal of this article was not to conclude whether annual, monthly, or daily rebalancing is optimal for Faber’s ‘Ivy-5’ portfolio. Indeed, quite the opposite. Rather, the goal was to provide a framework for judging the statistical robustness of a simple systematic asset allocation strategy. In doing so, it’s important to test how sensitive a strategy is to small changes in important features of the system. In this case, while our tests were very consistent with the spirit of the original analysis of the Ivy 5 method, small changes to the asset universe, moving average window, and in particular trade dates resulted in material dispersion in results. For example, the 5th percentile worst outcome for all annually rebalanced approaches, per Figure 3., was a compound return under 4%, a Sharpe ratio under 0.15, and a maximum drawdown of over 25%. In contrast, the 95th percentile outcome was a compound return over 6%, a Sharpe over 0.5 and a maximum drawdown under 10%. Pretty significant.

It also became clear through our analysis that an annually rebalanced approach to an Ivy 5 type methodology is very unlikely to generate the same absolute or risk-adjusted performance as the monthly rebalanced approach, even after accounting for fairly onerous transaction cost assumptions. On the other hand, more frequent daily rebalancing incurs transaction costs that swamp any potential benefits and may be vulnerable to more frequent whipsaws which have the potential to amplify drawdowns.

As simple as possible, but no simpler!

The post Faber’s Ivy Portfolio: As Simple as Possible, But No Simpler appeared first on GestaltU.

]]>The post Toward a Simpler Palate appeared first on GestaltU.

]]>The current article series deals with the concept of performance decay, which occurs when the performance of a systematic trading strategy is materially worse in application than it appeared during testing. We dealt with the concept of arbitrage in our last post, drawing a parallel with the phenomenon of ‘multiple discovery’ in science. Essentially, we hypothesized that many developers drawing from a similar body of research will stumble upon similar applications at approximately the same time. As these investors compete to harvest the same or similar anomalies, each investor will harvest a smaller share of the available alpha.

We also touched on reasons why we are confident that *thoughtful* active asset allocation strategies are likely to preserve their strong risk-adjusted return profile for the foreseeable future. Recall that a variety of structural impediments prevent contemporary ‘big money interests’ like pensions, endowments, and other large institutions from exploiting this arbitrage opportunity. At root, these large capital pools are constrained by group-think, corporate structure, and slow-moving governance procedures. These constraints preclude them from migrating their focus from traditional sources of alpha (i.e. security selection) to tactical sources.

This post begins our exploration of the concept of ‘degrees of freedom’ in system development. The term ‘degrees of freedom’ has slightly different meanings depending on whether the context is formal statistics or mechanical systems. While Investment system design often draws from both contexts, for the purpose of this series we will skew much closer to the latter. Essentially, the number of degrees of freedom in a system refers to the number of independent parameters in the system that may impact results.

When I first discovered systematic investing, my intuition was to find as many ways to measure and filter time series as could fit on an Excel worksheet. I was like a boy who had tasted an inspired bouillabaisse for the first time, and just *had to* try to replicate it myself. But rather than explore the endless nuance of French cuisine, I just threw every conceivable French herb into the pot at once.

To wit, one of my early designs had no less than 37 classifiers, including filters related to regressions, moving averages, raw momentum, technical indicators like RSI and stochastics, as well as fancier trend and mean reversion filters like TSI, DVI, DVO, and a host of other three and four letter acronyms. Each indicator was finely tuned to optimal values in order to maximize historical returns, and these values changed as I optimized against different securities. At one point I designed a system to trade IWM with a historical return above 50% and a Sharpe ratio over 4.

These are the kinds of systems that perform incredibly well in hindsight and then blow up in production, and that’s exactly what happened. My partner applied the IWM system to time US stocks for a few weeks, and lost 25%. Dozens of hours and weeks of late nights at the computer down the drain.

The problem with complicated systems with many moving parts is that they require you to find the exact perfect point of optimization in many different dimensions – in my case, 37. To understand what I mean by that, imagine trying to create a tasty dish with 37 different ingredients. How could you ever find the perfect combination? A little more salt may bring out the flavour of the rosemary, but might overpower the truffle oil. What to do? Add more salt and more truffle oil? But more truffle oil may not complement the earthiness of the chanterelles.

You see it isn’t enough to simply find the local optimum for each classifier individually, any more than you can decide on the optimal amount of any ingredient in a dish without considering its impact on the other ingredients. That’s because, in most cases the signal from one classifier interacts with other classifiers in non-linear ways. For example, if you operate with two filters in combination – say a moving average cross and an oscillator – you are no longer concerned about the optimal length of the moving average(s) or the lookback periods for the oscillator independently; rather, you must examine the results of the oscillator during periods where the price is above the moving average, and again when the price is below the moving average. You may find that the oscillator behaves quite differently when the moving average filter is in one state than it does in another state.

To give you an idea of the scope of this challenge, consider a simplification where each classifier has just 12 possible settings, say a lookback range of 1 to 12 months. 37 classifiers with 12 possible choices per classifier represents 6.6 x 10^18 possible permutations. While a quintillion permutations may not seem like a simplification, consider that many of the classifiers in my 37 dimension IWM system had two or three parameters of their own (short lookback, long lookback, z score, p value, etc.), and each of those parameters was also optimized. Never mind finding a needle in a haystack, this is like finding one particular grain of sand on the beach.

There is another problem as well: each time you divide the system into two or more states you definitionally reduce the number of observations in each state. To illustrate, imagine if each of the 37 classifiers in my IWM system had just 2 states – long or cash. Then there would be 2^37 = 137 billion possible system states. Recall that statistical significance depends on the number of observations, so reducing the number of observations per state of the system reduces the statistical significance of the observed results for each state, and also for the system in aggregate. For example, take a daily traded system with 20 years of testing history. If you divide a 20 year (~5000 day) period into 137 billion possible states, each state will have on average only 5000/137 billion=0.00000004 observations per state! Clearly 20 years of history isn’t enough to have any confidence in this system; you would need a testing period of more than 3 million years to derive statistical significance.

As a rule, the more degrees of freedom your model has, the greater the sample size that is required to prove statistical significance. The converse is also true: given the same sample size, a model with fewer degrees of freedom is likely to have higher statistical significance. In the investing world, if you are looking at back-tested results of two investment models with similar performance, you should generally have more confidence in the model with fewer degrees of freedom. At the very least, we can say that the results from that model would have greater statistical significance, and a higher likelihood of delivering results in production that are consistent with what was observed in simulation.

How many bowls of bouillabaisse would you have to sample to be sure you’d found the perfect combination of ingredients?

Because of this, optimization, like cooking, must be conducted in an integrated way that accounts for all of the dimensions of the problem at once. And this is the driving force behind the strange reality that often times in the investing world, as with cooking, *novices seek complexity, while veterans seek simplicity. *This is counterintuitive – even for investment professionals, which is why system design has a strange learning curve where the tendency is to move very quickly away from the simple approach that introduced you to systematic trading in first place (in our case Faber’s work along with The Chartist and Dorsey Wright) toward extremely complex designs, each with a very precise optimal setting.

Eventually you recognize the folly of this pursuit, and work backward toward coherence and simplicity. Of course, simple doesn’t mean easy, any more than a novice can follow a simple recipe to recreate a culinary masterpiece. As you will discover, thoughtful simplicity can be deceptively complex. We will give you an example of that in our next article. For now, please pass the salt and pepper.

The post Toward a Simpler Palate appeared first on GestaltU.

]]>The post Sources of Performance Decay appeared first on GestaltU.

]]>One of the most interesting phenomena observed over the centuries in science is ‘multiple discovery’. This phenomenon, so named by noted sociologist Robert K. Merton in 1963 (not to be confused with Robert C. Merton, who won the Nobel Prize in Economics for co-publishing the Black-Scholes-Merton option pricing model), occurs when two or more researchers stumble on the same discovery at nearly the same time, but without any prior collaboration or contact. Historically, these discoveries happened concurrently in completely different parts of the world, despite little shared scientific literature, and significant language barriers.

For example Newton, Fermat and Leibniz each independently discovered calculus within about 20 years of each other in the late 17th century. Within 15 years of each other in the 16th century, Ferro and Tartaglia independently discovered a method for solving cubic equations. Robert Boyle and Edme Mariotte independently discovered the fundamental basis for the Ideal Gas Law within 14 years of each other in the late 17th century. Carl Wilhelm Scheele discovered Oxygen in Uppsala, Sweden in 1773, just 1 year before Joseph Priestley discovered it in southern England. Both Laplace and Michell proposed the concept of ‘black holes’ just prior to the turn of the 18th century.

The 19th and 20th centuries also saw a wide variety of multiple discoveries, from electromagnetic induction (Faraday and Henry), the telegraph (Wheatstone and Morse in the same year!), evolution (Darwin and Wallace), and the periodic table of the elements (Mendeleev and Meyer). Alan Turing and Emil Post both proposed the ‘universal computing machine in 1936. Jonas Salk, Albert Sabin and Hilary Koprowski independently formulated a vaccine for polio between 1950 and 1963. Elisha Gray and Alexander Graham Bell filed independent patents for the telephone on the same day in 1876!

Altogether, Wikipedia has catalogued well over 100 instances of multiple discovery in just the past two centuries. If the frequency of multiple discovery is related to both the speed of communication and the number of linked nodes in a research community (a hypothesis for which I have no proof, but that is logically appealing), then the concept of ‘multiple discovery’ has important implications for current investors in the age of the Internet.

For us, there is a clear analog in quantitative finance: researchers operating independently, but sourcing ideas from a common reservoir will almost certainly stumble on similar discoveries at approximately the same time. This dynamic will almost certainly lead to some performance decay once these strategies are put to work out of sample, and with real money, as all of these investors will be attempting to draw from the same well of alpha. Indeed, in a recent paper Jing-Zhi Huang and Zhijian (James) Huang demonstrate that published anomalies do exhibit meaningful performance decay after publication, though they do in aggregate preserve some of their pre-publishing lustre out of sample. Interestingly, they also identify some simple filters that help to identify which anomalies are ‘working’ over time as they pop into and out of existence.

Note that the anomalies explored by Huang and Huang relate specifically to equity selection. We believe active approaches to global asset allocation have several advantages over strategies aimed at selecting securities within a specific asset class, and that they are less vulnerable to decay as a result.

For example, most investors have a strong home bias and are not open to approaches that stray too far from stocks and bonds of their country of residence. Strategies that propose to be agnostic to home bias, and spend substantial periods invested in unfamiliar assets are unlikely to gain mass adoption.

More importantly, major asset classes represent enormous pockets of capital, on the order of hundreds of billions, or even trillions, of dollars. Markets this deep require equally deep sources of capital to arbitrage. Yet the current large sources of capital in global markets – pensions, endowments, and other institutions – are constrained in their ability to take advantage of the opportunity in this space in three important ways:

- Many of these institutions are structured along asset class lines, with resources dedicated to each asset class silo individually. Dynamic asset allocation might see one silo receive very little capital allocation for many months or years; it is difficult to lay off employees and recall them when their asset class is back in favour.
- Where asset allocation is implemented through outside managers, dynamically shifting across asset classes would require frequent redemptions and reallocations which may not align with longer-term security selection strategies. Many more successful strategies would not accept such active rotation in and out of their funds, and may choose to limit access to institutions who frequently reallocate.
- Many, if not most, institutions are managed by large boards with diverse experience and skill-sets. These boards meet infrequently, but are responsible for approving large shifts in strategy. It would represent a large departure from convention for a board to approve a meaningful shift into such a novel approach, in which most if not all of the board members have little to no experience.

For these and other reasons, we feel global multi-asset active allocation strategies have many strong years ahead of them, in contrast to many other strategies which may live and die very quickly because they do not possess the above characteristics.

In the next article(s) we will explore the impact of overlooked sources of investment returns, paying special attention to the impact of interest rates, and examine the myriad ways in which quantitative researchers ignore sources of potential bias in their models. We will also offer some thoughts on how to address these shortcomings.

The post Sources of Performance Decay appeared first on GestaltU.

]]>The post SlideShare: Portfolio Optimization Under Uncertainty appeared first on GestaltU.

]]>

The post SlideShare: Portfolio Optimization Under Uncertainty appeared first on GestaltU.

]]>The post BPG Podcast with Preet Banerjee appeared first on GestaltU.

]]>The post BPG Podcast with Preet Banerjee appeared first on GestaltU.

]]>