The way the BASIC models work is very simple math: just use a transition matrix for various situations that do NOT consider the score, nor the time remaining. This is the way I do it for baseball. But I must admit, I never actually tested it.

In football, we may consider that if a team has a 21+ point lead that the two teams are going to play radically different. I know this is somewhat true in hockey, where the team that is DOWN by 2 goals is more likely to score the next goal then the team that is up by 2. This happens because the leading team giving up a goal (so they are only up by 1) costs more than the leading team gaining a third goal. So, what they basically want to do is reduce the goals they allow by say 20%, at the cost of themselves scoring by say 30%. It’s a “small ball” kind of tactic. At the same time, the team that is behind want to increase the goals they score by 20% even if it means increasing the goals they allow by 30%. The net effect is that it does NOT cancel out.

So, we know hockey teams play differently. We suspect that maybe football teams play differently, hence the idea that the Patriots had a 99.5% chance of winning is probably wrong. Indeed, someone tweeted out that when you look at teams where the model said they had a greater than 95% chance of winning, they actually ended up winning less than 90% of the time. The models, therefore, were too basic.

How about baseball? Well, that was a great idea, so I applied it to baseball, 2010-2016. I looked for all games where the home team FIRST had a 95%+ chance of winning, prior to the 5th inning. Remember, my Markov chain is based just on the run expectancy, and so, is unaware of the change of strategy. Does it matter? So, there were 1122 games that met the criteria. The average estimated win probability was 0.958. The actual number of wins was 1082 and actual losses was 40, for a win% of 0.964. So, that one works.

How about a 99%+ chance of winning, in the 8th or later innings? The average estimated was 0.994 The actual was 4268 wins and 11 losses, for an actual of 0.997.

It seems therefore that in baseball, when we say that the chance of a comeback is 99%, we actually do mean it is 99%.

]]>

My preference would be to use reaction point based on a constant TIME to home, but BPro went with a constant distance. My expectation is that I’ll be doing something along these lines as well, so it’s great to see so much great work already out there.

]]>I’ll offer Phil even more relevant cases: MLB AL v NL, 2005-present (but since I cherry picked 2005, you should start in 2004). You can look at NBA West v East. And closer to home: CFL West v East, and NHL West v East.

It offers a good look at the “leakage” issue that Phil brings up, especially since inter-conference games are much fewer in MLB than NBA. IIRC, I think I looked at the CFL, and it seemed that the proportion of games was just about right, such that their win% represented their true talent levels. I think.

Anyway, would love to see the Phil-touch applied here.

]]>What he did was try to forecast a future matchup, using: (a) their actual past matchup and (b) their season-to-date totals. And he found that BOTH added value. To which I say: well, obviously, both HAVE to add value. I’ll explain in a sec.

He also said that the season to date gets weighted at twice their past matchup. To which I say: well, I don’t know what that means unless I know the average number of PA in each group. I’ll explain in two secs.

First the first one: adding data is always good. So, no one is saying that matchup data should get tossed. It just needs to get added to the pile, basically, unweighted.

The second one: When you use “season to date”, you are using 0 to 600 PA for each hitter and 0 to 900 PA for each pitcher, more or less. The average for a starting hitter would mean about 300 PA, but since he includes ALL hitters, the average is probably… I dunno… 150 PA? Similarly, for SP, the average is probably 400 and for relievers it’s 50 or something, so maybe it’s 200 PA as the average? I dunno. Let’s say the harmonic mean of the average number of PA for each pairing being forecasted is around 175 PA.

In addition to that, we have something that has 50 PA. From my standpoint therefore, using these pure-guesses, we’d expect season-to-date to be weighted at 3:1 or 4:1, simply because one sample has 3x to 4x more than the other sample. Instead, we find it’s only 2:1. That is, matchup data gets “overweighted” by a factor of 2 relative to non-matchup data. That’s a significant finding.(*) But, I’d like to see Pizza fill in all the numbers I’m guessing.

(*) Note that in my unpublished research, with a methodology that is different from Pizza, I can confirm that the weighting is definitely higher than 1x. So, if anyone wants to keep researching this, you will find something.

Two points: as MGL brought up in the comments, you have to account for the handedness issue (batter and pitcher). That’s an easy variable to include. The other is that Pizza uses “season to date”, which means he starts at game 1. Which means it’s kind of a mess in terms of some samples having only 1 PA in season to date (actually, it starts at 0) and others at 600 PA.

Really, in order to do an apples-to-apples comparison, he should simply limit it to match each player’s actual number of trials. If let’s say Rice/Guidry faced each other 53 times, then I’d look at Rice’s first 53 PA in the season against LHP (which by the way, MUST exclude Guidry), and Guidry’s first 53 PA in the season against RHH. What this does is now make clear we have two equal sized samples, and we’ve neatly handled the handedness issue. Then we just figure out the weights.

And if Pizza does this, we may find we need to weight matchup data at 2x to non-matchup recent-data.

]]>I’m a little confused/disbelieving of the “start at 70 games for regression, gradually go to 35”. That said, I highly respect Adam’s work, so I’m keen to learn more.

There was also discussion of the decay rate. For players, I basically use a decay rate of 0.999, or 999/1000, which is a far cry from the numbers he was discussing (albeit at a team level). One thing that is different with teams is that the team makeup changes in response to their current record. So, it would make sense that the decay rate is different. To get around it, or at least “be fair”, you can “reverse” the season, and start at game 162 and go backward. The point of the decay rate is to tie recent performance to recent talent. But, the other point however is to tie it to a team’s future talent level (and change in roster). So, a couple of ways to go here.

What I’d like to see anyway is to see how the regression model and the Elo model best works with actual MLB (and preferably NBA) teams. NBA because talent is FAR higher tied and FAR quicker to a team’s record.

Love the whole article!

]]>Dave notes that when Posada plays as a sub, the Yanks had a .385 win% (72 wins, 115 losses). We can guess what happens: a catcher is given a day off, but he comes in to “play”, which really means he’s coming in as a PH because the team is down, and his backup catcher is a terrible hitter. This is a huge selection bias. You can’t look at the end result (W/L) by assuming that they had a reasonable shot at winning the game, as you would at the START of the game. What you’d actually have to do is figure out the win expectancy AT THE POINT HE ENTERED THE GAME. I’ll leave that for someone else to do.

More simply, what you really should do is break out the W/L record based on whether Posada started the game or not. In games he started, the Yanks were .602. In games he didn’t start, the Yanks were .594. By this simple calculation, the Yanks were +12 games ahead with Posada than without him starting. No big shakes, but at least it’s more honest than looking at it by games played.

Thanks, Dave, that was fun.

]]>More interestingly, MGL’s control group is pitchers who had the (similar) bad ERA overall, without a big number of bad starts. Again, his forecast did not look for number of bad starts. And their overall forecast was the same as the studied group. Except this control group actually outperformed their forecast.

This tells me that either:

- MGL’s forecasting system is not good enough (say 5% likelihood)
- Number of bad starts is actually a good indicator, but in the direction of “fewer bad starts, given same ERA as someone with many bad starts” in the positive direction (say 20% likelihood)
- Sample sizeitis (say 75% likelihood)

But for those hanging their hat on “if not for those really bad starts…”, they won’t find it here.

]]>

This study suffers from selection bias. This is not to pick on Rob, but it permeates most studies of this kind. Even saber-watcher MGL is not immune to it, which you can see based on his study of rookie and veteran pitcher forecasts. And I’m just as guilty as everyone else as well, as I suspect a good portion of my studies has this bias that needs to be accounted.

I think the clearest way to describe the problem is this way: a great NFL team is favored to win its first game (against an ordinary team) by 10 points. They lose by 1. In the second game, they are favored to win by 9 points (again against an ordinary team). They lose by 3. In the third game, they are favored to win by 6 points. They lose by 6. In the fourth game, they are treated as even odds against an average team. They lose by 12. In the fifth game, the oddsmakers start to realize their forecasts are all wrong. They are now underdogs in their fifth game by 10 points. They win by 1. Undeterred mostly, they are underdogs in the 6th game by 9, but they win by 3. In the 7th game, they are underdog by 6, but win by 6. And in the 8th game, they are even odds, but they win by 12.

Overall? The oddsmakers are 0-8 in terms of direction. Overall, this NFL team won 4 games and lost 4 games. When they were favored, they were favored to win by 25 points in 3 games. When they were underdogs, they were favored to lose by 25 points in 3 games. Overall, their expected margin of victory was exactly 0 points in 8 games. And in fact their outcome was that they in fact scored as many points as they allowed in 8 games. If you look overall therefore, it seems as if the oddsmakers nailed it. Except. Well. Basically, the oddsmakers were trying to play catchup, and overshot, only to end up at the same place. Except of course, they set horrible lines game by game. (Presumably.)

Getting back to Rob’s article: if you drastically change your approach, the guys who get good results will continue that approach. The guys who don’t will stop and go back to what worked for them. In the end, if you see a drastic change in approach using the final season stats, then you can bet that he was successful at it.

This is why for example Marcel sets ALL rookies to be league average hitters, even guys in A ball. Why? Because if they end up with 150 plate appearances in MLB, you can bet that they were overall close to average hitters. And if these A ball hitters end up with 30 PA and a .100 wOBA? Well, not much harm with only 30 PA. The rookies with 600 PA (and their undoubtedly above-average wOBA) will cover them. Overall, rookies end up looking good. It’s not average actually. It’s more like 20 wOBA points below average. If you look at forecasts from ZiPS and Oliver and Steamer, you will see they’ll forecast rookies from a .200 wOBA to a .400 wOBA. And they’d be “right”. But, there’s no way to test their forecasts for most of them! The point is that we end up with a self-selecting sample. Which is not what you want from a study.

In order to correct Rob’s study, and it’s the similar thing I tweeted to MGL: you have to base the change in approach through June 30. THEN, you look at the results from July 1 onwards. Of course, he may have changed his approach back to his usual in August. So, if you REALLY want to do this right, you do it game by game, whereby you have his “presumed forecasted” approach for the UPCOMING game. And then see the results for that presumed approach. And you keep doing that, game by game.

The same idea applies to what MGL did. The same idea applies in trying to match manager choices to his reliever talent in terms of getting high LI for his best relievers. Doing anything else simply biases the end results.

]]>Even without that, we still end up with the issue that balls that are severely pulled, or that are sliced the other way will be not a typical batted ball, as the hitter loses speed, and the launch angle will be much different than he intended. In effect, when you compare a hitter, home and away, you can’t just look at his exit speed, or his exit speed and launch angle, but also his spray direction.

So, for my first crack at creating Park Impact numbers for each park, I did the following. First, for each hitter, I broke up his data as to whether he was at his home park or away. In addition, I broke up each batted ball as having been hit close to the LF line (-30 degrees or “greater” in magnitude), left to straightaway center (0 to -30), straightaway center to right (0 to +30) or close to the RF line (+30 degrees or greater). As well as whether he did that as a RHH or LHH.

I then matched that grouping (batter, batside, slice) at home to the same grouping away. So, we are controlling as much as we can our variables. For each grouping, I determine the median exit speed and median launch angle. Median works better than average, because of the really wide range of mishits. I weight each pairs by the lesser of the number of contacted balls. So, if Ortiz has 70 batted balls at Fenway that he hit to the right-CF slice and 80 away from Fenway, I weight that as 70. I do this for every hitter, and total things at the venue/batside/slice level. (I set the minimum to 10 contacted balls for whatever slice I’m looking at, for each hitter.)

Here for example is Oakland: there were 7 RHH, totalling 242 contacted balls that hit balls to the right-CF slice. Their median launch speed was 91.8 mph in Oakland, while those exact same Oakland hitters had a launch speed of 91.3 mph away from Oakland. The net impact therefore is +0.5 mph in Oakland. Their launch angle was 25 degrees in Oakland and 22 degrees away, for +3 degrees of impact. The total net effect, if we look at the speed+angle pairing was a .375 wOBA in Oakland and .440 away from Oakland, for a wOBA impact of negative .065.

When I look at the righty hitters pulling the ball to the left-CF slice, we get the same +3 degree impact, except we also get a -2.2 mph hit. The net effect is negative .052 in wOBA.

When I look at the 2 slices for LHH, I don’t get similar results, with a net effect of less than 10 wOBA points.

Overall, these 4 slices (2 slices for LHH and 2 for RHH) gives us these totals: +1.1 degrees launch in Oakland compared to the same hitters away from Oakland, and -0.4 mph. The wOBA drop is 34 points in Oakland, which is the largest drop in MLB.

The park where the launch+speed pairing helps batters the most is in Arizona.

More to come…

]]>That profile, as you see in the bottom, is consistent with allowing a wOBA of .348 on contacted balls, about 20 points better than league average. I looked at other similar pitchers. The 10 pitchers whose profiles were most similar to McHugh were estimated to have allowed contacted balls consistent with a .358 wOBA. Their actual wOBA on contact was .360.

So, what happened? In each of the six zones above, he ended up with a higher wOBA than the league average. His 25 Barrels should have resulted in an estimated wOBA of 1.447, but instead it was an actual 1.620. That is, what could have been a couple of long fly outs instead became HR.

The 25 Solid Contacts should have resulted in a wOBA of .688, but instead, he got saddled with an actual 1.060. He really got hurt here.

His 129 Flares and Burners should have resulted in a wOBA of .617, but instead it was a bit worse at .680.

His 123 balls that he hit under should have resulted in a wOBA of only .117, but again, he couldn’t catch a break and resulted in a wOBA of .160.

His 145 balls that he topped: shouldabeen .185, actuallywas .230.

And even the 71 weak contacted balls should have been a .079 wOBA but instead was .150.

Up and down, across the board, McHugh ended up with an observed .426 wOBA, even though his profile would suggest .348.

Whether this is because of his horizontal spray pattern, or fielder alignment, or fielding talent, or just plain ole Random Variation, I don’t know (yet). But whatever it was, it hurt McHugh more than any other pitcher.

]]>- The red zone is Barrels, where you get a 1.433 wOBA.
- The lighter zone enveloping it is Solid Contact, where you get around a .692 wOBA.
- The Flares and Burners is that odd combination of trading plenty of speed for loft, or vice versa, so the ball drops in/between fielders, and a wOBA of .630
- The yellow zone is the weak contacts, at a .046 wOBA
- The blue zone is the sky high balls, anything you hit under, for a low .095 wOBA
- The green zone is the grounders in the grass, balls you topped a bit too much, for a somewhat low .206 wOBA

What this shows is the vertical launch angle, as an angle, with the exit speed as the radius. The inside yellow line is 60mph, and the outside yellow line is 100mph.

The colors represents performance, in the form of wOBA. The dark red is wOBA over .900, with the light red above .400 and the blue under .250, or thereabouts. We’ve basically mapped out every combination of speed and launch angle. And as a batter, you want to get the combination of speed and angle that is toward the red, and away from the blue.

All we have to do therefore is figure out the wOBA for each pair, and add up each hitter’s pairs of angle and speed, and voila, we will have his estimated wOBA.

But, is that the most interesting thing we can do here? What would you prefer, a hitter’s batting average or the individual components of 1B, 2B, 3B, HR? How about his slugging, or the 4 individual components? How about wOBA or his 4 individual components?

Some people may suggest they’d prefer the wOBA. And that’s certainly possible and understandable. Because sometimes you don’t want to know more. Whenever you present a single number encapsulation of a set of components, the conversation ends. What you will have done is taken a series of data points, merged them into one, and… present it. It just ends there because you can no longer unravel it otherwise.

Suppose you wanted to do FutureWOBA. The chance that the formula for FutureWOBA matches wOBA is zero. wOBA says to give a weight of 0.9 to the 1B, 1.25 to the 2B, 1.6 to the 3B and 2.0 to the HR. To do FutureWOBA, given only wOBA, you can only do something like .8*wOBA + (1-.8)*lgWOBA. That .8 is either .8, or you can get better results by using PA/(PA+250).

But the actual FutureWOBA would almost certainly look more like this: 0.6 1B, 1.0 2B, 1.2 3B, 1.6 HR (and PLUS 0.1 outs!). The point is that the weights won’t be maintained. And that’s because the HR is more indicative of talent than 1B. It simply tells you more about a hitter, so it’s better to know that as a component, so when it comes time to figuring out the talent of your player, you can get the better weights. But even if it didn’t, you’d still want to know the components separately so you know whether you have Wade Boggs or Jim Rice putting up a .400 wOBA. You want to know the profile of the hitter, just to know his profile.

So, getting back to the above image: it’s likely that the information contained in the barrels will need to get more weight than in the other zones. In order to understand the hitter, you need to know his profile. This is true whether you want to know about his future, or simply want to know about what you have on hand as a hitter. Knowing the profile of the hitter is better than knowing the single end point. The profile keeps the conversation going, while the single end point ends up being a single data point.

Yes, no reason to choose between the two, and we may as well present it all. The point remains however that the value exists greater at the component level, whereas at the summary level, it becomes a summary opinion with some evidence. The fun though is in the evidence.

Data tomorrow.

]]>So, many exciting things will come of this. First, on a hitter level, we can see how each hitter attacks each part of the zone, and by count. We can of course do even more exciting things than just his plate count, but the sequence of the plate count. He’s in a 2-1 count: did he get there from 2-0 or 1-1? Given how he got there, what is his swing zone? If he got to the 2-1 count from a 2-0 count, did he get there on a swinging strike or a called strike? And based on that, what is his swing zone? We can include whether the pitch was a fastball or curve. We can do it based on the previous pitch being a fastball and the current being a fastball. Really, there’s no end to the combination. Or, you can simply just look at the most high level view. The user can decide how micro- or macro- to look at the splits.

Then, think of it from the pitcher’s perspective: compare the swing zones of Verlander and Felix. And repeat all the above variables.

This is the first step. Lots more to come.

]]>The overall numbers: .561 win% for the home team, which is above the .540 regular season average, but just one standard deviation above. Breaking it up though based on who started the series:

- .605 home win% if they started the series at home
- .514 home win% if they didn’t start the series at home

The first is over 2 standard deviations away. Naturally, look hard enough, and you’ll find something that is 2 standard deviations away. That by itself doesn’t mean anything. But I think we would have been more shocked if those numbers had been flipped. That is, we probably had a prior that given that the overall was .540, that the “true” split might have been .550/.530 perhaps based on whether they started game 1 at home.

Of course, we don’t need to rely on wins, but rather on runs, or just plain ole wOBA.

But we also need to know the underlying talent in those games, since the Game 1 pitcher would likely be disproportionately different from games 2 and 3.

Anyway, fascinating premise, and we just need more data at this point.

]]>

If you do runs at the seasonal level, then why not runs at the game level? Or at the inning level. That in fact is what RE24 is: it makes sure all the runs are accounted for at the inning level (indeed at the play level). It is actually the closest bridge we have between sabermetrics and the mainstream.

But what about wins? Matt Cain gives up no runs in 9 innings, while in the same game Cliff Lee gives up no runs in TEN innings, a game that the Phillies lose. If the intent is to use wins and losses as a natural end point to make sure things add up, a checksum so to speak, then we want things to add up at the game level. We shouldn’t come up with something that says that the Phillies had 0.45 wins and 0.55 losses in a game they lost, and similarly, the Giants shouldn’t add up to 0.55 wins and 0.45 losses in a game they won.

Well, maybe YOU do. Maybe you actually don’t care about who actually won and lost. You just care about what the players did, and a margin of victory of 1 run and 10 runs should lead to different answers at the team level. Suddenly, to YOU, it’s not just about wins and losses but also about margin of victory. And if you lose two games 1-0 and you win one game 10-0, you don’t have a won-loss record of 1-2, but a won-loss record of 1.9-1.1.

So, this preamble is to setup the article that Bill James wrote here, and you can see his responses in the comments area, as well as mine. For those who aren’t members on his site, I’ll copy/paste all of my comments below, plus a tiny snippet from Bill that directly relates to my comments:

A reader wrote:

The “Luck or Timing” element, as I understand it, is completely ignored in WAR calculations.

Actually…. WAR is a FRAMEWORK. Baseball Reference has its own IMPLEMENTATION as does Fangraphs. Much like not all houses look alike, but they all follow the same standards. You supply all the building materials, the SAME building materials, and some will build a house one way, and some will build it another way. Some won’t use all the materials.

So, you definitely could include luck/timing in YOUR implemenation of WAR. The WAR framework is there. It’s solid. If you don’t like the houses that BR and Fangraphs built, then no problem! Build your own. And you might be surprised how similar it will look to the others.

***

I like the way Bill James characterizes these things: these are all estimates, from various different viewpoints. There’s a great deal of commonality or overlap. That Win Shares and the various WARs out there will generally agree on say the top 10 nonpitchers and top 10 pitchers is a point in favor of having different approaches. It proves that there’s multiple ways to estimate the same thing.

If you prefer a different analogy: the “inflation” rate is not something that is just handed down, and it’s not something where there’s exactly one way to calculate it. It’s an estimate. We are trying to model reality with the limited data we have, which itself is subject to potentially a great deal of bias.

That’s all we’re doing here, trying to come up with the best truth we can.

***

As for the consideration of wins: the way Win Shares does it, it allocates whatever it can’t account for in some proportional sense. So, if it’s short 10 wins, then it’ll assign those 10 wins in some manner. Is that necessarily a good thing? Could it be a bad thing? Would we better off having a bucket that says “timing”? I don’t know.

But you can do this with WAR right now. Just build your own WAR version based off of Baseball Reference version. And whatever gap is remaining, just distribute that gap say by adding +10 wins to the players based on playing time. Or, anything really that you want.

That’s EXACTLY how I do it here:

http://tangotiger.net/wonloss/index3.php?teamid=SEA&yearid=2001

There you go, a WAR-based system where the wins and losses add up to the team win totals.

***

The question is how to bridge that gap. And does bridging that gap automatically make it better? The method Bill applies, and the method I apply, is simply to distribute that gap with no real meaning. Bill says “let’s make it proportionate to claim points”, and I say “let’s make it proportionate to playing time”. But for all I know, both those choices are worse than simply creating a “I dunno” bucket, because the reality is, we don’t know!

But people don’t want that. Since they know the players played, they want that gap filled… by the players. So, Bill gave it to them, and I give it to them. I don’t know that it’s a good choice. I don’t know it’s a bad choice. But it is the most palatable choice.

So, I challenge the assertion that it’s good that you have a system that adds things up to wins as necessarily being good. It’s only good if you can somehow prove that doing so is done in a way that reflects the wins.

As much as people don’t like WPA, WPA is likely the best way to bridge that gap. But, people come out guns blazing on WPA. So, instead of doing it that way, we simply fill the gap the very simple way.

***

Bill responded, of which I’ll copy a tiny snippet:

... To adjust for it, as Tom says, is speculative—but NOT to adjust for it is equally speculative, and certain to be wrong.

...

I don’t adjust for that over-achievement because the public prefers it; in fact, my perception is that the public would prefer that I NOT adjust for it. I adjust for it because I think you HAVE to adjust for it. It’s too large to ignore, and it creates large-scale inaccuracies if you ignore—just as much as if you ignore fielding or ignore base running. It’s part of the game; you can’t ignore it.

...

Is the won-lost record “real”, or are the individual stats the end point of the line or real accomplishments.

What I was trying to get at in this article is, in part, that if you treat the individual stats as the end point of the line, then you’ve wiped out the games. The games no longer exist; only the individual accomplishments. I don’t think that’s a viable position. I think that we HAVE to treat the outcomes of the games as real events demanding acknowledgement in the analysis.

***

Bill would you agree therefore to be consistent to your goal that the accounting should work at the game level? For example regardless what other Redsox starters do, we should account for Porcello performance not only in his starts but start by start. And make sure in games tthe Sox won, that it adds up . And same with the losses.

As an extreme example, Matt Cain pitched 9 innings of no runs, that went into extra innings. Cliff Lee pitched TEN innings of no runs… and the Phillies eventually lost.

http://www.baseball-reference.com/boxes/SFN/SFN201204180.shtml

If we truly want to encapsulate this game to account for the 1 win for the Giants and 1 loss for the Phillies by assigning all the accounting to the players, we’re going to get into some fairly tough situations.

If you want to take a higher level view, and then just treat this one game as part of a 162 game season, and so, do everything at the seasonal level, then I think Bill is going away from his point that the game is what we have to account for. We’re adding a layer of abstraction by taking advantage of luck (mostly) evening out… and when it doesn’t, we’ll just have less difficult decisions to make than doing it game by game.

So, I don’t know that it’s more wrong or less wrong to just create a “timing” bucket, and dump everything that we don’t know in there.

]]>