I’ll start out with a warning. This is a post about posts about math. If that’s not your cup of tea, that’s great – you sound well adjusted. If trying to figure out what new pitching metrics are trying to tell us in baseball terms, read on.
On Wednesday, Jonathan Judge released a new public pitching measure, Context-FIP or cFIP for short. There’ve been many run estimators over the years, including the mostly descriptive ERA, RA/9 and FIP, and a series that regress past results in order to capture a pitcher’s elusive “True talent” and thus help predict his future results. For a number of reasons, ERA is exceptionally poor as a predictive measure. It conflates what a pitcher does with the contributions of his team, it ignores parks and opponents, and its attempt to strip errors out causes some odd effects that make it even less predictive than plain old RA/9. More predictive metrics like xFIP strip out defense, and then heavily regress actual HRs allowed. It’s generally pretty close to FIP, but it’s generally more stable from year to year. Importantly, the fact that it’s more stable than FIP is either a feature or a bug, depending on what you’re using it for. What Judge’s statistic attempts to do is bridge the gap between predictive models and descriptive ones. It’s not necessarily the best at predicting future runs allowed, but given the noise involved in *that*, Judge argues that we need to evaluate not only how well it predicts runs out of sample, but how well it predicts *itself* in future years.
One of the tantalizing aspects of cFIP is Judge’s use of “mixed models” to calculate cFIP. Instead of ignoring everything from batter handedness to ballpark to umpire, the model incorporates them, while keeping them segregated from the fixed effects (things with only a set number of possibilities; no matter how many observations you make, pitchers will throw with either their left or right hands). The model can then examine the “random effects” to see how they effect runs, adding certainty as you add more and more observations/data. So while FIP treats all HRs the same, and xFIP strips out all actual HRs, cFIP is an early example of a cool hybrid. A HR to Troy Tulowtizski in Colorado is *different* than a HR to Brendan Ryan in Safeco, and it’d be cool to incorporate that information into a FIP-like statistic. Judge is a great writer, and the explanation of the approach is surprisingly readable – he outlined mixed models in this great post on catcher framing, and his description of their application to cFIP is surprisingly lucid to a non-gory math person like me.
In the spirit of embracing context instead of ignoring it, Judge’s tests of various metrics isn’t how well it predicts future runs, but rather RE24. This is a win-expectancy-based stat, that calculates changes in run expectancy (runners on base and outs) as well as actual runs scored. A 2-out bases loaded hit is more damaging to a pitcher’s RE24 than it is to his FIP, which is uninterested in the base-out state, and uninterested in the hit itself. That’s an interesting change, though it’s something to keep in mind when you look at Judge’s table of correlations – he compares in-season correlations to RE24/PA of a bunch of run estimators, like FIP, xFIP, SIERA, ERA, RA, etc. Correlating to runs is hard enough, but adding context makes it even tougher – cFIP isn’t particularly good at the task, though it squeaks past SIERA, xFIP and the like. That said, a contextual measure being better correlated with another contextual measure than those that explicitly and intentionally *ignore* context isn’t all that impressive. cFIP moves to last place when he runs a weighted correlation on all pitchers, not just those who meet a batters-faced floor. Unsurprisingly, the ones that do well in this test are the *most* context-dependent, namely RA – a measure that lugs around a pitcher’s defense with him.
That said, cFIP shines when predicting RE24 in the following year. Shines may not be a particularly good term here; its three year average correlation is under .4, leaving it all but tied with SIERA, xFIP and kwERA (a Tom Tango metric that uses strikeouts and walks *only*). It’s really, really hard to predict a measure with as much noise as RE24, given that it RE24 is so dependent on sequencing. That’s not a knock on cFIP (or kwERA), but it’s worth noting that we’re talking about very small differences in what is ultimately not-so-hot predictive power. cFIP also ranks #1 in how well the measure correlates with itself from year to year, again finishing ahead of kwERA and SIERA.
So…is it, you know, good? It’s promising, but I’m not sure that *this version* gets us all that far. As we’ve seen, it’s quite close to kwERA, an extremely simple measure that does a bit better as a descriptor if a tiny bit worse as a predictor (of RE24, mind you). That it’s stable *could* be an indication that it’s homed in on true talent, but it could be an artifact of all the regression the model is doing. A measure will reduce big errors by squishing everyone towards the mean, but that can obscure or underestimate the gaps between great pitchers and their not-so-great colleagues. Adding more regression may give you more stability, but it does so by ignoring what actually happened; it may not be getting you closer to true talent, it may just be minimizing the importance of true talent. A number of much smarter-than-I analysts have already pointed these concerns out – here’s Neil Weinberg on the former, and Peter Jensen on the latter.
For me, I find it somewhat odd that it’s *so close* to measures like xFIP and kwERA that ignore HRs entirely. It’s utilizing actual results, but I’m not seeing much of an effect from that. Let’s turn to a Mariner-centric example. James Paxton had an injury-shortened 2014, but he was effective due to a high GB% (and thus very few HRs allowed) and due to batting average on balls in play. ERA loved him, FIP loved the few-HRs thing, but wasn’t blown away by his K:BB ratio, while xFIP liked the GB%, but thought he got a bit lucky. Taking all of that into account and regressing the real results, what does cFIP think? Hmm, a touch *below* league average. Not even xFIP was willing to go that far. One thing that it could indicate is that it’s putting a lot of weight on parks and the specific batters Paxton faced – maybe he didn’t give up a lot of HRs because he faced a disproportionate share of Eric Sogards and few Mike Trouts. Baseball Prospectus tracks the quality of opposing hitters AND a pitcher-specific park factor, so that can help us explain this. Paxton did indeed benefit from a great run environment thanks to pitching in Seattle and Anaheim, but the quality of opponent metric is extreme – in the opposite direction. Paxton faced an extremely difficult slate of hitters; facing Anaheim four times, and then adding in Baltimore, Toronto and Oakland will do that. No one except Carson Smith faced an average hitter with so high an OPS, and Paxton was in the top few starters by this metric (relievers typically face tougher hitters for obvious reasons).
Another example is ex-Angels reliever Ernesto Frieri. Judge notes that Frieri is the player with the largest gap between his cFIP and FIP-, both of which are park-adjusted. Last year, Frieri had a lovely K:BB ratio, and well over a strikeout per inning. However, a barrage of HRs and an abysmal strand rate got him shipped out of town. Frieri ended the year with an ERA well over 7, and a FIP of around 5.5 – giving up 11 HRs in just over 40 innings will do that to a guy. His xFIP was better, but with such an extreme fly ball ratio, it’s still not great (it’s lower than Paxton’s, for example). cFIP sees past all that, giving him a 90, or 10% better than league average. Paxton’s slightly worse than average, Frieri – who faced a slightly *worse* set of hitters, and also enjoyed HR-suppressing park environments – was better. Batter handedness? Nah, Paxton faced *five times* as many righties as he did lefties. This is a very anecdotal way to analyze a statistic, but whatever the model is doing with actual HRs allowed an actual hitters, it can’t be much. If some set of circumstances completely outweigh the actual results, that’s fine, but then the complexity in adding in all of those actual results to the model doesn’t seem to have been worth it.
The model’s promise is the ability to bridge the gap between descriptive and predictive, but it’s not immediately clear what all of the “actual results” are doing. Maybe the model regresses them away, as they don’t have the stability of good old strikeouts and walks. That’s fine, that’s interesting, but if so, it doesn’t seem to offer a lot beyond kwERA/kwFIP. Instead of building a bridge between the two classes of metrics, it certainly *looks* like cFIP is setting up camp with the predictive models. It appears to be more stable, but again, if it’s more stable solely because the spread is much lower than it is for FIP, xFIP, etc. (to say nothing of ERA), then that limits cFIP’s utility. What would be interesting is to show the correlation between cFIP and kwERA, or cFIP and SIERA. My guess is that they’re going to be very, very high.
At this point, we’ve seen two innovative approaches to integrating actual results to predictive models, SIERA and now cFIP. Just as an outside observer, those actual results seem to get regressed away pretty quickly. Both seem, on paper, to take some pretty important things into account – velocity for SIERA and umpire for cFIP. And despite that, or rather *because* of that, they end up looking like a souped up xFIP. ERA is clearly and increasingly widely seen as inadequate, but every new pitching metric seems to train its guns on FIP. If you’re looking to better describe *actual* results, RA/9’s place in the pantheon isn’t imperiled by cFIP. To the degree that we learn something new about the game of baseball, and every new metric should attempt to illuminate some aspect of the game, what we learn (or re-learn) is the central insight into DIPS – that strikeouts and walks matter so, so much more hits. We’ve added tons of data to FIP, or rather xFIP, and we’ve moved the needle, but by frustratingly little. That’s interesting in itself, if frustrating. At this point, it seems like we’re not going to get a noticeably more predictive/descriptive model by adding a bit more data. Multiple smart people have added tons, and the gains are marginal. If we’re going to break actual new ground, it seems like we might need to add tons more data. Don’t just incorporate umpire or velo, but incorporate pitch type, location, of every pitch, and what pitches precede and follow each pitch. These models are already frightfully complicated, and I hope/fear they’re going to get exponentially more complicated.
Ultimately, I think the mixed model approach has so much potential, and my skepticism (or confusion!) about cFIP isn’t based on a low ranking of Paxton, but on the fact that I can’t immediately see how the model uses actual results, especially HRs. FIP is *so* HR-dependent, and that leads it to underestimate guys like Hisashi Iwakuma. Other measures drop them entirely. We need something in between, but it may be that there’s simply too much variability in them to do this effectively or neutrally. As Neil Weinberg says, the star of the show may be kwERA – that knowing a pitcher’s Ks and BBs can give you as much information as you’re likely to get about future runs allowed as metrics that are light years more complex. Still, cFIP is something to watch. I’m excited to see what Judge does with it, and how analysts might utilize it – I’m even more excited to see what Judge does next.