marc w · March 15, 2015 at 11:58 pm · Filed Under Mariners 

I’ll start out with a warning. This is a post about posts about math. If that’s not your cup of tea, that’s great – you sound well adjusted. If trying to figure out what new pitching metrics are trying to tell us in baseball terms, read on.

On Wednesday, Jonathan Judge released a new public pitching measure, Context-FIP or cFIP for short. There’ve been many run estimators over the years, including the mostly descriptive ERA, RA/9 and FIP, and a series that regress past results in order to capture a pitcher’s elusive “True talent” and thus help predict his future results. For a number of reasons, ERA is exceptionally poor as a predictive measure. It conflates what a pitcher does with the contributions of his team, it ignores parks and opponents, and its attempt to strip errors out causes some odd effects that make it even less predictive than plain old RA/9. More predictive metrics like xFIP strip out defense, and then heavily regress actual HRs allowed. It’s generally pretty close to FIP, but it’s generally more stable from year to year. Importantly, the fact that it’s more stable than FIP is either a feature or a bug, depending on what you’re using it for. What Judge’s statistic attempts to do is bridge the gap between predictive models and descriptive ones. It’s not necessarily the best at predicting future runs allowed, but given the noise involved in *that*, Judge argues that we need to evaluate not only how well it predicts runs out of sample, but how well it predicts *itself* in future years.

One of the tantalizing aspects of cFIP is Judge’s use of “mixed models” to calculate cFIP. Instead of ignoring everything from batter handedness to ballpark to umpire, the model incorporates them, while keeping them segregated from the fixed effects (things with only a set number of possibilities; no matter how many observations you make, pitchers will throw with either their left or right hands). The model can then examine the “random effects” to see how they effect runs, adding certainty as you add more and more observations/data. So while FIP treats all HRs the same, and xFIP strips out all actual HRs, cFIP is an early example of a cool hybrid. A HR to Troy Tulowtizski in Colorado is *different* than a HR to Brendan Ryan in Safeco, and it’d be cool to incorporate that information into a FIP-like statistic. Judge is a great writer, and the explanation of the approach is surprisingly readable – he outlined mixed models in this great post on catcher framing, and his description of their application to cFIP is surprisingly lucid to a non-gory math person like me.

In the spirit of embracing context instead of ignoring it, Judge’s tests of various metrics isn’t how well it predicts future runs, but rather RE24. This is a win-expectancy-based stat, that calculates changes in run expectancy (runners on base and outs) as well as actual runs scored. A 2-out bases loaded hit is more damaging to a pitcher’s RE24 than it is to his FIP, which is uninterested in the base-out state, and uninterested in the hit itself. That’s an interesting change, though it’s something to keep in mind when you look at Judge’s table of correlations – he compares in-season correlations to RE24/PA of a bunch of run estimators, like FIP, xFIP, SIERA, ERA, RA, etc. Correlating to runs is hard enough, but adding context makes it even tougher – cFIP isn’t particularly good at the task, though it squeaks past SIERA, xFIP and the like. That said, a contextual measure being better correlated with another contextual measure than those that explicitly and intentionally *ignore* context isn’t all that impressive. cFIP moves to last place when he runs a weighted correlation on all pitchers, not just those who meet a batters-faced floor. Unsurprisingly, the ones that do well in this test are the *most* context-dependent, namely RA – a measure that lugs around a pitcher’s defense with him.

That said, cFIP shines when predicting RE24 in the following year. Shines may not be a particularly good term here; its three year average correlation is under .4, leaving it all but tied with SIERA, xFIP and kwERA (a Tom Tango metric that uses strikeouts and walks *only*). It’s really, really hard to predict a measure with as much noise as RE24, given that it RE24 is so dependent on sequencing. That’s not a knock on cFIP (or kwERA), but it’s worth noting that we’re talking about very small differences in what is ultimately not-so-hot predictive power. cFIP also ranks #1 in how well the measure correlates with itself from year to year, again finishing ahead of kwERA and SIERA.

So…is it, you know, good? It’s promising, but I’m not sure that *this version* gets us all that far. As we’ve seen, it’s quite close to kwERA, an extremely simple measure that does a bit better as a descriptor if a tiny bit worse as a predictor (of RE24, mind you). That it’s stable *could* be an indication that it’s homed in on true talent, but it could be an artifact of all the regression the model is doing. A measure will reduce big errors by squishing everyone towards the mean, but that can obscure or underestimate the gaps between great pitchers and their not-so-great colleagues. Adding more regression may give you more stability, but it does so by ignoring what actually happened; it may not be getting you closer to true talent, it may just be minimizing the importance of true talent. A number of much smarter-than-I analysts have already pointed these concerns out – here’s Neil Weinberg on the former, and Peter Jensen on the latter.

For me, I find it somewhat odd that it’s *so close* to measures like xFIP and kwERA that ignore HRs entirely. It’s utilizing actual results, but I’m not seeing much of an effect from that. Let’s turn to a Mariner-centric example. James Paxton had an injury-shortened 2014, but he was effective due to a high GB% (and thus very few HRs allowed) and due to batting average on balls in play. ERA loved him, FIP loved the few-HRs thing, but wasn’t blown away by his K:BB ratio, while xFIP liked the GB%, but thought he got a bit lucky. Taking all of that into account and regressing the real results, what does cFIP think? Hmm, a touch *below* league average. Not even xFIP was willing to go that far. One thing that it could indicate is that it’s putting a lot of weight on parks and the specific batters Paxton faced – maybe he didn’t give up a lot of HRs because he faced a disproportionate share of Eric Sogards and few Mike Trouts. Baseball Prospectus tracks the quality of opposing hitters AND a pitcher-specific park factor, so that can help us explain this. Paxton did indeed benefit from a great run environment thanks to pitching in Seattle and Anaheim, but the quality of opponent metric is extreme – in the opposite direction. Paxton faced an extremely difficult slate of hitters; facing Anaheim four times, and then adding in Baltimore, Toronto and Oakland will do that. No one except Carson Smith faced an average hitter with so high an OPS, and Paxton was in the top few starters by this metric (relievers typically face tougher hitters for obvious reasons).

Another example is ex-Angels reliever Ernesto Frieri. Judge notes that Frieri is the player with the largest gap between his cFIP and FIP-, both of which are park-adjusted. Last year, Frieri had a lovely K:BB ratio, and well over a strikeout per inning. However, a barrage of HRs and an abysmal strand rate got him shipped out of town. Frieri ended the year with an ERA well over 7, and a FIP of around 5.5 – giving up 11 HRs in just over 40 innings will do that to a guy. His xFIP was better, but with such an extreme fly ball ratio, it’s still not great (it’s lower than Paxton’s, for example). cFIP sees past all that, giving him a 90, or 10% better than league average. Paxton’s slightly worse than average, Frieri – who faced a slightly *worse* set of hitters, and also enjoyed HR-suppressing park environments – was better. Batter handedness? Nah, Paxton faced *five times* as many righties as he did lefties. This is a very anecdotal way to analyze a statistic, but whatever the model is doing with actual HRs allowed an actual hitters, it can’t be much. If some set of circumstances completely outweigh the actual results, that’s fine, but then the complexity in adding in all of those actual results to the model doesn’t seem to have been worth it.

The model’s promise is the ability to bridge the gap between descriptive and predictive, but it’s not immediately clear what all of the “actual results” are doing. Maybe the model regresses them away, as they don’t have the stability of good old strikeouts and walks. That’s fine, that’s interesting, but if so, it doesn’t seem to offer a lot beyond kwERA/kwFIP. Instead of building a bridge between the two classes of metrics, it certainly *looks* like cFIP is setting up camp with the predictive models. It appears to be more stable, but again, if it’s more stable solely because the spread is much lower than it is for FIP, xFIP, etc. (to say nothing of ERA), then that limits cFIP’s utility. What would be interesting is to show the correlation between cFIP and kwERA, or cFIP and SIERA. My guess is that they’re going to be very, very high.

At this point, we’ve seen two innovative approaches to integrating actual results to predictive models, SIERA and now cFIP. Just as an outside observer, those actual results seem to get regressed away pretty quickly. Both seem, on paper, to take some pretty important things into account – velocity for SIERA and umpire for cFIP. And despite that, or rather *because* of that, they end up looking like a souped up xFIP. ERA is clearly and increasingly widely seen as inadequate, but every new pitching metric seems to train its guns on FIP. If you’re looking to better describe *actual* results, RA/9’s place in the pantheon isn’t imperiled by cFIP. To the degree that we learn something new about the game of baseball, and every new metric should attempt to illuminate some aspect of the game, what we learn (or re-learn) is the central insight into DIPS – that strikeouts and walks matter so, so much more hits. We’ve added tons of data to FIP, or rather xFIP, and we’ve moved the needle, but by frustratingly little. That’s interesting in itself, if frustrating. At this point, it seems like we’re not going to get a noticeably more predictive/descriptive model by adding a bit more data. Multiple smart people have added tons, and the gains are marginal. If we’re going to break actual new ground, it seems like we might need to add tons more data. Don’t just incorporate umpire or velo, but incorporate pitch type, location, of every pitch, and what pitches precede and follow each pitch. These models are already frightfully complicated, and I hope/fear they’re going to get exponentially more complicated.

Ultimately, I think the mixed model approach has so much potential, and my skepticism (or confusion!) about cFIP isn’t based on a low ranking of Paxton, but on the fact that I can’t immediately see how the model uses actual results, especially HRs. FIP is *so* HR-dependent, and that leads it to underestimate guys like Hisashi Iwakuma. Other measures drop them entirely. We need something in between, but it may be that there’s simply too much variability in them to do this effectively or neutrally. As Neil Weinberg says, the star of the show may be kwERA – that knowing a pitcher’s Ks and BBs can give you as much information as you’re likely to get about future runs allowed as metrics that are light years more complex. Still, cFIP is something to watch. I’m excited to see what Judge does with it, and how analysts might utilize it – I’m even more excited to see what Judge does next.


7 Responses to “On cFIP”

  1. Eastside Crank on March 16th, 2015 9:25 am

    Looking at the relative rankings of Felix, Iwakuma, and Paxton, the Mariners’ pitching staff does not look so formidable. Pitching in Safeco still does wonders for ERA, but it is for all pitchers and not just the Mariners.

    I am not sure what you mean by using actual results. cFIP uses FIP data but reorders it based on additional factors. It tries to solve the question: given a certain set of circumstances, what is the likely outcome? Pitchers who do a better job of stopping runners from scoring in all situations will look better. For example, a reliever coming in with bases loaded and allowing all runners to score but not allowing any additional runs will look worse than one who does not allow any runs. The ERA would look the same in both instances. The predictive portion of the statistic makes the underlying assumption that pitchers are consistent from one year to the next if we just measured the right parameters. That may be more difficult to assess without knowing detailed medical histories and psychological aspects like clubhouse dynamics.

  2. PackBob on March 16th, 2015 5:55 pm

    Perhaps mixed analytical systems are needed to arrive at comparative worth for all pitchers. If FIP undervalues Iwakuma, maybe use another system that more fairly rates him, and then calibrate it to match the relative FIP output. Or maybe find out how much less than average a fly ball pitcher gives up home runs and apply that as an add-on to WAR.

    Regarding context, good/bad hitter is too simple, as a good hitter can be bad or a bad hitter good against certain pitchers or repertoires. Mike Trout is one of the best hitters in baseball, but a pitcher that pitches up in the strike zone may have a better contextual situation than a low ball pitcher.

    To get to a true comparative talent for all pitchers would seem to require all available context, difficult in itself to identify and quantify.

  3. heyoka on March 17th, 2015 6:16 am

    I just invented PERA – Perceptive Earned Run Average.

    We just poll everyone on what they think/feel the pitcher’s Earned Run Average is, and that’s what it is.

  4. MrZDevotee on March 17th, 2015 4:09 pm

    I know it’s preseason, and stats need to go through the “grain of salt” strainer, but they still have to make good contact to put up stats and our starters seem to be in the zone… Preseason stats so far:

    Batting averages for spring:

    A. Jackson .333
    Seth Smith .278
    Robbie .357
    Cruz .412
    Seager .300
    Morrison .320
    Zunino .354
    Ackley .421
    Miller .389

    OBP for starters: .346
    OPS for starters: .792

    Again, this is against inferior players and not MLB action, but they still have to perform and it’s way nicer to see HIGH numbers than bad/scuffling numbers…

    And the standings don’t really reflect what’s going on, but for anyone who’s been following spring games, the M’s have been giving up most of their runs in the late innings, when the youngsters are pitching. (Like today… 4-1 after 6 innings, now 5-5 in the 9th)

    Things are feeling pretty positive so far… Knock on wood. Go M’s.

  5. Longgeorge1 on March 17th, 2015 7:49 pm

    ARGH! my head hurts. Baseball was so much more fun when we would just go to the ballpark, line up a couple of Rainers and bask in the warm sun ( or the freezing rain). Really the team that scores the most runs wins and that is whether your starter is Felix or Hector (ugh) It’s not like I get to set the roster, my opinion on Taylor vs Miller makes for a nice discussion ( or did)but really. I fully understand the value of knowing what really makes a great player. If I was Z I’d be all over this shit. Do I really care if Paxton’s effectiveness defies his xFIP? Am I worried when Zunino drives in the winning run, but it defies what he should do because he now has an unsustainable BABIP? Is it OK if I just get a ticket and enjoy the beauty and pace of the game? Do those screaming little kids falling in love with the game care about FIPS,, WHIPS and DIPS? There is a place in Redmond where they appreciate that sort of thought. Otherwise – PLAY BALL, GO M’s.

  6. LongDistance on March 18th, 2015 12:29 am

    Ultimately, and Marc said it himself:

    “… knowing a pitcher’s Ks and BBs can give you as much information as you’re likely to get …”

  7. djw on March 18th, 2015 11:50 am

    ^^ Paying attention to efforts to improve our ability to evaluate baseball players is, of course, entirely optional, as you know. For some of us, intellectual curiosity accompanies and complements fandom; as such we find this interesting. This will likely never the case for most fans, which is obviously fine. I don’t really understand your frustration; there’s no obligation whatsoever to try to keep up with this stuff if it’s not your thing.

Leave a Reply

You must be logged in to post a comment.