A Critique of fWAR: Inferential and Sensitivity Studies of the Baseball Wins Above Replacement Metric

Section 1: Introduction

Wins Above Replacement (WAR) is an estimation of the number of wins a baseball player brings his team, more so than a designated replacement level player. Any given WAR equation is composed of dozens of different calculations that incorporate dozens of different existing baseball metrics. WAR gained popularity on fangraphs.com as a tool for the media and fans to look at the value of different players across history. It evolved into a tool used by everybody, including those in professional baseball, to summarize the value of active players with a single number. The metric, also, very quickly was adapted to serve financial purposes, in that it was used to look at the quality of player’s contracts. Also, because of this, it is reasonable to assume that this metric, or a similar metric, is being used as a tool in contract negotiations.

The purpose of this research is to inferentially test WAR to determine its quality and precision. The reason being that there is no public inferential testing of any WAR. While there is plenty of analysis of WAR as a finished product through a theoretical lens, there is no public inferential testing of any WAR equation in any detail. To do this testing, I chose fWAR, the WAR created and calculated by fangraphs.com, to annotate, analyze, and inferentially test. Also, I will only be analyzing fWAR for position players, not pitchers. The reason being that WAR for pitchers is traditionally less trusted and used, and the equations are typically very straightforward, using one or two counting stats and relating them to a replacement player.

To properly analyze fWAR for position players, it would be wise to inferentially compare different popular equations for position player WAR, perform sensitivity analysis of a variety of changes to the fWAR equation, find a confidence interval for fWAR, and perform exploratory modelling through machine learning with baseball theory absent.

Section 2: Literature Search

The first step in this research was to search for existing literature for inferential testing or statistical criticism of and WAR. I did this by searching in the SPORTDiscus database, all the St. John Fisher College databases, and in the online SABR research database. The details of these searches are included in Appendix A. The result was two moderately relevant documents, but not a single document inferentially testing any WAR in any way.

The first of the moderately relevant published documents is a paper that creates a new WAR equation called openWAR (Baumer, Jensen, and Matthews, 2015). The intention being to make a reproducible and open-source WAR equation. In explaining the reasoning for creating such an equation, the authors express many of the same concerns and frustrations with popular WAR equations as previously stated. Our projects take different roads once the authors proceed to create their own, new WAR equation without testing previous iterations. The other document that was moderately relevant is a blog post that attempts to explain the presence of multiple WAR equations by comparing it the different equations to measure Gross Domestic Product (GDP) in economics. The difference, once again, being that the author does not test or analyze any of the WAR equations being referenced.

Section 3: Analysis

All the following computational analysis was done in RStudio. https://github.com/robertweber98

Section 3.1: Inferential Testing

Before being able to test and learn anything about fWAR specifically, it was first necessary to inferentially test how it differs from other popular WARs. The other producers of WAR selected were baseball-reference.com (referenced further as “brWAR”) and baseballprospectus.com (referenced further as bpWAR). The reason being that these are, along with fWAR, the most popular and used WARs among the media.

The data gathered was all three website’s WAR for each player in the 2017 season. Then, the sum of each WAR for all 30 teams was calculated and stored. To inferentially test the difference between these similar metrics, a ridge regression was created with team wins being predicted by all three WARs. A ridge regression was chosen as opposed to a regular linear regression because the WAR values, both serving as independent variables in this model, are very similar, in value and rank, which will produce collinearity. A ridge regression will effectively remove said collinearity. The regression yielded the results shown in Table 1.

WARCoefficients
(Intercept)(58.4455942)
fWAR0.3479014
brWAR0.4217119
bpWAR0.3538059
Table 1

The coefficients are obviously reasonably similar, but also slightly different. So, the next step was to investigate whether the different metrics are significantly different predictors of team wins. To do so, the ridge regression was bootstrapped and ran 100,000 times, each time, choosing 20 random teams from the league and finding and storing the absolute value of the coefficients to obtain their magnitude. Operating under the understanding that any one WAR is significantly different than another at predicting team wins if the 95% confidence intervals for the magnitude of their coefficients do not overlap, and that the 95% confidence interval is found by taking the middle 95% of the distribution of the absolute value of the coefficients, none of the WARs are significantly different. This is shown definitively in Figure 1. Also, the 95% confidence intervals are included in Table 2.

Figure 1
WARLower BoundUpper Bound
fWAR0.043166660.5349429
brWAR0.200577100.8139445
bpWAR0.123567220.5740127
Table 2

So, we can say with 95% confidence that none of the popular versions of WAR are significantly different than each other I predictive ability. This conclusion has especially interesting connotations through the lens of baseball theory. All three iterations of WAR have entirely different equations relying on privately created metrics. The only similarity being that they all, to some extent, have to rely on the same general data and counting stats that are associated with the performance of the player.

Section 3.2: Investigating Possible Adjustments

Since it has been established that one can be 95% confident that none of the popular WARs are significantly different than each other, the next step taken was to investigate if any reasonable changes and adjustments to the equation will make a WAR that is significantly different. As much as possible without any proprietary data or information, the equation for fWAR was deduced through the fangraphs.com glossary pages and then written in R code.

The first adjustment to fWAR was to use On-Base-Percentage (OBP) in the Batting-Runs section of the equation instead of Weighted-On-Base-Average (wOBA). wOBA is used here to get a more accurate representation of a hitters batting performance than OBP by weighting different batting results with their Run-Value. The adjustment was made to test whether this distinction makes a significant difference.

The second adjustment was to use a manual calculation for Baserunning-Runs instead of the fangraphs.com published values. The reason being that the website publishes a step-by-step explanation of how to calculate two of the three parts of Baserunning-Runs (Ultimate-Baserunning (UBR) and Weighted-Grounded-into-Double-Plays (wGDP)) and the finished values, but origin of the data used in those calculations is kept private. So, the data used instead was the full 2017 Statcast data set, a spreadsheet of every event in baseball in the 2017 season in detail. The described calculations were then followed as closely as possible to manually calculate UBR and wGDP. The values produced were not similar to the fangraphs.com values at all, even though both calculations appear to be the same. Then, the last piece of Baserunning-Runs, Weighted-Stolen-Bases (wSB), was also adjusted. This metric acts as a stolen base rate with linear weights of run-values for a stolen base and being caught stealing. Manual calculations of the run value of those events were different, so a new wSB was created and used in this adjustment. So, testing the predictive ability of this adjusted fWAR against regular fWAR will tell whether the two calculations produce a significantly different results.

The third adjustment was using a real Runs-per-Win (rRPW) instead of the fangraphs.com estimated Runs-per-Win (RPW). The site uses an equation to estimate how many runs result in a win for the team in the long run. For 2017, that value was 10.048. The interesting part being that the average number of runs scored by the winning team in 2017 was 6.368. So, this adjustment was made to test whether the two RPW values produce significantly different WARs.

The fourth adjustment was removing the value of Replacement-Runs (Rep) from the equation. Rep is the piece of the equation that adds the “Above Replacement” part to Wins Above Replacement. Since Rep is merely added at the very end to add interpretive ease, it doesn’t seem terribly necessary, so, the fWAR produced without Rep will test whether its presence is significantly effective to its predictive ability.

The fifth adjustment was taking out the equation’s second league adjustment from the fWAR equation. In the calculation for Batting-Runs, it adjusts for the league (AL or NL) the player plays in. Then, it appears to adjust for the league again in the final calculation. So, this adjustment will test whether this second league adjustment significantly changes the predictive ability of fWAR.

The sixth adjustment was the removal of the equation’s positional adjustment. The positional adjustment is meant to adjust for the fact that certain positions are harder to play than others, so, it adds or subtracts runs based the position played most by the player. This adjustment to the fWAR equation will simply test whether this piece of the equation significantly changes its predictive ability.

The seventh, and last, adjustment was cumulative, using the manual Baserunning-Runs calculation, rRPW, the removal of the second league adjustment, and the removal of replacement runs. This serves as a test of whether an fWAR equation with logical adjustments will produce a value with significantly different predictive abilities.

First, a ridge regression was run with team wins being predicted by the team totals of the normalized values of all the adjusted WARs. The values were normalized so, that way, they are on the same scale and no WARs will look more or less predictive if the magnitude is consistently larger or smaller than the other WARs in the model. The summary of this regression is included below in Table 3.

WARCoefficient
(Intercept)81.00
fWAR0.11900271
WAROBP0.16138025
WARDif. Base-Runs0.08590853
WARrRPW0.07798246
WARNo Rep-Runs0.34591196
WARNo lg-Adj0.07829564
WARNo Pos-Adj0.15694370
WARCumulative0.29568868
Table 3

            Next, similar to before, to find whether any of the predictors are significantly different from each other at predicting team wins, the ridge regression was bootstrapped and ran 100,000 times. Each time, 20 teams were randomly chosen, and the absolute value of the coefficients were stored. The results are shown graphically in Figure 2 and the 95% confidence intervals of the magnitude of the coefficients are shown in Table 4. It is very clear that none of these adjusted WARs are significantly better or worse at predicting team wins than any other. Also, by extension, they are not significantly better or worse at predicting team wins than any of the other popular WARs.

Figure 2
WARUpper BoundLower Bound
fWAR0.016247890.3018428
WAR_10.025308910.9661601
WAR_20.019107450.8620905
WAR_30.016126100.5576981
WAR_40.087344071.3361446
WAR_50.016000160.5541393
WAR_60.030719500.4285664
WAR_70.078141470.9782936
Table 4

Section 3.3: Sensitivity Analysis

Since not one WAR has been significantly different in its predictive ability, the next step was to perform sensitivity analysis to determine whether there is any reasonable WAR equation that will yield a product that has a significantly different predictive ability. To do this analysis, the process was very similar to that of the last step. There were three impossible changes made to the equation and the three different WARs produced were put in a ridge regression with fWAR predicting team wins. The team totals of each WAR were then found and stored.  Then, the regression was bootstrapped and ran 10,000 times choosing 20 random teams each time.

The first WAR was produced using an impossibly scaled wOBA in its equation. Meaning, the values given to each event are deliberately more than the possible amount runs that could the event could yield. For example, walks were given a weight of 2.0, even though the most runs that could be scored on a walk is one. Likewise, a home-run was given a value of 5.0, even though the most runs that can be scored on a home-run is four, and so on. The second WAR was produced using an absurdly high 20.0 RPW in its equation. The third WAR was produced using the regular fWAR equation, except Fielding-Runs was given five-times the weight by being multiplied by 5.0.

It can be seen, once again and quite obviously, none of the four WARs involved are significantly different in their ability to predict team wins. The 95% confidence intervals for the magnitude of the coefficients are shown in Table 5 and the plot of the distributions are shown in Figure 3.

Figure 3
WARLower BoundUpper Bound
fWAR0.060962100.4437420
WARCrazy wOBA0.094399561.0675372
WARCrazy RPW0.067414670.5704427
WAR5xFld0.065955990.5689341
Table 5

Section 3.4: Exploratory Machine Learning Analysis

After not finding a single WAR with significantly different predictive abilities, the next step was to find out if there was any way to use baseball counting-stats to predict wins and achieve a significant difference. This was done with machine learning through a ridge regression.

The data collected was ten years of MLB team totals and team wins from 2006-2016. Then, a ridge regression was made with team wins being predicted by unintentional walks (uBB), times hit-by-pitch (HBP), singles (1B), doubles (2B), triples (3B), homeruns (HR), stolen bases (SB), times caught stealing (CS), and times grounded into double-plays (GDP). The coefficients of this regression were then stored and are shown in Table 6.

StatCoefficient
uBB0.036370460
HBP0.099830785
1B0.035686716
2B-0.007193859
3B0.005149822
HR0.093983374
SB0.027315687
CS-0.099715357
GDP-0.025207731
Table 6

            The next step was to use these coefficients as weights for the corresponding stats for individual players in the 2017 season to total up for teams and test it with fWAR. Then, a ridge regression with team wins being predicted by fWAR and the machine-learning-quasi-WAR (ml_WAR) was bootstrapped and run 1,000 times, choosing 20 teams randomly each time. The 95% confidence intervals of the magnitude of the coefficients are included in Table 7 and the plot of the distributions of the absolute values of the coefficients is included in Figure 4. Since the 95% confidence intervals overlap, we can say with 95% confidence that the two metrics are significantly different in their ability to predict team wins.

MetricLower BoundUpper Bound
fWAR0.32189491.421170
ml_WAR0.95714144.786799
Table 7
Figure 4

Section 3.5: A Confidence Interval for fWAR for Position Players

The next step, acknowledging that none of the metrics have been significantly different in their predictive abilities, is to find a confidence interval for fWAR. Since nothing has been better or worse, it would be helpful, at least, to have an estimate of its accuracy.

To do this, 50 years of fangraphs.com position player fWARs were taken and narrowed down to players that started their career in 1966 to 2016. Then, it was investigated what stretch of player’s careers, on average, had the least variation. It was found, from the scatterplot shown in Figure 5, that from years 5 to 10 in the league, the player’s fWAR varied the least of any other stretch.

Figure 5

From there, years 5 through 10 for each player were isolated and, because combinatorial probability theory states that 94% of the data will exist from the second to the fifth sorted value out six values, the fWAR of the player’s sixth year was stored as their estimated lower bound and the fWAR for the player’s ninth year was stored as their estimated upper bound. The average of those two values was taken for every player to find the plus-minus value of their estimated confidence interval. Then, the average of those averages among all players sampled was taken and found to be 0.8763538. Meaning that it is estimated, with 94% confidence, that the true value of a player’s fWAR lies 0.877 above or below the recorded value.

Section 4: Discussion

            After the given analysis, one could take away, with reasonable confidence, three coinciding conclusions; none of the most popular WARs are significantly different in their predictive ability of team wins, there are no reasonable changes to the WAR equation for position players that will make it significantly different in predicting team wins, and the confidence interval is too great to compare any two similar players with any confidence.

            So, it seems as though, while the equation for position player WAR has many different pieces that can take on different values with different philosophical derivations, the construct of WAR is rigid. In that, while one can change the values of any given piece and it may change the order of the finished product slightly, it is extremely difficult to significantly change its overall predictive ability.

Along with that, the confidence interval for fWAR, which, given what has been shown about the differences between the popular WARs, is most likely very similar for all position player WARs, is reasonably large. Most position player WARs are on a scale from about -2 to about 8 or 10 and most players will be on the bottom of that scale. So, as an example, if, going into a season, a team’s starting second baseman had 1.4 WAR last year but a second baseman the team just acquired had 1.6 WAR. While it seems like the second player had a better season, their WARs are only 0.2 apart, so, they can’t be compared with any amount of confidence. Also, in using fWAR numbers from 2017, dollars per WAR was about $8 million. 88% of that is about $7 million. This shows that, if a team were to rely completely on fWAR to decide the contracts of their position players for any single season, they could be mis-spending about $14 million (±$7 million) on salary. While it is obvious that teams don’t rely on one single metric for their contracts and the actual dollar value can be affected by other players on the market at the time of signing, if front offices are putting a lot of weight in decision making on some in-house WAR metric, they could be being misled.

The machine learning section of the analysis brings a separate set of possible conclusions. The exploratory analysis that was done in that section yielded a regression with coefficients that made little sense. For example, the coefficient for doubles was negative, meaning getting a double is bad. This is obviously not true. But, the bootstrapping showed that this nonsensical metric still did not yield significantly different predictive abilities than fWAR. What this may show is that the analytics community has reached the end of counting-stats. All of the metrics used in the analysis section rely on traditional and non-traditional counting-stats. Counting-stats being metrics composed of a weighted or un-weighted number of times, sometimes in the form of a percentage, that a given event occurs. The fact that no metrics used were significantly different in their influence over wins, it may be that no collection or composition of counting-stats will fare differently.

So, the next logical question is what else is going to be better? It remains to be seen whether anything else is going to be better, but, there are lot of other options for future research. For example, run expectancy, a concept used often throughout the baseball analytics community, is a very interesting idea. While it can be used to tally up contributions, it is, at its core, the value of any given action that can totaled, averaged, etc. A possibility would be to make the number of opportunities neutral and simply look at player performance in all the possible situations on offense and defense. To go further, it would be possible to used average run-expectancy added by the types (through clustering) of the players involved in the play and find the win probability.

The point being that there are loads of possibilities in non-counting-stat measuring of player performance, and that appears to be the next step in the process of estimating a player’s value.

Appendix

A: Database Searches

  1. SPORTDiscus
    • Date Accessed: 06/23/2018
    • Searched “wins above replacement” AND baseball
    • 107 results
  2. St. John Fisher Library: Articles, Books, and more
    • Date Accessed: 07/16/2018
    • Searched “wins above replacement” in Title OR “wins above replacement” in Description
    • 58 results
  3. sabr.org
    • Date Accessed: 07/16/2018
    • Searched “Wins Above Replacement”
    • 238 results

References

  • Baumer, B. S., Jensen, S. T., & Matthews, G. J. (2015). OpenWAR: An open source system for evaluating overall player performance in major league baseball. Journal of Quantitative Analysis in Sports, 11(2). doi:10.1515/jqas-2014-0098

Website | + posts

Co-founder of First Line Sports Analytics

Published by Rob Weber

Co-founder of First Line Sports Analytics

Leave a Reply

%d bloggers like this: