Sunday, August 28, 2016

Binary Components

If a hitter with a .320 batting average faces a pitcher who allows a .230 batting average in a league with a .260 batting average, what would the batting average of their match-up be?

The way I solved this on my old spreadsheet (that I built to win money on fanduel) was to take the batter's rate and convert everything else to a factor of the league-average rate, and then multiply the batter rate by those factors - the park factor, the platoon factor, the home field advantage factor, and the opposing pitcher factor.

The pitcher in the example above allows a batting average of .230, or 88.5% of the league average (.230 / .260), so the opposing pitcher factor is .885. Multiply that by the batter's rate (.320), and you get the batting average for the match-up: .283. A .320 hitter would hit .283 against a .230 pitcher when the league average is .260 (and assuming all other factors are 1).

That's how I would have figured it on my old spreadsheet. Actually, there's a much better way (more on that later).

And that's if I figured batting average. But I didn't, even on my old spreadsheet. Never cared about it.

Batting average is the rate of hits - any kind of hit - per plate appearances that aren't bases on balls, hit batsmen, sacrifices flies, or sacrifice bunts. That's not a very intuitive way to look at things.

Back around 2000 or so, Voros McCracken turned the sabermetric community upside-down with his DIPS theory (Defense-Independent Pitching Statistics), which proposed that a pitcher has no control over hits in play. If contact is made, and it's not a homerun, the outcome is entirely determined by the batter and the fielders.

Voros overstated his case slightly - pitchers have some ability to limit the amount of hits they give up on balls in play, as other sabermatricians quickly found out, but it takes a long time (years) to separate a pitcher's true BABIP (batting average on balls in play) ability from random variance. Therefore a pitcher's rates of walks, strikeouts, and homeruns allowed are much more reliable than their rates of other hits allowed and in-play outs.

Bill James mentioned Voros in a comment in his New Historical Abstract, which was published soon after. Like nearly all sabermatricians, Bill James taught me how to think, but recently I've gleaned more knowledge from comments left by Tom Tango in other writers' blogs than I have from entire books written by James.

I was already toying with the idea of separating batting events into individual yes-or-no components, starting with the three true outcomes - the walk, the strikeout, and the homerun - the events entirely determined by the batter and the pitcher. After weeding out the three true outcomes, then you can work your way down to fielding-dependent outcomes, like base hits and extra-base hits.

I was either inspired from James' comment on Voros and DIPS, or some subsequent articles or comments by Bill James or Tom Tango or some other sabermatrician, or a combination of all the above. (Like Bill James, Voros McCracken was hired to consult for the Red Sox in 2003. Unlike James, he left the Sox in '05 and never wrote about baseball again.)

But a comment made by Tango on this Bill James article (subscription required) affirmed my line of thinking:

"When I look at the data, I break it down into binary components, which is a method that I've adopted from Voros.

"For example, Voros would do:
$SO = SO/(PA-BB)
$nonHRH = (H-HR)/(PA-BB-SO-SO) or inplay BA
And you can continue
$2B3B = (2b+3b)/(H-HR)
$3B = 3b / (2b+3b)

"At every step, Voros would remove one component, so that each metric is independent of the others, a very binary tree approach."

Here is my modified version:

1. $BB = (BB + HBP) / PA

Did the pitcher miss the strike zone (or hit the batter), and did the batter lay off? (If so, then go to #7). If not, then...

2. $SO = SO / (PA - BB - HBP)

Did the batter make contact? If so, then...

3. $HR = HR / (PA - BB - SO - HBP)

Did the batter hit a homerun? If not, then...

4. $H = (H - HR) / (PA - HR - BB - SO - HBP)

Did the batter get a base hit? (This is roughly the same as BABIP). If so, then...

5. $XBH = (2B + 3B) / (H - HR)

Did the hit go for extra bases? (If not, go to #7). If so, then...

6. $3B = 3B / (2B + 3B)

Was the extra-base hit a triple?

7. $SA = (SB + CS) / (H - 2B - 3B - HR + BB + HBP)

If the batter is on first, did he attempt to steal? (I know not all steal attempts are from first base, but most of them are - unless Billy Hamilton is involved. Nor does a batter need a hit or a walk or a HBP to reach first, and sometimes when he's on base the next base is blocked. But this is a good approximation of stolen base "opportunities".)

If the batter attempted to steal, then:

8. $SB = SB / (SB + CS)

Was the steal successful?

This accounts for all statistics in Steamer projections for batters. I don't worry about sac flies, sac bunts, reached on errors, GIDPs, etc.

For an example, let's use...who else? Joey Votto.

Here is Steamer's up-to-date projection for Votto, per 600 PA:

Name        AB   H 2B 3B HR  R RBI  BB  SO HBP SB CS  AVG  OBP  SLG
Joey Votto 478 139 28  2 22 84 73 111 120   5 6  4 .290 .426 .496

Plugging those numbers into the binary component formulas:

1. $BB = (BB + HBP) / PA = (111 + 5) / 600 = 19%

2. $SO = SO / (PA - BB - HBP) = 120 / (600 - 111 - 5) = 25%

3. $HR = HR / (PA - BB - SO - HBP) = 22 / (600 - 111 -120 - 5) = 6%

4. $H = (H - HR) / (PA - HR - BB - SO - HBP) = (139 - 22) / (600 - 22 - 111 - 120 - 5) = 34%

5. $XBH = (2B + 3B) / (H - HR) = (28 + 2) / (139 - 22) = 26%

6. $3B = 3B / (2B + 3B) = 2 / (28 + 2) = 7%

7. $SA = (SB + CS) / (H - 2B - 3B - HR + BB + HBP) =  (6 + 4) / (139 - 28 - 2 - 22 + 111 + 5) = 5%

8. $SB = SB / (SB + CS) = 6 / (6 + 4) = 60%

Against average pitching, playing his home games at GABP, Joey Votto will walk or get hit by a pitch in 19% of his plate appearances. If he doesn't walk or get hit by a pitch, he will strike out 25% of the time. When he connects, 6% of his balls hit will leave the yard, and of the ones that stay in the park, 34% will go for hits. 26% of his hits will go for extra bases, and of those, 7% will be triples. When Votto is on first, he will attempt to steal 5% of the time, and he will be successful in 60% of those attempts.

Here are the components for Votto and Billy Hamilton based on their Steamer projections, along with the rates for the 2016 MLB season to-date:

                 $BB $SO $HR  $H $XBH $3B $SA $SB
Votto (proj.)    19% 25%  6% 34%  26%  7%  5% 60%
Hamilton (proj.)  8% 20%  2% 30%  21% 15% 56% 76%
2016 MLB          9% 23%  4% 30%  25% 10%  8% 72%

I think Steamer's stolen base success rates ($SB) are a little pessimistic - Billy Hamilton's actual stolen base success rate in 2016 is 88%; the MLB average of 72% is almost as good as his projected 76%.

Anyways, to find the rates for pitchers, you can just plug in their batting against statistics. Unfortunately, Steamer only shows pitching stats in their projections (innings pitched instead of plate appearances and at bats, and no doubles, triples, stolen bases, etc.)

But we can still figure the first four components from what we have. The formulas are different - I have to estimate batters faced by multiplying innings by 3 to get outs and adding hits and walks:

1. $BB = BB / ((IP * 3) + H + BB)

2. $SO = SO / ((IP * 3) + H)

3. $HR = HR / ((IP * 3) + H - SO)

4. $H = (H - HR) / ((IP * 3) + H - HR - SO)

Here are Aroldis Chapman's Steamer-projected components, along with the MLB rates for 2016:

                $BB $SO $HR  $H
Chapman (proj.)  8% 42%  3% 29%
2016 MLB         8% 23%  4% 29%

Notice the MLB components are slightly different for batting totals and pitching totals due to the different formulas used, but they can easily be put back on the same scale by multiplying each component by the rate for batting totals and dividing by the rate for pitching totals.

The usefulness of this is in solving batter/pitcher matchups. Say Chapman is pitching to Votto, and he doesn't walk him. If Votto strikes out in 25% of non-walk PAs, and Chapman strikes out batters he doesn't walk at a rate of 1.83 (42% / 23%) times the league average, then Votto would strike out in 46% (.25 * 1.83) of his non-walk PAs against Chapman (and that's without multiplying in a platoon factor, which would be over 1 for lefty vs. lefty). You can work your way through each of the first four components this way.

But like I said earlier, this method (converting pitcher components to factors of the league average, and multiplying them by batter components) is how I used to do it. In my next post, I'll talk about why this method doesn't work, and discuss a better method to turn my batter and pitcher components into match-up probabilities - the Odds Ratio Method.


  1. Just stumbled on your blog today. Loved this post.

    1. Thanks for reading. The binary components are what I used to find the actual baseball generations, which I will be writing about here in the very near future.