Bill James' similarity scores use traditional baseball statistics, which includes both counting stats (hits, homeruns, strikeouts), and rate stats (batting average, on-base percentage, earned run average).
As Kerry Whisnant explained, using counting stats in similarity scores will mean that "players can be similar only if they are similar hitters and have similar numbers of plate appearances" (emphasis mine). That's not necessarily a bad thing, for Bill's purposes. The length of a player's career could be considered part of his "statistical image", so it might be appropriate to use counting stats for player comparisons. Likewise, the total number of plate appearances in a season, which is normally relatively stable from year to year unless there's an expansion or a work stoppage, could be considered part of a season's "statistical image".
But cohorts vary so widely in size from birthyear to birthyear that raw counting stats lose all meaning. Strauss & Howe were much more interested in a cohort's "peer personality" than in its total numbers. To find the peer personality of baseball cohorts, we need rate-based similarity scores.
But the traditional rate stats are redundant, as Jim Albert explained in an article in By the Numbers:
"...Traditional hitting statistics confound...talents. A batting average confounds three batter talents: the talent not to strikeout, the talent to hit a home run, and the talent to hit an in-play ball for a hit. An on-base percentage confounds the batter’s talent to draw a walk with his talent to get an in-play hit and his talent to hit a home run" (page 24).In 2001, Voros McCracken rocked the sabermetric community by figuring out how to sort out responsibility between pitchers and fielders. Using a "divide and conquer" approach, he isolated, one at a time, defense-independent statistics (walks, strikeouts, and homeruns) from the defense-dependent ones (everything else)... and realized that pitchers have little-to-no control over the "everything else": balls batted into the field of play.
Jim Albert recreated Voros's approach (on pages 23 and 24), but applied it to batting statistics, breaking them down into four basic skills:
"A player comes to bat for a plate appearance. Either the player walks or doesn’t walk – his chance of walking is estimated by the walk rate (BB+HBP)/PA. (Note that we combine walks and hit by pitches in the formula since each event has the same result of getting the batter to first base without creating an at-bat.)
"Removing walks from the plate appearances, we next record if the batter strikes out or not. We define the strikeout rate as the fraction of strikeouts to the number of at-bats or SO/AB.
"With walks and strikeouts removed, we next record if the batter hits a home run or not. The home run rate is defined to be the fraction of home runs for all plate appearances where contact is made by the bat. That is, HR rate = HR/(AB – SO).
"With the walks, strikeouts, and homeruns removed, we only have plate appearances where the ball is hit in the park. Of these balls put in-play, we record the fraction that fall in for hits – we call this the in-play hit rate or 'hit rate' (H-HR)/(AB-SO-HR)....
"Since these rates are defined by sequentially removing walks, strikeouts, and home runs from the plate appearances, they measure distinct qualities of a hitter. Specifically, these rates measure (1) the talent to draw a walk, (2) the talent to avoid a strikeout, (3) the talent to hit a ball out of the park (a home run), and (4) the talent to hit a ball 'where they ain’t'."And unlike traditional statistics, which have different rate categories for hitting (average, OBP) and pitching (ERA, WHIP), these four rates can be used for hitters AND pitchers. Pitchers are trying to accomplish the exact opposite of what hitters are trying to accomplish. Hitters are trying to draw walks, make contact, hit homeruns, and convert batted balls into hits. Pitchers are trying to throw strikes, miss bats, prevent homeruns, and (with the help of their defense) convert batted balls into outs. So it makes sense then to use the same statistics to measure hitters and pitchers.
Dave Studeman also wrote about the four rates:
"...The beauty of these ratios is that they build on each other. The impact of each event is removed from the denominator of the next event. A batter can’t hit a home run unless he actually hits a ball, so walks and strikeouts aren’t considered in the home run rate.
"For pitchers, you often see strikeout and walk rates, but as a proportion of total plate appearances. Here, they are treated sequentially. The strikeout ratio implies that a pitcher can’t strike out a batter if he walks him first, so the formula takes walks out of consideration when calculating strikeout rates."Tom Tango expanded on Voros' "four horsemen", breaking the in-play hit component down into two additional components: extra-base hits (doubles and triples) per in-play hit, and triples per extra-base hit. He also added stolen bases as a rate of opportunities (singles and walks). "Each rate describes something specific" he wrote, and used the seven rates to study the aging patterns of different skills in batters.
So, the idea then is that there are four basic outcomes of every batter/pitcher match-up: a walk (or HBP), a strikeout, a homerun, or a ball in play. A ball in play can result in a hit, a hit can be an extra-base hit, and an extra-base hit can be a triple. Also, a batter who has reached base (usually via single or walk) can steal a base.
But there's another, even-more-obvious framework for baseball events: GAMES. Every batter-pitcher match-up happens within the context of a game. In 1876, batters and pitchers had the same roles, and thus the same usage: starters played almost every game, and almost every inning, and reserves were basically emergency back-ups. Aside from the practice of platooning, a batter's role hasn't changed much in the last 140 years, whereas the pitcher's role has evolved through the generations and fragmented into a variety of roles - from a complete game starter who pitches once every five days to a one-out reliever who can pitch several days in a row, and everything in between.
Since a pitcher's role is part of his personality/statistical image, I am adding one more rate to the seven Voros/Tango components: batters faced per game. And since a batter's role is NOT really a defining part of his personality (and since it would render the rate meaningless if I confounded hitters' plate appearances with pitchers' batters faced) it is a pitchers-only rate.
So here are the eight rates - the eight attributes that define the "personality" of a baseball player or cohort or cohort-group - and the formulas I use to calculate them, grouped by what statistics they draw from (batting, pitching, or both):
BF / G
Batters and Pitchers (the "Three True Outcomes"):
$BB = (BB + HBP) / PA
$SO = SO / (PA - BB - HBP)
$HR = HR / (PA - BB - SO - HBP)
$H = (H - HR) / (PA - HR - BB - SO - HBP)
$XBH = (2B + 3B) / (H - HR)
$3B = 3B / (2B + 3B)
$SB = SB / (H - 2B - 3B - HR + BB + HBP)
(Unlike Albert, I use (PA - BB - HBP) instead of at-bats, because at-bats weren't recorded for pitchers until 1930, and because it keeps sac flies and sac hits included in the denominator.)
Since the "three true outcomes" ($BB, $SO, and $HR) combine batting AND pitching statistics, the formula for $BB (as an example) is actually
$BB = (BBb + HBPb + BBp + HBPp) / (PA + BF)
where BBb is batters' bases on balls, BBp is pitchers' bases on balls, etc.
This means that when 1991-born Mike Trout faces 1991-born Trevor Bauer, it counts as TWO plate appearances for the 1991 cohort. And if Trout strikes out, it counts as two strikeouts, and if he homers, it counts as two homeruns. (Actually, a homerun would go down as an 0-for-2 in the $BB component, an 0-for-2 in the $SO component, and a 2-for-2 in the $HR component. But if Trout instead hits a single, it would be an 0-for-2 in $HR and a 1-for-1 in $H, since the $H rate doesn't draw from pitcher statistics.)
Here is a "binary tree" diagram of the eight attribute rates. The "headers" are the denominators, or opportunities:
per G per PA per H per 1B or BB
| / per XBH
Concluded in part three.