Saturday, December 31, 2016

The HOF Case for (Almost) Every Player on the 2017 Ballot

The title's misleading: I've made a case for 19 players, and there's 34 players on the 2017 ballot. And I haven't really "made a case" for any of them (other than Barry Bonds, who I consider a no-brainer).

All I've done is listed each player's career ranks within his generation. There are two generations of players on the ballot, Boomers (two) and Gen-Xers (thirty-two). Players who don't rank in the top 10 in their generation in any reasonably-meaningful statistical category aren't listed.

A couple more caveats: players' ranks are by generation, not by position. So even though Pudge Rodriguez is probably the greatest catcher of Generation X, he only ranks in the top 10 of a couple categories, and then just barely, because he's competing with all batters in his generation.

Also, pitchers need 1,000 innings to qualify for rate statistics, which is a total only a handful of career relievers reach anymore. So even though Billy Wagner only pitched 186 fewer innings than Trevor Hoffman, Wagner is on the wrong side of the 1,000-inning threshold, and thus doesn't qualify for rate stats (even though many of his rate stats are better than Hoffman's).

Boomers


Tim Raines - 1st in SB% (84.7%), 2nd in Stolen Bases (808), 6th in Triples (113), 8th in Runs (1,571), 8th in Walks (1,330), 10th in OBP (.385)

Lee Smith - 1st in Saves (478), 2nd in FIP (2.93), 2nd in K/9 (8.73), 4th (tied) in ERA+ (132), 10th in ERA (3.03)

Generation X


Barry Bonds - 1st in Runs (2,227), 1st in Homeruns (762), 1st in Walks (2,558), 1st in Intentional Walks (688), 1st in OBP (.444), 1st in Slugging (.607), 1st in OPS (1.051), 1st in OPS+ (182), 1st in Total Bases (5,976), 1st in Extra Base Hits (1,440), 1st in Times on Base (5,599), 1st in WAA (123.5), 1st in WAR (162.4), 2nd in RBI (1,996), 2nd in Isolated Power (.309), 3rd in Stolen Bases (514), 4th in Doubles (601), 7th in Hits (2,935)

Gary Sheffield - 4th (tied) in Sac Flies (111), 6th in Walks (1,475), 6th in Times on Base (4,299), 9th in Runs (1,636), 10th in RBI (1,676)

Ivan Rodriguez - 9th in Hits (2,844), 10th in Doubles (572)

Sammy Sosa - 5th in Homeruns (609), 10th in Isolated Power (.261)

Manny Ramirez - 2nd in OPS (.996), 3rd in Slugging (.585), 4th in Isolated Power (.273), 5th in RBI (1,831), 5th in Intentional Walks (216), 5th in OBP (.411), 5th in OPS+ (154), 6th (tied) in Batting Avg. (.312), 7th in Extra Base Hits (1,122), 8th in Total Bases (4,826), 9th in Homeruns (555)

Jeff Bagwell - 6th in OBP (.408), 6th in OPS+ (149), 6th in WAA (51.8), 7th in Walks (1,401), 7th in WAR (79.6), 9th in OPS (.948), 10th (tied) in Sac Flies (102)

Vladimir Guerrero - 1st in Batting Avg. (.318), 3rd in Intentional Walks (250), 10th in Slugging (.553)

Jeff Kent - 8th (tied) in Sac Flies (103)

Larry Walker - 3rd (tied) in Batting Avg. (.313), 5th in Slugging (.565), 5th (tied) in OPS (.965), 7th in WAA (48.2), 10th (tied) in OBP (.400), 10th in WAR (72.6)

Edgar Martinez - 3rd in OBP (.418), 6th (tied) in Batting Avg. (.312), 7th (tied) in OPS+ (147)

Magglio Ordonez - 9th (tied) in Batting Avg. (.309)

Roger Clemens - 1st in Complete Games (118), 1st in Shutouts (46), 1st in WAA (94.5), 1st in WAR (139.4), 2nd in Wins (354), 2nd in Innings (4,916 2/3), 2nd in Strikeouts (4,672), 3rd in ERA+ (143), 4th in W-L% (.658), 4th in ERA (3.12), 4th in FIP (3.09), 9th in HR/9 (0.66), 9th in H/9 (7.66)

Mike Mussina - 5th in Wins (270), 5th in WAR (82.7), 6th in Shutouts (23), 6th in Innings (3,562 2/3), 6th in WAA (48.6), 7th in W-L% (.638), 7th in Strikeouts (2,813), 10th (tied) in Strikeouts per Walk (3.58)

Trevor Hoffman - 2nd in Saves (601), 2nd in ERA (2.87), 2nd in H/9 (6.99), 3rd in FIP (3.08), 3rd in WHIP (1.058), 4th in K/9 (9.36), 5th in ERA+ (141), 6th in Strikeouts per Walk (3.69)

Billy Wagner - 3rd in Saves (422)

Curt Schilling - 1st in Strikeouts per Walk (4.38), 4th in Complete Games (83), 5th in Strikeouts (3,116), 5th in WHIP (1.137), 5th in WAA (54.1), 6th in FIP (3.23), 6th in WAR (80.7), 9th (tied) in Shutouts (20), 10th (tied) in K/9 (8.60)

Arthur Rhodes - 8th (tied) in K/9 (8.73)

Friday, December 30, 2016

Hall of Fame Players by Generation

The below table shows the eight (semi-)retired MLB-playing generations. For each generation, I have listed the number of members with 10-year MLB careers (who ended their careers by 2010), the number enshrined in the Hall of Fame as players, and the percentage of eligible players enshrined. (Addie Joss is the only Hall-of-Fame player with less than 10 years.)

GenerationBirthyears10 yearsHOFHOF%First Members Inducted
National1835-18565459.3%Cap Anson & Old Hoss Radbourn (1939)
American1857-18731822714.8%Cy Young (1937)
Dead-Ball1874-18922643513.3%Cobb, Johnson, Mathewson & Wagner (1936)
Live-Ball1893-19113405014.7%Babe Ruth (1936)
G.I.1912-1928308299.4%Joe DiMaggio (1955)
Silent1929-1944434317.1%Sandy Koufax (1972)
Boom1945-1961686294.2%Catfish Hunter (1987)
Generation X1962-1980621111.8%Roberto Alomar (2011)
TOTAL2,8892177.5%

As you can see, the Hall of Fame is suffering from a severe anti-recency bias. Over 14% of players born from 1857 to 1911 are enshrined, but the percentage then plummets with each successive generation, down to less than 2% of eligible Gen-Xers (players born from 1962 to 1980).

Overall, 7.5% of eligible players are enshrined in the Hall of Fame. (This table considers banned players eligible). To get up to that 7.5%, the Silent Generation would need two more players to get inducted (how about Pete Rose and Dick Allen?), the Boom Generation would need 23 more players, and Gen X would need 36 more currently-eligible players. (But of course, the number of Gen-Xers eligible for enshrinement is increasing every year as well.)

Gen X was also relatively late breaking into the Hall. Starting with Babe Ruth's generation, the age of each generation's firstborn cohort when its first player was enshrined was 43, 43, 43, 42, and 49. (For instance, firstborn Gen-Xers like Darren Daulton and Kevin Mitchell were 49 years old when Robbie Alomar became the first of their peers to be inducted into the Hall.)

Wednesday, December 28, 2016

The All-Time, All-Generations Team

I identified the baseball generations and the best player of each generation. But it didn't occur to me until I was almost finished with the leaderboards that there are exactly nine MLB-playing generations. Nine positions, nine innings, nine generations. This wasn't intentional, I promise; it just sort of worked out that way. Also, from the first National cohort (1835, birthyear of Harry Wright) to the last Millennial cohort (1996, birthyear of Julio Urias) there are 162 birthyear-cohorts: exactly 18 birthyears per generation. (162 games per season, 18 half-innings per game...)

Anyways, what if we made an all-time team, where we have not only one player for each position, but one player for each generation?

Using Adam Darowski's Hall Rating as a guideline, I assembled the best possible all-time starting line-up that represents all nine generations (one player from each generation). I even sorted them into a batting order:

PosPlayerGenerationHall Rating
CFWillie MaysSilent336
SSHonus WagnerDead-Ball284
RFBabe RuthLive-Ball399
LFTed WilliamsG.I.279
3BAlex RodriguezGeneration X245
1BCap AnsonNational215
2BRobinson CanoMillennial123
CJohnny BenchBoom181
PCy YoungAmerican336

The Hall of Fame Case for Barry Bonds

Barry Bonds (photo credit)
I place Barry Bonds (born 1964) in Generation X...also known as the Steroid Generation because many (most, according to Jose Canseco) of its members used (or were accused of using) performance-enhancing drugs.

This generation includes all players born between 1962 and 1980 - 3,600 players with MLB experience. You could call these 3,600 players the peers or contemporaries of Barry Bonds.

Out of those 3,600 players, Bonds ranks 1st in runs scored, 1st in homeruns, 1st in walks, 1st in on-base percentage, 1st in slugging %, 1st in OPS, and 1st in OPS+. He's his generation's all-time leader in runs created (by over 600), even though he's only 8th in outs. He leads his peers in total bases, extra-base hits, times on base, Wins Above Average, and Wins Above Replacement.

He's also 2nd in runs batted in, 3rd in stolen bases (but 8th in caught stealing), 4th in doubles, and 7th in hits.

Let's compare Bonds' accomplishments to those of one of his more-likable peers: Ken Griffey Jr., who was just inducted into the Hall of Fame with the highest-ever percentage of votes from the BBWAA.

Among Gen-Xers, Griffey ranks 3rd in homeruns, 3rd in RBI, 4th in total bases, 4th (tied) in XBH, 7th in runs created, 7th in outs, 8th in runs scored, and 9th in times on base. He's not in the top 10 in hits, OBP, slugging, or stolen bases. He ranks 8th in WAA and 6th in WAR, so he's arguably not even the best position player in his generation who's untainted by steroid allegations.

For sheer performance, Generation X might have been the best generation of baseball players ever, and Barry Bonds was the best player of that generation.

Yeah, he was almost certainly using PEDs, and Griffey almost certainly wasn't. But, as I said, Bonds was far from the only player using; he belonged to a PED-enhanced generation. But between the surely PED-enhanced (Bonds, McGwire, A-Rod) and the surely natural (Griffey), there's the vast majority of this generation's players, and they're stuck in a murky limbo of allegations and uncertainty about whether or not they used and to what extent using improved their performance.

The only thing we know for sure is the record book. And we know two things about Bonds: he was a first-ballot Hall of Famer BEFORE he (allegedly) started using (in 1999) (400 homers, 400 steals, .400 OBP, 8 Gold Gloves), and once he started using, for about five years he was the greatest (and most feared) hitter the game has ever seen.

Monday, December 19, 2016

The Leaderboards are Finished

...for now. I may add some (although I doubt it). I may delete some. There're currently 76 statistical categories, and each has a page both for single-season and for career leaders.

After some more spottiness, the Play Index Finder started working great, and allowed me to fly through most of the pitching leaders. I owe a big thanks to Sean Forman and baseball-reference.

At some point I started using the Strauss & Howe names for a couple of the generations, and restored the "Steroid Generation" to Generation X. While Integration and Expansion are important things that happened during the respective careers of the 1912-1928 and 1929-1944 cohort-groups, I decided that they just weren't very good names for the groups themselves. I didn't go back and change the pages I had already completed, though, which is something I'll do once I've settled on names. So currently, some generations are called different names on different pages.

Observations on record-setting (mostly Millennial) pitchers:

Chris Archer and James Shields BOTH tied Jeremy Bonderman's 13-year-old Millennial record for losses with 19. Bonderman's teammate on the '03 Tigers, Mike Maroth, holds the Gen X record, and is the only pitcher since Brian Kingman in 1980 to lose 20 games.

Shields had a rough year - he also set a new millennial record with 40 homeruns allowed. Jered Weaver and Josh Tomlin cracked the top 5.

* * * * * * * * * * * * * *

Jose Rijo is 7th among Gen-Xers in career ERA, at 3.24.

Johnny Cueto is 7th among Millennials in career ERA, at 3.23.

* * * * * * * * * * * * * *

Why was 1988 the a-Balk-alypse??

The top nine seasons for balks by Boomers all occurred in 1988 (including an all-time record 16 by Dave Stewart). So did six of the top ten Gen X seasons (and three of the other four occurred in 1987 or 1989). Apparently a "subtle" rule change caused an explosion in the amount of balks called. The MLB record was broken... six weeks into the 1988 season. After the season ended the baseball rules committee wisely changed the rule back to its former wording, and this insanity was never spoken of again. Indeed, I had no idea the a-balk-alypse was a thing until I saw the leaderboards. This blog post goes into great detail about it.

* * * * * * * * * * * * * *

In his final season (sadly), Jose Fernandez set a Millennial single-season record for K/9.

The baseball community lost an incredible young man. He will be missed.

* * * * * * * * * * * * * *

Noah Syndergaard broke Jon Lester's Millennial record (set in 2015) for stolen bases allowed. Surprisingly, Lester is the career leader for caught stealing, with 74. I guess it makes sense - Lester's complete lack of ability to hold runners tempts more runners to steal. Sometimes they get gunned down.

With leaderboards complete, I can start on some more interesting work: generational biographies.

Tuesday, December 6, 2016

The Leaderboards are Being Uncluttered...

I have most of the batting leaderboards done; after they're finished I'll start on the pitching leaders. It's a slow process, made more so by the occasional lack of cooperation by the Play Index Finder (maybe it's a problem with my pc or internet, who knows).

Some observations thus far:

Bryce Harper's epic 2015 season set Millennial records in OBP, SLG, OPS, OPS+, and runs created. It's already looking like a fluke season... or, more likely (as Bill James opined), like Reggie Jackson's 1969 season: a performance he will never top even if he goes on to a long Hall-of-Fame career.

Mike Trout crossed the 3,000-PA threshold in 2016, meaning he now qualifies for career leaderboards (Harper is still 230 PA short). Among Millennials, Trout ranks 7th in BA, 2nd in OBP (to Joey Votto), 2nd in SLG (to Miguel Cabrera), and 1st in OPS. In OPS+ (which adjusts for league and ballpark effects), Trout is way ahead; Votto is a distant 2nd, followed by Miggy.

The Reds' Billy Hamilton ignominiously set the Millennial caught stealing record his rookie year, with 23. He's been a much more efficient base-stealer since then (he currently ranks 8th among his peers in career SB%), but he still has yet to crack the Millennial top 10 in stolen bases, simply because he hasn't played a full season since his rookie year of 2014. The only thing keeping him from 80 steals and the Millennial single-season record is durability.

Sunday, November 20, 2016

The Generations of Baseball, part three

Continued from part two.
"In some respects, a peer personality gives heavy focus to the attitudes and experiences of the generational elite...('the heads of society, the kings of thought, the lords of the generation'). But while they commonly express the tone of a generation's peer personality, the personality itself is often established by non-elites" (Generations page 64).
The following table shows PA and BF totals and attribute rates for the 1964 cohort. I listed all batters with at least 8,000 PA and all pitchers with at least 8,000 BF ("the generational elite"), all other batters and pitchers (the "non-elites"), and the cohort totals. As you can see, the "non-elites" as a whole overwhelm "elites" in raw numbers of PA and BF, but elites still obviously have a lot of impact on their cohort's attribute rates (Barry Bonds accounted for roughly 6% of his cohort's total plate appearances, while Kenny Rogers pitched to about 8% of its batters faced).

PlayerPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Barry Bonds12,606.211.155.091.284.312.114.124
Rafael Palmeiro12,046.120.127.061.282.254.061.030
Mark Grace9,290.119.078.023.308.245.081.025
B.J. Surhoff9,106.074.099.025.289.225.087.061
Barry Larkin9,057.110.101.027.304.241.147.145
Will Clark8,283.120.163.047.325.257.097.028
Ellis Burks8,177.104.183.059.312.265.135.084
Jose Canseco8,129.122.272.089.299.250.040.098
Kenny Rogers14,28018.7.091.152.031
Dwight Gooden11,70527.2.088.215.025
John Burkett11,32425.4.070.168.029
Bobby Witt11,00325.6.129.204.033
Bret Saberhagen10,42126.1.051.173.027
Kevin Tapani9,60026.6.063.165.035
All other batters138,971.091.169.023.279.221.128.081
All other pitchers114,1299.3.101.168.032
Totals215,665182,46212.0.099.166.032.286.234.115.079

(Batting stats for pitchers are included in "All other batters"; pitching stats for position players are included in "All other pitchers").

Here are stats for seven consecutive cohorts, 1961-1967 (the first wave of Generation X):

CohortSample MemberPABFBF/G$BB$SO$HR$H$XBH$3B$SB
1961Don Mattingly114,102114,88811.8.088.170.029.283.221.127.082
1962Roger Clemens162,446181,14713.7.093.179.031.289.234.112.073
1963Randy Johnson189,945163,10211.3.096.183.035.292.229.107.073
1964Barry Bonds215,665182,46212.0.099.166.032.286.234.115.079
1965Craig Biggio192,592167,78312.7.093.177.033.285.233.111.065
1966Greg Maddux138,533224,28711.1.093.180.034.288.244.088.052
1967John Smoltz215,318197,16910.9.095.177.033.289.236.109.090

And here is Generation X divided into three cohort-groups:

Cohort-GroupSample MemberPABFBF/G$BB$SO$HR$H$XBH$3B$SB
1961-1967Barry Bonds1,228,6011,230,83811.8.094.176.033.288.233.110.074
1968-1974Pedro Martinez1,252,5181,167,27410.1.097.183.037.295.244.091.060
1975-1981Alex Rodriguez1,168,4411,206,18610.7.093.194.038.292.251.098.065
Generation X3,649,5603,604,29810.9.095.184.036.292.243.099.066

And here again are the eleven baseball-playing Strauss & Howe generations - the nine MLB generations (Gilded through Millennial), the purely amateur Transcendentals, and the (so far) purely little-league Homelanders:

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Transcendental1792-182100
Gilded1822-184210,0425,94535.4.028.016.002.283.142.293.039
Progressive1843-1859653,547512,02735.9.051.067.005.272.210.277.093
Missionary1860-18821,822,8041,926,47031.5.083.091.006.276.201.294.143
Lost1883-19001,868,4901,941,12922.2.084.095.009.281.216.248.086
G.I.1901-19242,265,0392,225,34819.1.093.100.019.280.224.185.036
Silent1925-19421,989,6551,939,35615.3.091.152.028.272.203.159.040
Boom1943-19602,915,9432,904,73914.4.091.152.026.279.212.134.071
Generation X1961-19813,649,5603,604,29810.9.095.184.036.292.243.099.066
Millennial1982-20041,154,4831,234,29810.0.090.218.036.298.249.103.068
Homelanders2005-?00

But like I said in part one, these are social generations. Strauss & Howe were trying to identify generations based on how they shape and react to history and each other.

My goal is much simpler: I want to identify baseball generations based on how (if at all) they played major league baseball. In part two I defined my tool for identifying baseball generations: similarity scores based on attribute rates.

I'll use the Strauss & Howe generations as a starting point. Then I can ask of each cohort, does it actually belong in this generation? or should it be in a neighboring one?

Let's look again at the first wave of Generation X, the 1961-1967 cohorts. Here are the similarity scores of each cohort to its own generation (Gen X) and to its next-older generation (Boom):

CohortBoomGen X
1961929870
1962893940
1963856968
1964878916
1965894939
1966847961
1967858953

As you can see, each cohort is in fact more similar to Generation X than it is to the Boom Generation, except for the first one, 1961, which is more similar to the Boomers (929 to 870). So based on these numbers, the 1961 cohort belongs in the Boom Generation instead of Generation X.

So what I did is compare every cohort to every generation. Starting with the firstborn major leaguer (Nate Berkenstock, born 1831) and ending with 2016's youngest player (Julio Urias, born 1996), I have 166 MLB cohorts. And there are eleven generations, although the first and last (Transcendentals and Homelanders) are statistically identical, because their members had no major league experience.

To find the statistical similarity of a cohort and a generation (or of any two players or groups), start at 1,000 and subtract a penalty for each attribute rate. The penalty is the difference (or absolute value) between the cohort and the generation, times a multiplier. The multiplier is the weight I assigned to the attribute rate, divided by (4 times its standard deviation).

BF/G$BB$SO$HR$H$XBH$3B$SB
St. Dev.8.3.013.047.012.011.019.071.040
Weight9018018018090909090
Multiplier2.73,4839553,8062,0171,166315569

I have 166 total cohorts, but for standard deviations, I only wanted to include cohorts with at least 10,000 total PA and 10,000 total BF, so I limited my population to the 143 cohorts born between 1850 and 1992.

I wanted the weights of the "three true outcomes" rates ($BB, $SO, and $HR) to be double the other weights, because they use both hitting AND pitching stats, while the other rates only use one or the other. And I wanted the weights to add up to close to 1,000, so that if two groups are exactly four standard deviations apart in every category, they will have a similarity score near zero.

Don Mattingly's 1961 cohort has a $HR rate of 2.9%, while the $HR rate of Generation X overall is 3.6%. For the similarity score of that cohort to its generation, the $HR penalty is .007 (.036 - .029) times the rate multiplier of 3,806, or about 27.

Add up all eight penalties and subtract from 1,000, and that is the similarity score.

I made a worksheet of the similarity scores of every cohort to every generation, and used conditional formatting to create a "heat map" of the scores, where 1,000 is green and zero (or negative) is red, and everything in between is on a gradient. Below is a screenshot of the portion of the sheet showing the Gen X cohorts (and the first Millennial cohort):


And next to it, I made another table. For each cohort I added a variable to its similarity scores so that the highest score always equals 1,000, and then I highlighted all the 1,000's. It's not as pretty as the above table, but it's more useful for showing which generation each cohort belongs in.


Generation X's last-born cohort, 1981, also jumps ship, going over to the Millennials. And this is in fact where I end up for the birthyears of baseball's Generation X - 1962 to 1980. But I didn't start here. I started at the beginning.


The 1831-1834 (and 1837) Gilded cohorts are much more similar to the Transcendentals (and to the Homelanders, who, as I said, are statistically identical to the Transcendentals). So I'll move the 1831-1834 cohorts to the Transcendental Generation (the pre-MLB 1822-1830 cohorts go too). That pushes the Transcendental/Gilded boundary from 1821-1822 to 1834-1835:

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Transcendental1792-1834700.0.000.429.000.000.000.000.000
Gilded1835-184210,0355,94535.4.028.016.002.283.142.293.039
Progressive1843-1859653,547512,02735.9.051.067.005.272.210.277.093
Missionary1860-18821,822,8041,926,47031.5.083.091.006.276.201.294.143

You can tell from the above table of highlighted 1,000's that most of the 1835 to 1851 cohorts were already more similar to the Gilded Generation. But before I even look at the similarity scores again, I'm enforcing a minimum length for these generations. The shortest Strauss & Howe generation is 17 years; I'll relax that by one year and say the baseball generations must be at least 16 years.

The Gilded Generation is currently reduced to eight years (1835-1842). So it gets pushed out eight more years to 1850, which means the Progressive Generation gets extended, too (which shortens the Missionary Generation to 16 years exactly):

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Transcendental1792-1834700.0.000.429.000.000.000.000.000
Gilded1835-1850127,17981,27937.4.027.031.003.285.179.256.058
Progressive1851-18661,119,915916,49835.4.070.078.007.271.210.287.137
Missionary1867-18821,239,2921,446,66530.4.083.094.006.277.200.294.131

Since the generations' birthyears have changed, so too have their attribute rates. Which means the similarity scores will be different:


Since the Gilded birthyears are now 1835-1850, that generation "pulls in" not only most of those cohorts, but 1851-1853 and 1855 as well. So I can add the 1851-1853 cohorts to the Gildeds, which means I have to extend the Progressives, Missionaries, and Lost (to get them back to 16 years), and cut into the G.I. birthyears:

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Gilded1835-1853205,755138,21237.0.031.039.003.279.181.251.053
Progressive1854-18691,275,2111,226,75534.8.076.080.007.274.211.294.144
Missionary1870-18851,290,1121,313,81028.2.082.101.005.274.199.289.127
Lost1886-19011,663,1761,787,79922.0.084.092.010.283.219.240.078
G.I.1902-19242,185,6612,144,34319.0.094.101.019.280.224.184.036

Which changes the similarity scores again:


The 1854 cohort is still most similar to the Progressives, but now it's more similar to the Gildeds than the 1855 cohort is to the Progressives. So I'll move both cohorts to the Gildeds, which pushes each of the next three generations out another two years:

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Gilded1835-1855303,360210,47936.4.036.049.004.274.189.254.055
Progressive1856-18711,314,6921,253,84834.5.079.080.007.276.211.296.149
Missionary1872-18871,390,1291,429,22827.3.082.103.006.274.200.287.126
Lost1888-19031,615,4721,732,26821.6.084.090.011.285.224.231.066
G.I.1904-19241,996,2621,985,09618.9.094.103.019.278.222.181.035

Which pulls not only the 1854 cohort but also the 1856 cohort into the Gilded Generation:


I think you get the idea by now. After I add the 1856 cohort to the Gilded group, the 1857 and later cohorts remain more similar to the Progressives, which means the Gilded group isn't pulling in any more cohorts. The birthyears have "locked", giving me an opportunity to paraphrase page 82 of Generations:

All things have a beginning, and so must the story of baseball generations.

I start with the cohort-group of 1835 through 1856. I call it the "National Generation." 469 members of this group appeared in at least one major league game. (For the purposes of this study, I consider the National Association, 1871-1875, a "major" league; it certainly was to Harry Wright and his peers). It includes every member of the 1869 Cincinnati Red Stockings (the first openly professional team), nearly every player in the National Association (the first professional league), and most of the players in William Hulbert's National League (1876-1881).

A couple of earlier-born players appeared in the National Association, but they were both forty-somethings who played just one game each. Besides them, the firstborn MLB players were Harry Wright (born 1835) and Dickey Pearce (born 1836). Both played seven years between the NA and NL, and both were key pioneers of the professional game.

After applying the method described in this post through to the end, I arrive at these birthyears for the generations of baseball:

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Transcendental1792-1834700.0.000.429.000.000.000.000.000
Gilded1835-1856379,169319,45736.3.038.055.004.272.195.262.063
Progressive1857-18731,397,5851,336,70034.1.082.080.007.278.209.297.149
Missionary1874-18921,801,7431,875,10625.4.082.104.006.274.204.275.115
Lost1893-19111,829,0431,928,19220.6.086.089.014.287.229.204.046
G.I.1912-19281,618,9361,487,75017.4.098.113.022.273.215.176.032
Silent1929-19441,893,2011,941,91915.2.090.158.028.272.202.154.046
Boom1945-19612,719,9382,680,77814.1.091.152.027.280.214.133.073
Generation X1962-19803,396,9633,326,58910.8.095.184.036.292.243.098.065
Millennial1981-19961,292,9781,397,11910.1.090.217.036.298.249.105.070
Homelanders1997-?00

Here is the same table again, but with a few changes. I gave the earlier generations more appropriate names. Also, I dropped the first 20 cohorts of the generation formerly known as the Transcendentals, and started it at 1812 (birthyear of Duncan Curry, first Knickerbocker president). Lastly, since this study is about grown men playing organized baseball at the highest level (which is what sets the Knickerbocker Generation apart from earlier generations), I omitted the Homelanders (for now):

GENERATIONBIRTH YEARSPABFBF/G$BB$SO$HR$H$XBH$3B$SB
Knickerbocker1812-1834700.0.000.429.000.000.000.000.000
National1835-1856379,169319,45736.3.038.055.004.272.195.262.063
American1857-18731,397,5851,336,70034.1.082.080.007.278.209.297.149
Dead-Ball1874-18921,801,7431,875,10625.4.082.104.006.274.204.275.115
Live-Ball1893-19111,829,0431,928,19220.6.086.089.014.287.229.204.046
G.I.1912-19281,618,9361,487,75017.4.098.113.022.273.215.176.032
Silent1929-19441,893,2011,941,91915.2.090.158.028.272.202.154.046
Boom1945-19612,719,9382,680,77814.1.091.152.027.280.214.133.073
Generation X1962-19803,396,9633,326,58910.8.095.184.036.292.243.098.065
Millennial1981-19961,292,9781,397,11910.1.090.217.036.298.249.105.070

Sunday, October 23, 2016

Cubs vs Indians

Of the four teams playing in the two league championship series, the Toronto Blue Jays have the most recent world championship - 1993, 23 years ago. They were dispatched by the Indians in five games.

The Los Angeles Dodgers have the next-shortest drought, 28 years. They fell to the Cubs last night in the sixth and final game of the NLCS.

If this trend continues, the Chicago Cubs (108 years) will defeat the Cleveland Indians (68 years) to win the World Series.

Wednesday, October 19, 2016

The Generations of Baseball, part two

Continued from part one.

Bill James' similarity scores use traditional baseball statistics, which includes both counting stats (hits, homeruns, strikeouts), and rate stats (batting average, on-base percentage, earned run average).

As Kerry Whisnant explained, using counting stats in similarity scores will mean that "players can be similar only if they are similar hitters and have similar numbers of plate appearances" (emphasis mine). That's not necessarily a bad thing, for Bill's purposes. The length of a player's career could be considered part of his "statistical image", so it might be appropriate to use counting stats for player comparisons. Likewise, the total number of plate appearances in a season, which is normally relatively stable from year to year unless there's an expansion or a work stoppage, could be considered part of a season's "statistical image".

But cohorts vary so widely in size from birthyear to birthyear that raw counting stats lose all meaning. Strauss & Howe were much more interested in a cohort's "peer personality" than in its total numbers. To find the peer personality of baseball cohorts, we need rate-based similarity scores.

But the traditional rate stats are redundant, as Jim Albert explained in an article in By the Numbers:
"...Traditional hitting statistics confound...talents. A batting average confounds three batter talents: the talent not to strikeout, the talent to hit a home run, and the talent to hit an in-play ball for a hit. An on-base percentage confounds the batter’s talent to draw a walk with his talent to get an in-play hit and his talent to hit a home run" (page 24).
In 2001, Voros McCracken rocked the sabermetric community by figuring out how to sort out responsibility between pitchers and fielders. Using a "divide and conquer" approach, he isolated, one at a time, defense-independent statistics (walks, strikeouts, and homeruns) from the defense-dependent ones (everything else)... and realized that pitchers have little-to-no control over the "everything else": balls batted into the field of play.

Jim Albert recreated Voros's approach (on pages 23 and 24), but applied it to batting statistics, breaking them down into four basic skills:
"A player comes to bat for a plate appearance. Either the player walks or doesn’t walk – his chance of walking is estimated by the walk rate (BB+HBP)/PA. (Note that we combine walks and hit by pitches in the formula since each event has the same result of getting the batter to first base without creating an at-bat.)
"Removing walks from the plate appearances, we next record if the batter strikes out or not. We define the strikeout rate as the fraction of strikeouts to the number of at-bats or SO/AB.
"With walks and strikeouts removed, we next record if the batter hits a home run or not. The home run rate is defined to be the fraction of home runs for all plate appearances where contact is made by the bat. That is, HR rate = HR/(AB – SO).
"With the walks, strikeouts, and homeruns removed, we only have plate appearances where the ball is hit in the park. Of these balls put in-play, we record the fraction that fall in for hits – we call this the in-play hit rate or 'hit rate' (H-HR)/(AB-SO-HR)....
"Since these rates are defined by sequentially removing walks, strikeouts, and home runs from the plate appearances, they measure distinct qualities of a hitter. Specifically, these rates measure (1) the talent to draw a walk, (2) the talent to avoid a strikeout, (3) the talent to hit a ball out of the park (a home run), and (4) the talent to hit a ball 'where they ain’t'."
And unlike traditional statistics, which have different rate categories for hitting (average, OBP) and pitching (ERA, WHIP), these four rates can be used for hitters AND pitchers. Pitchers are trying to accomplish the exact opposite of what hitters are trying to accomplish. Hitters are trying to draw walks, make contact, hit homeruns, and convert batted balls into hits. Pitchers are trying to throw strikes, miss bats, prevent homeruns, and (with the help of their defense) convert batted balls into outs. So it makes sense then to use the same statistics to measure hitters and pitchers.

Dave Studeman also wrote about the four rates:
"...The beauty of these ratios is that they build on each other. The impact of each event is removed from the denominator of the next event. A batter can’t hit a home run unless he actually hits a ball, so walks and strikeouts aren’t considered in the home run rate.
"For pitchers, you often see strikeout and walk rates, but as a proportion of total plate appearances. Here, they are treated sequentially. The strikeout ratio implies that a pitcher can’t strike out a batter if he walks him first, so the formula takes walks out of consideration when calculating strikeout rates."
Tom Tango expanded on Voros' "four horsemen", breaking the in-play hit component down into two additional components: extra-base hits (doubles and triples) per in-play hit, and triples per extra-base hit. He also added stolen bases as a rate of opportunities (singles and walks). "Each rate describes something specific" he wrote, and used the seven rates to study the aging patterns of different skills in batters.

So, the idea then is that there are four basic outcomes of every batter/pitcher match-up: a walk (or HBP), a strikeout, a homerun, or a ball in play. A ball in play can result in a hit, a hit can be an extra-base hit, and an extra-base hit can be a triple. Also, a batter who has reached base (usually via single or walk) can steal a base.

But there's another, even-more-obvious framework for baseball events: GAMES. Every batter-pitcher match-up happens within the context of a game. In 1876, batters and pitchers had the same roles, and thus the same usage: starters played almost every game, and almost every inning, and reserves were basically emergency back-ups. Aside from the practice of platooning, a batter's role hasn't changed much in the last 140 years, whereas the pitcher's role has evolved through the generations and fragmented into a variety of roles - from a complete game starter who pitches once every five days to a one-out reliever who can pitch several days in a row, and everything in between.

Since a pitcher's role is part of his personality/statistical image, I am adding one more rate to the seven Voros/Tango components: batters faced per game. And since a batter's role is NOT really a defining part of his personality (and since it would render the rate meaningless if I confounded hitters' plate appearances with pitchers' batters faced) it is a pitchers-only rate.

So here are the eight rates - the eight attributes that define the "personality" of a baseball player or cohort or cohort-group - and the formulas I use to calculate them, grouped by what statistics they draw from (batting, pitching, or both):

Pitchers Only:
BF / G

Batters and Pitchers (the "Three True Outcomes"):
$BB = (BB + HBP) / PA
$SO = SO / (PA - BB - HBP)
$HR = HR / (PA - BB - SO - HBP)

Batters Only:
$H = (H - HR) / (PA - HR - BB - SO - HBP)
$XBH = (2B + 3B) / (H - HR)
$3B = 3B / (2B + 3B)
$SB = SB / (H - 2B - 3B - HR + BB + HBP)

(Unlike Albert, I use (PA - BB - HBP) instead of at-bats, because at-bats weren't recorded for pitchers until 1930, and because it keeps sac flies and sac hits included in the denominator.)

Since the "three true outcomes" ($BB, $SO, and $HR) combine batting AND pitching statistics, the formula for $BB (as an example) is actually

$BB = (BBb + HBPb + BBp + HBPp) / (PA + BF)

where BBb is batters' bases on balls, BBp is pitchers' bases on balls, etc.

This means that when 1991-born Mike Trout faces 1991-born Trevor Bauer, it counts as TWO plate appearances for the 1991 cohort. And if Trout strikes out, it counts as two strikeouts, and if he homers, it counts as two homeruns. (Actually, a homerun would go down as an 0-for-2 in the $BB component, an 0-for-2 in the $SO component, and a 2-for-2 in the $HR component. But if Trout instead hits a single, it would be an 0-for-2 in $HR and a 1-for-1 in $H, since the $H rate doesn't draw from pitcher statistics.)

Here is a "binary tree" diagram of the eight attribute rates. The "headers" are the denominators, or opportunities:

per G   per PA  per H  per 1B or BB

BF/G---->$BB-------------->$SB
          |                 ^
          |                /
          v               /
         $SO             /
          |             /
          |            /
          v           /
         $HR         /
          |         /
          |        /   per XBH
          v       /
         $H---->$XBH---->$3B

Concluded in part three.

Sunday, October 16, 2016

The Generations of Baseball, part one

"A generation, like an individual, merges many different qualities, no one of which is definitive standing alone. But once all the evidence is assembled, we can build a persuasive case for identifying (by birthyear) eighteen generations over the course of American history. All Americans born over the past four centuries have belonged to one or another of these generations."
That is the claim made by authors William Strauss and Neil Howe on page 68 of Generations. The book, published in 1991, "retells the history of America as a series of generational biographies going back to 1584."

Below is a table of the last ten of those generations - the generations that played BASEBALL - along with their total MLB-playing population (through 2016) and best (or best-known) player:

GENERATIONBIRTH YEARSMLB POP.FAMOUS MEMBER
Transcendental1792-18210Alexander Cartwright
Gilded1822-184221Harry Wright
Progressive1843-1859688Cap Anson
Missionary1860-18822,129Cy Young
Lost1883-19002,573Babe Ruth
G.I.1901-19242,693Ted Williams
Silent1925-19421,818Willie Mays
Boom1943-19602,602Rickey Henderson
Generation X1961-19813,926Barry Bonds
Millennial1982-20042,332Mike Trout

But these are the generations of AMERICAN history - social generations. Their members encountered "the same national events, moods, and trends at similar ages" (page 48) and felt "the ebb and flow of history from basically the same age or phase-of-life perspective" (page 64).

Baseball has its own national events, moods, and trends; its own ebbs and flows. Does it have its own generations, then, too? And can we find them by applying Strauss & Howe's methodology to sabermetrics?
"...We have to start with the building-block of generations: the ‘cohort.’ Derived from the Latin word for an ordered rank of soldiers, ‘cohort’ is used by modern social scientists to refer to any set of persons born in the same year; ‘cohort-group’ means any wider set of persons born in a limited set of consecutive years" (page 44).
"A GENERATION is a cohort-group whose length approximates the span of a phase of life and whose boundaries are fixed by peer personality" (page 60). There are four "phases of life" (on page 56), each 22 years long: youth (ages 0 to 21), rising adulthood (22 to 43), midlife (44 to 65), and elderhood (66 to 87).
"But how do we actually identify [a generation]? For that, we have to focus our attention on ‘peer personality’ - the element in our definition that distinguishes a generation as a cohesive cohort-group with its own unique biography. The peer personality of a generation is essentially a caricature of its prototypical member. It is, in its sum of attributes, a distinctly personlike creation" (page 63).
"A PEER PERSONALITY is a generational persona recognized and determined by (1) common age location; (2) common beliefs and behavior; and (3) perceived membership in a common generation" (page 64). But while Strauss & Howe (on page 67) "can apply no reductive rules for comparing the beliefs and behavior of one cohort-group with those of its neighbors," we can apply similarity scores.

Bill James devised similarity scores in the 1980s as a means to identify groups of comparable players based on their career statistics. "To compare one player to another, start at 1000 points and then subtract points based on the statistical differences of each player."

In July 2015, Bill used similarity scores to compare seasons (like MLB 2009 and MLB 2010) and organize them into eras and epochs (subscription required to read either article).
"My thinking was that if I did these studies in an organized, systematic way, 'fault lines' would appear. Natural groups of seasons would be apparent in the data, I thought; we could call the smaller groups, 5 to 10 years, 'epochs' and the larger groups 'eras'....And, in fact, this approach does, in most cases, make it very clear where the lines should be drawn."
For Bill, a smaller group of seasons is an epoch and a larger group is an era. For Strauss & Howe, a smaller group of cohorts is a cohort-group and a larger group is a generation.

Bill "measured the Similarity of Seasons...by the similarity of their statistical image" to identify "natural groups of seasons" and find the "fault lines" separating them.

Strauss & Howe "focus...on 'peer personality'" to identify "a cohesive cohort-group" "and find the boundaries separating it from its neighbors" (page 64).

For a baseball player, then, his "personality" IS his statistical image.

So, ignoring the third part of peer personality ("perceived membership"), Strauss & Howe on page 64 offer us two possible paths to identifying baseball generations:

We can "look first at [a cohort-group's] chronology: its common age location, where its lifecycle is positioned against the background chronology of historic trends and events," meaning we can identify eras first, then place cohorts in generations based on which era they primarily played in.

Or, we can "look at [a cohort-group's] attributes: objective measures of its common beliefs and behavior," meaning we can apply Bill James' method to cohorts directly; that is, use similarity scores to find "natural groups" of cohorts - generations. I'm going with this approach.

Continued in part two.

Saturday, October 15, 2016

538 Podcast on the History of the Shift

"Well, I'm Ted Williams, pilgrim, and I don't change my swing for nobody!"
Okay, that's my John-Wayne-esque paraphrasing of Ted Williams, but this is a fantastic podcast about Williams, Lou Boudreau, Joe Maddon, and the rise and fall and rise again of the defensive shift in baseball. Enjoy!

http://fivethirtyeight.com/features/ahead-of-their-time-why-baseball-revived-a-60-year-old-strategy-designed-to-stop-ted-williams/

Sunday, October 9, 2016

Mike Trout - The Greatest of Millennials

http://www.slate.com/articles/sports/sports_nut/2016/10/why_doesn_t_anyone_care_about_mike_trout.html
"Mickey Mantle is alive. He’s right out there in centerfield, every day, running down balls in the alleys, taking over games with his offense, defense, and speed. And this is prime Mantle, too—not later-years, shot-kneed, wee-hours-at-the-Copacabana Mantle. He debuted at 19, just turned 25, and he’s been the best player in baseball every year since he—"
Great article from Mike Schur on Slate on the underappreciated greatness of Mike Trout.

I liked the comparison between Trout and Mantle. Mantle played in New York, partied, and was seldom healthy and sober (although the author overstated his case a little - Mantle had nine seasons of at least 620 PA). Trout plays in southern California, plays every day and with a smile on his face, doesn't do or say anything newsworthy off the field, and his biggest passion besides baseball is...weather.

Schur makes this compelling argument:
"Here is what you should know: Mike Trout is the best player in baseball. There’s an argument to be made that he’s had the greatest season ever, for a player his age, in every year he has played."
I've already established (if you agree that WAR is at least in the ballpark at correctly estimating player value) that Mike Trout is the greatest player ever through the age of 24. But that doesn't mean he was the best ever at his age for each individual season. Or was he?

Age 20:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Mike Trout 10.8 2012 30 83 49 .326 .399 .564
2 Alex Rodriguez 9.4 1996 36 123 15 .358 .414 .631
3 Al Kaline 8.2 1955 27 102 6 .340 .421 .546
4 Mel Ott 7.4 1929 42 151 6 .328 .449 .635
5 Ty Cobb 6.8 1907 5 119 53 .350 .380 .468
6 Manny Machado 6.7 2013 14 71 6 .283 .314 .432
7 Ted Williams 6.7 1939 31 145 2 .327 .436 .609
8 Vada Pinson 6.5 1959 20 84 21 .316 .371 .509
9 Frank Robinson 6.5 1956 38 83 8 .290 .379 .558
10 Mickey Mantle 6.5 1952 23 87 4 .311 .394 .530
Provided by Baseball-Reference.com: View Play Index Tool Used
Generated 10/9/2016.

Mike Trout DID have the greatest season ever for a 20-year-old, besting A-Rod's '96 season by a fairly wide margin.

Age 21:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Rogers Hornsby 9.9 1917 8 66 17 .327 .385 .484
2 Mike Trout 9.3 2013 27 97 33 .323 .432 .557
3 Rickey Henderson 8.8 1980 9 53 100 .303 .420 .399
4 Eddie Mathews 8.3 1953 47 135 1 .302 .406 .627
5 Cesar Cedeno 8.0 1972 22 82 55 .320 .385 .537
6 Jimmie Foxx 7.9 1929 33 118 9 .354 .463 .625
7 Andruw Jones 7.4 1998 31 90 27 .271 .321 .515
8 Ken Griffey 7.1 1991 22 100 18 .327 .399 .527
9 Arky Vaughan 7.0 1933 9 97 3 .314 .388 .478
10 Frank Robinson 6.9 1957 29 75 10 .322 .376 .529
Provided by Baseball-Reference.com: View Play Index Tool Used
Generated 10/9/2016.

Trout is 2nd to Hornsby. But already, Trout and Frank Robinson are the only players in the top 10 of both lists - greatest 20-year-olds AND greatest 21-year-olds. Trout is 1st and 2nd; Robinson is 9th and 10th.

Age 22:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Ted Williams 10.6 1941 37 120 2 .406 .553 .735
2 Bryce Harper 9.9 2015 42 99 6 .330 .460 .649
3 Ty Cobb 9.8 1909 9 107 76 .377 .431 .517
4 Eddie Collins 9.7 1909 3 56 63 .347 .416 .450
5 Stan Musial 9.4 1943 13 81 9 .357 .425 .562
6 Dick Allen 8.8 1964 29 91 3 .318 .382 .557
7 Alex Rodriguez 8.5 1998 42 124 46 .310 .360 .560
8 Cal Ripken 8.2 1983 27 102 0 .318 .371 .517
9 Joe DiMaggio 8.2 1937 46 167 3 .346 .412 .673
10 Mike Trout 7.9 2014 36 111 16 .287 .377 .561
Provided by Baseball-Reference.com: View Play Index Tool Used
Generated 10/9/2016.

This was probably Trout's worst season, and also the only year so far he's taken home MVP honors. Harper's 2015 season is ahead of a Ty Cobb Triple Crown and A-Rod's 40-40, and second only to the year Teddy Ballgame hit .406.

Age 23:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Willie Mays 10.6 1954 41 110 8 .345 .411 .667
2 Ted Williams 10.6 1942 36 137 3 .356 .499 .648
3 Ty Cobb 10.5 1910 8 91 65 .383 .456 .551
4 Eddie Collins 10.5 1910 3 81 81 .324 .382 .418
5 Cal Ripken 10.0 1984 27 86 2 .304 .374 .510
6 Mookie Betts 9.6 2016 31 113 26 .318 .363 .534
7 Mickey Mantle 9.5 1955 37 99 8 .306 .431 .611
8 Mike Trout 9.4 2015 41 90 11 .299 .402 .590
9 Reggie Jackson 9.2 1969 47 118 13 .275 .410 .608
10 Arky Vaughan 9.2 1935 19 99 4 .385 .491 .607
Provided by Baseball-Reference.com: View Play Index Tool Used
Generated 10/9/2016.

23-year-old Mookie Betts (the 2016 AL MVP-favorite) is slightly ahead of 23-year-old Mike Trout (but NOT ahead of 2016 Mike Trout).

Age 24:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Lou Gehrig 11.8 1927 47 173 10 .373 .474 .765
2 Mickey Mantle 11.2 1956 52 130 10 .353 .464 .705
3 Ty Cobb 10.7 1911 8 127 83 .420 .467 .621
4 Mike Trout 10.6 2016 29 100 30 .315 .441 .550
5 Jimmie Foxx 10.5 1932 58 169 3 .364 .469 .749
6 Alex Rodriguez 10.4 2000 41 132 15 .316 .420 .606
7 Tris Speaker 10.1 1912 10 90 52 .383 .464 .567
8 Mike Schmidt 9.7 1974 36 116 23 .282 .395 .546
9 Rogers Hornsby 9.6 1920 9 94 12 .370 .431 .559
10 Shoeless Joe Jackson 9.6 1912 3 90 35 .395 .458 .579
Provided by Baseball-Reference.com: View Play Index Tool Used
Generated 10/9/2016.

Trout is 4th, behind the career years of Lou Gehrig, Mickey Mantle, and Ty Cobb.

So there you have it. Trout didn't literally have "the greatest season ever, for a player his age, in every year he has played." But he was the best at age 20, the second-best at age 21 (to a player in the Dead-ball Era), in the top five at age 24, and in the top 10 at ages 22 and 23.

Ty Cobb, who Trout surpassed as the "greatest young player of all time," finished in the top five in four of the years, but missed the top 10 entirely at age 21 and doesn't rank higher than third at any age.

Mantle was only in the top 10 at three of the ages - 20, 23 and 24. Mantle didn't really become "great" until he was 23, and then he was the best player in the American League for about ten years.

Mike Trout has been the best player in baseball since he was a 20-year-old rookie.