Friday, August 18, 2017

Generations of Baseball Players

"A generation, like an individual, merges many different qualities, no one of which is definitive standing alone. But once all the evidence is assembled, we can build a persuasive case for identifying (by birthyear) eighteen generations over the course of American history. All Americans born over the past four centuries have belonged to one or another of these generations" (Generations page 68).
William Strauss and Neil Howe wrote the book on generations (literally). The aim of this post - and, indeed, this entire blog - is to apply their theory to baseball. You can call it my manifesto.

Therefore, by the end of this post, I hope to have built a persuasive case for identifying (by birthyear) NINE generations over the course of baseball history. I will claim that all ballplayers born over the past two centuries belong to one or another of these generations.

Why do we need definitive baseball generations? For one, to provide context for "best of their generation" conversations and arguments (best player, best pitcher, best 3rd baseman, best leadoff batter, etc.) Beyond that, having definitive generations restores meaning to baseball's hallowed leaderboards. For example, Roger Connor hit 138 home runs in his career - a modest total by today's standards, but the most ever hit by a player born before 1887. Similarly, Miguel Cabrera's career batting average of .318 ranks 55th all-time (as of this writing), but it ranks first among players born after 1960.

So if we can agree that sorting baseball players by generation is useful, how exactly should we go about doing it? I'll start with Strauss & Howe's definition:
"A GENERATION is a cohort-group whose length approximates the span of a phase of life and whose boundaries are fixed by peer personality" (page 60).
Earlier in the book (page 44), Strauss & Howe defined a "cohort" as "any set of persons born in the same year" and a "cohort-group" as "any wider set of persons born in a limited set of consecutive years."

The authors laid out (on page 56) four "phases of life," each 22 years long: youth (age 0 to 21), rising adulthood (age 22 to 43), midlife (age 44 to 65), and elderhood (age 66 to 87). Obviously, most major league careers fall almost entirely within the second phase, rising adulthood. And since the longest careers tend to last about 22 years, we can say then that the length of a baseball generation approximates the span of a very long major league career.

And that brings us to "peer personality" - "the element in our definition that distinguishes a generation as a cohesive cohort-group" (page 63). Strauss & Howe measured the similarity of cohorts by the similarity of their peer personality. In a pair of articles published to his website in July 2015 (and placed behind a subscription paywall), Bill James "measured the Similarity of Seasons...by the similarity of their statistical image."

Strauss & Howe "use peer personality to identify a generation and find the boundaries separating it from its neighbors" (page 64). Bill James used similarity scores to identify "natural groups of seasons" and find the "fault lines" separating them.

And while Strauss & Howe could "apply no reductive rules for comparing the beliefs and behavior of one cohort-group with those of its neighbors" (page 67), we have baseball's rich statistical record at our disposal for comparing the "behavior" of baseball cohort-groups.

So here then is my modified definition:
A BASEBALL GENERATION is a cohort-group whose length approximates the span of a very long major league career and whose boundaries are fixed by statistical image.
Bill James' similarity scores for seasons used 30 statistical categories, including both counting stats (hits, homeruns, strikeouts), and rate stats (batting average, on-base percentage, earned run average). But, as Kerry Whisnant explained, using counting stats to compare cohorts, like using them to compare players or seasons, will mean that only cohorts with "similar numbers of plate appearances" will be similar.

And the traditional rate stats "confound" talents, as Jim Albert explained on page 24 of an article in By the Numbers. "A batting average confounds three batter talents: the talent not to strikeout, the talent to hit a home run, and the talent to hit an in-play ball for a hit."

Peer personality has nothing to do with the raw numbers of a generation, but rather the collective behavior of its members. Strauss & Howe elaborated (on page 63 of Generations):
"The peer personality of a generation is essentially a caricature of its prototypical member. It is, in its sum of attributes, a distinctly personlike creation. A generation...can be safe or reckless, calm or aggressive, self-absorbed or outer-driven, generous or selfish, spiritual or secular, interested in culture or interested in politics."
Likewise, the "statistical image" of a baseball generation (like the statistical image of a team or a league) is essentially a caricature of its average player. It is, in its sum of attributes, like an individual player. A baseball generation can be patient or free-swinging, adept at making contact or prone to striking out, powerful or light-hitting, good at hitting the ball "where they ain't" or bad at avoiding defenders, aggressive on the base-paths or station-to-station.

So I'll use eight rate statistics - what I'll call the "attribute rates" - to measure the similarity of baseball cohorts. ("Each rate describes something specific," Tom Tango wrote, describing four of the rates.) These eight rates - representing eight different skills, or tools - taken together, reveal the "statistical image" or "peer personality" of a baseball generation; how its members collectively played the game.

The first rate, BF/G, uses pitching stats only. The next three, $BB, $SO, and $HR - the "three true outcomes" - draw from both batting and pitching stats (for the formulas listed below, I differentiate between batting and pitching with a small 'b' or 'p' in the variables). The last four use batting stats only.

BF/G - the number of batters a pitcher faces per game. = BF / G

$BB - the percentage of plate appearances that end in a walk or a hit by pitch.
= (bBB + bHBP + pBB + pHBP) / (PA + BF)

$SO - the percentage of plate appearances ending in a strike (called, swung and missed, or batted) that are strikeouts.
= (bSO + pSO) / (PA + BF - bBB - bHBP - pBB - pHBP)

$HR - the percentage of batted balls that are homeruns.
= (bHR + pHR) / (PA + BF - bBB - bSO - bHBP - pBB - pSO - pHBP)

$H - the percentage of balls batted into the field of play that are hits.
= (H - HR) / (PA - HR - BB - SO - HBP)

$XBH - the percentage of base hits that go for extra bases (doubles or triples).
= (2B + 3B) / (H - HR)

$3B - the percentage of extra-base hits that are triples. = 3B / (2B + 3B)

$SB - the percentage of successful stolen bases per (approximate) times on first.
= SB / (H - 2B - 3B - HR + BB + HBP)

So, now all we need are the career totals for every batter and every pitcher in MLB history. Then, on a separate sheet in Excel, it's just a matter of using SUMIF formulas to add up the necessary batting and pitching totals for each cohort. From the firstborn player (Nate Berkenstock, 1831) to the lastborn (Julio Urias, 1996), there are 166 MLB cohort birthyears (through 2016; I'm typing this just days after Ozzie Albies became the first 1997-born major-leaguer). The table below shows the career totals for 1980s cohorts:

Batting TotalsPitching Totals
BornPAH2B3BHRBBSOHBPSBGBBSOHRBFHBP
1980151,35234,6887,0836714,57513,62128,1951,5702,24313,65711,49624,9823,630137,5801,321
1981138,49532,2506,3368343,40910,44425,5921,0682,76614,42714,33929,7604,482162,8211,476
1982167,40239,2897,9578134,30213,46530,3971,5972,51416,64213,58628,0854,128153,4481,434
1983210,54049,43710,0739815,72118,42038,8151,7213,42219,55716,30136,7865,174197,7871,708
1984144,81233,2786,5557163,55311,52928,4561,1872,41416,43814,77934,7084,425175,0981,496
1985109,48025,5135,1286102,8368,07722,1681,0071,88216,63511,10426,2323,256128,0411,199
1986124,54227,4085,3975953,34510,21927,1211,1221,77712,89911,47729,3643,770147,6021,205
1987121,87327,9015,6136993,3409,78525,8659761,87212,6679,86823,0213,000115,8161,004
198862,59314,2352,6074041,2514,44212,8145101,39210,0798,43222,0062,518102,863801
198969,49615,6553,0243481,7525,22114,4556238597,2916,30416,4192,06679,846717

All the batters born in 1980 combined for 151,352 plate appearances, 34,688 hits, and 4,575 homeruns. All the pitchers born that year combined for 11,496 bases on balls, 24,982 strikeouts, etc.

Then I can calculate the attribute rates for each cohort:

BornBF/G$BB$SO$HR$H$XBH$3B$SB
198010.1.097.204.039.291.257.087.060
198111.3.091.202.036.294.249.116.083
19829.2.094.201.036.297.251.093.061
198310.1.093.204.037.300.253.089.065
198410.7.091.217.035.297.245.098.069
19857.7.090.224.036.301.253.106.072
198611.4.088.228.037.291.249.099.060
19879.1.091.226.038.300.257.111.065
198810.2.086.230.032.298.232.134.093
198911.0.086.226.036.293.243.103.052

Next I need the standard deviations of each rate. I have 166 cohort birthyears, but many of the very early and very recent cohorts were not (or aren't yet) well-represented in the major leagues. So I'll set minimum requirements of 10,000 total plate appearances and 10,000 total batters faced, and therefore only include the 143 cohorts from 1850 (Al Spalding) through 1992 (Bryce Harper) in the population for my standard deviations.

Also, I'll need to assign weights to each rate. I wanted the "three true outcomes" rates ($BB, $SO, and $HR) to weigh double the other rates, because they use both hitting and pitching statistics, and I wanted the $XBH and $3B rates to weigh half the other rates, because they both deal with breakdowns of base hits. Finally, I wanted the weights to add up to 1,000, so that if two groups are exactly four standard deviations apart in every category, their similarity score will be zero.

BF/G$BB$SO$HR$H$XBH$3B$SB
St. Dev.8.3.013.047.012.011.019.071.040
Weight1002002002001005050100
Multiplier3.03870106142292241648175633

To find the similarity score between two groups, start at 1,000 and subtract a penalty for each attribute rate. The penalty is the difference between the two groups, times a multiplier. The multiplier is the rate's weight divided by (4 times its standard deviation).

For example, the 1980 cohort has a $BB rate of .097 and the 1981 cohort has a $BB rate of .091, a difference of .006. So the $BB penalty for 1980 and 1981 would be the difference (.006) times the multiplier (3,870), which is about 23. Add up the penalties for all eight rates and subtract from 1,000, and that is the similarity score.

To find "Epochs and Eras," Bill James asked of "every season in baseball history: Is it more like the season before it, or more like the season after it?" He then made two-year comparisons, three-year comparisons, four-year comparisons... comparing "each season to every other season in baseball history within 15 years before or after."

Instead of comparing each baseball cohort to other neighboring cohorts, I'm comparing them to the 15-year cohort-groups before and after. To find Baseball Generations, I'm asking of every baseball cohort: Is it more like the 15-year cohort-group before it, or more like the 15-year cohort-group after it?

Is the 1980 cohort more similar to the 1965-1979 cohort-group, or more similar to the 1981-1995 cohort-group?

1980 to 1965-1979 - 943
1980 to 1981-1995 - 916

The 1980 cohort is backward-looking, more similar to the cohort-group before it (943) than the cohort-group after it (916). What about 1981?

1981 to 1966-1980 - 938
1981 to 1982-1996 - 954

The 1981 cohort is forward-looking, more similar to the cohort-group after it (954) than the cohort-group before it (938).

To get previous and next cohort-groups for all 166 cohorts, I calculated attribute rates for every possible 15-year group, from the group before the first cohort (1816-1830) to the group after the last cohort (1997-2011), and all 180 groups in between. Attribute rates are calculated from batting and pitching totals. Cohort batting and pitching totals are found by adding up the career totals of the individual batters and pitchers belonging to each cohort; cohort-group batting and pitching totals are found by adding up the cohort totals of the 15 individual cohorts belonging to each cohort-group. (The 1816-1830 and 1997-2011 groups both have totals and rates of zero across the board, of course.)

The table below shows the similarity scores of the 1973-1993 cohorts to their respective previous and next 15-year groups. I also calculated a "forward score" for each cohort, which is simply its next-group similarity score MINUS its previous-group similarity score. The forward score shows just HOW forward- or backward-looking a cohort is. A positive forward score indicates a cohort is forward-looking and a negative score indicates it is backward-looking, and a score above +50 (or below -50) means that the cohort is VERY forward- (or backward-) looking.

BornPrev.NextForward
1973957950-7
1974953919-34
1975950941-9
1976951945-7
1977955944-12
19789409488
1979939931-8
1980943916-27
198193895416
1982958948-10
1983956941-15
198493296128
198591596550
198692795527
19879349417
198887894265
19899269337
19908848927
199188693852
199289194352
199388190625

I've shaded the positive forward scores green and the negative forward scores red. Every cohort from 1973 to 1983, except for 1978 and 1981, is backward-looking. Every cohort from 1984 to 1993 is forward-looking.

While Bill James declined to develop a "specific protocol...based on this method," he did state, as a general rule, that "an 'epoch' is formed by a series of forward-looking seasons, followed by a series of backward-looking seasons." But what he was really looking for was the "hard break" between epochs - a series of backward-looking seasons (the end of one epoch) followed by a series of forward-looking seasons (the beginning of a new epoch). He was looking for "fault lines" separating "natural groups of seasons," just as Strauss & Howe looked for boundaries separating cohesive cohort-groups.

I showed the 1973-1993 cohorts in the table above, not because those cohorts form a cohesive group, but because they're halves of two different groups; the second half of one group and the first half of the next group, with the boundary between the two groups appearing to fall between 1983 and 1984. But it's not a clean break; not all of the 1973-1983 cohorts are backward-looking, and the 1981-1983 cohorts are all fairly similar to both their respective groups.

I know if a cohort is forward- or backward-looking, and how forward- or backward-looking it is; now I need a way to determine if a cohort is part of a forward- or backward-looking trend. And since I do want a specific protocol for defining generations by an objective process, I'm adding what I'll call a "trend score" for each cohort. The trend score - as its name applies - checks each cohort's forward score to see if it is part of a trend. If a cohort's forward score is positive (forward-looking), the trend score adds to it the forward scores of the next two cohorts. If its forward score is negative (backward-looking), the trend score adds to it the forward scores of the previous two cohorts.

When at least three backward-trending cohorts are followed by at least three forward-trending cohorts, I draw a generational boundary between the last backward-trending cohort and the first forward-trending cohort.

Rather than trying to explain any further how or why trend scores work now, I'll go ahead and start locating generational boundaries and explain them as I go. Here are the first ten baseball cohorts, 1831 to 1840:

BornPrev.NextForwardTrend
1831204-788-992-992
1832204-71-275-1,268
1833545-59-604-1,872
1834545-82-627-1,507
1835-4248061,230929
1836821732-89514
183797-115-211929
1838834732-101-402
1839109-70-179-492
1840736665-71-352

Bill James gave a couple of caveats to his rule for defining epochs:
"1) Sometimes it is not a series of backward-looking years that ends an epoch, but just one year, and 2) Sometimes what ends an epoch is not a backward-looking phase, but rather a large difference between two adjacent seasons."
I take it to also be true that sometimes it is just one cohort, or a large difference between two adjacent cohorts, that STARTS a generation, and I tried to build these caveats into my trend scores. Even though the 1835 cohort is the only forward-looking cohort in its group, it is SO different from the cohorts that came before that it should be the start of a new generation. (The 1835 cohort consists of Harry Wright, the firstborn player to have a real major league career; the two players older than him appeared in just one game each as forty-somethings.) So even though the 1836 and 1837 cohorts are backward-looking, they're forward-trending because 1835's forward score is so high it overwhelms their negative scores.

The next several generational boundaries are easy to spot, without trend scores. We can draw one between 1856 and 1857:

BornPrev.NextForwardTrend
1852877756-121-576
1853878624-254-578
1854799795-4-380
1855893735-158-417
1856861758-103-265
185778384259206
185882684418174
1859787916129206
186084787528165
186184289149238

And 1873 and 1874:

BornPrev.NextForwardTrend
1869940933-727
1870900884-16-3
1871909886-22-45
1872923885-38-76
1873943917-26-86
18749199291044
18758538853282
1876920921158
18778779254876
1878911920942

And 1892 and 1893:

BornPrev.NextForwardTrend
188889790912-67
1889935894-42-97
1890930892-38-67
1891926863-63-142
1892922862-61-161
189388492036132
189490591711185
189585093484225
189686495489171
189787192251110

And 1911 and 1912:

BornPrev.NextForwardTrend
190790495248-83
1908956903-53-49
1909959880-79-83
1910963879-84-216
1911944882-61-224
191288694458212
191387594772177
191488196382116
19159049272347
191691292411-27

It looks like there might be a boundary between 1922 and 1923:

BornPrev.NextForwardTrend
1918927876-51-27
1919941867-74-112
19209159216-26
1921926896-30-98
1922898896-2-26
19239029201852
19248829112946
19259199245-13
192691392512-28
1927936905-31-13

There're several mostly backward-looking cohorts followed by several forward-looking cohorts. But the 1925 and 1926 cohorts aren't forward-looking enough to be forward-trending; so the forward trend fizzles after the 1923 and 1924 cohorts, which means it doesn't meet my standard of at least three forward-trending cohorts. Besides, a boundary here would mean a generation of just 11 cohort birthyears (1912-1922), which is too short to be a true generation.

The actual boundary is six years later, between 1928 and 1929:

BornPrev.NextForwardTrend
19248829112946
19259199245-13
192691392512-28
1927936905-31-13
1928932922-10-28
19298829436194
19308849122890
1931885889459
19328789355769
1933931929-259

This time the backward-trending cohorts (1925-1928) are followed by a sustained forward trend. The 1929 cohort is the MOST forward-looking cohort since 1914, in the first wave of the previous generation.

The forward trend lasts through the 1941 cohort, and is then followed by 20 consecutive backward-trending cohorts. The next generational boundary isn't until 1961/62, 33 birthyears after the last one.

BornPrev.NextForwardTrend
1957959863-95-185
1958921896-24-154
1959928877-51-170
1960938906-32-107
1961934875-59-142
196290593833144
1963864975110108
1964906907148
1965926924-3108
196689594550137

And finally, we're back to the boundary between the two currently-active generations:

BornPrev.NextForwardTrend
1979939931-8-12
1980943916-27-27
198193895416-9
1982958948-10-21
1983956941-15-9
198493296128106
19859159655085
19869279552799
1987934941779
19888789426579

And it looks like the boundary is indeed between 1983 and 1984, at least for now. These cohorts are still adding to their batting and pitching totals. The 1981-1983 group could possibly slip into the younger generation (I hope it does, anyway; it's hard to imagine the baby-faced Miguel Cabrera of the 1983 cohort being in the same generation as Clemens and Bonds).

So that's eight boundaries, which divides every MLB player in history into nine generations. Leaving out the first and last (partial) generations, the cohort lengths of the middle seven baseball generations range from 17 to 33 years and average 21.3 years. This average nearly matches Strauss & Howe's 22-year "phase of life", or the length of a very long major league career.

GenerationBirth YearsBest Player
Knickerbocker1831-1834
National1835-1856Cap Anson
American1857-1873Cy Young
Deadball1874-1892Ty Cobb
Ruthian1893-1911Babe Ruth
G.I.1912-1928Ted Williams
Expansion1929-1961Willie Mays
Steroid1962-1983Barry Bonds
Millennial1984-1996Mike Trout

No comments:

Post a Comment