"A generation, like an individual, merges many different qualities, no one of which is definitive standing alone. But once all the evidence is assembled, we can build a persuasive case for identifying (by birthyear) eighteen generations over the course of American history. All Americans born over the past four centuries have belonged to one or another of these generations" (Generations page 68).William Strauss and Neil Howe wrote the book on generations - literally. The aim of this post - and, indeed, this entire blog - is to apply their theory to baseball. You can call it my manifesto.
Therefore, by the end of this post, I hope to have built a persuasive case for identifying (by birthyear) NINE generations over the course of baseball history. I will claim that all ballplayers born over the past two centuries belong to one or another of these generations.
Why do we need definitive baseball generations? For one, to provide context for "best of their generation" conversations and arguments (best player, best pitcher, best 3rd baseman, best leadoff batter, etc.) Beyond that, having definitive generations restores meaning to baseball's hallowed leaderboards. For example, Roger Connor hit 138 home runs in his career - a modest total by today's standards, but the most ever hit by a player born before 1887. Similarly, Miguel Cabrera's career batting average of .318 ranks 55th all-time (as of this writing), but it ranks first among players born after 1960.
So if we can agree that sorting baseball players by generation is useful, how exactly should we go about doing it? I'll start with Strauss & Howe's definition:
"A GENERATION is a cohort-group whose length approximates the span of a phase of life and whose boundaries are fixed by peer personality" (page 60).Earlier in the book (page 44), Strauss & Howe defined a "cohort" as "any set of persons born in the same year" and a "cohort-group" as "any wider set of persons born in a limited set of consecutive years."
The authors laid out (on page 56) four "phases of life," each 22 years long: youth (age 0 to 21), rising adulthood (age 22 to 43), midlife (age 44 to 65), and elderhood (age 66 to 87). Obviously, most major league careers fall almost entirely within the second phase, rising adulthood. And since the longest careers tend to last about 22 years, we can say then that the length of a baseball generation approximates the span of a very long major league career.
And that brings us to "peer personality" - "the element in our definition that distinguishes a generation as a cohesive cohort-group" (page 63). Strauss & Howe measured the similarity of cohorts by the similarity of their peer personality. In a pair of articles published to his website in July 2015 (and placed behind a subscription paywall), Bill James "measured the Similarity of Seasons...by the similarity of their statistical image."
Strauss & Howe "use peer personality to identify a generation and find the boundaries separating it from its neighbors" (page 64). Bill James used similarity scores to identify "natural groups of seasons" and find the "fault lines" separating them.
And while Strauss & Howe could "apply no reductive rules for comparing the beliefs and behavior of one cohort-group with those of its neighbors" (page 67), we have baseball's rich statistical record at our disposal for comparing the "behavior" of baseball cohort-groups.
So here then is my modified definition:
A BASEBALL GENERATION is a cohort-group whose length approximates the span of a very long major league career and whose boundaries are fixed by statistical image.Bill James' similarity scores for seasons used 30 statistical categories, including both counting stats (hits, homeruns, strikeouts), and rate stats (batting average, on-base percentage, earned run average). But, as Kerry Whisnant explained, using counting stats to compare cohorts, like using them to compare players or seasons, will mean that only cohorts with "similar numbers of plate appearances" will be similar.
And the traditional rate stats "confound" talents, as Jim Albert explained on page 24 of an article in By the Numbers. "A batting average confounds three batter talents: the talent not to strikeout, the talent to hit a home run, and the talent to hit an in-play ball for a hit."
Peer personality has nothing to do with the raw numbers of a generation, but rather the collective behavior of its members. Strauss & Howe elaborated (on page 63 of Generations):
"The peer personality of a generation is essentially a caricature of its prototypical member. It is, in its sum of attributes, a distinctly personlike creation. A generation...can be safe or reckless, calm or aggressive, self-absorbed or outer-driven, generous or selfish, spiritual or secular, interested in culture or interested in politics."Likewise, the "statistical image" of a baseball generation (like the statistical image of a team or a league) is essentially a caricature of its average player. It is, in its sum of attributes, like an individual player. A baseball generation can be patient or free-swinging, adept at making contact or prone to striking out, powerful or light-hitting, good at hitting the ball "where they ain't" or bad at avoiding defenders, aggressive on the base-paths or station-to-station.
So I'll use eight rate statistics - what I'll call the "attribute rates" - to measure the similarity of baseball cohorts. ("Each rate describes something specific," Tom Tango wrote, describing four of the rates.) These eight rates - representing eight different skills, or tools - taken together, reveal the "statistical image" or "peer personality" of a baseball generation; how its members collectively played the game.
The first rate, BF/G, uses pitching stats only. The next three, $BB, $SO, and $HR - the "three true outcomes" - draw from both batting and pitching stats (for the formulas listed below, I differentiate between batting and pitching with a small 'b' or 'p' in the variables). The last four use batting stats only.
BF/G - the number of batters a pitcher faces per game. = BF / G
$BB - the percentage of plate appearances that end in a walk or a hit by pitch.
= (bBB + bHBP + pBB + pHBP) / (PA + BF)
$SO - the percentage of plate appearances ending in a strike (called, swung and missed, or batted) that are strikeouts.
= (bSO + pSO) / (PA + BF - bBB - bHBP - pBB - pHBP)
$HR - the percentage of batted balls that are homeruns.
= (bHR + pHR) / (PA + BF - bBB - bSO - bHBP - pBB - pSO - pHBP)
$H - the percentage of balls batted into the field of play that are hits.
= (H - HR) / (PA - HR - BB - SO - HBP)
$XBH - the percentage of base hits that go for extra bases (doubles or triples).
= (2B + 3B) / (H - HR)
$3B - the percentage of extra-base hits that are triples. = 3B / (2B + 3B)
$SB - the percentage of successful stolen bases per (approximate) times on first.
= SB / (H - 2B - 3B - HR + BB + HBP)
So, now all we need are the career totals for every batter and every pitcher in MLB history. Then, on a separate sheet in Excel, it's just a matter of using SUMIF formulas to add up the necessary batting and pitching totals for each cohort. From the firstborn player (Nate Berkenstock, 1831) to the lastborn (Julio Urias, 1996), there are 166 MLB cohort birthyears (through 2016; I'm typing this just days after Ozzie Albies became the first 1997-born major-leaguer). The table below shows the career totals for 1980s cohorts:
|Batting Totals||Pitching Totals|
All the batters born in 1980 combined for 151,352 plate appearances, 34,688 hits, and 4,575 homeruns. All the pitchers born that year combined for 11,496 bases on balls, 24,982 strikeouts, etc.
Then I can calculate the attribute rates for each cohort:
Next I need the standard deviations of each rate. I have 166 cohort birthyears, but many of the very early and very recent cohorts were not (or aren't yet) well-represented in the major leagues. So I'll set minimum requirements of 10,000 total plate appearances and 10,000 total batters faced, and therefore only include the 143 cohorts from 1850 (Al Spalding) through 1992 (Bryce Harper) in the population for my standard deviations.
Also, I'll need to assign weights to each rate. I wanted the "three true outcomes" rates ($BB, $SO, and $HR) to weigh double the other rates, because they use both hitting and pitching statistics, and I wanted the $XBH and $3B rates to weigh half the other rates, because they both deal with breakdowns of base hits. Finally, I wanted the weights to add up to 1,000, so that if two groups are exactly four standard deviations apart in every category, their similarity score will be zero.
To find the similarity score between two groups, start at 1,000 and subtract a penalty for each attribute rate. The penalty is the difference between the two groups, times a multiplier. The multiplier is the rate's weight divided by (4 times its standard deviation).
For example, the 1980 cohort has a $BB rate of .097 and the 1981 cohort has a $BB rate of .091, a difference of .006. So the $BB penalty for 1980 and 1981 would be the difference (.006) times the multiplier (3,870), which is about 23. Add up the penalties for all eight rates and subtract from 1,000, and that is the similarity score.
To find "Epochs and Eras," Bill James asked of "every season in baseball history: Is it more like the season before it, or more like the season after it?" He then made two-year comparisons, three-year comparisons, four-year comparisons... comparing "each season to every other season in baseball history within 15 years before or after."
Instead of comparing each baseball cohort to other neighboring cohorts, I'm comparing them to the 15-year cohort-groups before and after. To find Baseball Generations, I'm asking of every baseball cohort: Is it more like the 15-year cohort-group before it, or more like the 15-year cohort-group after it?
Is the 1980 cohort more similar to the 1965-1979 cohort-group, or more similar to the 1981-1995 cohort-group?
1980 to 1965-1979 - 943
1980 to 1981-1995 - 916
The 1980 cohort is backward-looking, more similar to the cohort-group before it (943) than the cohort-group after it (916). What about 1981?
1981 to 1966-1980 - 938
1981 to 1982-1996 - 954
The 1981 cohort is forward-looking, more similar to the cohort-group after it (954) than the cohort-group before it (938).
The table below shows the similarity scores of the 1973-1993 cohorts to their respective previous and next 15-year groups. I also calculated a "forward score" for each cohort, which is simply its next-group similarity score MINUS its previous-group similarity score. The forward score shows just HOW forward- or backward-looking a cohort is. A positive forward score indicates a cohort is forward-looking and a negative score indicates it is backward-looking, and a score above +50 (or below -50) means that the cohort is VERY forward- (or backward-) looking.
I've shaded the positive forward scores green and the negative forward scores red. Every cohort from 1973 to 1983, except for 1978 and 1981, is backward-looking. Every cohort from 1984 to 1993 is forward-looking.
While Bill James declined to develop a "specific protocol...based on this method," he did state, as a general rule, that "an 'epoch' is formed by a series of forward-looking seasons, followed by a series of backward-looking seasons." But what he was really looking for was the "hard break" between epochs - a series of backward-looking seasons (the end of one epoch) followed by a series of forward-looking seasons (the beginning of a new epoch). He was looking for "fault lines" separating "natural groups of seasons," just as Strauss & Howe looked for boundaries separating cohesive cohort-groups.
I showed the 1973-1993 cohorts in the table above, not because those cohorts form a cohesive group, but because they're halves of two different groups; the second half of one group and the first half of the next group, with the boundary between the two groups appearing to fall between 1983 and 1984. But it's not a clean break; not all of the 1973-1983 cohorts are backward-looking, and the 1981-1983 cohorts are all fairly similar to both their respective groups.
I know if a cohort is forward- or backward-looking, and how forward- or backward-looking it is; now I need a way to determine if a cohort is part of a forward- or backward-looking trend. And since I do want a specific protocol for defining generations by an objective process, I'm adding what I'll call a "trend score" for each cohort. The trend score - as its name applies - checks each cohort's forward score to see if it is part of a trend. If a cohort's forward score is positive (forward-looking), the trend score adds to it the forward scores of the next two cohorts. If its forward score is negative (backward-looking), the trend score adds to it the forward scores of the previous two cohorts.
When at least three backward-trending cohorts are followed by at least three forward-trending cohorts, I draw a generational boundary between the last backward-trending cohort and the first forward-trending cohort.
Rather than trying to explain any further how or why trend scores work now, I'll go ahead and start locating generational boundaries and explain them as I go. Here are the first ten baseball cohorts, 1831 to 1840:
Bill James gave a couple of caveats to his rule for defining epochs:
"1) Sometimes it is not a series of backward-looking years that ends an epoch, but just one year, and 2) Sometimes what ends an epoch is not a backward-looking phase, but rather a large difference between two adjacent seasons."I take it to also be true that sometimes it is just one cohort, or a large difference between two adjacent cohorts, that STARTS a generation, and I tried to build these caveats into my trend scores. Even though the 1835 cohort is the only forward-looking cohort in its group, it is SO different from the cohorts that came before that it should be the start of a new generation. (The 1835 cohort consists of Harry Wright, the firstborn player to have a real major league career; the two players older than him appeared in just one game each as forty-somethings.) So even though the 1836 and 1837 cohorts are backward-looking, they're forward-trending because 1835's forward score is so high it overwhelms their negative scores.
The next several generational boundaries are easy to spot, without trend scores. We can draw one between 1856 and 1857:
And 1873 and 1874:
And 1892 and 1893:
And 1911 and 1912:
It looks like there might be a boundary between 1922 and 1923:
There're several mostly backward-looking cohorts followed by several forward-looking cohorts. But the 1925 and 1926 cohorts aren't forward-looking enough to be forward-trending; so the forward trend fizzles after the 1923 and 1924 cohorts, which means it doesn't meet my standard of at least three forward-trending cohorts. Besides, a boundary here would mean a generation of just 11 cohort birthyears (1912-1922), which is too short to be a true generation.
The actual boundary is six years later, between 1928 and 1929:
This time the backward-trending cohorts (1925-1928) are followed by a sustained forward trend. The 1929 cohort is the MOST forward-looking cohort since 1914, in the first wave of the previous generation.
The forward trend lasts through the 1941 cohort, and is then followed by 20 consecutive backward-trending cohorts. The next generational boundary isn't until 1961/62, 33 birthyears after the last one.
And finally, we're back to the boundary between the two currently-active generations:
And it looks like the boundary is indeed between 1983 and 1984, at least for now. These cohorts are still adding to their batting and pitching totals. The 1981-1983 group could possibly slip into the younger generation (I hope it does, anyway; it's hard to imagine the baby-faced Miguel Cabrera of the 1983 cohort being in the same generation as Clemens and Bonds).
So that's eight boundaries, which divides all ballplayers born between 1831 and 1996 into nine generations. Ignoring the first and last (partial) generations, the seven generations in the middle have an average length of 21.3 years, which nearly matches Strauss & Howe's 22-year "phase of life", or the length of a very long major league career.
|Generation||Birth Years||Best Player|