Sunday, November 20, 2016

The Generations of Baseball, part three

Continued from part two.
"In some respects, a peer personality gives heavy focus to the attitudes and experiences of the generational elite...('the heads of society, the kings of thought, the lords of the generation'). But while they commonly express the tone of a generation's peer personality, the personality itself is often established by non-elites" (Generations page 64).
The following table shows PA and BF totals and attribute rates for the 1964 cohort. I listed all batters with at least 8,000 PA and all pitchers with at least 8,000 BF ("the generational elite"), all other batters and pitchers (the "non-elites"), and the cohort totals. As you can see, the "non-elites" as a whole overwhelm "elites" in raw numbers of PA and BF, but elites still obviously have a lot of impact on their cohort's attribute rates (Barry Bonds accounted for roughly 6% of his cohort's total plate appearances, while Kenny Rogers pitched to about 8% of its batters faced).

Barry Bonds12,606.
Rafael Palmeiro12,
Mark Grace9,
B.J. Surhoff9,
Barry Larkin9,
Will Clark8,
Ellis Burks8,
Jose Canseco8,
Kenny Rogers14,28018.
Dwight Gooden11,70527.
John Burkett11,32425.
Bobby Witt11,00325.
Bret Saberhagen10,42126.
Kevin Tapani9,60026.
All other batters138,971.
All other pitchers114,1299.

(Batting stats for pitchers are included in "All other batters"; pitching stats for position players are included in "All other pitchers").

Here are stats for seven consecutive cohorts, 1961-1967 (the first wave of Generation X):

CohortSample MemberPABFBF/G$BB$SO$HR$H$XBH$3B$SB
1961Don Mattingly114,102114,88811.
1962Roger Clemens162,446181,14713.
1963Randy Johnson189,945163,10211.
1964Barry Bonds215,665182,46212.
1965Craig Biggio192,592167,78312.
1966Greg Maddux138,533224,28711.
1967John Smoltz215,318197,16910.

And here is Generation X divided into three cohort-groups:

Cohort-GroupSample MemberPABFBF/G$BB$SO$HR$H$XBH$3B$SB
1961-1967Barry Bonds1,228,6011,230,83811.
1968-1974Pedro Martinez1,252,5181,167,27410.
1975-1981Alex Rodriguez1,168,4411,206,18610.
Generation X3,649,5603,604,29810.

And here again are the eleven baseball-playing Strauss & Howe generations - the nine MLB generations (Gilded through Millennial), the purely amateur Transcendentals, and the (so far) purely little-league Homelanders:

Generation X1961-19813,649,5603,604,29810.

But like I said in part one, these are social generations. Strauss & Howe were trying to identify generations based on how they shape and react to history and each other.

My goal is much simpler: I want to identify baseball generations based on how (if at all) they played major league baseball. In part two I defined my tool for identifying baseball generations: similarity scores based on attribute rates.

I'll use the Strauss & Howe generations as a starting point. Then I can ask of each cohort, does it actually belong in this generation? or should it be in a neighboring one?

Let's look again at the first wave of Generation X, the 1961-1967 cohorts. Here are the similarity scores of each cohort to its own generation (Gen X) and to its next-older generation (Boom):

CohortBoomGen X

As you can see, each cohort is in fact more similar to Generation X than it is to the Boom Generation, except for the first one, 1961, which is more similar to the Boomers (929 to 870). So based on these numbers, the 1961 cohort belongs in the Boom Generation instead of Generation X.

So what I did is compare every cohort to every generation. Starting with the firstborn major leaguer (Nate Berkenstock, born 1831) and ending with 2016's youngest player (Julio Urias, born 1996), I have 166 MLB cohorts. And there are eleven generations, although the first and last (Transcendentals and Homelanders) are statistically identical, because their members had no major league experience.

To find the statistical similarity of a cohort and a generation (or of any two players or groups), start at 1,000 and subtract a penalty for each attribute rate. The penalty is the difference (or absolute value) between the cohort and the generation, times a multiplier. The multiplier is the weight I assigned to the attribute rate, divided by (4 times its standard deviation).

St. Dev.

I have 166 total cohorts, but for standard deviations, I only wanted to include cohorts with at least 10,000 total PA and 10,000 total BF, so I limited my population to the 143 cohorts born between 1850 and 1992.

I wanted the weights of the "three true outcomes" rates ($BB, $SO, and $HR) to be double the other weights, because they use both hitting AND pitching stats, while the other rates only use one or the other. And I wanted the weights to add up to close to 1,000, so that if two groups are exactly four standard deviations apart in every category, they will have a similarity score near zero.

Don Mattingly's 1961 cohort has a $HR rate of 2.9%, while the $HR rate of Generation X overall is 3.6%. For the similarity score of that cohort to its generation, the $HR penalty is .007 (.036 - .029) times the rate multiplier of 3,806, or about 27.

Add up all eight penalties and subtract from 1,000, and that is the similarity score.

I made a worksheet of the similarity scores of every cohort to every generation, and used conditional formatting to create a "heat map" of the scores, where 1,000 is green and zero (or negative) is red, and everything in between is on a gradient. Below is a screenshot of the portion of the sheet showing the Gen X cohorts (and the first Millennial cohort):

And next to it, I made another table. For each cohort I added a variable to its similarity scores so that the highest score always equals 1,000, and then I highlighted all the 1,000's. It's not as pretty as the above table, but it's more useful for showing which generation each cohort belongs in.

Generation X's last-born cohort, 1981, also jumps ship, going over to the Millennials. And this is in fact where I end up for the birthyears of baseball's Generation X - 1962 to 1980. But I didn't start here. I started at the beginning.

The 1831-1834 (and 1837) Gilded cohorts are much more similar to the Transcendentals (and to the Homelanders, who, as I said, are statistically identical to the Transcendentals). So I'll move the 1831-1834 cohorts to the Transcendental Generation (the pre-MLB 1822-1830 cohorts go too). That pushes the Transcendental/Gilded boundary from 1821-1822 to 1834-1835:


You can tell from the above table of highlighted 1,000's that most of the 1835 to 1851 cohorts were already more similar to the Gilded Generation. But before I even look at the similarity scores again, I'm enforcing a minimum length for these generations. The shortest Strauss & Howe generation is 17 years; I'll relax that by one year and say the baseball generations must be at least 16 years.

The Gilded Generation is currently reduced to eight years (1835-1842). So it gets pushed out eight more years to 1850, which means the Progressive Generation gets extended, too (which shortens the Missionary Generation to 16 years exactly):


Since the generations' birthyears have changed, so too have their attribute rates. Which means the similarity scores will be different:

Since the Gilded birthyears are now 1835-1850, that generation "pulls in" not only most of those cohorts, but 1851-1853 and 1855 as well. So I can add the 1851-1853 cohorts to the Gildeds, which means I have to extend the Progressives, Missionaries, and Lost (to get them back to 16 years), and cut into the G.I. birthyears:


Which changes the similarity scores again:

The 1854 cohort is still most similar to the Progressives, but now it's more similar to the Gildeds than the 1855 cohort is to the Progressives. So I'll move both cohorts to the Gildeds, which pushes each of the next three generations out another two years:


Which pulls not only the 1854 cohort but also the 1856 cohort into the Gilded Generation:

I think you get the idea by now. After I add the 1856 cohort to the Gilded group, the 1857 and later cohorts remain more similar to the Progressives, which means the Gilded group isn't pulling in any more cohorts. The birthyears have "locked", giving me an opportunity to paraphrase page 82 of Generations:

All things have a beginning, and so must the story of baseball generations.

I start with the cohort-group of 1835 through 1856. I call it the "National Generation." 469 members of this group appeared in at least one major league game. (For the purposes of this study, I consider the National Association, 1871-1875, a "major" league; it certainly was to Harry Wright and his peers). It includes every member of the 1869 Cincinnati Red Stockings (the first openly professional team), nearly every player in the National Association (the first professional league), and most of the players in William Hulbert's National League (1876-1881).

A couple of earlier-born players appeared in the National Association, but they were both forty-somethings who played just one game each. Besides them, the firstborn MLB players were Harry Wright (born 1835) and Dickey Pearce (born 1836). Both played seven years between the NA and NL, and both were key pioneers of the professional game.

After applying the method described in this post through to the end, I arrive at these birthyears for the generations of baseball:

Generation X1962-19803,396,9633,326,58910.

Here is the same table again, but with a few changes. I gave the earlier generations more appropriate names. Also, I dropped the first 20 cohorts of the generation formerly known as the Transcendentals, and started it at 1812 (birthyear of Duncan Curry, first Knickerbocker president). Lastly, since this study is about grown men playing organized baseball at the highest level (which is what sets the Knickerbocker Generation apart from earlier generations), I omitted the Homelanders (for now):

Generation X1962-19803,396,9633,326,58910.