Sunday, October 23, 2016

Cubs vs Indians

Of the four teams playing in the two league championship series, the Toronto Blue Jays have the most recent world championship - 1993, 23 years ago. They were dispatched by the Indians in five games.

The Los Angeles Dodgers have the next-shortest drought, 28 years. They fell to the Cubs last night in the sixth and final game of the NLCS.

If this trend continues, the Chicago Cubs (108 years) will defeat the Cleveland Indians (68 years) to win the World Series.

Wednesday, October 19, 2016

The Generations of Baseball, part two

Continued from part one.

Bill James' similarity scores use traditional baseball statistics, which includes both counting stats (hits, homeruns, strikeouts), and rate stats (batting average, on-base percentage, earned run average).

As Kerry Whisnant explained, using counting stats in similarity scores will mean that "players can be similar only if they are similar hitters and have similar numbers of plate appearances" (emphasis mine). That's not necessarily a bad thing, for Bill's purposes. The length of a player's career could be considered part of his "statistical image", so it might be appropriate to use counting stats for player comparisons. Likewise, the total number of plate appearances in a season, which is normally relatively stable from year to year unless there's an expansion or a work stoppage, could be considered part of a season's "statistical image".

But cohorts vary so widely in size from birthyear to birthyear that raw counting stats lose all meaning. Strauss & Howe were much more interested in a cohort's "peer personality" than in its total numbers. To find the peer personality of baseball cohorts, we need rate-based similarity scores.

But the traditional rate stats are redundant, as Jim Albert explained in an article in By the Numbers:
"...Traditional hitting statistics confound...talents. A batting average confounds three batter talents: the talent not to strikeout, the talent to hit a home run, and the talent to hit an in-play ball for a hit. An on-base percentage confounds the batter’s talent to draw a walk with his talent to get an in-play hit and his talent to hit a home run" (page 24).
In 2001, Voros McCracken rocked the sabermetric community by figuring out how to sort out responsibility between pitchers and fielders. Using a "divide and conquer" approach, he isolated, one at a time, defense-independent statistics (walks, strikeouts, and homeruns) from the defense-dependent ones (everything else)... and realized that pitchers have little-to-no control over the "everything else": balls batted into the field of play.

Jim Albert recreated Voros's approach (on pages 23 and 24), but applied it to batting statistics, breaking them down into four basic skills:
"A player comes to bat for a plate appearance. Either the player walks or doesn’t walk – his chance of walking is estimated by the walk rate (BB+HBP)/PA. (Note that we combine walks and hit by pitches in the formula since each event has the same result of getting the batter to first base without creating an at-bat.)
"Removing walks from the plate appearances, we next record if the batter strikes out or not. We define the strikeout rate as the fraction of strikeouts to the number of at-bats or SO/AB.
"With walks and strikeouts removed, we next record if the batter hits a home run or not. The home run rate is defined to be the fraction of home runs for all plate appearances where contact is made by the bat. That is, HR rate = HR/(AB – SO).
"With the walks, strikeouts, and homeruns removed, we only have plate appearances where the ball is hit in the park. Of these balls put in-play, we record the fraction that fall in for hits – we call this the in-play hit rate or 'hit rate' (H-HR)/(AB-SO-HR)....
"Since these rates are defined by sequentially removing walks, strikeouts, and home runs from the plate appearances, they measure distinct qualities of a hitter. Specifically, these rates measure (1) the talent to draw a walk, (2) the talent to avoid a strikeout, (3) the talent to hit a ball out of the park (a home run), and (4) the talent to hit a ball 'where they ain’t'."
And unlike traditional statistics, which have different rate categories for hitting (average, OBP) and pitching (ERA, WHIP), these four rates can be used for hitters AND pitchers. Pitchers are trying to accomplish the exact opposite of what hitters are trying to accomplish. Hitters are trying to draw walks, make contact, hit homeruns, and convert batted balls into hits. Pitchers are trying to throw strikes, miss bats, prevent homeruns, and (with the help of their defense) convert batted balls into outs. So it makes sense then to use the same statistics to measure hitters and pitchers.

Dave Studeman also wrote about the four rates:
"...The beauty of these ratios is that they build on each other. The impact of each event is removed from the denominator of the next event. A batter can’t hit a home run unless he actually hits a ball, so walks and strikeouts aren’t considered in the home run rate.
"For pitchers, you often see strikeout and walk rates, but as a proportion of total plate appearances. Here, they are treated sequentially. The strikeout ratio implies that a pitcher can’t strike out a batter if he walks him first, so the formula takes walks out of consideration when calculating strikeout rates."
Tom Tango expanded on Voros' "four horsemen", breaking the in-play hit component down into two additional components: extra-base hits (doubles and triples) per in-play hit, and triples per extra-base hit. He also added stolen bases as a rate of opportunities (singles and walks). "Each rate describes something specific" he wrote, and used the seven rates to study the aging patterns of different skills in batters.

So, the idea then is that there are four basic outcomes of every batter/pitcher match-up: a walk (or HBP), a strikeout, a homerun, or a ball in play. A ball in play can result in a hit, a hit can be an extra-base hit, and an extra-base hit can be a triple. Also, a batter who has reached base (usually via single or walk) can steal a base.

But there's another, even-more-obvious framework for baseball events: GAMES. Every batter-pitcher match-up happens within the context of a game. In 1876, batters and pitchers had the same roles, and thus the same usage: starters played almost every game, and almost every inning, and reserves were basically emergency back-ups. Aside from the practice of platooning, a batter's role hasn't changed much in the last 140 years, whereas the pitcher's role has evolved through the generations and fragmented into a variety of roles - from a complete game starter who pitches once every five days to a one-out reliever who can pitch several days in a row, and everything in between.

Since a pitcher's role is part of his personality/statistical image, I am adding one more rate to the seven Voros/Tango components: batters faced per game. And since a batter's role is NOT really a defining part of his personality (and since it would render the rate meaningless if I confounded hitters' plate appearances with pitchers' batters faced) it is a pitchers-only rate.

So here are the eight rates - the eight attributes that define the "personality" of a baseball player or cohort or cohort-group - and the formulas I use to calculate them, grouped by what statistics they draw from (batting, pitching, or both):

Pitchers Only:
BF / G

Batters and Pitchers (the "Three True Outcomes"):
$BB = (BB + HBP) / PA
$SO = SO / (PA - BB - HBP)
$HR = HR / (PA - BB - SO - HBP)

Batters Only:
$H = (H - HR) / (PA - HR - BB - SO - HBP)
$XBH = (2B + 3B) / (H - HR)
$3B = 3B / (2B + 3B)
$SB = SB / (H - 2B - 3B - HR + BB + HBP)

(Unlike Albert, I use (PA - BB - HBP) instead of at-bats, because at-bats weren't recorded for pitchers until 1930, and because it keeps sac flies and sac hits included in the denominator.)

Since the "three true outcomes" ($BB, $SO, and $HR) combine batting AND pitching statistics, the formula for $BB (as an example) is actually

$BB = (BBb + HBPb + BBp + HBPp) / (PA + BF)

where BBb is batters' bases on balls, BBp is pitchers' bases on balls, etc.

This means that when 1991-born Mike Trout faces 1991-born Trevor Bauer, it counts as TWO plate appearances for the 1991 cohort. And if Trout strikes out, it counts as two strikeouts, and if he homers, it counts as two homeruns. (Actually, a homerun would go down as an 0-for-2 in the $BB component, an 0-for-2 in the $SO component, and a 2-for-2 in the $HR component. But if Trout instead hits a single, it would be an 0-for-2 in $HR and a 1-for-1 in $H, since the $H rate doesn't draw from pitcher statistics.)

Here is a "binary tree" diagram of the eight attribute rates. The "headers" are the denominators, or opportunities:

per G   per PA  per H  per 1B or BB

          |                 ^
          |                /
          v               /
         $SO             /
          |             /
          |            /
          v           /
         $HR         /
          |         /
          |        /   per XBH
          v       /

Concluded in part three.

Sunday, October 16, 2016

The Generations of Baseball, part one

"A generation, like an individual, merges many different qualities, no one of which is definitive standing alone. But once all the evidence is assembled, we can build a persuasive case for identifying (by birthyear) eighteen generations over the course of American history. All Americans born over the past four centuries have belonged to one or another of these generations."
That is the claim made by authors William Strauss and Neil Howe on page 68 of Generations. The book, published in 1991, "retells the history of America as a series of generational biographies going back to 1584."

Below is a table of the last ten of those generations - the generations that played BASEBALL - along with their total MLB-playing population (through 2016) and best (or best-known) player:

Transcendental1792-18210Alexander Cartwright
Gilded1822-184221Harry Wright
Progressive1843-1859688Cap Anson
Missionary1860-18822,129Cy Young
Lost1883-19002,573Babe Ruth
G.I.1901-19242,693Ted Williams
Silent1925-19421,818Willie Mays
Boom1943-19602,602Rickey Henderson
Generation X1961-19813,926Barry Bonds
Millennial1982-20042,332Mike Trout

But these are the generations of AMERICAN history - social generations. Their members encountered "the same national events, moods, and trends at similar ages" (page 48) and felt "the ebb and flow of history from basically the same age or phase-of-life perspective" (page 64).

Baseball has its own national events, moods, and trends; its own ebbs and flows. Does it have its own generations, then, too? And can we find them by applying Strauss & Howe's methodology to sabermetrics?
"...We have to start with the building-block of generations: the ‘cohort.’ Derived from the Latin word for an ordered rank of soldiers, ‘cohort’ is used by modern social scientists to refer to any set of persons born in the same year; ‘cohort-group’ means any wider set of persons born in a limited set of consecutive years" (page 44).
"A GENERATION is a cohort-group whose length approximates the span of a phase of life and whose boundaries are fixed by peer personality" (page 60). There are four "phases of life" (on page 56), each 22 years long: youth (ages 0 to 21), rising adulthood (22 to 43), midlife (44 to 65), and elderhood (66 to 87).
"But how do we actually identify [a generation]? For that, we have to focus our attention on ‘peer personality’ - the element in our definition that distinguishes a generation as a cohesive cohort-group with its own unique biography. The peer personality of a generation is essentially a caricature of its prototypical member. It is, in its sum of attributes, a distinctly personlike creation" (page 63).
"A PEER PERSONALITY is a generational persona recognized and determined by (1) common age location; (2) common beliefs and behavior; and (3) perceived membership in a common generation" (page 64). But while Strauss & Howe (on page 67) "can apply no reductive rules for comparing the beliefs and behavior of one cohort-group with those of its neighbors," we can apply similarity scores.

Bill James devised similarity scores in the 1980s as a means to identify groups of comparable players based on their career statistics. "To compare one player to another, start at 1000 points and then subtract points based on the statistical differences of each player."

In July 2015, Bill used similarity scores to compare seasons (like MLB 2009 and MLB 2010) and organize them into eras and epochs (subscription required to read either article).
"My thinking was that if I did these studies in an organized, systematic way, 'fault lines' would appear. Natural groups of seasons would be apparent in the data, I thought; we could call the smaller groups, 5 to 10 years, 'epochs' and the larger groups 'eras'....And, in fact, this approach does, in most cases, make it very clear where the lines should be drawn."
For Bill, a smaller group of seasons is an epoch and a larger group is an era. For Strauss & Howe, a smaller group of cohorts is a cohort-group and a larger group is a generation.

Bill "measured the Similarity of the similarity of their statistical image" to identify "natural groups of seasons" and find the "fault lines" separating them.

Strauss & Howe "focus...on 'peer personality'" to identify "a cohesive cohort-group" "and find the boundaries separating it from its neighbors" (page 64).

For a baseball player, then, his "personality" IS his statistical image.

So, ignoring the third part of peer personality ("perceived membership"), Strauss & Howe on page 64 offer us two possible paths to identifying baseball generations:

We can "look first at [a cohort-group's] chronology: its common age location, where its lifecycle is positioned against the background chronology of historic trends and events," meaning we can identify eras first, then place cohorts in generations based on which era they primarily played in.

Or, we can "look at [a cohort-group's] attributes: objective measures of its common beliefs and behavior," meaning we can apply Bill James' method to cohorts directly; that is, use similarity scores to find "natural groups" of cohorts - generations. I'm going with this approach.

Continued in part two.

Saturday, October 15, 2016

538 Podcast on the History of the Shift

"Well, I'm Ted Williams, pilgrim, and I don't change my swing for nobody!"
Okay, that's my John-Wayne-esque paraphrasing of Ted Williams, but this is a fantastic podcast about Williams, Lou Boudreau, Joe Maddon, and the rise and fall and rise again of the defensive shift in baseball. Enjoy!

Sunday, October 9, 2016

Mike Trout - The Greatest of Millennials
"Mickey Mantle is alive. He’s right out there in centerfield, every day, running down balls in the alleys, taking over games with his offense, defense, and speed. And this is prime Mantle, too—not later-years, shot-kneed, wee-hours-at-the-Copacabana Mantle. He debuted at 19, just turned 25, and he’s been the best player in baseball every year since he—"
Great article from Mike Schur on Slate on the underappreciated greatness of Mike Trout.

I liked the comparison between Trout and Mantle. Mantle played in New York, partied, and was seldom healthy and sober (although the author overstated his case a little - Mantle had nine seasons of at least 620 PA). Trout plays in southern California, plays every day and with a smile on his face, doesn't do or say anything newsworthy off the field, and his biggest passion besides baseball

Schur makes this compelling argument:
"Here is what you should know: Mike Trout is the best player in baseball. There’s an argument to be made that he’s had the greatest season ever, for a player his age, in every year he has played."
I've already established (if you agree that WAR is at least in the ballpark at correctly estimating player value) that Mike Trout is the greatest player ever through the age of 24. But that doesn't mean he was the best ever at his age for each individual season. Or was he?

Age 20:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Mike Trout 10.8 2012 30 83 49 .326 .399 .564
2 Alex Rodriguez 9.4 1996 36 123 15 .358 .414 .631
3 Al Kaline 8.2 1955 27 102 6 .340 .421 .546
4 Mel Ott 7.4 1929 42 151 6 .328 .449 .635
5 Ty Cobb 6.8 1907 5 119 53 .350 .380 .468
6 Manny Machado 6.7 2013 14 71 6 .283 .314 .432
7 Ted Williams 6.7 1939 31 145 2 .327 .436 .609
8 Vada Pinson 6.5 1959 20 84 21 .316 .371 .509
9 Frank Robinson 6.5 1956 38 83 8 .290 .379 .558
10 Mickey Mantle 6.5 1952 23 87 4 .311 .394 .530
Provided by View Play Index Tool Used
Generated 10/9/2016.

Mike Trout DID have the greatest season ever for a 20-year-old, besting A-Rod's '96 season by a fairly wide margin.

Age 21:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Rogers Hornsby 9.9 1917 8 66 17 .327 .385 .484
2 Mike Trout 9.3 2013 27 97 33 .323 .432 .557
3 Rickey Henderson 8.8 1980 9 53 100 .303 .420 .399
4 Eddie Mathews 8.3 1953 47 135 1 .302 .406 .627
5 Cesar Cedeno 8.0 1972 22 82 55 .320 .385 .537
6 Jimmie Foxx 7.9 1929 33 118 9 .354 .463 .625
7 Andruw Jones 7.4 1998 31 90 27 .271 .321 .515
8 Ken Griffey 7.1 1991 22 100 18 .327 .399 .527
9 Arky Vaughan 7.0 1933 9 97 3 .314 .388 .478
10 Frank Robinson 6.9 1957 29 75 10 .322 .376 .529
Provided by View Play Index Tool Used
Generated 10/9/2016.

Trout is 2nd to Hornsby. But already, Trout and Frank Robinson are the only players in the top 10 of both lists - greatest 20-year-olds AND greatest 21-year-olds. Trout is 1st and 2nd; Robinson is 9th and 10th.

Age 22:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Ted Williams 10.6 1941 37 120 2 .406 .553 .735
2 Bryce Harper 9.9 2015 42 99 6 .330 .460 .649
3 Ty Cobb 9.8 1909 9 107 76 .377 .431 .517
4 Eddie Collins 9.7 1909 3 56 63 .347 .416 .450
5 Stan Musial 9.4 1943 13 81 9 .357 .425 .562
6 Dick Allen 8.8 1964 29 91 3 .318 .382 .557
7 Alex Rodriguez 8.5 1998 42 124 46 .310 .360 .560
8 Cal Ripken 8.2 1983 27 102 0 .318 .371 .517
9 Joe DiMaggio 8.2 1937 46 167 3 .346 .412 .673
10 Mike Trout 7.9 2014 36 111 16 .287 .377 .561
Provided by View Play Index Tool Used
Generated 10/9/2016.

This was probably Trout's worst season, and also the only year so far he's taken home MVP honors. Harper's 2015 season is ahead of a Ty Cobb Triple Crown and A-Rod's 40-40, and second only to the year Teddy Ballgame hit .406.

Age 23:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Willie Mays 10.6 1954 41 110 8 .345 .411 .667
2 Ted Williams 10.6 1942 36 137 3 .356 .499 .648
3 Ty Cobb 10.5 1910 8 91 65 .383 .456 .551
4 Eddie Collins 10.5 1910 3 81 81 .324 .382 .418
5 Cal Ripken 10.0 1984 27 86 2 .304 .374 .510
6 Mookie Betts 9.6 2016 31 113 26 .318 .363 .534
7 Mickey Mantle 9.5 1955 37 99 8 .306 .431 .611
8 Mike Trout 9.4 2015 41 90 11 .299 .402 .590
9 Reggie Jackson 9.2 1969 47 118 13 .275 .410 .608
10 Arky Vaughan 9.2 1935 19 99 4 .385 .491 .607
Provided by View Play Index Tool Used
Generated 10/9/2016.

23-year-old Mookie Betts (the 2016 AL MVP-favorite) is slightly ahead of 23-year-old Mike Trout (but NOT ahead of 2016 Mike Trout).

Age 24:

Rk Player WAR/pos Year HR RBI SB BA OBP SLG
1 Lou Gehrig 11.8 1927 47 173 10 .373 .474 .765
2 Mickey Mantle 11.2 1956 52 130 10 .353 .464 .705
3 Ty Cobb 10.7 1911 8 127 83 .420 .467 .621
4 Mike Trout 10.6 2016 29 100 30 .315 .441 .550
5 Jimmie Foxx 10.5 1932 58 169 3 .364 .469 .749
6 Alex Rodriguez 10.4 2000 41 132 15 .316 .420 .606
7 Tris Speaker 10.1 1912 10 90 52 .383 .464 .567
8 Mike Schmidt 9.7 1974 36 116 23 .282 .395 .546
9 Rogers Hornsby 9.6 1920 9 94 12 .370 .431 .559
10 Shoeless Joe Jackson 9.6 1912 3 90 35 .395 .458 .579
Provided by View Play Index Tool Used
Generated 10/9/2016.

Trout is 4th, behind the career years of Lou Gehrig, Mickey Mantle, and Ty Cobb.

So there you have it. Trout didn't literally have "the greatest season ever, for a player his age, in every year he has played." But he was the best at age 20, the second-best at age 21 (to a player in the Dead-ball Era), in the top five at age 24, and in the top 10 at ages 22 and 23.

Ty Cobb, who Trout surpassed as the "greatest young player of all time," finished in the top five in four of the years, but missed the top 10 entirely at age 21 and doesn't rank higher than third at any age.

Mantle was only in the top 10 at three of the ages - 20, 23 and 24. Mantle didn't really become "great" until he was 23, and then he was the best player in the American League for about ten years.

Mike Trout has been the best player in baseball since he was a 20-year-old rookie.