Can We Predict MLB Draft Bonuses?

In this series of articles, we’ll be trying to predict a player’s signing bonus based on their college stats. For now, we’re going to limit this to the top 10 rounds of the draft. We choose to predict bonus and not draft position because bonus is more telling of how a team views a player than draft spot in some occasions due to how the MLB draft is setup. For now, we’ll also be limiting ourselves to college players and looking at the years 2014-2019, excluding the potential difficulties of dealing with 2020.

For pitchers we will be using GS% (Games Started%), K%, BB%, HR%, Height, Grade and league rating. For batters we’ll use a college wRC+ based on MLB weights, HR%, BB%, SB%, Grade and league rating. League Rating comes from something I found on Chris Long’s GitHub from 2017. For Grade, we’ve simplified all grades into a single “Senior” or “Not Senior” category.

Conference adjustments are something that I have messed around with in the past. One approach is to treat the conference as a factor, another one is to just have the conference level (a numeric weight) as an input itself which I did here. I’ve also tried multiplying the stat by the conference rate. One could also use strength of schedule adjustments which would better account for year to year adjustments. Future analysis could look into how the conferences vary in specific categories. For example, maybe walk rate doesn’t need to be conference or strength of schedule adjusted but something like strikeout rate should be.

For this paper we’ll be using a Bayesian regression model. Doing so allows us to look at the probability that a model parameter (input such as BB%) lies in a given interval, meaning we can get a good idea of how one of our inputs impacts bonus. For any given player we also can look at a distribution of possible bonuses, allowing us to get a much better idea of the wide range of possibilities that can occur in the draft.

One assessment of our model is making sure our chains converge. I’m still new to the concept, but in Bayesian regression we are sampling the posterior distribution (to be explained) in groups known as chains. We can measure how stable our estimates are by using the Rhat statistic, which evaluates the variance between chains to the variance within chains. If they all converge, then they should be roughly equal. Thus, we look for values near one or less than 1.1. In the images below we see that they are in fact all near one.

Upon meeting that criteria, we can also look at the effective sample size, which tells us the number of independent draws from the posterior distribution. Originally I didn’t have a large enough n_eff. In rstanarm, four chains with 2000 iterations are the default, but I increased the number of chains to 2,500 to get more samples. Here we are looking for an n_eff of roughly 5,000 (double the number of iterations), which we see that we have for pitchers, but still fall short for batters.

It’s also extremely important that out model fits the data, that’s the whole point of doing this exercise. This can be done by simulating our data according to our fitted model and comparing the simulations to the observed data. To quote from one paper I read on this topic, “To generate these simulations, we need to sample from the posterior predictive distribution, which is the distribution of the outcome variable implied by the posterior distribution of the model parameters. Each time the MCMC draws from the posterior distribution, we generate a new dataset according to the data generating process used in our model.”[1] This means we can compare our actual data to data sets simulated according to our assumed data based on model parameters. This is done by showing the density plot of our original data with estimates from our posterior data.

These graphs look at the distributions of our predictions in a different way with the x axis representing the bonus in millions of dollars. For both these graphs, we see that we appear to be overpredicting the likelihood for pitchers of a low salary player, while we are right on the mark for batters. Our model appear to be continuously slightly underpredicting the likelihood of a $500,000 to $1 million dollar player though and may slightly overpredict over 1 million. It is hard to say exactly this is the case, but in our next article we’ll get into some of the flaws behind this approach.

In these graphs I show the distributions of signing bonuses for batters and pitchers with actual bonuses on top and median predicted bonuses for all players on bottom. For pitchers I cut off the top 13 players (over $3 million) and for batters I cut off the top 18 bonuses (over $3.5 million) in order to help us better visualize these predictions. We’ve also already established that the model is missing a little bit on very high end players. With these graphs, we can see that past $1.5 to $2 million, predictions are virtually nonexistent. The graphs also show we may be giving a bit too much credit to lower end players, possibly because we don’t have many data points of a very low salary in the middle rounds.

While the model has areas for improvement, the ability to see the distribution of possible outcomes in something that can have a lot of randomness, like the draft, is very helpful. If desired, one could zero in on a player and make a distribution graph for just their predictions too. Another big benefit of the Bayesian modeling approach is that we can look at the probability our parameter value (an input to our model) lies within a given range. This allows us to get a good idea of how certain stats impact bonus better than a traditional frequentist approach would.

Here we are looking at the standard deviation and estimates for our parameters. We look at the 2.5% and 97.5% columns and say that we are 95% confident that our value lies within this range. From looking at hitters, we see that most of our numbers actually don’t have much impact other than a player being a senior having a negative impact and a better league rating increasing bonus. We also see a slightly positive impact of HR%. For pitchers, we have a more pronounced impact of parameters, but with a wider range. We can see that starting more games, striking batters out and playing in a good conference is impactful. There is also a slightly positive impact from a player being tall. We can also see that a higher BB% negatively impacts bonus.

The graphs above also show the coefficients with their 95% intervals that were discussed above. These are just another way of showing what we looked at before and seeing how one of our model inputs impacts bonus.

In our next article we’ll look at some of our model’s predictions for specific groups of players and some of our best and worst predictions.

Draft Signing Bonuses Pt. 2

Pitch Comp Tool