Hockey Shot Blocking with R and the Stattleship API

The other day, the Yhat blog posted a nice article on shot blocking in the NHL, with emphasis on the differences that may occur between the regular season and the playoffs. The author Ross demonstrated how to collect the data from NHL website by crawling the website using python, and later commented that building out the shot blocking dataset was more difficult than expected.

At Stattleship, we have an API that makes accessing sports data extremely simple. While we currently only include this season's data for the NHL, the API will consistently and reliably return these data for you; no need to worry about changing webpages or building out parsers in your favorite language!

Below we will use our R package stattleshipR to work with the API to replicate the Yhat post. This post will walk through the various code samples in bite-sized pieces, but the full script is included in the Appendix as well.

Requirements

To follow along with the code in this post, you will need an API token. If you do not have one already, you can go here to sign up.

Assuming that you already have R installed, you will need the devtools package to install our R package, as it is not currently yet on CRAN.

install.packages("devtools")  

That's it!

Getting Started

Below we are going to setup our R session to ensure that we are good to go.

## factors are the devil
options(stringsAsFactors=FALSE)

## install a dev branch of the package with devtools
devtools::install_github("stattleship/stattleship-r", ref="helpers")

## load the packages
library(stattleshipR)  
library(dplyr)  
library(plotly)  

Above, we disable the default behavior of R to treat strings as factors. Beyond that, we are installing a development version our package. This is using a sneak peak of the package that will include a number of functions that make it even easier to interface with the API. Lastly, because the Yhat used plotly, we are loading up the equivalent R package.

Now we need to initialize our API token. Below we are using an environment variable on the local machine, but you could just as easily set it via TOKEN <- "your_token_here".

TOKEN <-Sys.getenv("STATTLE_TOKEN")  
set_token(TOKEN)  

From here, the call below will get the game logs for each team in the regular season using the team_game_logs endpoint. For more information on how to use the API and to get a sense of what is available, check out our developers reference.

The call below will automatically page through the results of the API.

logs_reg <- ss_team_logs(team_id="")  

It will take a minute or so to get the results, but yes, it's that easy. To understand the defaults used in the function above, take a look at our documentation using ?ss_team_logs.

The object logs_reg is a data.frame and has 2460 rows and 146 of data!

Below we will get the data from the first two rounds of the playoffs to round out the data that we need.

logs_14 <- ss_team_logs(team_id="", interval_type="conferencequarterfinals")  
logs_semi <- ss_team_logs(team_id="", interval_type="conferencesemifinals")  

Above we needed to specifiy an additional parameter in our call to the API. For more information on how the API uses interval_type, check out the documentation here.

We now have the core data that we need to analyze blocked shots, but like any data analysis project, we need to do some simple cleanup.

## put the data together into one dataframe
gls <- bind_rows(logs_reg, logs_14)  
gls <- bind_rows(gls, logs_semi)

## keep just the columns of interest
cols <- c("team_nickname", "team_division_name", "game_interval_type","player_blocked_shots", "player_hits")  
gls <- ss_keep_cols(gls, cols)  

Last but not least, to mirror the Yhat post, I am going to segment the data by regular season and the playoffs.

## create a regular seasons/playoff flag
gls <- transform(gls, game_type = ifelse(game_interval_type=="regularseason", "regular", "playoffs"))  

A quick look into what we have to ensure it's what we need.

glimpse(gls)  
Observations: 2,604  
Variables: 6  
$ team_nickname        (chr) "Sharks", "Flames", "Senators", "Panthers...
$ team_division_name   (chr) "Pacific", "Pacific", "Atlantic", "Atlant...
$ game_interval_type   (chr) "regularseason", "regularseason", "regula...
$ player_blocked_shots (int) 16, 15, 14, 25, 6, 15, 17, 5, 16, 18, 7, ...
$ player_hits          (int) 8, 14, 30, 30, 35, 26, 9, 16, 21, 25, 22,...
$ game_type            (chr) "regular", "regular", "regular", "regular...

Looks good, but right now the there is one row for every team by interval_type, and in the Yhat post, the data were captured as 1 row per team. Aggregating the data using dplyr is a breeze.

gls_agg <- gls %>%  
  group_by(team_nickname, team_division_name) %>% 
  summarise(games_reg = sum(ifelse(game_type=="regular", 1, 0)),
            hits_reg = sum(ifelse(game_type=="regular", player_hits, 0)),
            blocks_reg = sum(ifelse(game_type=="regular", player_blocked_shots, 0)),
            games_post = sum(ifelse(game_type=="playoffs", 1, 0)),
            hits_post = sum(ifelse(game_type=="playoffs", player_hits, 0)),
            blocks_post = sum(ifelse(game_type=="playoffs", player_blocked_shots, 0))) %>% 
  mutate(bpg_reg = blocks_reg/games_reg,
         hpg_reg = hits_reg/games_reg,
         bpg_post = blocks_post/games_post,
         hpg_post = hits_post/games_post,
         bpg_diff = bpg_post - bpg_reg,
         hpg_diff = hpg_post - hpg_reg)

Above may look complicated if you are new to R, but we are simply grouping the data by team info using group_by, isolating games played, hits, and blocks for both the regular season and postseason with summarise, and adding a few additional metric columns with mutate.

Another quick look look at the data

glimpse(gls_agg)  
Observations: 30  
Variables: 14  
$ team_nickname      (chr) "Avalanche", "Blackhawks", "Blue Jackets", ...
$ team_division_name (chr) "Central", "Central", "Metropolitan", "Cent...
$ games_reg          (dbl) 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82,...
$ hits_reg           (dbl) 1848, 1389, 2145, 1937, 2164, 1932, 1533, 1...
$ blocks_reg         (dbl) 1401, 1133, 1260, 1127, 1184, 1144, 1058, 1...
$ games_post         (dbl) 0, 7, 0, 14, 0, 0, 0, 12, 0, 0, 7, 0, 6, 0,...
$ hits_post          (dbl) 0, 206, 0, 493, 0, 0, 0, 432, 0, 0, 223, 0,...
$ blocks_post        (dbl) 0, 109, 0, 224, 0, 0, 0, 194, 0, 0, 104, 0,...
$ bpg_reg            (dbl) 17.08537, 13.81707, 15.36585, 13.74390, 14....
$ hpg_reg            (dbl) 22.53659, 16.93902, 26.15854, 23.62195, 26....
$ bpg_post           (dbl) NA, 15.57143, NA, 16.00000, NaN, NaN, NaN, ...
$ hpg_post           (dbl) NA, 29.42857, NA, 35.21429, NaN, NaN, NaN, ...
$ bpg_diff           (dbl) NA, 1.7543554, NA, 2.2560976, NaN, NaN, NaN...
$ hpg_diff           (dbl) NA, 12.489547, NA, 11.592334, NaN, NaN, NaN...

We can see that some team's have NaN for some of columns. This is because they didn't make the posteason.

Plotting

In the post, the author compares the distributions of blocks per game in the regular season and the playoffs.

plot_ly(x = gls_agg$bpg_reg, opacity = 0.66, type = "histogram", name="Regular Season") %>%  
  add_trace(x = gls_agg$bpg_post, type="histogram", opacity=.55, name="Playoffs") %>%
  layout(barmode="overlay", 
         bargap = .25,
         title="2015-16 NHL Shot Blocks Per Game",
         xaxis = list(title="Blocks Per Game"))

We can see that the data for the 2015-16 season mirror the original post in that the shot blocking appears to increase in the postseason.

Below, the plot isolates the distribution of the difference in blocks-per-game for teams that are currently in the playoffs.

## isolate playoff teams and then a simple distribution
gls_post <- filter(gls_agg, games_post > 0)  
gls_post %>% plot_ly(x = bpg_diff, opacity=.66, type="histogram", name="Delta") %>%  
  layout(bargap = .25, xaxis = list(title="Difference"), title="Difference in Blocks Per Game")

Outside of two teams that blocked few shots, it's clear that the majority of the teams really start to lay it on the line come playoffs.

Just like in the original post, we can use R to run a simple test of normality and test that the blocks-per-game are statistically different in the playoffs relative to the regular season.

## normality test
shapiro.test(gls_post$bpg_diff)  
    Shapiro-Wilk normality test

data:  gls_post$bpg_diff  
W = 0.95895, p-value = 0.6428  
## 1-sample t-test
t.test(gls_post$bpg_diff, mu = 0)  
    One Sample t-test

data:  gls_post$bpg_diff  
t = 3.9824, df = 15, p-value = 0.001202  
alternative hypothesis: true mean is not equal to 0  
95 percent confidence interval:  
 1.213646 4.008847
sample estimates:  
mean of x  
 2.611246 

Because we are using only one season, the sample size is smaller. We don't have to go into the discussions around how you should test for normality and the impact of sample size on these tests (see this post for more detail), but above we fail to reject that null hypothesis that the data are not normal, just like in Yhat's post. Similarly, even with just one year of data, the results suggest that shot blocking increases statistically in the playoffs.

Last but not least, instead of using boxplots to look at the distribution of blocks per game over a set of seasons, below we are isolating the differences in shot blocking in the regular season by division for the 2015-16 season.

gls_agg %>%  
  plot_ly(y = bpg_reg, color = team_division_name, type="box") %>% 
  layout(title = "2015-16 Regular Season Blocks per Game by Division",
         yaxis = list(title="Blocks per Game"))

Appendix

The code as one script

## install devtools if you do not already have it
install.packages("devtools")

## factors are the devil
options(stringsAsFactors=FALSE)

## install a dev branch of the package with devtools
devtools::install_github("stattleship/stattleship-r", ref="helpers")

## load the packages
library(stattleshipR)  
library(dplyr)  
library(plotly)

## set the token
TOKEN <-Sys.getenv("STATTLE_TOKEN")  
set_token(TOKEN)

## get the regular season game logs for every team
logs_reg <- ss_team_logs(team_id="")

## and the rest for the playofs
logs_14 <- ss_team_logs(team_id="", interval_type="conferencequarterfinals")  
logs_semi <- ss_team_logs(team_id="", interval_type="conferencesemifinals")

## put the data together into one dataframe
gls <- bind_rows(logs_reg, logs_14)  
gls <- bind_rows(gls, logs_semi)

## keep just the columns of interest
cols <- c("team_nickname", "team_division_name", "game_interval_type","player_blocked_shots", "player_hits")  
gls <- ss_keep_cols(gls, cols)

## create a regular seasons/playoff flag
gls <- transform(gls, game_type = ifelse(game_interval_type=="regularseason", "regular", "playoffs"))

## A look at the data
glimpse(gls)

## summarize the data into 1 row per team
gls_agg <- gls %>%  
  group_by(team_nickname, team_division_name) %>% 
  summarise(games_reg = sum(ifelse(game_type=="regular", 1, 0)),
            hits_reg = sum(ifelse(game_type=="regular", player_hits, 0)),
            blocks_reg = sum(ifelse(game_type=="regular", player_blocked_shots, 0)),
            games_post = sum(ifelse(game_type=="playoffs", 1, 0)),
            hits_post = sum(ifelse(game_type=="playoffs", player_hits, 0)),
            blocks_post = sum(ifelse(game_type=="playoffs", player_blocked_shots, 0))) %>% 
  mutate(bpg_reg = blocks_reg/games_reg,
         hpg_reg = hits_reg/games_reg,
         bpg_post = blocks_post/games_post,
         hpg_post = hits_post/games_post,
         bpg_diff = bpg_post - bpg_reg,
         hpg_diff = hpg_post - hpg_reg)

## another look
glimpse(gls_agg)

## plot the distributions of blocks per game by regular season and playoffs
plot_ly(x = gls_agg$bpg_reg, opacity = 0.66, type = "histogram", name="Regular Season") %>%  
  add_trace(x = gls_agg$bpg_post, type="histogram", opacity=.55, name="Playoffs") %>%
  layout(barmode="overlay", 
         bargap = .25,
         title="2015-16 NHL Shot Blocks Per Game",
         xaxis = list(title="Blocks Per Game"))

## isolate playoff teams and then a simple distribution
gls_post <- filter(gls_agg, games_post > 0)  
gls_post %>% plot_ly(x = bpg_diff, opacity=.66, type="histogram", name="Delta") %>%  
  layout(bargap = .25, xaxis = list(title="Difference"), title="Difference in Blocks Per Game")

## normality test
shapiro.test(gls_post$bpg_diff)

## 1-sample t-test
t.test(gls_post$bpg_diff, mu = 0)

## Regular season blocks per game by team dvision
gls_agg %>%  
  plot_ly(y = bpg_reg, color = team_division_name, type="box") %>% 
  layout(title = "2015-16 Regular Season Blocks per Game by Division",
         yaxis = list(title="Blocks per Game"))