How we calculate Expected Goals (xG)
By Fantasy Football Fix ()
Modelling Expected Goals
A more sophisticated approach is to bucket the data according to shot type (e.g. in versus out of the box, headed versus non-headed) and assign a different probability for each class. Indeed, this is how some xG models work, and they perform better than the naive all shots equal model. Others perform regression on various shot properties, and you can end up with quite a complex set of equations. However, with large datasets and many variables describing each shot, this is an ideal application for machine learning/artificial intelligence (AI).
We have built our own AI based xG model, using an algorithm called a Gradient Boosted Model (GBM). In a GBM, the AI learns a set of decision trees, and traverses the tree to determine the xG value. This model also tells you which shot properties (or features in machine learning terminology) have the most weight in determining the decision. We find the following properties are the most important.
Shot distance and opening angle
Two features that quantify shot location are the distance and opening angle. The further away the shot, the less likely it is to be a goal. The shot opening angle quantifies how much of the goal is visible. It is a better feature than shot angle (measured form the centre of the goal), as in the case of a shot on the goal line but just offset from the centre, this would have a large shot angle, and large angles would normally be correlated with low goal scoring probability.
Pattern of play
Our data provider Opta quantities the pattern of play as either a regular play, fast break, from a corner, a penalty, direct free kick, set piece or throw in set piece.
The body part can either be left footed, right footed, headed or other body part.
Other shot labels
Opta provides a variety of labels for each shot. We found the most important ones are if the shot was labelled as a big chance, followed an error and/or came from an intentional assist. One of the weaknesses of xG models (which is a limitation of available data) is the lack of information on the exact state of play (i.e. the positions of all players on the pitch) at the time of the shot. These labels can be a useful proxy for factors such as defensive pressure on the shot.
One unique aspect of our AI model is that also it takes into account the 2 actions preceding the shot (we found including further actions did not improve the model). These include the type of event (we use more than 30 types of action such as a pass, open-play cross, corner, save, interception, take on), the position on the pitch of the preceding action, the team and the time difference in seconds between actions.
Our dataset consists of 57,000 shots in the Premier League between the 2013/14 and 2018/19 seasons. We randomly split 80% of the data into a training set (used only to train the model) and 20% into a validation set (used only to quantify performance). We perform 5-fold cross validation, in which the dataset is split into 5 equal sized parts and each partition is used in turn to assess performance.
Previous analysis of xG models have aggregated data over the entire season and calculated the r2 correlation coefficient between actual and expected goals. However, this loses important information on the accuracy of each indiviudal shot. For this reason, we use the Root Mean Square Error (RMSE) of the validation dataset, defined by