Although it now features in mainstream media such as Match of the Day, expected goals (xG) is quite a divisive topic in the FPL community. Some regard it as the most important metric for evaluating goal scoring potential, others recognise its usefulness in conjunction with other statistics and the eye test, while a few have gone so far to mute the very word from their Twitter timeline! If you are unconvinced of its use, but value shots as an attacking metric, it's worth considering that in the simplest xG model (where all shots are treated equally), expected goals is given by
since, on average, there is a 10% chance of scoring from a single shot in the Premier League. Of course, all shots do not have an equal chance of resulting in a goal, so more advanced models take into account other factors such as shot location. Opta Sports give the following definition of xG
Expected goals (xG) measures the quality of a shot based on several variables such as assist type, shot angle and distance from goal, whether it was a headed shot and whether it was defined as a big chance. Adding up a player or team’s expected goals can give us an indication of how many goals a player or team should have scored on average, given the shots they have taken.
Modelling Expected Goals
A more sophisticated approach is to bucket the data according to shot type (e.g. in versus out of the box, headed versus non-headed) and assign a different probability for each class. Indeed, this is how some xG models work, and they perform better than the naive all shots equal model. Others perform regression on various shot properties, and you can end up with quite a complex set of equations. However, with large datasets and many variables describing each shot, this is an ideal application for machine learning/artificial intelligence (AI).
We have built our own AI based xG model, using an algorithm called a Gradient Boosted Model (GBM). In a GBM, the AI learns a set of decision trees, and traverses the tree to determine the xG value. This model also tells you which shot properties (or features in machine learning terminology) have the most weight in determining the decision. We find the following properties are the most important.
Shot distance and opening angle
Two features that quantify shot location are the distance and opening angle. The further away the shot, the less likely it is to be a goal. The shot opening angle quantifies how much of the goal is visible. It is a better feature than shot angle (measured form the centre of the goal), as in the case of a shot on the goal line but just offset from the centre, this would have a large shot angle, and large angles would normally be correlated with low goal scoring probability.
Pattern of play
Our data provider Opta quantities the pattern of play as either a regular play, fast break, from a corner, a penalty, direct free kick, set piece or throw in set piece.
The body part can either be left footed, right footed, headed or other body part.
Other shot labels
Opta provides a variety of labels for each shot. We found the most important ones are if the shot was labelled as a big chance, followed an error and/or came from an intentional assist. One of the weaknesses of xG models (which is a limitation of available data) is the lack of information on the exact state of play (i.e. the positions of all players on the pitch) at the time of the shot. These labels can be a useful proxy for factors such as defensive pressure on the shot.
One unique aspect of our AI model is that also it takes into account the 2 actions preceding the shot (we found including further actions did not improve the model). These include the type of event (we use more than 30 types of action such as a pass, open-play cross, corner, save, interception, take on), the position on the pitch of the preceding action, the team and the time difference in seconds between actions.
Our dataset consists of 57,000 shots in the Premier League between the 2013/14 and 2018/19 seasons. We randomly split 80% of the data into a training set (used only to train the model) and 20% into a validation set (used only to quantify performance). We perform 5-fold cross validation, in which the dataset is split into 5 equal sized parts and each partition is used in turn to assess performance.
Previous analysis of xG models have aggregated data over the entire season and calculated the r2 correlation coefficient between actual and expected goals. However, this loses important information on the accuracy of each indiviudal shot. For this reason, we use the Root Mean Square Error (RMSE) of the validation dataset, defined by
where xG is the prediction by the model (a probability from 0 to 1) for the shot labelled by the index i, and G is the true outcome (0 for a non-goal, 1 for a goal). A lower RMSE indicates better performance.
In the following table we give our RMSE values for different models trained with variable numbers of features.
|All shots equal||0.304|
|Basic (following error, big chance and intentional assist labels)||0.274|
|Standard (as above plus pattern of play, body part and shot distance/opening angle)||0.265|
|Full (as above plus previous 2 shot actions)||0.262|
One can see that adding more input features increases the performance of the model. There is a small but noticeable improvement from considering the preceding actions to the shot.
The RMSE of our full model (0.262) is highly competitive with all existing xG models. To ensure a fair comparison, however, one needs to use the same validation and training data. We summarise our performance against Opta's own model in the table below. We obtain an RMSE of around 0.04 better.
|Training data||Validation data||xG model||RMSE|
|2013/14 to 2016/17||2017/18||Our full model||0.270|
|2013/14 to 2017/18||2018/19||Our full model||0.272|
With our trained model we can now make predictions. In the following interactive charts we show xG values for every shot for Mohamed Salah and Sergio Agüero in the 2018/19 season. You can hover over each shot to see some of the features of the shot.
Full expected goals information is available within the premium memberships options of the site.