SYNOPSIS
My task is to predict the outcome of soccer games based on attributes of the match, such as the two teams playing, the date the tournament of the game, and the location of the game and its neutrality. It will only be handling International matches, whether it is a friendly game or in an important tournament. The relevance of this task comes from the upcoming Soccer World Cup 2018. Such a historical sports event will fuel the important market of gambling. Intense soccer fans along with many distant followers will all be making guesses and placing bets with low accuracy and success rate. Will machine learning make better soccer predictions than humans? I approached this project in two different ways. At first, I manipulated the data I was using to get a label that is a numerical value for the outcome of games. This value was calculated from the score difference of the two teams. A negative value would mean that the away team won, whereas a positive value meant that the home team won. I then tested different learners, to finally conclude that a linear regression classifier with 10 fold cross-validation had a reliably high correlation coefficient of 0.5777. However, it seemed that this approach was not the most adapted to the task. A numerical label made the predictions harder to understand as the model would predict a score difference and not the winner of the game. Thus, I manipulated the data once again to get a nominal label. Three outputs were then possible: 'H' for the home team won, 'A' for the away team won, and 'D' for a draw. With this new dataset, different classifiers performed well as they were above 50%, whereas the ZeroR baseline only achieved an accuracy score of 48.6241%. The most important features are expectedly the home team and the away team. A summary of the results can be found in the graph below. It can be drawn out of the graph that linear regression and logistic regression perform the best, but Bayes Net is reliably accurate as well, with respective accuracy scores for the last two of 57.2585% and 53.1823%. |
FIRST APPROACH
DATA |
RESULTS & ANALYSIS |
UPDATED APPROACH
DATA |
RESULTS & ANALYSIS |