When you think about it, the standard categories into which we group football players aren’t very satisfactory.
Midfielder, defender, striker.
Those say little about a player, beyond where he stands as the game kicks off.
So we add some more detail.
Full back, wing back, defensive midfielder, “number 10”, “false 9”.
And things quickly get out of hand. John Barnes spent a good while on Radio Five last weekend, arguing about the definition of a winger. My idea of winger and your idea of winger might be different. We could point at two quite different players and label them both “winger”.
We’re left with an unwieldy array of names that may mean different things to different people.
What about a manager who’s looking for a new player? He’s not just going to say “find me a midfielder”. Well he might, but he’ll probably add “a midfielder like x” and name a player, either in his own team, or in somebody else’s.
Barcelona’s Andrés Iniesta is 31. What if Luis Enrique said, “in a couple of years, I’ll need a younger Iniesta. Who’s out there?”
Could stats help with that?
I’m sure they could.
First, some data.
We need a way to identify similar players, and we’ll need to do it based on a few different factors.
1. Where they go on the pitch
This will help to describe a broad set of playing positions – Full back, striker, defensive midfielder etc.
I’ve cut the pitch into two area types along its length (centre and wings) and into four across its width (deep, own half, opposition half, and advanced)
That gives eight areas of the pitch: Advanced-wide, advanced-centre, deep-centre etc. etc. Note that this does mean my model won’t distinguish between left and right sided players, it will group them up as “wide players”. This is deliberate.
2. What they do when they have the ball
Pass, shoot, take on a player… what do they choose and how often?
3. Success rates
Do they succeed with the choices they make? It’s no good me saying that a player stands in the same areas of the pitch as Iniesta, if he passes at 70% accuracy.
I’ve brought together these data points for every player in the big five European leagues for the 2014/15 season. That’s just under 1500 players, who played 900 minutes or more last season.
Clustering the players.
Once you’ve got a load of data, how do you go about identifying similar players?
A commonly used technique would be K-Means clustering.
I started with K-Means and tweeted a few output images, but as I dug deeper, it was a bit unsatisfactory.
K-Means can get confused by outliers, because it worries most about big clusters of data. You get some odd classifications, which are obviously wrong. For a while (and based only on small data), K-Means was insisting that some goalkeepers looked like centre backs.
Hierarchical clustering gives a better set of results.
It works like this.
You’ve got six players, [a] to [f], and the clustering algorithm compares the data that you feed it, grouping some players together because they have similar data. At the first stage, it says [b/c] are similar and [d/e] are similar. [a] and [f] are different enough not to be grouped yet.
Then it runs again and groups up to the next level. Now [d/e/f] all sit in the same (higher level) group, [b/c] are still together and [a] is still different enough to be on his own.
Then it groups again, and again, and keeps on grouping until all of the players are in one big set.
When we look at the clustering output, at the lowest level, we’ll have lots of pairs of players. That’s quite fun but not very useful – when we ask for a midfielder, we want a few choices to look at, not just one statistically similar option.
Trimming back the tree, gives some more useful output. You select a player and the algorithm will return a small list of similar players, from across Europe’s top five leagues.
Give it a try…
Edit: This post still gets a fair bit of traffic, so I’m adding a quick reminder that it was written in August 2015. The tool below uses data that is somewhat out of date.
An updated and improved version of the tool exists and was presented at OptaPro’s 2016 conference, but it isn’t currently public, sorry.
So, what about Iniesta?
If you select Iniesta, only one player pops up as being comparable, which tends to happen with truly elite players. Try selecting Messi – the tool will tell you that it can’t find any comparable players at all, and fair enough. You want another player like Messi? Sorry, there isn’t one.
It’s easy enough to relax the segmentation and get a bigger list of players, but let’s go with the option that we have. The tool thinks that Mateo Kovacic of Inter plays similarly enough to Iniesta, to end up in the same group.
I have no personal opinion on this comparison at all. I don’t watch Italian football and that’s partly the point of the exercise. I don’t need to know Serie A for the tool to flag a player in Italy, who we might want to take a look at.
This tool would be an early input to a longer process of investigating the player’s detailed stats… Watching videos… Scouting him… Enquiring about his availability and price.
On the dashboard above, you can link through to Squawka and Whoscored to see a player’s stats, but just for Iniesta and Kovacic, let’s have a look using my player profiles.
They’re not exactly the same, but both are very high standard passers, playing large numbers of attacking balls. Neither creates large numbers of assists. They occupy similar areas of the pitch. Their take-on stats are similar.
Kovacic is more of a goal threat. He’s also 21.
Is Kovacic an ideal replacement for Iniesta? I repeat that I have no idea, I’ve never seen either play in the flesh. However, big data and a clustering algorithm has certainly asked an interesting question that bears further investigation.
(I promise faithfully that I wrote this whole piece before running this Google search.)
(and before the Guardian wrote just hours ago, that Kovacic is going to Real)
Actually, what those Google searches mean is that probably, stats have identified something that people who follow European football more closely than I do, already knew.
That other people have come to a similar conclusion through observation, confirms that this technique can process very large amounts of data, very quickly and whittle down thousands of players into a short-list to investigate. We could throw it at geographically remote leagues, or at lower-level leagues and at volumes of players that just wouldn’t be feasible to assess live.
As very often with data analysis, this isn’t the whole answer, but does have the potential to make one part of the recruitment process faster, easier and more accurate.
Why not try the player search and let me know on Twitter what you think of its suggestions?