The objective was to pick the likely best-performing eleven players using the data from the past IPL seasons. This repository showcases Data Wrangling, Feature Engineering and Data Visualisation. PCA was used to create a "ranking" variable based on the player's stats.
First of all, we divided our dataset separately into batsmen, bowlers, wicket-keepers and all-rounders for each year. This greatly helped us in further analysis and feature engineering. We keep all features that were originally provided in the dataset. However, since we divided our datasets separately into Batsmen, Bowlers, All Rounders and Wicket Keepers, we dropped irrelevant features in the respective datasets. For e.g., for Batsmen we dropped “Balls Bowled”, ‘Runs Conceded”, “Wickets Taken” and “Economy” and for Bowlers we dropped “Runs”, ‘Balls”, “Strike Rate”, “Four”, ‘Six”, “Highest Run”. For All Rounders we kept all features and for Wicket Keepers we dropped all the bowling attributes as stated above.
However we also added some extra features which helped our dimensionality reduction technique which we used later in our approach to rank our players better. The features are as mentioned below:
Batting:
Hard Hitting Ability = (4Fours + 6Sixes) / Balls Played by Batsman
Finisher = Not Out innings / Total Innings played
Fast Scoring Ability = Total Runs / Balls Played by Batsman
Running Between Wickets = (Total Runs – (4Fours + 6Sixes))/(Total Balls Played – Boundary Balls)
Bowling:
Wicket Taking Ability = Number of balls bowled / Wickets Taken
Consistency = Runs Conceded / Wickets Taken
For All-Rounder all the above features were added and for Wicket Keepers just the batting attribute were added.
Batsmen:
Snapshot of 2018 Batsmen with 2 PCA components
Variation in the data explained by the components:
Component 1 = 14.82703404
Component 2 = 7.38806297
Thus we can see that Component 1 explains a greater part of the variation. Thus we use that as a Ranking Index to rank all the players.
Below is an example of the coefficients assigned by PCA to our features:
Component1/RankingIndex= 0.351Runs + 0.338Balls + 0.297StrikeRate + 0.338Fours + 0.324Sixes + 0.325HighestRunScored+ 0.200Ct_St+ 0.247RunOuts + 0.296MatchesPlayed + 0.265Hard Hitting + 0.297FastScoring + 0.02RunningBetweenWickets
Bowlers:
Snapshot of 2018 Bowlers with 2 PCA components
Variation in the data explained by the components:
Component 1 = 13.25160512
Component 2 = 9.08744071
Thus we can see that Component 1 explains a greater part of the variation. Thus we use that as a Ranking Index to rank all the players.
Below is an example of the coefficients assigned by PCA to our features:
Component1/RankingIndex= 0.45589429BallsBowled + 0.43463612Runs Conceded + -0.22240624Economy + 0.45347261Wickets + 0.30987845Ct_St + 0.44498349MatchesPlayed + -0.19604465Consistency + -0.12522551WicketTaking
Note: All the weights in negative imply that the lesser the better. For eg.- Economy
All-Rounders:
Snapshot of 2018 All-Rounders with 2 PCA components
Variation in the data explained by the components:
Component 1 = 12.31900172
Component 2 = 9.43282814
Thus we can see that Component 1 explains a greater part of the variation. Thus we use that as a Ranking Index to rank all the players.
Below is an example of the coefficients assigned by PCA to our features:
Component1/RankingIndex= 0.351Runs + 0.326Balls + 0.338StrikeRate + 0.297Fours + 0.338Sixes + 0.324HighesRunScored+ 0.200Ct_St+ 0.247RunOuts + 0.296MatchesPlayed + 0.265Hard Hitting + 0.297FastScoring + 0.02RunningBetweenWickets+ 0.45589429BallsBowled + 0.43463612Runs Conceded + -0.22240624Economy + 0.45347261Wickets + 0.30987845Ct_St + 0.44498349MatchesPlayed + -0.19604465Consistency + -0.12522551WicketTaking
Note: All the weights in negative imply that the lesser the better. For eg.- Economy
Similarly, in each year all the Batsmen, Wicket Keepers and Bowlers were ranked and all the players with a consistent rank (Note: consistent and not highest rank) throughout all the years were chosen with the constraints of the competition kept in mind.
Key Difficulties:
Certain players did not play for a certain year, some players had one particular year where their performance dropped, some players stopped bowling/batting from particular years (though designated as All Rounders), and some players played in 2018 for the first time. However, our team was made taking into consideration the consistent performance of a player in all years between 2010-2018. Data previous to 2010 was not used because most players playing in 2019 never played in 2009 or earlier. We noticed that the data for 2010 and 2011 were actually the same and only considered one of them for our approach.
Data Wrangling:
● Same players across years have different formats of their names written. Some years have the First Name abbreviated while some do not.
● Player Type was not present in dataset before 2017.
Both these problems had to be fixed before any further work/analysis.
Our playing 11 which we deduced from our PCA analysis:
1. Batsmen: Virat Kohli, Kane Williamson, Shikhar Dhawan, KL Rahul
2. All Rounders: Hardik Pandya, Shane Watson
3. Bowlers: Rashid Khan, Jasprit Bumrah, Bhuvneshwar Kumar, Sunil Narine
4. Wicket Keeper: MS Dhoni
Our choice of captain is MS Dhoni and Vice-Captain is Virat Kohli.
Note: All the constraints were kept in mind during team formation.
The PCA was done after the data was divided into batsmen, bowlers, wicket-keepers and all-rounders separately. Do keep that in mind while trying to run the code