Data Analysis project with Python
Welcome to this football data analysis project in this project we're going to learn techniques used in that analysis and also in data science such as data collection data cleaning and preparation to build a simple model that predicts the winner of the FIFA World Cup 2022 so first we're going to strike football matches from all the World Cups played so far then we're going to strike the fixture of the World Cup 2022 and then we're going to clean all this data once the data is clean and transformed we're going to build a simple model to predict the FIFA World Cup 2022 and to do this we're going to use Python libraries such as pandas selenium and sci-fi alright now let's get started in this video I'm going to show you how to easily do web scraping using Python and pandas with pandas we can extract data from some websites like Wikipedia and do this in a couple of minutes in this case I'm going to track data from the FIFA World Cup Qatar 2022 and we're gonna scrape these tables that you can see here and on the next video I'm going to show you how to scrape data from the World Cup 1930 to 2018 and also the fixture from this coming World Cup Qatar 2022 so let's get started all right to easily script tables from websites such as Wikipedia what we have to do first is install pandas so we have to write peep install pandas if you're using Jupiter notebooks like me you have to add this uh this sign and then you have to run this in my case I already have pandas so I can read I can write only uh import pandas as PD so this is how you do this then what we have to do is use a method called read HTML so I write read underscore HTML and then we have to write the name of the website in this case I already have the name of the website but what you have to do is go to the website I had before and copy and paste it so you have to go here and copy and paste this website so in my case I'm just going to paste the website and it's here all right after we do this I'm going to give a name to this expression so I'm going to name it as all underscore tables so we're going to scrape tables only and I name it all underscore tables so I run this and now we can see the content inside this variable so I run all underscore tables and we see that we have a lot of content so not all of them are tables because if we see the content here I'm opening this square bracket 0 to 4 to see the first element so we can see that this is a table but this is not that table that we're looking for we're only looking for tables from the World Cup which is group a b c until H so what we have to do here is to uh go and write one by one until we find the group a so I already know that group a is the number 11.
So if I write 11 we have the table of group eight so is this one and then if we want the table of Group B we have to go until I believe is uh 18. so if we run this we have the the table of Group B which is this one so we see here England uh Iran United States Wales and it's here so it's the table so you can see that we have to go uh every seven tables so 11 plus 7 18 so 18 plus 7 25 and we should get a table of Group C which is Argentina Saudi Arabia Mexico and Poland so is this one so we successfully found the pattern to found to get all the tables so that's basically what we have to do to scrape all the tables of this website so the last table that we want I believe is table 16 if I'm not wrong yeah this is the last table which belongs to This one belongs to Group H I believe so is Group H so this is the last table we want so that's it now we can do some modifications to customize our tables for example we can change the column name which is this one uh I want this to rename as team and also we can get rid of this table that I'm not sorry this column that I'm not gonna use in this case qualification I'm not going to use this column in the future for a future project so I want to get rid of this and to do this we can use a for Loop so to modify each data frame so what I can do here is first right for I in and then I write a range so first we have to write the first element which is 11 which is this one then the last element which is 60.
And then the step in this case I'm going after seven tables so seven so we get 11 18 25 and until actually it's 67 because the last one is not uh included so 67 is not going to be included but 16. so that's it if we print this we see all of this so 11 is group 8 18 is Group B and so on so we have this now I want to link each number which the table that I'm going to scrape so for example the first table is group a the second is Group B and the last one is Group H so to link this number with the table what we have to do is use another library in this case I'm going to use the string Library so I write a string from string import and now I import ASCII so I write ASCII in this case I want it in uppercase so I write underscore uppercase and I'm going to name this as the alphabet because this actually gives me the alphabet so I run this and now if I copy and paste this one here we can see that we have the alphabet which is this one so this is my alphabet and this is going to help me link the letters with the numbers and that's very good if I want to better organize my tables because this is very ugly I don't want 11 18 25 that doesn't mean anything but group a b and c until H actually has a meaning so I'm gonna link them and to do this I'm gonna use a function called zip so if I write zip here and open parenthesis now we have to link these numbers with the letters so I can write alphabet as my first element and then this range as the second element now I can write letter and now I can print both so letter and I then I run this and as we can see now we linked this letter with this number so 11 is for a and 60 is for H so we successfully linked this too now I'm going to clean uh these tables or these data frames and then I'm going to create a dictionary so first I'm going to clean this data frame because it has some dirty data here so first what I'm gonna do is I'm going to use these old tables I'm going to copy and paste it here so you can see it much better and now I'm going to use this variables so now I write all underscore tables and I'm gonna um open these square brackets and use I so for each iteration I'm gonna get a data frame and it's gonna be uh for example data frame of group a then of Group B until Group H so I'm gonna get one by one then I'm gonna rename each uh each column that is named Team BTE I don't want this name I want to rename it so I use DF that rename open this parenthesis then columns are equal to and we open these curly braces then I want this uh column name rename as team so that's what I'm gonna do then I write in place equal to true so we save all the changes that we're making here so if we run this we're gonna have an issue because not all the all the data frames have this thin VTA column if we see here uh data frame 60 this one has this one and probably 25 yeah it has the same column and 18.
Yeah it has but 11 it has this weird uh name which is huge and that's gonna cost me a problem here because that data frame which is the first one doesn't have this name so I'm gonna get an error saying that the column name is not there so to avoid that issue what I'm gonna do is use a method called um uh columns actually is an attribute I believe so here I'm going to write uh I'm going to open here and I'm going to write uh I'm going to use the one of these data frames so you understand me much better and I'm gonna just copy this one for example and write it here then this is my data frame of course then I write columns and if I run this you can see that I get all the columns post team VT pld I'm going to run this here so you can see so all of this here are here all the all the letters in both that represent the columns are in my list so now what I can do is um go to 11. and see what I get so we see that we get all the columns and yeah but this looks ugly but we see that the second element is team and if we go to for example 18 we see that the second element is still team and if we go to 60 we see that it has the same position so what we can do is just get the second column which is represented by this index one in Python it starts the index with zero so the second will be one and if we run this we have this name of the column so we don't have to write the name of the column but only use this attribute and we get the name of the column we want uh independently of the of that name so if we write 11 we get that one uh actually actually 11 here sorry I made a mistake 11 here and here uh one so yeah this is the ugly or huge uh column name so now what we have to do is just copy this you know uh put it here so instead of writing this column name I'm going to use uh DF dot columns one so this is this represents the second column of each data frame that we have here so group a until Group H so now I delete this and now it's zero so now finally I'm gonna delete this column as I said I don't want this qualification column so I use the pop method and write the name of this column in this case all the data frames have this name I checked this before and all of them have this column name are the name is the same and finally I'm gonna write the I want to create that dictionary and to do this I have used to write a an empty dictionary so I write did underscore table equal to and I have these curly braces so this indicates that I'm creating an empty dictionary and then I write the underscore table and then I open these uh square brackets sorry this uh yeah square brackets to create a new element so for example I want my new key in this case name as group a and then this is equal to the value which is going to be that DF so in the first iteration we get a element a and then we get a data frame in this case table a or group a and this is going to be in this key and to create different keys not only group a but group a b until H we have to use uh this F string and then open these curly brightness so these curly braces in the F string allows me to put a variable in this case I'm gonna write here the letter which is this that I'm I obtained from the alphabet and yeah this is going to help me get from a to H so that's pretty much it now we can run this and see the results so now I run this and yeah it successfully run you know I can see the content of my dictionary so I write the underscore table and now I can use the keys method to see all the keys so I have from group a to Group H great now I can see the content of each key in this case group a so I see I write group a and now I see the table of group a now if I write Group B I get the table of Group B and so on if I change to Group H this is the table of Group H and that's how we successfully scrape all the data of this website and also we created a dictionary to better manage all this data finally uh one extra step that you can do in case you want to use this this dictionary in a project as I'm gonna do in this series of videos is uh use the pickle library to export this dictionary so to export this dictionary and use it in another file we can use pickled so we only have to write import pickle in case you don't have it you just have to install it with pip pip install Pico in case you don't have it I already have it installed so I'm gonna use this one so to do this to export this dictionary in a file what we have to do is write the following we write with then right open then we write the name that we want to export this file so the name I want arrow is digged underscore table you can write whatever you want but I'm going to name it like that then I write w b which stands for write bytes I believe and then I write as and the name as uh or the name I want to sign so this represents all of this represents outputs and now I'm going to use pickle so I write pickle that dump to dump all these data in my dictionary so I write dig underscore table and then I write the name um output here so basically what I'm saying here is all the data that is inside this underscore table that we created uh all this data put it in this uh file that I'm creating so put it inside this and now we only have to run this and that's it if we go to our working uh directory we can see that there is a new file and it's digged underscore table and that's how we export our dictionary and that's it that's it for this video in the following video I'm going to show you how to scrape the data of all the matches uh that are going to be played in the World Cup so basically we're gonna get the fixture and also I'm gonna show you how to extract all the historical data from the from all the World Cups from 1930 to 2018 so you can have a database of all the matches but that's going to be on the next video in this video I'm going to show you how to scrape data of football matches from the FIFA World Cup so we're going to scrape all the matches played so far from 1930 to 2018 and also we're going to script the fixture of the FIFA World Cup that is in Qatar which is this one and we're going to use python and a library called Beautiful soup so let's get started okay to script the data of football matches from all the World Cups first we need a list with all the years that represent a World Cup for example from 1930 to 2018 so these are the years that had a World Cup all right I'm after that what we have to do is install the library to scrape a website in this case I'm going to use beautiful soup and to do this what we have to do first is install beautiful soup so what we have to do is open the terminal and write pip install bs4 which represents beautiful soup so we press enter and we install beautiful soup and after that what we have to do is import this Library so we write from bs4 import beautiful soup so now we have beautiful soup and after we do this we have to get the link that we're going to scrape in this case I'm going to start with the uh the 2014 World Cup and I'm gonna paste it here so I'm Gonna Leave the link in the description and I'm gonna give you the name which is going to be web so this is my the website that I'm gonna scrape after this what we have to do is send a request to this website and to do this we have to use the request Library so what we have to do is import requests I'm not sure if this uh Library comes with python in case it doesn't what we have to do is open up a terminal again and right peep install requests so we press enter and then we can continue with this so we need requests to send requests to the website that we're going to scrape so I'm going to copy these requests and then use the get method so I write get and I have to write the website where I want to send my request in this case this website that I uh I created the the variable so after this when we send a request to our website we get always a response so I'm going to name this variable as response and after we do this we're gonna get a response and we want the text of this response so we use the text attribute to get the text of this response so what I'm going to do now is print the this response so you can see what uh this looks like so I printed this and what we have here is the HTML of this website so the HTML behind this website is printed here so if you know some HTML you can recognize some elements like the tags and all of that but you don't need to know HTML to scrape a website but it helps so after we do this I'm going to create a variable name content so this content is gonna be parsed and to parse this content we need a parser and we're going to use a parser named lxml but first i'm gonna use uh this beautiful soup that we imported and then this uh accepts to um two two elements the first one is the content and the second one is the parser so content and then the parser which is lxml this lxml we need to install so we're not we need to open up the terminal and write pip install and then lxml then press enter and you install this parser so now I'm going to name this as soup this is the typical name that we use for uh for this and this soup is gonna help us extract data from this website we cannot extract data if we don't have this soup so now that I have the soup I can start scraping this website so basically what we did here is the the common uh steps that we need to get started with the beautiful soup now we can start scraping this website so I write soup dot find in this case to get an element we need to use either find or find all the first one help us get uh one element in particular in the second one find all helps us get uh multiple elements that match the name of the tag and the class name that we're going to write in a couple of seconds so in this case I'm going to use find all because I wanna get all the matches here uh all the matches that were played in the group stage in The Knockouts the semi-finals and the finals so I'm gonna get multiple matches so I'm gonna just find underscore all and then we need to follow a syntax but before I tell you this syntax we need to go to the website and we need to inspect this website to see the HTML uh behind this website so to do that I have to right click and then click on inspect so after this we get this uh developer tools and here we can see the HTML document so this is the HTML that represents this website and to get one specific element what we have to do is click here on this button on the left and then select any element that we want all right now to scrape a website what we have to do is find a pattern every time that we want to scrape a website we need to find a pattern that we can use in our code so here a pattern that I discovered is uh there is something with the matches with the rows actually so this match is inside a row so if I for example I select Brazil here I click on it I get this element but I don't want only Brazil I want the whole row that gives me that match so what I have to do is in this HTML document just scroll up a little bit until I find the row that contains all the data so I'm going to scroll up and finally I found this row as you can see this element highlights all the rule in blue and that tells me that this is the the road that I was looking for so if I now um click here in this Arrow I can minimize this element and I can see if there is a pattern here so first let's identify these uh the elements in this HTML element so first we have a div this is the tag then we have this uh item type it's a class name and then we have another class name which is class with the class uh with the the value football box so this is one Element let's see the next row so we have this one and as we can see it has the same elements is exactly the same so we have div and we have class and then we have football box is the same and if we scroll down we have the same for the next match so we can see these three rows have the same elements so we can use this to get all the rows in this website so we're gonna get from group a to group b c d until Group H and then semi-finals Knockouts and well the final so we're gonna get all the matches and that's good so now I'm gonna show you the syntax I was talking about before so to extract the data inside this element we have to follow a syntax and this syntax tells us that we need to uh to write first the name of the tag so here the tag and then uh the name of the class so we write class this is an a parameter class and then the name of the class so let's identify this in this HTML document or actually in this element the tag is div as we can see the name is div here in blue so here instead of that I write div and then in the class we have to write uh the name of the class which is football box so I copy this and then I paste it here this is the name of the class all right so now we have uh this element that represents this row so this is a match so I'm going to name this as match but since we're gonna get all the matches because it's gonna be like this one and this one and this one because as you might remember we're not using find find will give us only the first one but find all gives us all of them so fine all gives me all the matches so I'm gonna name it matches because we're gonna get all the matches inside a list so when we use find all we get a list with multiple elements matches is a list all right now what we have to do is Loop through this list which is matches so I have a list and I'm gonna Loop through it so I write 4 match in matches uh I write the colon and then I'm gonna extract each data that I want from this row so in this case I just want to get the name of the home team which is in this case Brazil the score in this case three one and the away team in this case Croatia so to do that I'm going to open this element with this uh click in this little arrow I think and then what we have to do is identify this element so I'm gonna scroll um uh scroll down a little bit and then I'm going to identify the one that gives me the home team which is this one F home so we have this element th with Tac th and then with class name f hum so this is the one I'm looking for then for the score is this one F score and for that way team is this one F away so we have all these elements now we only need to get them inside this for Loop so what I'm gonna do is write match in this case match represents the soup that we created before as I told you we cannot extract data without the soup but since this Zoop is inside this matches list we can use the match uh as a replacement we can say so this match is gonna get us all the data or it's gonna help us uh get inside of these elements that are here in the rows so we are now in the rows that is here now we want the data inside the row so we're going for f home so match Dot find in this case because I just want one element inside this row and then I'm gonna write first attack in this case is th and then I'm gonna write the name of the class which is in this case f hum so first th then Kuma and then class underscore equal to F hom so I have now this and now to get the text inside this element what we have to do is use the get text function so I'm going to write get underscore text parenthesis and with this I have the name of Brazil so the text inside this element is Brazil and I'm gonna get that so now I'm gonna duplicate this twice and I'm gonna change the name of the class the second one is f score and the last one is f away so now F score and F away great so now we have all these elements and now to see uh the results what we can do is print this so I'm going to print each of them to see the results that we have so far to see if everything is working fine so now I print this and now I can run it so I right click and run this to see what we have so far and as you can see all these elements were printed so we have from from the first match which was Brazil Croatia 3-1 which is this one until that last match which is uh Germany Argentina this was the final and as you can is if you scroll down you're gonna see that is here the last match so one nil and it's here one nil so we successfully extracted all this data and now what we have to do is um put all this data into a table and to do this in Python what we have to do is use a library called pandas pandas is a package that allows us to better manage the data that we we have so basically in pandas we work with data frames that is the equivalent of a table in Excel but it helps us manage our data efficiently in Python and to use pandas what we have to do is import pendants but first of course we need to install it so we have to go to the terminal right pip install pandas and yeah that's pretty much it we press enter and we install pendants so in this case I'm going to import pandas as PD so PD is gonna represent pandas and now to create a data frame which is a table basically it's simply a table what we have to do is create lists to build this data frame so what I'm gonna do here is grade 3 lists one for the home team the other for the score and the last one for the away team so I'm gonna write home equal to this empty list and then here the score then away and that's it now I have to append each element to this list so I'm gonna delete this print one by one so I don't get any uh any error so it's ready so now I'm gonna write home that a pant open parenthesis and with this we append each home team to the home list and then we do the same with the score and also we do the same with the away team so away that append then we open the parenthesis and that's it so we have each element in each list so with this we have the lists now what we have to do is create a data frame and something I like to do is first create a dictionary to organize my my data or my lists in this case so I'm going to create a dictionary called dig underscore football and I'm gonna open these curly braces and write the keys in this case you can name it as you want I'm gonna name it uh the same name I I put for the for the list so a home score in a way and uh values are gonna be the lists that we created so the home list which is this one then this core list which is this one and finally the away lists again these names can be different from the keys and finally what I'm gonna do is create a data frame so I write PD which represents pandas that data frame this is the method that we're gonna use and then I'm going to use this uh dictionary that I created and I have here so with this we create the data frame and I'm going to name it as DF underscore football so this is my football data frame and finally what I'm going to do is uh also create a column that identifies the year of this data frame for example this is the World Cup 2014 from the Brazil uh This World Cup belongs to Brazil but we don't care about the country but the year 2014.
So to do that to register that the Year we're gonna create a column named year and then I'm going to write it equal to in this case it's going to be only for Brazil and it's going to be 2014 which represents that World Cup uh 2014 Brazil so that's the final World Cup and now I'm going to print so you can see the data frame so I print this then I right click and run and we get this data so we have all the data that we extracted better organized in a data frame so we have from the first match Brazil Croatia the year 2014 and the last match Germany Argentina the final in 2014.
So that's how we track data from uh this Wikipedia website but we all destructed one single workout this 2014 but what we can also do is extract data from all the World Cups from 1930 to 2018 and then also we can extract the feature of the 2022 World Cup and to do this I'm going to create a function so what I'm gonna do is first Define a function which I'm going to name get underscore matches and this is gonna uh have as input the year because as you can see uh the year is going to change is the only thing that changes then I'm gonna put all of this inside this function and then we have to identify another pattern in web scripting we always look for patterns most of the time and here we have to see the website so I'm going to the website again and something interesting that Wikipedia has in this website in particular is that it organizes uh the let's say the links based on the or the workups based on the on the year so this website Remains the Same but the only thing that changes from for example the World Cup 2014 uh with the World Cup 2018 is the year for example if I write here instead of 2014 if I write 2018 and press enter voila we have the World Cup 2018 and that works for all the World Cups also for the 2022 which is the the coming World Cup uh Qatar and also for example here 1930 which is the first World Cup we can also get it by just right in here instead of 2022 1930 and we have the data of this world cup so now what we have to do is just implement this here in our code and to do that I'm gonna use something called F string F string is basically uh some kind of a string that allows us to put variables inside the string easily so to do this we only have to write the F in front of the quotes that we have here and then we have to uh to introduce a variable inside this string we only need to write these curly braces so now instead of writing 2014 we can write year so this represents the input that I have here so this is going to be for all the details that we have in this uh dealers list and then we have to look for the changes that we have to make because we're not using that year variable so everything here Remains the Same the same the same and here we have another change so instead of writing 2014 we have to write gear which represents my your input and that's pretty much it now we can test this function by writing up but there is something else here I need to return something so instead of printing this I'm going to use return and we're going to return this data frame DF underscore football so this is the data frame and finally we can test this function so I can write the name of the function get matches and then we can write the uh the year that we want in this case I'm going to write I know 2018 which is the last World Cup so 2018 and we can print this to see the results so we print this and now I'm gonna right click and then run this and see what happens so we have the result I scroll up and we have from the first match which was Russia uh Saudi Arabia 5 nil and the last match France Croatia 4-2 so this is our old data on in this world cup and as we can see this uh this function is working just fine but there is a little detail I want to mention there are some uh some gears or actually some workouts where we don't have the same pattern so for example let's go to 1930.
In this world cup actually it's here 1930 if we inspect this website we can see that we don't have the same pattern that we use here for the for the home score in a way so for example if we use that function and we strike data from 1930 and we press run actually uh we don't get all the matches I'm not sure if it was for 1930 I think it was for the World Cup 1982 I'm gonna run and as you can see we only have uh four matches and this is uh very weird because in the World Cup 1982 there were more matches so if we check check out here we see that there are a lot of matches but we could instruct those matches because we don't have the same HTML document so if we inspect this we see something different for the for example the home team it doesn't have this uh this thing that we use the f-home class name it has a different um a different class name and different tag name I think so what I'm trying to say is that for the Old World Cups it uses it we might not extract all the data but we can use another approach that I'm gonna show you in the next video so our function is working just fine but for some years we cannot just extract all the data because they just don't follow the same pattern but as I said before we're gonna deal with that later in the next in the next video actually and now I'm going to finish this by extracting all the the data of all these uh years so what I'm gonna do is first use a for Loop so I'm gonna Loop through this gears list so I write 4 year and years and then I'm going to use the function to get all this data so get matches and then I have to write the name of the year actually this year variable that is inside this Loop and we're gonna get all the data of all the workups so now we need to store this inside another list uh so to create this list easily I'm going to use list comprehension which is basically the same but with a different syntax so I'm going to use get matches here I'm going to put this here and then I'm gonna put this here this might look uh weird if you don't know this comprehension but it's exactly the same but this helps me create a list uh faster so here I can name the list which is I'm gonna name it FIFA and this is basically the same I can use this approach that I uh used here with home and empty list and then using a pen but this is another approach that does exactly the same but it uh it's shorter we can say so after we do this we get all the uh we get one data frame for each gear so we get a data frame from 1930 1934 and so on until 2018.
So what we have to do next is concatenate each uh each data frame so we only have one data frame or one table and when I say concatenate I mean join all the tables and to join all these tables we have to use that concat method in Panda so I write PD that concat to join all the data frames or all the tables of all the World Cups so now I have to write FIFA because this is the list that has all the tables and then I'm gonna add an extra parameter which is ignore underscore index and set it this equal to true this basically ignores the index of each data frame which is not absolutely necessary but I just want to add it then I'm gonna write DF underscore FIFA which is gonna be the final data frame which is a combination of all the data frames that we extracted here from 1930 to 2018 and finally I'm going to export this to excel actually to a CSV file so I write two CSV and then I'm going to write um a name for this file in this case I'm going to name it uh as FIFA underscore World Cup underscore historical data that's CSV we need to write this tension and I'm gonna add a parameter which is index equal to false this uh allows me avoid this index that are here so for example here I have the index 0 1 2 and 3. if I don't want to export this into the Excel file or the CSV file I have to set index equal to false so I only export The Columns home score away in year so that's why I write index equal to false and with this we have all the history critical data from 1930 to 2018 and finally I'm gonna also get a picture of 2022. so if we go here to 2022 we're gonna see that we have exactly the same uh the same structure but we have a single small difference here we have the matches but we don't have the results because the the World Cup uh Qatar 2022 hasn't been hasn't started actually so we only have the home team the white team but we don't have the result of the match so that's why this is going to be only the fixture we're gonna get uh instead of the score we're gonna get match one so here I'm going to write um get matches which is the name of my function and then write the year 2022.
So with this we get the fixture so I'm going to name it as DF underscore fixture so this is the fixture and now I'm going to export this as a CSV file so I write two underscore CSV and then I'm going to name it uh something like this that I have here but instead of historical data I'm going to name it uh underscore picture so that's CSV and then index equal to false so we don't support the index and that's it now we can run to see all the results so I'm gonna right click run this script and wait a couple seconds all right the script finished running and now we can open the files that we supported this FIFA World Cup historical data and the fixture you can open it with excel in this case I'm using Python and I have a plugin that allows me to read these CSV files inside of pycharm so I have these two files fixture and historical data and I'm just gonna open this first one historical data and I have this uh this file so this is the CSV file that I'm opening in pyterm and this uh these letters and bsp is just something that the plugin doesn't read well but actually this is uh well this was extracted successfully so we have all the matches from the first match in 1930 to the last match in 2018 so the final friends Croatia for two then we can also open up the the fixture of the World Cup Qatar 2022 so we have the first match which is going to be Qatar Ecuador and the last match which is going to be the the winner uh I don't know winner one with winner two so this is the final match and that's it that's how we scrape data using beautiful soup and in the next video I'm gonna show you how to with scrape the missing data that we couldn't scrape using beautiful soup because we successfully scraped data from most World Cups but there are some workups like 1982 where we couldn't scrape the data because uh it has a different pattern so in the next video I'm going to show you how to extract that missing data using Python and another Library called selenium in this video we're going to learn how to clean data using the pandas Library so in the previous video we create real world data and when we work with real world data sometimes we can have some missing data or some blank spaces in some words or anything that needs to be cleaned and to do this we're going to use pandas and python okay to start with this tutorial we need to have pandas installed on our computer so in case you don't have pandas you have to run peep install pendants on the terminal or if you have Jupiter like me you have to add this exclamation mark and then run this on the on this cell once we have pandas you can run this and import pandas as PD then we're going to read three CSV files and these three CSV files are the the data that we script in the previous videos so this uh the first one the World Cup underscore matches data this one is from a video where we scrape the data from Wikipedia and in this one we use beautiful soup and well this is the first one then we have the workout picture and we use pandas to scrape this data and finally we have the World Cup missing data which is the third one and this one was like the complement for that for the first one because when we scrape this uh Wikipedia with beautiful soup we didn't uh struck all the data because of some inconsistencies in the website so we have to scrape again with selenium to complete the missing data and well this is the data set or the CSV file that we got so now let's run this and let's have a look at these data sets very quickly so here is the historical data this has the data from the World Cup 1930 to the workout 2018 then the missing data which completes that uh the previous data frame and then the picture and the fixture are all the games that are going to be played in the coming World Cup Qatar 2022 so these three data frames are going to help me predict the winner of the World Cup 2022 and before I do the prediction and run any uh any model or anything we want what we have to do first is clean the data and this is the uh something we must do always in any data analysis or data science project okay now let's start cleaning the DF underscore fixture data frame so this is the data frame and if we have a look at this data frame we see that everything looks fine but if you uh you check this in detail this data frame we're going to see that the strings inside the home and away columns have some blank spaces so this is not visible right now it's very hard to see but if you work with the with the data as I did before making the video you're going to see that there are blank spaces for example we can have as an example the word Qatar and then a blank space and we don't want that we want to clean this data so what we're going to do in this case is use the strip method so we have to write a DF fixture home we have to select the column and then we have to use the Str attribute and then use the strip method as I mentioned before and with this we make sure that uh we delete the leading and trailing spaces uh the blank spaces so if we run this we only get the words we're interested in so only that name of the countries and we do the same for the away team and then to save this uh these changes I'm gonna do uh just equal to and then update the values so that's how you clean that DF underscore feature so if I run out this uh sorry if I run this you're gonna see that it looks the same but actually we got rid of the trailing and leading spaces so that's something we have to do so that DF underscore fixture is clean so everything uh everything else is fine so now let's continue with that DF missing data and let's clean this DF missing data and then add it to the DF historical data because as I mentioned before this historical data and missing data are the same both are the World Cup matches but in the first one we couldn't extract all the workouts I think there are some workouts missing and these workouts are considered in the missing data data frame so let's clean the missing data data frame and then let's add it to the historical data data frame so to do this first something that I want to show you is that this missing data has some new data and this is why we have to clean it so if I do uh DF missing that home and then use that is new method and then show this we can see that there are 64 rows that have missing data so it has the n a n value which means another number and indicates that it is missing data so to get rid of this missing data or this null data what we have to do is use the drop in a method and to do this first I'm gonna I'm going to copy this and then I'm gonna write drop n a so I write drop n a and now what I have to do is write in place equal to true to successfully drop the values and save the changes and update the data frame so with this we uh we drop the N A values so if I run this let me uh comment this out email run this we're going to see that if we run this one again the line of code that returns missing data we're going to see that now we don't have anything so we got an empty table and this indicates that we successfully drop the missing data with this line of code okay now let's continue this and for the next step I'm going to concatenate actually I'm going to delete this one and yeah I'm going to concatenate uh the data frame so I'm going to add the missing data to my historical data so now that this data frame is clean I'm going to add it and to do this I'm going to use the concat method and this method allows me to concat two data frames or more data frames so here I'm going to write uh where is it is DF underscore missing data and the order is DF underscore historical underscore data so it's here okay now I'm gonna add one extra parameter which is ignore underscore index and I'm going to set it to true so I do this because I want to ignore that index of data frame 1 and data frame 2.
So if data frame 1 has index from 1 to 100 and data Frame 2 has indexed from 1 to 300. with this I'm ignoring this uh original indexes so when I combine them this is going to start from 0 to the to the end so it's not going to consider the original indexes and that's good so now I'm going to concatenate in the name I'm going to set for this concatenation is going to be historical data so this is my final data frame and now uh what I'm gonna do is uh after uh combining these two data frames I'm gonna drop some potential duplicates that we could get so maybe when we extracted again the data here the missing data we could have extracted also some data that was already included here and this is not good because we don't want a repetitive data and to avoid that to make sure everything is going to be fine what we have to do is write this the name of the data frame which is the the result of the concatenation and then we have to write a drop duplicates method so I'm not sure if there is duplicated values but just in case I'm going to use drop duplicates then I open parenthesis in place equal to true so with this we drop the duplicates in the historical data column uh sorry data frame then what I'm going to do is sort this data frame by the year so we write sort values allows me to to sort a data frame uh by a year or by a column in this case the year and then I write in place equal to true and finally I'm going to show this data frame so we're dropping the duplicate first we're concatenating then dropping the duplicates and then sorting by the gear so I'm going to run all of this and now we have this data frame and now we see that is this data frame is the final historical data data frame after the concatenation we don't have duplicates and this data frame was sorted by a year or so it starts with the first workout 1930 and it ends with the last workout played so far which is 2018 and great so we successfully cleaned the uh the missing data and then we added it to the historical data okay now let's continue cleaning uh DF historical data because so far we only cleaned the missing data and then we added it to the historical data but this historical data the final data frame is not clean we have to make some modifications to to clean it and also we have to make some modifications to have it ready for the processing that we we're going to do later before you run any machine learning or anything you want this data frame has to be ready and right now it's not ready because for example the score column has this uh these numeric data but it's not only numeric data we have this hyphen in the middle and this is not good when we want to process the data usually we only want numeric data and we don't want anything else so we're going to get rid of this High thing in the middle and we're going to create two columns from for the home uh for the goals score uh from the home team and the goal score from the away team so let's do this and first what I'm gonna do to clean this historical data is uh do something that is uh a bit uh particular in this project but it's necessary so I was exploring this data frame and I found that this data frame or actually in the World Cups there was a match between Sweden and Austria that ended in a walk over and this means that one team was not in the was didn't wasn't part of the match so when one team the same play the match doesn't want to play the match or cannot play the match for some reason this ends in a walk over so it's like three nil for the team that that was actually waiting for the other team in the stadium and this is not something good because uh I want to predict the winner of the World Cup based on goals and if I say that one team has three nil uh without actually scoring those goals doesn't help me this doesn't help me predict uh the the winner of the World Cup because it's not like they deserve those three goals so I'm gonna change this data actually I'm gonna show it right now so I'm gonna copy this one and show you much better what I mean so I'm gonna copy and show you here so I run this and we have this uh this match Sweden Austria 1938 and this is uh not numeric data this is another thing that is not good for us so as you can see score here in the column we have numeric data for one for example here and four two here but here we have a w slash o that stands uh for walk over and this is not something good we have to either change this to three nil or uh delete it and in this case I'm gonna delete this row and to do this first I'm going to get the index so here first um I got this with the Str contains which allows me to get only the values that contain this this particular word in this case Sweden and then once I got the row I want I can use index and this index uh gets the index of this uh of the whole thing so when this condition is through we're going to get this index and this is the index that we're going to drop so this all of this is here and I name this as delete underscore index and we're gonna delete it right now so to do this I'm going to use the drop uh method so I write first the name of the data frame and then I write drop then I open parenthesis and with drop we can drop either columns or rows in this case I'm going to drop a row so I write index equal to m delete index then in place equal to true and this is ready so if we run this we can see that this is not here anymore so just to verify this I'm gonna copy and paste it so we can verify that we actually drop this uh this row so if I run this we got uh we get this empty data frame which indicates that this is not anymore so great we drop this this match that was uh wasn't good for my analysis so now let's continue cleaning the rest of the data okay now something that I want to show you is that some columns like the score columns actually only this color the score columns not only has this uh this clean uh score that you see here for example for one but some of them have not only digits but has some other things for example here I'm using this uh regex that stands for regular expression that gets the the matches that are not in the matches the characters not in Brackets so in this case it's matching anything that is not a digit and that it's not uh this hyphen so anything that is not a digit and is not a hyphen is not the score right it's maybe a text maybe it's something else but that's what we want to identify so here I'm using the string contains and let's find out what's not a digit and what's not a hyphen in our score column so I run this now I need to add this one here to show and okay now we got this so for some reason some scores as you can see here have this parenthesis and then a e t I don't know what that means but we don't want that we only want the uh digit hyphen digit that represents the score and to get rid of this I'm going to use the replace method so this method allows me to replace one string with another and to do this instead of writing here contains I'm going to write uh the name of the column first Str and then I'm going to write replace so I'm going to replace this which represents something that is not a digit and not a hyphen so anything uh but deeds and hyphen so all of this in this example and here all of this and we're going to replace this with uh an empty string which is this one so after we do this we're gonna uh turn on regex so I write reg x equal to true and that's it so with this we replace anything that is at a digit in hyphen with an empty space so we can run this and see the results so I'm gonna run and now we can see that uh we replace this value successfully so now what I'm going to do is write this and I set this equal to I mean the DF historical data score I set this equal to the expression we had before and with this we're updating all these columns so great now if I run this we shouldn't have uh the those parentheses with a e t anymore great now let's uh also uh clean the names of the the words inside the home in a way columns so to do this I'm just going to copy this and I'm going to strip this because it might have some some data with blank spaces and we have to get rid of the leading and trailing spaces so what I'm gonna do is write this actually I'm gonna do it below and I'm gonna write it here and actually here and then what I'm gonna do is change the name so instead of uh instead of DF underscore picture what I'm going to write is this so I can replace each of them and clean it easily so if we run this we can see that actually we won't be able to see it because as I mentioned before those plug spaces are kind of invisible but now we have uh we're certain that we don't have either uh leading or trailing spaces so that's good because that can spoil our analysis okay now uh next we have to split the score columns into Home and Away goals and then drop the score column so with this we're gonna get rid of The Hyphen in the middle and we're gonna split the home the the goals score from the home team and the goals scored from the away team so this is just to have this data better organized and also have only numeric data so I'm gonna do this and to do this I'm going to use the split method so I'm going to write here first the name of that data frame then the column I want to work with then Str and then split and inside I'm gonna write this element which is the hyphen and then I'm splitting with a hyphen and if I run this we see that we get a list for each row and this list represents only the elements that are numeric so for example four one and four two here four one and four two so we have an at least with these two elements and now if I want to put each element in a column in a separate column what I have to do is add expand equal to true so with this we get two columns and one data frame so for one for two and this is what we wanted now if we want to add this to the historical data data frame what we have to do is write this the name of the data frame and then specify the new names of the column so I want to name the column 0 as home goals and I want to name that column one as away goals so that's it that's everything we have to do so if we run now this we should get um the two columns so I'm gonna run this and I'm going to show you now the value here so here's the value we have this uh the home score away in year uh the original data frame and then we have Chrome goals and Away goals and this is what we wanted so instead of four hyphen one now we have four in one column and one in another column and this helps us a lot when we want to do some processing later so that's what we want it and now we have to get rid of the score column because we're not going to use it anymore so I'm gonna drop this uh the score column using the drop method and right in here score and then here setting axis equal to 1.
So with axis equal to 1 we indicate that we want to drop a column so you have to write axis equal to one for columns then in place equal to true we'll run this and we have a look at the data frame we'll see that we have now all the columns but not the score column because we dropped it okay now to continue this let's rename the names of the columns here home away in year and also let's change the data types so let's uh I'm going to delete this and let's do this here so let's rename the columns and to do this we have to use the rename method and we have to write our name then columns equal to here columns because we want to rename columns and then we have to write the original name on the left and on the right the new name so instead of home I'm going to name it as a home team because it represents the home team and then instead of away I'm gonna write away team and the last one is going to be the year so I'm gonna write here near and I'm gonna uh write the first one in upper case so that's uh the year and finally in place equal to true to update this data frame so in place equal to true and that's it now I'm gonna change the data types because here if we see again this data frame apparently this uh the year home goals in away goals have numeric data I mean the data types will be uh int which is int which stands for integer but that might not be the case if we check D types which which is a attribute that Returns the data types we'll see that only the gear has integer values so only the year has actually numeric values uh that in this case is an integer and home goals in away goals have object and this represents strings and we don't want the strings we want that home goals and Away goals we want it to be uh we want them to be only uh integers not object or strings so what we have to do here is write this and then we have to write as type as type allows us to change the data type of any column we want so now we open a dictionary and we specify any column we want so we want to change home rules to int then we want to change away goals and it's going to be the same in this case int int integer and then we could also set here to integer we don't need to do that but I just wrote it here and well that's it so I'm going to add the in place equal to true to save this uh this these values are actually what I can do to here is just write DF historical data equal to this and with this we can also update the whole data frame that's another way instead of writing in place equal to true so now I run this and now if I show again this data frame we see that we have this new uh new columns or actually this new column names and then if we see the data types with the types we'll see that now home goals and Away goals are integers and that's what we want it okay finally we're going to create a new column named total goals that's not necessary but let's just create it in case we need it so the total goals is going to be the home goals plus the away codes and that's uh what we're doing right now so away goals plus home goals and this is going to be my total goal so I write this equal 2 and then I add a new column which is total gold so we'll run this and I got an error because I didn't write here the square bracket so now I run and if I show you the data frame we'll see that we have a new column named total goals which represents the sum of these two columns here okay that's it for the cleaning part so we have now all our data cleaned and to finish this we have to support this cling data frame so we have to export it to to CSV files if we want it so to do this I'm gonna write first historical data this is the last data frame we cleaned and I'm gonna write two underscore csb and then we have to write the name of the data frame in this case I'm going to use the original name which was let me see here historical data it was this one and I'm just gonna add the word clean in front of that name so clean underscore FIFA World Cup matches and that's it so now I can uh I can also add index equal to false and with this we're not going to support the indexes in this case 0 17 16 and so on but we're going to support only the data inside these columns that I'm highlighting right right now and well now let's continue with the feature data frame which is this one and for the fixture we have this name so I'm going to add also the name uh I'm going to use the same name and I'm going to add the word clean in front so I Write Clean underscore and now this name so now I copy this and then I'm gonna paste it below so this is the name and then I'm gonna write your index equal to false and with this we're going to export these two data frames and now I'm not exporting this data frame name missing data because we already include this data frame inside the historical data data frame before when we concatenated these two data frames so when we use the PD concat we put them together and we actually uh put all of this in the data frame historical data and well if we run this and then we check our directory I have it here now we see that we have two files this clean FIFA World Cup picture and then clean FIFA World Cup matches and that's it that's how you clean data using pandas and before I finish this video I want to show you some uh some practices that we often use to do some verifications sometimes we can verify if we successfully clean the data by checking the number of elements that we have because by mistake we might have got rid of some rows or maybe we didn't get rid of some rows that should have been deleted and we could verify this uh using the Len function and in this case with the Jiro because we can verify how many matches we have per World Cup if we run this we can see each workout I mean each year and then the number of matches because I'm using the length of the data frame and this Returns the number of rows so the number of matches so if you know something about football you will know that in 2018 there were 64 matches between the among the group stage The Knockouts the quarterfinals semi-final and the final there were 64 matches but this was not the case for the all World Cups for example 1994 had less matches but if for some reason for example you got here in 2018 30 matches you will see that you may be by mistake uh dropped some rows here in 2018 because you should have 64 and you had in the example you had 30 for example and that's something that could help you and verify that you successfully cleaned your data so that's something that you can do sometimes when it's possible and this is something I recommend you always to do to verify that the final data makes sense so there are many ways to do this and that depends on your your data set and also on the information that you have available and that's it in this video we clean the data with script in previous videos now that the data is clean we can start making our predictions with python in this video we're going to predict the winner of the World Cup Qatar 2022 and we're going to use Python for this we're going to use the pandas library and also we're going to use scipy to import that poisson distribution so let's get started okay to start with this video we have first to import pandas pickle and uh the poisson from scipy so to do this first we have to install these libraries so in case you don't have any of them you have to write PIV install and then either pandas or pink hole or sci-fi so if your in Jupiter notebooks like me you have to write this exclamation mark and if you're on any other IDE you have to open the terminal and run this this command pip installed and the name of the library okay once you have the libraries installed on your computer we have to run this and we successfully imported these libraries okay now for this video we're going to use this uh this CSV file we're gonna import this CSV file and we're going to read it with pandas so let's have a look at the CSV files first we're going to use the Dig table actually this dick table I web script a website to get the tables of the groups in the world cup and then I export this uh these tables with pickle and now I'm reading this table again with pickle so all these files this uh the underscore table and these two CSV files are going to be in the description and you can download uh you can download these files just uh just for the record we extracted this data in previous videos so if you are interested in learning web scraping how to extract this data in your room you can watch the videos but if you're not interested in that you can just download these files and these files has that clean data and you can read it in with these lines of codes okay so now then we have the clean FIFA World Cup matches and basically this has the data from the World Cup 1930 to 2018.
So we have all the historical data all the matches played in those World Cups and finally we have the World Cup fixture this is the feature of the World Cup Qatar 2022 and now we can read all this data and have a look to this to this tab data so I'm gonna show you first that dick table and this one has the groups as I said before so we have the tables and right now it has only the value of zero in each uh in each team because we have an uh run any model to predict who is going to be the winner between uh the matches that are going to be on the World Cup but we're gonna take care of that later so we have here all the groups from group a to Group H okay then we have the historical data as I said before from the matches from 1932 2018. okay then the fixture of the 2022 World Cup and this is the picture okay now let's see uh the next next thing we have to do and here we have to split the data frame historical data into two uh data frames home and away and to do this first I'm going to show you again this data frame and here we have uh this data with all the matches and what we have to do is to split this uh into two so we have to write for example DF home this is the one of the data frames and this is going to be equal to DF historical data and then we have to select only some columns and the columns that we're going to select are the home team then the home goes and then away goes so three columns and we're going to do the same for the the other one for the way team so I write DF away and then a weight team then home goals two and Away goals and that's it so with this we have these two um these two data frames and we can run this so we ran these uh lines of code and then for the next step we're going to rename some columns so I'm going to show you the name of this data frame uh the content of this data frame actually and this is the content so we have the home team home rules and Away goals and now we're gonna do uh we're gonna rename some columns for some particular reason and I'm going to explain to you right now but first i'm gonna write rename which is the method that allows me to rename a column and then write the column parameter or actually columns then I open a dictionary and we can rename any column we want in this case we're going to rename the three columns that we have here so first home team and then home goals so I'm gonna write here first original name then this colon and then the new name so instead of home team I just want to name it as team then we have home goals and I want to name this as goals scored so I don't care if it's a from the home or the way I just want to name it as home escort actually goes is called sorry goals escort and for away goals we're gonna name this as ghost considered so I'm going to open this uh quotes and then write goals considered so we have team for the home team then goes scored and then goals considered and this is basically because when the home the home team scores a goal those are the goals they scored and when the weight team scores a goal for the home team those are the goals that the home team concedes so that's why I put here I wrote here ghost is scored and goals considered so that's the reason why I'm writing this then we're gonna do the same for the way uh data frame but in this case I'm going to write DF underscore away now rename and here away team so it's something like this and then we're going to use again team instead of a weight team and then here there is som
So if I write 11 we have the table of group eight so is this one and then if we want the table of Group B we have to go until I believe is uh 18. so if we run this we have the the table of Group B which is this one so we see here England uh Iran United States Wales and it's here so it's the table so you can see that we have to go uh every seven tables so 11 plus 7 18 so 18 plus 7 25 and we should get a table of Group C which is Argentina Saudi Arabia Mexico and Poland so is this one so we successfully found the pattern to found to get all the tables so that's basically what we have to do to scrape all the tables of this website so the last table that we want I believe is table 16 if I'm not wrong yeah this is the last table which belongs to This one belongs to Group H I believe so is Group H so this is the last table we want so that's it now we can do some modifications to customize our tables for example we can change the column name which is this one uh I want this to rename as team and also we can get rid of this table that I'm not sorry this column that I'm not gonna use in this case qualification I'm not going to use this column in the future for a future project so I want to get rid of this and to do this we can use a for Loop so to modify each data frame so what I can do here is first right for I in and then I write a range so first we have to write the first element which is 11 which is this one then the last element which is 60.
And then the step in this case I'm going after seven tables so seven so we get 11 18 25 and until actually it's 67 because the last one is not uh included so 67 is not going to be included but 16. so that's it if we print this we see all of this so 11 is group 8 18 is Group B and so on so we have this now I want to link each number which the table that I'm going to scrape so for example the first table is group a the second is Group B and the last one is Group H so to link this number with the table what we have to do is use another library in this case I'm going to use the string Library so I write a string from string import and now I import ASCII so I write ASCII in this case I want it in uppercase so I write underscore uppercase and I'm going to name this as the alphabet because this actually gives me the alphabet so I run this and now if I copy and paste this one here we can see that we have the alphabet which is this one so this is my alphabet and this is going to help me link the letters with the numbers and that's very good if I want to better organize my tables because this is very ugly I don't want 11 18 25 that doesn't mean anything but group a b and c until H actually has a meaning so I'm gonna link them and to do this I'm gonna use a function called zip so if I write zip here and open parenthesis now we have to link these numbers with the letters so I can write alphabet as my first element and then this range as the second element now I can write letter and now I can print both so letter and I then I run this and as we can see now we linked this letter with this number so 11 is for a and 60 is for H so we successfully linked this too now I'm going to clean uh these tables or these data frames and then I'm going to create a dictionary so first I'm going to clean this data frame because it has some dirty data here so first what I'm gonna do is I'm going to use these old tables I'm going to copy and paste it here so you can see it much better and now I'm going to use this variables so now I write all underscore tables and I'm gonna um open these square brackets and use I so for each iteration I'm gonna get a data frame and it's gonna be uh for example data frame of group a then of Group B until Group H so I'm gonna get one by one then I'm gonna rename each uh each column that is named Team BTE I don't want this name I want to rename it so I use DF that rename open this parenthesis then columns are equal to and we open these curly braces then I want this uh column name rename as team so that's what I'm gonna do then I write in place equal to true so we save all the changes that we're making here so if we run this we're gonna have an issue because not all the all the data frames have this thin VTA column if we see here uh data frame 60 this one has this one and probably 25 yeah it has the same column and 18.
Yeah it has but 11 it has this weird uh name which is huge and that's gonna cost me a problem here because that data frame which is the first one doesn't have this name so I'm gonna get an error saying that the column name is not there so to avoid that issue what I'm gonna do is use a method called um uh columns actually is an attribute I believe so here I'm going to write uh I'm going to open here and I'm going to write uh I'm going to use the one of these data frames so you understand me much better and I'm gonna just copy this one for example and write it here then this is my data frame of course then I write columns and if I run this you can see that I get all the columns post team VT pld I'm going to run this here so you can see so all of this here are here all the all the letters in both that represent the columns are in my list so now what I can do is um go to 11. and see what I get so we see that we get all the columns and yeah but this looks ugly but we see that the second element is team and if we go to for example 18 we see that the second element is still team and if we go to 60 we see that it has the same position so what we can do is just get the second column which is represented by this index one in Python it starts the index with zero so the second will be one and if we run this we have this name of the column so we don't have to write the name of the column but only use this attribute and we get the name of the column we want uh independently of the of that name so if we write 11 we get that one uh actually actually 11 here sorry I made a mistake 11 here and here uh one so yeah this is the ugly or huge uh column name so now what we have to do is just copy this you know uh put it here so instead of writing this column name I'm going to use uh DF dot columns one so this is this represents the second column of each data frame that we have here so group a until Group H so now I delete this and now it's zero so now finally I'm gonna delete this column as I said I don't want this qualification column so I use the pop method and write the name of this column in this case all the data frames have this name I checked this before and all of them have this column name are the name is the same and finally I'm gonna write the I want to create that dictionary and to do this I have used to write a an empty dictionary so I write did underscore table equal to and I have these curly braces so this indicates that I'm creating an empty dictionary and then I write the underscore table and then I open these uh square brackets sorry this uh yeah square brackets to create a new element so for example I want my new key in this case name as group a and then this is equal to the value which is going to be that DF so in the first iteration we get a element a and then we get a data frame in this case table a or group a and this is going to be in this key and to create different keys not only group a but group a b until H we have to use uh this F string and then open these curly brightness so these curly braces in the F string allows me to put a variable in this case I'm gonna write here the letter which is this that I'm I obtained from the alphabet and yeah this is going to help me get from a to H so that's pretty much it now we can run this and see the results so now I run this and yeah it successfully run you know I can see the content of my dictionary so I write the underscore table and now I can use the keys method to see all the keys so I have from group a to Group H great now I can see the content of each key in this case group a so I see I write group a and now I see the table of group a now if I write Group B I get the table of Group B and so on if I change to Group H this is the table of Group H and that's how we successfully scrape all the data of this website and also we created a dictionary to better manage all this data finally uh one extra step that you can do in case you want to use this this dictionary in a project as I'm gonna do in this series of videos is uh use the pickle library to export this dictionary so to export this dictionary and use it in another file we can use pickled so we only have to write import pickle in case you don't have it you just have to install it with pip pip install Pico in case you don't have it I already have it installed so I'm gonna use this one so to do this to export this dictionary in a file what we have to do is write the following we write with then right open then we write the name that we want to export this file so the name I want arrow is digged underscore table you can write whatever you want but I'm going to name it like that then I write w b which stands for write bytes I believe and then I write as and the name as uh or the name I want to sign so this represents all of this represents outputs and now I'm going to use pickle so I write pickle that dump to dump all these data in my dictionary so I write dig underscore table and then I write the name um output here so basically what I'm saying here is all the data that is inside this underscore table that we created uh all this data put it in this uh file that I'm creating so put it inside this and now we only have to run this and that's it if we go to our working uh directory we can see that there is a new file and it's digged underscore table and that's how we export our dictionary and that's it that's it for this video in the following video I'm going to show you how to scrape the data of all the matches uh that are going to be played in the World Cup so basically we're gonna get the fixture and also I'm gonna show you how to extract all the historical data from the from all the World Cups from 1930 to 2018 so you can have a database of all the matches but that's going to be on the next video in this video I'm going to show you how to scrape data of football matches from the FIFA World Cup so we're going to scrape all the matches played so far from 1930 to 2018 and also we're going to script the fixture of the FIFA World Cup that is in Qatar which is this one and we're going to use python and a library called Beautiful soup so let's get started okay to script the data of football matches from all the World Cups first we need a list with all the years that represent a World Cup for example from 1930 to 2018 so these are the years that had a World Cup all right I'm after that what we have to do is install the library to scrape a website in this case I'm going to use beautiful soup and to do this what we have to do first is install beautiful soup so what we have to do is open the terminal and write pip install bs4 which represents beautiful soup so we press enter and we install beautiful soup and after that what we have to do is import this Library so we write from bs4 import beautiful soup so now we have beautiful soup and after we do this we have to get the link that we're going to scrape in this case I'm going to start with the uh the 2014 World Cup and I'm gonna paste it here so I'm Gonna Leave the link in the description and I'm gonna give you the name which is going to be web so this is my the website that I'm gonna scrape after this what we have to do is send a request to this website and to do this we have to use the request Library so what we have to do is import requests I'm not sure if this uh Library comes with python in case it doesn't what we have to do is open up a terminal again and right peep install requests so we press enter and then we can continue with this so we need requests to send requests to the website that we're going to scrape so I'm going to copy these requests and then use the get method so I write get and I have to write the website where I want to send my request in this case this website that I uh I created the the variable so after this when we send a request to our website we get always a response so I'm going to name this variable as response and after we do this we're gonna get a response and we want the text of this response so we use the text attribute to get the text of this response so what I'm going to do now is print the this response so you can see what uh this looks like so I printed this and what we have here is the HTML of this website so the HTML behind this website is printed here so if you know some HTML you can recognize some elements like the tags and all of that but you don't need to know HTML to scrape a website but it helps so after we do this I'm going to create a variable name content so this content is gonna be parsed and to parse this content we need a parser and we're going to use a parser named lxml but first i'm gonna use uh this beautiful soup that we imported and then this uh accepts to um two two elements the first one is the content and the second one is the parser so content and then the parser which is lxml this lxml we need to install so we're not we need to open up the terminal and write pip install and then lxml then press enter and you install this parser so now I'm going to name this as soup this is the typical name that we use for uh for this and this soup is gonna help us extract data from this website we cannot extract data if we don't have this soup so now that I have the soup I can start scraping this website so basically what we did here is the the common uh steps that we need to get started with the beautiful soup now we can start scraping this website so I write soup dot find in this case to get an element we need to use either find or find all the first one help us get uh one element in particular in the second one find all helps us get uh multiple elements that match the name of the tag and the class name that we're going to write in a couple of seconds so in this case I'm going to use find all because I wanna get all the matches here uh all the matches that were played in the group stage in The Knockouts the semi-finals and the finals so I'm gonna get multiple matches so I'm gonna just find underscore all and then we need to follow a syntax but before I tell you this syntax we need to go to the website and we need to inspect this website to see the HTML uh behind this website so to do that I have to right click and then click on inspect so after this we get this uh developer tools and here we can see the HTML document so this is the HTML that represents this website and to get one specific element what we have to do is click here on this button on the left and then select any element that we want all right now to scrape a website what we have to do is find a pattern every time that we want to scrape a website we need to find a pattern that we can use in our code so here a pattern that I discovered is uh there is something with the matches with the rows actually so this match is inside a row so if I for example I select Brazil here I click on it I get this element but I don't want only Brazil I want the whole row that gives me that match so what I have to do is in this HTML document just scroll up a little bit until I find the row that contains all the data so I'm going to scroll up and finally I found this row as you can see this element highlights all the rule in blue and that tells me that this is the the road that I was looking for so if I now um click here in this Arrow I can minimize this element and I can see if there is a pattern here so first let's identify these uh the elements in this HTML element so first we have a div this is the tag then we have this uh item type it's a class name and then we have another class name which is class with the class uh with the the value football box so this is one Element let's see the next row so we have this one and as we can see it has the same elements is exactly the same so we have div and we have class and then we have football box is the same and if we scroll down we have the same for the next match so we can see these three rows have the same elements so we can use this to get all the rows in this website so we're gonna get from group a to group b c d until Group H and then semi-finals Knockouts and well the final so we're gonna get all the matches and that's good so now I'm gonna show you the syntax I was talking about before so to extract the data inside this element we have to follow a syntax and this syntax tells us that we need to uh to write first the name of the tag so here the tag and then uh the name of the class so we write class this is an a parameter class and then the name of the class so let's identify this in this HTML document or actually in this element the tag is div as we can see the name is div here in blue so here instead of that I write div and then in the class we have to write uh the name of the class which is football box so I copy this and then I paste it here this is the name of the class all right so now we have uh this element that represents this row so this is a match so I'm going to name this as match but since we're gonna get all the matches because it's gonna be like this one and this one and this one because as you might remember we're not using find find will give us only the first one but find all gives us all of them so fine all gives me all the matches so I'm gonna name it matches because we're gonna get all the matches inside a list so when we use find all we get a list with multiple elements matches is a list all right now what we have to do is Loop through this list which is matches so I have a list and I'm gonna Loop through it so I write 4 match in matches uh I write the colon and then I'm gonna extract each data that I want from this row so in this case I just want to get the name of the home team which is in this case Brazil the score in this case three one and the away team in this case Croatia so to do that I'm going to open this element with this uh click in this little arrow I think and then what we have to do is identify this element so I'm gonna scroll um uh scroll down a little bit and then I'm going to identify the one that gives me the home team which is this one F home so we have this element th with Tac th and then with class name f hum so this is the one I'm looking for then for the score is this one F score and for that way team is this one F away so we have all these elements now we only need to get them inside this for Loop so what I'm gonna do is write match in this case match represents the soup that we created before as I told you we cannot extract data without the soup but since this Zoop is inside this matches list we can use the match uh as a replacement we can say so this match is gonna get us all the data or it's gonna help us uh get inside of these elements that are here in the rows so we are now in the rows that is here now we want the data inside the row so we're going for f home so match Dot find in this case because I just want one element inside this row and then I'm gonna write first attack in this case is th and then I'm gonna write the name of the class which is in this case f hum so first th then Kuma and then class underscore equal to F hom so I have now this and now to get the text inside this element what we have to do is use the get text function so I'm going to write get underscore text parenthesis and with this I have the name of Brazil so the text inside this element is Brazil and I'm gonna get that so now I'm gonna duplicate this twice and I'm gonna change the name of the class the second one is f score and the last one is f away so now F score and F away great so now we have all these elements and now to see uh the results what we can do is print this so I'm going to print each of them to see the results that we have so far to see if everything is working fine so now I print this and now I can run it so I right click and run this to see what we have so far and as you can see all these elements were printed so we have from from the first match which was Brazil Croatia 3-1 which is this one until that last match which is uh Germany Argentina this was the final and as you can is if you scroll down you're gonna see that is here the last match so one nil and it's here one nil so we successfully extracted all this data and now what we have to do is um put all this data into a table and to do this in Python what we have to do is use a library called pandas pandas is a package that allows us to better manage the data that we we have so basically in pandas we work with data frames that is the equivalent of a table in Excel but it helps us manage our data efficiently in Python and to use pandas what we have to do is import pendants but first of course we need to install it so we have to go to the terminal right pip install pandas and yeah that's pretty much it we press enter and we install pendants so in this case I'm going to import pandas as PD so PD is gonna represent pandas and now to create a data frame which is a table basically it's simply a table what we have to do is create lists to build this data frame so what I'm gonna do here is grade 3 lists one for the home team the other for the score and the last one for the away team so I'm gonna write home equal to this empty list and then here the score then away and that's it now I have to append each element to this list so I'm gonna delete this print one by one so I don't get any uh any error so it's ready so now I'm gonna write home that a pant open parenthesis and with this we append each home team to the home list and then we do the same with the score and also we do the same with the away team so away that append then we open the parenthesis and that's it so we have each element in each list so with this we have the lists now what we have to do is create a data frame and something I like to do is first create a dictionary to organize my my data or my lists in this case so I'm going to create a dictionary called dig underscore football and I'm gonna open these curly braces and write the keys in this case you can name it as you want I'm gonna name it uh the same name I I put for the for the list so a home score in a way and uh values are gonna be the lists that we created so the home list which is this one then this core list which is this one and finally the away lists again these names can be different from the keys and finally what I'm gonna do is create a data frame so I write PD which represents pandas that data frame this is the method that we're gonna use and then I'm going to use this uh dictionary that I created and I have here so with this we create the data frame and I'm going to name it as DF underscore football so this is my football data frame and finally what I'm going to do is uh also create a column that identifies the year of this data frame for example this is the World Cup 2014 from the Brazil uh This World Cup belongs to Brazil but we don't care about the country but the year 2014.
So to do that to register that the Year we're gonna create a column named year and then I'm going to write it equal to in this case it's going to be only for Brazil and it's going to be 2014 which represents that World Cup uh 2014 Brazil so that's the final World Cup and now I'm going to print so you can see the data frame so I print this then I right click and run and we get this data so we have all the data that we extracted better organized in a data frame so we have from the first match Brazil Croatia the year 2014 and the last match Germany Argentina the final in 2014.
So that's how we track data from uh this Wikipedia website but we all destructed one single workout this 2014 but what we can also do is extract data from all the World Cups from 1930 to 2018 and then also we can extract the feature of the 2022 World Cup and to do this I'm going to create a function so what I'm gonna do is first Define a function which I'm going to name get underscore matches and this is gonna uh have as input the year because as you can see uh the year is going to change is the only thing that changes then I'm gonna put all of this inside this function and then we have to identify another pattern in web scripting we always look for patterns most of the time and here we have to see the website so I'm going to the website again and something interesting that Wikipedia has in this website in particular is that it organizes uh the let's say the links based on the or the workups based on the on the year so this website Remains the Same but the only thing that changes from for example the World Cup 2014 uh with the World Cup 2018 is the year for example if I write here instead of 2014 if I write 2018 and press enter voila we have the World Cup 2018 and that works for all the World Cups also for the 2022 which is the the coming World Cup uh Qatar and also for example here 1930 which is the first World Cup we can also get it by just right in here instead of 2022 1930 and we have the data of this world cup so now what we have to do is just implement this here in our code and to do that I'm gonna use something called F string F string is basically uh some kind of a string that allows us to put variables inside the string easily so to do this we only have to write the F in front of the quotes that we have here and then we have to uh to introduce a variable inside this string we only need to write these curly braces so now instead of writing 2014 we can write year so this represents the input that I have here so this is going to be for all the details that we have in this uh dealers list and then we have to look for the changes that we have to make because we're not using that year variable so everything here Remains the Same the same the same and here we have another change so instead of writing 2014 we have to write gear which represents my your input and that's pretty much it now we can test this function by writing up but there is something else here I need to return something so instead of printing this I'm going to use return and we're going to return this data frame DF underscore football so this is the data frame and finally we can test this function so I can write the name of the function get matches and then we can write the uh the year that we want in this case I'm going to write I know 2018 which is the last World Cup so 2018 and we can print this to see the results so we print this and now I'm gonna right click and then run this and see what happens so we have the result I scroll up and we have from the first match which was Russia uh Saudi Arabia 5 nil and the last match France Croatia 4-2 so this is our old data on in this world cup and as we can see this uh this function is working just fine but there is a little detail I want to mention there are some uh some gears or actually some workouts where we don't have the same pattern so for example let's go to 1930.
In this world cup actually it's here 1930 if we inspect this website we can see that we don't have the same pattern that we use here for the for the home score in a way so for example if we use that function and we strike data from 1930 and we press run actually uh we don't get all the matches I'm not sure if it was for 1930 I think it was for the World Cup 1982 I'm gonna run and as you can see we only have uh four matches and this is uh very weird because in the World Cup 1982 there were more matches so if we check check out here we see that there are a lot of matches but we could instruct those matches because we don't have the same HTML document so if we inspect this we see something different for the for example the home team it doesn't have this uh this thing that we use the f-home class name it has a different um a different class name and different tag name I think so what I'm trying to say is that for the Old World Cups it uses it we might not extract all the data but we can use another approach that I'm gonna show you in the next video so our function is working just fine but for some years we cannot just extract all the data because they just don't follow the same pattern but as I said before we're gonna deal with that later in the next in the next video actually and now I'm going to finish this by extracting all the the data of all these uh years so what I'm gonna do is first use a for Loop so I'm gonna Loop through this gears list so I write 4 year and years and then I'm going to use the function to get all this data so get matches and then I have to write the name of the year actually this year variable that is inside this Loop and we're gonna get all the data of all the workups so now we need to store this inside another list uh so to create this list easily I'm going to use list comprehension which is basically the same but with a different syntax so I'm going to use get matches here I'm going to put this here and then I'm gonna put this here this might look uh weird if you don't know this comprehension but it's exactly the same but this helps me create a list uh faster so here I can name the list which is I'm gonna name it FIFA and this is basically the same I can use this approach that I uh used here with home and empty list and then using a pen but this is another approach that does exactly the same but it uh it's shorter we can say so after we do this we get all the uh we get one data frame for each gear so we get a data frame from 1930 1934 and so on until 2018.
So what we have to do next is concatenate each uh each data frame so we only have one data frame or one table and when I say concatenate I mean join all the tables and to join all these tables we have to use that concat method in Panda so I write PD that concat to join all the data frames or all the tables of all the World Cups so now I have to write FIFA because this is the list that has all the tables and then I'm gonna add an extra parameter which is ignore underscore index and set it this equal to true this basically ignores the index of each data frame which is not absolutely necessary but I just want to add it then I'm gonna write DF underscore FIFA which is gonna be the final data frame which is a combination of all the data frames that we extracted here from 1930 to 2018 and finally I'm going to export this to excel actually to a CSV file so I write two CSV and then I'm going to write um a name for this file in this case I'm going to name it uh as FIFA underscore World Cup underscore historical data that's CSV we need to write this tension and I'm gonna add a parameter which is index equal to false this uh allows me avoid this index that are here so for example here I have the index 0 1 2 and 3. if I don't want to export this into the Excel file or the CSV file I have to set index equal to false so I only export The Columns home score away in year so that's why I write index equal to false and with this we have all the history critical data from 1930 to 2018 and finally I'm gonna also get a picture of 2022. so if we go here to 2022 we're gonna see that we have exactly the same uh the same structure but we have a single small difference here we have the matches but we don't have the results because the the World Cup uh Qatar 2022 hasn't been hasn't started actually so we only have the home team the white team but we don't have the result of the match so that's why this is going to be only the fixture we're gonna get uh instead of the score we're gonna get match one so here I'm going to write um get matches which is the name of my function and then write the year 2022.
So with this we get the fixture so I'm going to name it as DF underscore fixture so this is the fixture and now I'm going to export this as a CSV file so I write two underscore CSV and then I'm going to name it uh something like this that I have here but instead of historical data I'm going to name it uh underscore picture so that's CSV and then index equal to false so we don't support the index and that's it now we can run to see all the results so I'm gonna right click run this script and wait a couple seconds all right the script finished running and now we can open the files that we supported this FIFA World Cup historical data and the fixture you can open it with excel in this case I'm using Python and I have a plugin that allows me to read these CSV files inside of pycharm so I have these two files fixture and historical data and I'm just gonna open this first one historical data and I have this uh this file so this is the CSV file that I'm opening in pyterm and this uh these letters and bsp is just something that the plugin doesn't read well but actually this is uh well this was extracted successfully so we have all the matches from the first match in 1930 to the last match in 2018 so the final friends Croatia for two then we can also open up the the fixture of the World Cup Qatar 2022 so we have the first match which is going to be Qatar Ecuador and the last match which is going to be the the winner uh I don't know winner one with winner two so this is the final match and that's it that's how we scrape data using beautiful soup and in the next video I'm gonna show you how to with scrape the missing data that we couldn't scrape using beautiful soup because we successfully scraped data from most World Cups but there are some workups like 1982 where we couldn't scrape the data because uh it has a different pattern so in the next video I'm going to show you how to extract that missing data using Python and another Library called selenium in this video we're going to learn how to clean data using the pandas Library so in the previous video we create real world data and when we work with real world data sometimes we can have some missing data or some blank spaces in some words or anything that needs to be cleaned and to do this we're going to use pandas and python okay to start with this tutorial we need to have pandas installed on our computer so in case you don't have pandas you have to run peep install pendants on the terminal or if you have Jupiter like me you have to add this exclamation mark and then run this on the on this cell once we have pandas you can run this and import pandas as PD then we're going to read three CSV files and these three CSV files are the the data that we script in the previous videos so this uh the first one the World Cup underscore matches data this one is from a video where we scrape the data from Wikipedia and in this one we use beautiful soup and well this is the first one then we have the workout picture and we use pandas to scrape this data and finally we have the World Cup missing data which is the third one and this one was like the complement for that for the first one because when we scrape this uh Wikipedia with beautiful soup we didn't uh struck all the data because of some inconsistencies in the website so we have to scrape again with selenium to complete the missing data and well this is the data set or the CSV file that we got so now let's run this and let's have a look at these data sets very quickly so here is the historical data this has the data from the World Cup 1930 to the workout 2018 then the missing data which completes that uh the previous data frame and then the picture and the fixture are all the games that are going to be played in the coming World Cup Qatar 2022 so these three data frames are going to help me predict the winner of the World Cup 2022 and before I do the prediction and run any uh any model or anything we want what we have to do first is clean the data and this is the uh something we must do always in any data analysis or data science project okay now let's start cleaning the DF underscore fixture data frame so this is the data frame and if we have a look at this data frame we see that everything looks fine but if you uh you check this in detail this data frame we're going to see that the strings inside the home and away columns have some blank spaces so this is not visible right now it's very hard to see but if you work with the with the data as I did before making the video you're going to see that there are blank spaces for example we can have as an example the word Qatar and then a blank space and we don't want that we want to clean this data so what we're going to do in this case is use the strip method so we have to write a DF fixture home we have to select the column and then we have to use the Str attribute and then use the strip method as I mentioned before and with this we make sure that uh we delete the leading and trailing spaces uh the blank spaces so if we run this we only get the words we're interested in so only that name of the countries and we do the same for the away team and then to save this uh these changes I'm gonna do uh just equal to and then update the values so that's how you clean that DF underscore feature so if I run out this uh sorry if I run this you're gonna see that it looks the same but actually we got rid of the trailing and leading spaces so that's something we have to do so that DF underscore fixture is clean so everything uh everything else is fine so now let's continue with that DF missing data and let's clean this DF missing data and then add it to the DF historical data because as I mentioned before this historical data and missing data are the same both are the World Cup matches but in the first one we couldn't extract all the workouts I think there are some workouts missing and these workouts are considered in the missing data data frame so let's clean the missing data data frame and then let's add it to the historical data data frame so to do this first something that I want to show you is that this missing data has some new data and this is why we have to clean it so if I do uh DF missing that home and then use that is new method and then show this we can see that there are 64 rows that have missing data so it has the n a n value which means another number and indicates that it is missing data so to get rid of this missing data or this null data what we have to do is use the drop in a method and to do this first I'm gonna I'm going to copy this and then I'm gonna write drop n a so I write drop n a and now what I have to do is write in place equal to true to successfully drop the values and save the changes and update the data frame so with this we uh we drop the N A values so if I run this let me uh comment this out email run this we're going to see that if we run this one again the line of code that returns missing data we're going to see that now we don't have anything so we got an empty table and this indicates that we successfully drop the missing data with this line of code okay now let's continue this and for the next step I'm going to concatenate actually I'm going to delete this one and yeah I'm going to concatenate uh the data frame so I'm going to add the missing data to my historical data so now that this data frame is clean I'm going to add it and to do this I'm going to use the concat method and this method allows me to concat two data frames or more data frames so here I'm going to write uh where is it is DF underscore missing data and the order is DF underscore historical underscore data so it's here okay now I'm gonna add one extra parameter which is ignore underscore index and I'm going to set it to true so I do this because I want to ignore that index of data frame 1 and data frame 2.
So if data frame 1 has index from 1 to 100 and data Frame 2 has indexed from 1 to 300. with this I'm ignoring this uh original indexes so when I combine them this is going to start from 0 to the to the end so it's not going to consider the original indexes and that's good so now I'm going to concatenate in the name I'm going to set for this concatenation is going to be historical data so this is my final data frame and now uh what I'm gonna do is uh after uh combining these two data frames I'm gonna drop some potential duplicates that we could get so maybe when we extracted again the data here the missing data we could have extracted also some data that was already included here and this is not good because we don't want a repetitive data and to avoid that to make sure everything is going to be fine what we have to do is write this the name of the data frame which is the the result of the concatenation and then we have to write a drop duplicates method so I'm not sure if there is duplicated values but just in case I'm going to use drop duplicates then I open parenthesis in place equal to true so with this we drop the duplicates in the historical data column uh sorry data frame then what I'm going to do is sort this data frame by the year so we write sort values allows me to to sort a data frame uh by a year or by a column in this case the year and then I write in place equal to true and finally I'm going to show this data frame so we're dropping the duplicate first we're concatenating then dropping the duplicates and then sorting by the gear so I'm going to run all of this and now we have this data frame and now we see that is this data frame is the final historical data data frame after the concatenation we don't have duplicates and this data frame was sorted by a year or so it starts with the first workout 1930 and it ends with the last workout played so far which is 2018 and great so we successfully cleaned the uh the missing data and then we added it to the historical data okay now let's continue cleaning uh DF historical data because so far we only cleaned the missing data and then we added it to the historical data but this historical data the final data frame is not clean we have to make some modifications to to clean it and also we have to make some modifications to have it ready for the processing that we we're going to do later before you run any machine learning or anything you want this data frame has to be ready and right now it's not ready because for example the score column has this uh these numeric data but it's not only numeric data we have this hyphen in the middle and this is not good when we want to process the data usually we only want numeric data and we don't want anything else so we're going to get rid of this High thing in the middle and we're going to create two columns from for the home uh for the goals score uh from the home team and the goal score from the away team so let's do this and first what I'm gonna do to clean this historical data is uh do something that is uh a bit uh particular in this project but it's necessary so I was exploring this data frame and I found that this data frame or actually in the World Cups there was a match between Sweden and Austria that ended in a walk over and this means that one team was not in the was didn't wasn't part of the match so when one team the same play the match doesn't want to play the match or cannot play the match for some reason this ends in a walk over so it's like three nil for the team that that was actually waiting for the other team in the stadium and this is not something good because uh I want to predict the winner of the World Cup based on goals and if I say that one team has three nil uh without actually scoring those goals doesn't help me this doesn't help me predict uh the the winner of the World Cup because it's not like they deserve those three goals so I'm gonna change this data actually I'm gonna show it right now so I'm gonna copy this one and show you much better what I mean so I'm gonna copy and show you here so I run this and we have this uh this match Sweden Austria 1938 and this is uh not numeric data this is another thing that is not good for us so as you can see score here in the column we have numeric data for one for example here and four two here but here we have a w slash o that stands uh for walk over and this is not something good we have to either change this to three nil or uh delete it and in this case I'm gonna delete this row and to do this first I'm going to get the index so here first um I got this with the Str contains which allows me to get only the values that contain this this particular word in this case Sweden and then once I got the row I want I can use index and this index uh gets the index of this uh of the whole thing so when this condition is through we're going to get this index and this is the index that we're going to drop so this all of this is here and I name this as delete underscore index and we're gonna delete it right now so to do this I'm going to use the drop uh method so I write first the name of the data frame and then I write drop then I open parenthesis and with drop we can drop either columns or rows in this case I'm going to drop a row so I write index equal to m delete index then in place equal to true and this is ready so if we run this we can see that this is not here anymore so just to verify this I'm gonna copy and paste it so we can verify that we actually drop this uh this row so if I run this we got uh we get this empty data frame which indicates that this is not anymore so great we drop this this match that was uh wasn't good for my analysis so now let's continue cleaning the rest of the data okay now something that I want to show you is that some columns like the score columns actually only this color the score columns not only has this uh this clean uh score that you see here for example for one but some of them have not only digits but has some other things for example here I'm using this uh regex that stands for regular expression that gets the the matches that are not in the matches the characters not in Brackets so in this case it's matching anything that is not a digit and that it's not uh this hyphen so anything that is not a digit and is not a hyphen is not the score right it's maybe a text maybe it's something else but that's what we want to identify so here I'm using the string contains and let's find out what's not a digit and what's not a hyphen in our score column so I run this now I need to add this one here to show and okay now we got this so for some reason some scores as you can see here have this parenthesis and then a e t I don't know what that means but we don't want that we only want the uh digit hyphen digit that represents the score and to get rid of this I'm going to use the replace method so this method allows me to replace one string with another and to do this instead of writing here contains I'm going to write uh the name of the column first Str and then I'm going to write replace so I'm going to replace this which represents something that is not a digit and not a hyphen so anything uh but deeds and hyphen so all of this in this example and here all of this and we're going to replace this with uh an empty string which is this one so after we do this we're gonna uh turn on regex so I write reg x equal to true and that's it so with this we replace anything that is at a digit in hyphen with an empty space so we can run this and see the results so I'm gonna run and now we can see that uh we replace this value successfully so now what I'm going to do is write this and I set this equal to I mean the DF historical data score I set this equal to the expression we had before and with this we're updating all these columns so great now if I run this we shouldn't have uh the those parentheses with a e t anymore great now let's uh also uh clean the names of the the words inside the home in a way columns so to do this I'm just going to copy this and I'm going to strip this because it might have some some data with blank spaces and we have to get rid of the leading and trailing spaces so what I'm gonna do is write this actually I'm gonna do it below and I'm gonna write it here and actually here and then what I'm gonna do is change the name so instead of uh instead of DF underscore picture what I'm going to write is this so I can replace each of them and clean it easily so if we run this we can see that actually we won't be able to see it because as I mentioned before those plug spaces are kind of invisible but now we have uh we're certain that we don't have either uh leading or trailing spaces so that's good because that can spoil our analysis okay now uh next we have to split the score columns into Home and Away goals and then drop the score column so with this we're gonna get rid of The Hyphen in the middle and we're gonna split the home the the goals score from the home team and the goals scored from the away team so this is just to have this data better organized and also have only numeric data so I'm gonna do this and to do this I'm going to use the split method so I'm going to write here first the name of that data frame then the column I want to work with then Str and then split and inside I'm gonna write this element which is the hyphen and then I'm splitting with a hyphen and if I run this we see that we get a list for each row and this list represents only the elements that are numeric so for example four one and four two here four one and four two so we have an at least with these two elements and now if I want to put each element in a column in a separate column what I have to do is add expand equal to true so with this we get two columns and one data frame so for one for two and this is what we wanted now if we want to add this to the historical data data frame what we have to do is write this the name of the data frame and then specify the new names of the column so I want to name the column 0 as home goals and I want to name that column one as away goals so that's it that's everything we have to do so if we run now this we should get um the two columns so I'm gonna run this and I'm going to show you now the value here so here's the value we have this uh the home score away in year uh the original data frame and then we have Chrome goals and Away goals and this is what we wanted so instead of four hyphen one now we have four in one column and one in another column and this helps us a lot when we want to do some processing later so that's what we want it and now we have to get rid of the score column because we're not going to use it anymore so I'm gonna drop this uh the score column using the drop method and right in here score and then here setting axis equal to 1.
So with axis equal to 1 we indicate that we want to drop a column so you have to write axis equal to one for columns then in place equal to true we'll run this and we have a look at the data frame we'll see that we have now all the columns but not the score column because we dropped it okay now to continue this let's rename the names of the columns here home away in year and also let's change the data types so let's uh I'm going to delete this and let's do this here so let's rename the columns and to do this we have to use the rename method and we have to write our name then columns equal to here columns because we want to rename columns and then we have to write the original name on the left and on the right the new name so instead of home I'm going to name it as a home team because it represents the home team and then instead of away I'm gonna write away team and the last one is going to be the year so I'm gonna write here near and I'm gonna uh write the first one in upper case so that's uh the year and finally in place equal to true to update this data frame so in place equal to true and that's it now I'm gonna change the data types because here if we see again this data frame apparently this uh the year home goals in away goals have numeric data I mean the data types will be uh int which is int which stands for integer but that might not be the case if we check D types which which is a attribute that Returns the data types we'll see that only the gear has integer values so only the year has actually numeric values uh that in this case is an integer and home goals in away goals have object and this represents strings and we don't want the strings we want that home goals and Away goals we want it to be uh we want them to be only uh integers not object or strings so what we have to do here is write this and then we have to write as type as type allows us to change the data type of any column we want so now we open a dictionary and we specify any column we want so we want to change home rules to int then we want to change away goals and it's going to be the same in this case int int integer and then we could also set here to integer we don't need to do that but I just wrote it here and well that's it so I'm going to add the in place equal to true to save this uh this these values are actually what I can do to here is just write DF historical data equal to this and with this we can also update the whole data frame that's another way instead of writing in place equal to true so now I run this and now if I show again this data frame we see that we have this new uh new columns or actually this new column names and then if we see the data types with the types we'll see that now home goals and Away goals are integers and that's what we want it okay finally we're going to create a new column named total goals that's not necessary but let's just create it in case we need it so the total goals is going to be the home goals plus the away codes and that's uh what we're doing right now so away goals plus home goals and this is going to be my total goal so I write this equal 2 and then I add a new column which is total gold so we'll run this and I got an error because I didn't write here the square bracket so now I run and if I show you the data frame we'll see that we have a new column named total goals which represents the sum of these two columns here okay that's it for the cleaning part so we have now all our data cleaned and to finish this we have to support this cling data frame so we have to export it to to CSV files if we want it so to do this I'm gonna write first historical data this is the last data frame we cleaned and I'm gonna write two underscore csb and then we have to write the name of the data frame in this case I'm going to use the original name which was let me see here historical data it was this one and I'm just gonna add the word clean in front of that name so clean underscore FIFA World Cup matches and that's it so now I can uh I can also add index equal to false and with this we're not going to support the indexes in this case 0 17 16 and so on but we're going to support only the data inside these columns that I'm highlighting right right now and well now let's continue with the feature data frame which is this one and for the fixture we have this name so I'm going to add also the name uh I'm going to use the same name and I'm going to add the word clean in front so I Write Clean underscore and now this name so now I copy this and then I'm gonna paste it below so this is the name and then I'm gonna write your index equal to false and with this we're going to export these two data frames and now I'm not exporting this data frame name missing data because we already include this data frame inside the historical data data frame before when we concatenated these two data frames so when we use the PD concat we put them together and we actually uh put all of this in the data frame historical data and well if we run this and then we check our directory I have it here now we see that we have two files this clean FIFA World Cup picture and then clean FIFA World Cup matches and that's it that's how you clean data using pandas and before I finish this video I want to show you some uh some practices that we often use to do some verifications sometimes we can verify if we successfully clean the data by checking the number of elements that we have because by mistake we might have got rid of some rows or maybe we didn't get rid of some rows that should have been deleted and we could verify this uh using the Len function and in this case with the Jiro because we can verify how many matches we have per World Cup if we run this we can see each workout I mean each year and then the number of matches because I'm using the length of the data frame and this Returns the number of rows so the number of matches so if you know something about football you will know that in 2018 there were 64 matches between the among the group stage The Knockouts the quarterfinals semi-final and the final there were 64 matches but this was not the case for the all World Cups for example 1994 had less matches but if for some reason for example you got here in 2018 30 matches you will see that you may be by mistake uh dropped some rows here in 2018 because you should have 64 and you had in the example you had 30 for example and that's something that could help you and verify that you successfully cleaned your data so that's something that you can do sometimes when it's possible and this is something I recommend you always to do to verify that the final data makes sense so there are many ways to do this and that depends on your your data set and also on the information that you have available and that's it in this video we clean the data with script in previous videos now that the data is clean we can start making our predictions with python in this video we're going to predict the winner of the World Cup Qatar 2022 and we're going to use Python for this we're going to use the pandas library and also we're going to use scipy to import that poisson distribution so let's get started okay to start with this video we have first to import pandas pickle and uh the poisson from scipy so to do this first we have to install these libraries so in case you don't have any of them you have to write PIV install and then either pandas or pink hole or sci-fi so if your in Jupiter notebooks like me you have to write this exclamation mark and if you're on any other IDE you have to open the terminal and run this this command pip installed and the name of the library okay once you have the libraries installed on your computer we have to run this and we successfully imported these libraries okay now for this video we're going to use this uh this CSV file we're gonna import this CSV file and we're going to read it with pandas so let's have a look at the CSV files first we're going to use the Dig table actually this dick table I web script a website to get the tables of the groups in the world cup and then I export this uh these tables with pickle and now I'm reading this table again with pickle so all these files this uh the underscore table and these two CSV files are going to be in the description and you can download uh you can download these files just uh just for the record we extracted this data in previous videos so if you are interested in learning web scraping how to extract this data in your room you can watch the videos but if you're not interested in that you can just download these files and these files has that clean data and you can read it in with these lines of codes okay so now then we have the clean FIFA World Cup matches and basically this has the data from the World Cup 1930 to 2018.
So we have all the historical data all the matches played in those World Cups and finally we have the World Cup fixture this is the feature of the World Cup Qatar 2022 and now we can read all this data and have a look to this to this tab data so I'm gonna show you first that dick table and this one has the groups as I said before so we have the tables and right now it has only the value of zero in each uh in each team because we have an uh run any model to predict who is going to be the winner between uh the matches that are going to be on the World Cup but we're gonna take care of that later so we have here all the groups from group a to Group H okay then we have the historical data as I said before from the matches from 1932 2018. okay then the fixture of the 2022 World Cup and this is the picture okay now let's see uh the next next thing we have to do and here we have to split the data frame historical data into two uh data frames home and away and to do this first I'm going to show you again this data frame and here we have uh this data with all the matches and what we have to do is to split this uh into two so we have to write for example DF home this is the one of the data frames and this is going to be equal to DF historical data and then we have to select only some columns and the columns that we're going to select are the home team then the home goes and then away goes so three columns and we're going to do the same for the the other one for the way team so I write DF away and then a weight team then home goals two and Away goals and that's it so with this we have these two um these two data frames and we can run this so we ran these uh lines of code and then for the next step we're going to rename some columns so I'm going to show you the name of this data frame uh the content of this data frame actually and this is the content so we have the home team home rules and Away goals and now we're gonna do uh we're gonna rename some columns for some particular reason and I'm going to explain to you right now but first i'm gonna write rename which is the method that allows me to rename a column and then write the column parameter or actually columns then I open a dictionary and we can rename any column we want in this case we're going to rename the three columns that we have here so first home team and then home goals so I'm gonna write here first original name then this colon and then the new name so instead of home team I just want to name it as team then we have home goals and I want to name this as goals scored so I don't care if it's a from the home or the way I just want to name it as home escort actually goes is called sorry goals escort and for away goals we're gonna name this as ghost considered so I'm going to open this uh quotes and then write goals considered so we have team for the home team then goes scored and then goals considered and this is basically because when the home the home team scores a goal those are the goals they scored and when the weight team scores a goal for the home team those are the goals that the home team concedes so that's why I put here I wrote here ghost is scored and goals considered so that's the reason why I'm writing this then we're gonna do the same for the way uh data frame but in this case I'm going to write DF underscore away now rename and here away team so it's something like this and then we're going to use again team instead of a weight team and then here there is som