Python exploring and cleaning data for Analysis


Hello everyone in this lesson we're going to learn about how to perform data exploration and cleaning in Python now the first part of really any data analysis or predictive modeling task is an initial exploration and cleaning of the data when you first load a dataset into Python you don't really have a good sense of the different variables that it contains the range of values those variables might take on and whether there might be some errors or other oddities in the data that you have to take care of before you can carry out your analysis or modeling tasks so this is going to be a fairly lengthy lesson where we are going to load in a real data set and go through the sorts of exploration and cleaning tasks that you might have to perform when you first are loading in and looking at a data set so for this lesson we're going to use the Titanic disaster training set on Kegel as our motivating example so we're going to just start by loading in some required packages and then the data set so this code cell is just loading in some of the packages we're going to be using and then we will read in our data and now after first loading in a data set when you don't really have any idea of what it looks like yet it's good to start by just getting some basic information by using some of the functions we've learned about in previous lessons for instance we could come down here and run dot shape or check the dot shape attribute just to see how big the data is that's a good first step to run because if the data is really large we might have to think harder about how we're going to work with it because some functions might be slow on a large data set this data set it turns out is relatively small so we shouldn't have to worry about anything running on it being particularly slow now let's also run dot info this will give us an idea of the different variables it contains and what types they are so let's scroll down and look at the dot info output we can see we have several data columns there's 12 and total and we have 891 total records let's go ahead and look at some actual data entries or records by running dot so here we're gonna run Titanic train dot head look at the first five entries and by inspecting some actual records we can get a better idea of what sorts of data we're working with and what it looks like so it looks like we have some passenger ID survived passenger class columns the name of the passenger the fare that they paid their ticket and a cabin variable.

I'm noticing right away actually that the cabin variable contains some not a number entries so those are missing so already we see that this data set is going to contain some missing values we're probably gonna have to figure out how to deal with so after gaining some sense of the types of Records we have let's run another summary function described to get an idea of the distribution of these numerical columns so we'll scroll down here and run Titanic train dot describe so this is pulling up summary statistics on each of the numeric columns so you can see here for instance passenger ID well this just seems to be a number going from 1 to 891 so essentially each row is just given a unique passenger ID so that's not particularly interesting for us the columns survived is actually the target for prediction for this competition so this variable actually tells us whether a given passenger survived the Titanic disaster or not so that is actually one that's very important for us to look at so let's see what that is we can see the average of survived is 0.38 that means that approximately 38% of these passengers survived which means most of them actually did not survive now notice that when we ran describe it only kept the numeric columns because it can't run summary statistics on categorical columns so we'll look at those separately here let's just make a list of the columns that are categorical so to do that.

We're going to make a list of the categorical columns with this construction here.

Basically all this is doing is saying we want to find the columns where the d-type is an object so that's things other than numbers and we want to take the index of those columns and we'll say those as the categorical columns and then we're just going to use that those column names to index back into Titanic train to get only the categorical columns and then we'll rerun dot describe but now since we're only running it on categorical columns it will run a different describe operation that works on categories instead of the one that works on numeric columns so let's run this and just see what the result is we can see these are the categorical columns named sex ticket cabin embark'd and when we run describe on this it just gives us a table showing the count of each column the number of unique different entries for each column the top most frequently seen entry and the frequency of the most seen entry so for instance under the sexes column it's showing us that there are two unique values here so that's probably male and female and then it's showing us top the most common one is male and there were 577 male entries so as I alluded to earlier when you load in a dataset it's sometimes hard to make sense of the different columns without some sort of external explanation as to what the columns mean especially if the column titles are not very explanatory of what's in the column so with this competition and with many other cattle competitions and other datasets you might find out there in the wild you'll often have an accompanying document that provides some explanations as to what the columns actually mean so this competition actually has that and I've included in here the different variable descriptions so in this box it's just showing some explanations about what these variables mean for instance with the survived variable zero means no the person didn't survive and a one means yes they did so the importance of any kind of documentation like this for what the different variables in a data that mean cannot be understated because having a good understanding of the variables you're working with will be a great help in any sort of data analysis or prediction task that you're doing so after loading in data and doing some basic exploration for the first time there are several questions you should be asking yourself to get the data ready for analysis so we're going to list what some of those are and then go through each one in turn and look at different methods for achieving it so one thing that you want to always be thinking about after loading in a new data set is do you actually need all of the variables oftentimes you'll have columns that either are just junk data that you don't want to use because they don't tell you anything or they're data that aren't useful for whatever task you're trying to do and by getting rid of it you might be able to free up some memory space and also perhaps reduce unnecessary computation so let's consider with the Titanic disaster training data set that we loaded in are there any variables we should think about getting rid of well the point of this particular data set is predicting whether passengers on the Titanic disaster survived or not so in terms of that end goal we can kind of already see there's probably one variable that we could remove the passenger ID variable because when we were looking at the description of that variable it was simply just a counter that was counting each of the rows 1 through 891 so the passenger ID is just an arbitrary number that's been assigned to these different Records and really has nothing to do with whether that person was going to survive or not so for this task we could safely remove that variable and probably would actually want to remove it so let's start off by saying delete Titanic train passenger ID now the numeric columns that we looked at as well as the categorical ones that describe passengers in broad categories such as sex and class and age those are things that would probably be somewhat useful for prediction so I would say we should keep those but we have a few other categorical variables that might be somewhat questionable as to whether they're going to be useful for prediction or not those would be the name variable ticket and cabin because when we looked at those they had a lot of different unique values and it was unclear whether that would be something that would be useful for doing prediction so let's start by looking at the name variable a little closer we're going to sort it and then just look at the fifth first 15 sorted names so let's scroll down here and we see it's just a bunch of full names for passengers and what really does a person's name tell you about whether they're going to survive or not in fact if we run describe on the name variable we see that there are 891 unique names that's not really a surprise every person has their own unique name so unless we wanted to try to run some sort of text processing steps on these actual names this is a variable we could probably remove for this prediction task but in this case it'd be nice to have some unique identifier for each of the records and we already removed the passenger ID unique identifier so for now we'll just keep this but it's not something we probably want to use to be predicting with next let's go ahead and look at the ticket variable so we'll look at the first 15 entries of that and we can see here the the tickets really don't seem to have any rhyme or reason for their structure we have ones that are just pure numbers we have ones that are our mixture of letters and numbers even like a slash in that one let's also run dot describe on it to see how many different unique entries and whatnot we have and we can see there are actually 681 unique cabins listed so when a categorical variable has almost as many unique values as there are total records it's usually suspect as in terms of whether it will be useful for prediction or not we could try to reduce the number of different levels by somehow combining things into different groups but I think in this case it's messy enough and there's enough unique values in it that I don't think we're going to use it so let's go ahead and delete the ticket column as well now finally let's look at that cabin column we're going to pull up the first 15 rows for that and when we look at that we see there are quite a few missing or not a number values and if we run describe on it we see that there are fewer unique values this time there are only a hundred and forty seven unique values but there are also only two hundred and four counted values because most of them are missing since most of the non missing cabin values are unique we could probably think that this isn't going to be useful for prediction but we can also notice above that the cabins have a letter designation on them that is common across a lot of different numerical cabin numbers so here we see C 85 c 123 C 103 so in this case this letter designation C could be a way of grouping categories together in a way that might actually be logical say there might be a section C on the ship where all of these different c numbered cabins are in that area and it could be that a certain designated section like that in the ship people were more likely to die in that area or survive depending on how the crash happened and where the section is located so this variable still might actually have some value for us in prediction if we can extract that information into some meaningful groups so we will leave the cabin variable here for now and see if we can deal with that later so now that we've considered whether we should remove variables we should also be considering should you transform any variables for instance when you first load a data set some of your variables might be encoded as data types that don't fit particularly well with what you're trying to do for instance if your data had say dates and they were loaded in as strings or numbers and you wanted them as dates well you might have to convert them to actual date objects in our data set the survived variable was loaded in as zeros or ones depending on whether they survived or died but if that's not such a nice way of thinking about it it'd probably be easier to think about if we just had the full string died or survived so we don't have to think about it in terms of zeros and ones so as a simple example of a transformation we could turn the survived column into categorical who cast something as a categorical variable we can use the pandas function PD categorical and then we will pass in the column survived that we want to transform so after we do that and save it as a new variable we will also run this operation on it to change the names of the categories so we're going to take our new survived data and dot rename the categories to these died and survived after we do that we can run dot describe on our new variable just to look at a table of it now so now we have some nice categories that are easy to understand and we can see the frequencies here where 61% of people died and 38 ish % survived we also noticed another variable earlier that actually has questionable data encoding and that is the P class or class variable the P class can has three different values one for being first class 2 for being second class and three for being third class this variable was loaded in as an integer number.

Even though we know intuitively that passenger classes are actually categories so it doesn't make a lot of sense to encode that as a numeric or integer variable what's more first class would be considered to be above or higher than second class but first class was encoded as the integer 1 which is actually before or lower than 2 so we could address this by transforming the p class variable into an ordered categorical variable so we'll show how to do that here again we can run pd categorical on our p class column and to make it an ordered categorical variable we can add this additional argument so comma ordered equals true and after creating this new p class variable that's categorical we'll just do dot rename categories again so we can give them names that we can easily understand class 1 class 2 class 3 and now instead of a numerical summary that perhaps wasn't particularly useful before we get a summary of the categorical version of p class and this gives us a bit more interesting information here we can see that about 25% of the passengers here were in first class 20% weren't second class and 55% were in third class so let's go ahead and overwrite our original P class variable with this new categorical version and now it's time to revisit our cabin variable that we didn't remove before if we run dot unique on cabin we can see that it has a bunch of unique values but again many of them start with the same letter designation we have like CEB a so let's group the cabins by the letter designation and hopefully we can reduce the number of unique categories and perhaps extract something useful from this variable so what we want to do here is just extract the first letter of each of these cabin strings and then create a new categorical variable that's just based on that letter instead of the numerical portion so we'll show how we could do that below we're going to start by getting the cabin variable that's what this is doing then we're gonna first encode it as type string because when we loaded everything in there encoded as object types now we're going to run a list comprehension on this column to extract the first letter from each record so this is the list comprehension we're going to run here cabin 0 that's saying get the first letter for each cabin in the cabin variable that we just saved so this list comprehension is stripping off the first letter of every single record of the variable and then we're just wrapping this whole thing in NP array to turn this into an umpire array we're running PD categorical on the array to transform it into a new categorical variable and now we'll run describe on that new variable to check what we got here so we can see after running this we have managed to condense that cabin variable down into much fewer categories so although many of the variables were missing here these are the missing ones 77% didn't have a cabin listed we have far fewer unique categories so some of these might actually be useful for prediction for instance there are almost 60 people in the sea area of the ship 47 people in the B area and quite a number in the and II areas so potentially that could be useful for prediction for instance if area D was the spot on the ship where the crash occurred perhaps everyone in that whole section didn't survive and if that was the case this D variable could be a powerful predictor for not surviving since this could potentially be useful.

Let's go ahead and save over the original cabin variable with our new cabin variable now that we've done some variable transformations another question that we should consider is are there na na or missing values outliers are other strange values we should deal with before proceeding with any kind of analysis in Python using the pandas library you can detect whether there are any missing values with P D is null so we'll just give an example of doing that we'll create a dummy vector of data that contains some nun or missing values and then we'll run dot is null on this dummy vector it'll just produce logical output where if the value is missing or null a true will be listed and if it is not a false will be listed so we can see in this case in expedition 1 and 3 were true because we put missing or nun values in for those so identifying missing values is the easy part it's much harder to decide exactly how to handle them for instance if we check the age variable in the titanic data and run dot describe on it we see that the count here is actually fewer than a total number of Records so let's identify the indices where these missing age values occur we can use the is null construction that we made earlier to do this so we can use the same construction above Titanic train age is null and we want to check where that is true and we're going to use this whole thing within an NP where clause it will extract the corresponding indices so by running this we are going to get all of the indices where the value is missing so here we have an array of indices where the corresponding age value is missing and this is quite a few missing values so we definitely need to do something with this before we continue with our prediction task so when you see missing values in a data set there are a number of different things you could consider doing to them one you could replace all the null values with zeros if you happen to be working with numeric data anyway you could replace all of the missing values with some central value like the mean or median of the column you could try to impute some other value imputation means using some sort of algorithm to fill in the values in fact mean and median are simple forms of imputation but there are other more complicated forms of imputation say using data in other columns to try to make a guess at what would be reasonable values for the age column another possibility would be splitting the data set into two parts one where the age variable is not missing and the other where it is missing now setting all of the missing ages to some central number like the median or mean age could be a reasonable thing to do except we don't know exactly what the distribution of the age variable looks like yet so to do that we could pull up a quick histogram of the age variable to see if it looks like a fairly normally distributed variable and if it is perhaps using the mean or median would be a reasonable thing to do so we'll show how to make a simple histogram in pandas you pass the name of the data frame dot hist and then within this you say column equals whatever column you want to plot and these are just specifying the size of the figure and the number of Bin's we want in the histogram when we run this we should create a histogram of the age column we can see that while there is some concentration of ages at the low end so there were some younger people on the ship in general the data seems to follow a somewhat normal pattern here so perhaps setting the missing values to some middle value age between say 25 or 30 might be a reasonable thing to do so let's go ahead and set the missing age values to the median value of 28 and we'll show how to do that below so we can use the NP dot where function to do this we will say NP dot we're then a logical check in this case we want to check we're tight train age is null and when this is true we're going to set the value to 28 and where this is false we're just going to keep the original value that's actually there and that will give us our new age variable and we'll just overwrite the original age variable with our new one and then we'll run dot describe on it to just confirm that we do indeed have a count of 891 now so all of those missing values were overwritten let's go ahead and rerun our histogram as well just to get a sense of what the data look like now you can see that by assigning all the missing values a value of 28 this bin where 28 lies is much taller now because there's a whole bunch of extra values in there now clearly some of these age values we imputed with the median are going to be quite a bit off from their actual values on the other hand doing this might be better than throwing entire records away these are the sorts of decisions you'll have to be making in data analysis projects often times and there's often really no one right answer as to what the best thing to do is it often is just a matter of having experience and knowing a lot about the domain that you're working in now let's consider outliers outliers are extreme values that lie far away from the typical values within a distribution if a data set has some extreme outliers they can have significant negative impacts on various predictive modeling techniques so it's good to identify them and perhaps do something to them to deal with that to look at some outliers let's check the fare variable I'm going to run dot plot kind box on it to make a box plot and this sort of plot is good for detecting outliers here we see on the box plot this green line is the median value and this blue box contains the middle 50% of values so some of these circles above the box are way above the median so out of interest let's determine who actually was this person or people that paid so much for their tickets so to do that we can run NP dot we're on the fare column where the fare is equal to the max of that column because we want to see who the biggest one was that should be the max and if we do that save that index and use it to index into the data set we can find who that person is so we can see there's actually three people that all paid the same amount they paid five hundred and twelve either dollars or pounds or whatever the fair value is and all of these people were in first class well you should hope that they would be if they paid so much now similar to na or missing values there's no single cure for what to do with outliers we could keep them in we could delete them we could try to transform them in some way to reduce their impact or select modeling techniques that are not badly affected by outliers but it's still worth identifying them and keeping them in the back of your mind because they can't have a disproportionately large influence on your results so we're going to keep these high rollers in our data set but it's a good thing we know that they're there one final thing to consider is should we create any new variables so when you first load in a data set the variables that you are given aren't always the most useful ones for prediction creating some new variables that are derivations or combinations of the existing ones that you're given is a common step that can help you create some more useful information for whatever task you're working on this process of creating new variables is known as feature engineering now creating a new variable can be as simple as just adding subtracting multiplying or dividing two numeric variables together so we'll give an example of how we could do that so in the data set we had some variables related to family one that told how many siblings a person had on board and another that told how many parents the person had on board well if we added those two things together it would just be an overall metric that showed how many family members they had on board so we could easily create a new variable called family that's just the sum of those two things so we'll run that to create a family variable and now that we have that for interests sake let's see who had the most family on board so to do that we can take our new variable family check where it is equal to the max of itself run.

NP dot we're on that whole thing to find the indices where that is true then we can take these indices to index back into the dataset to extract those records then we run that we can see that there are several people actually that had ten family members on board and that had eight siblings on board so all of these different people were actually siblings of each other and sadly all seven of these siblings have zero listed under the survived column so that means none of these siblings survived the Titanic disaster so to wrap up in this lesson we covered several general questions that you should think about addressing when you first inspect a data set your first goal when you load in some data is to explore its structure and then prepare the variables for your analysis only after you have cleaned the data and gotten it into a format that you can work with can you move on and do more complicated analyses and prediction tasks since data cleaning and formatting is a very important and often time-consuming part of data analysis.

It's important that we spend some time learning how to work with different types of data and how to manipulate data better so over the next few lessons we'll go more in depth as to how to clean and pre process different sorts of data including text data numeric data and dates.

If you found this video useful you can drop a like hit subscribe and.

I'll see you again next time you.