Pandas, Python and Data Analysis


Pandas known for their excellent camouflaging skills ability to climb even the tallest trees here to talk about talk about the data analysis.

Library pandas in the Python programming language we can use pandas to do data cleansing data analysis and data.

Transformations you can think of pandas as a collection of power query pivot tables and formulas in Excel in this video.

I'm gonna show you how we can use pandas in a real world situation recently I was doing a training program for one of the largest hotel chains in the world and they showed me this sample data that they get for the hotel bookings and it had some really weird issues so we are going to use pandas to load up the data cleanse it and do some quick analysis let's get into python for this example we will use the hotelbookingdata dot txt file like you can see here we have got some columns and some rows of data but in between some text values are there which kind of make it hard for us to combine everything in analyze it together so let's understand how we can kind of load this data into pandas and then how we can clean it up how we can understand this data to get started either open a new tab or new window and then choose open a different editor option and then select jupyter notebook this way you can do the interactive notebook style programming within vs code directly this is very simple all we have to do is write some code execute it to see the results and then add comments or documentation along way so most of the time in the world of data analytics and data science people use interactive notebook style programming when building or developing the understanding once things are working as intended then you might have just one python code rather than the interactive notebook style things but this is a fairly common format so you have either a code block or a markdown block code is where your code goes and Mark Downing is where you could write some documentation or explanation of what is happening so the very first thing that you want to do is import the libraries that we will need to work with this data so we start with import pandas as PD of course you can call this as anything for example you can call it DC but it's a convention to just use PD there another package that we normally use when we are analyzing data is the nem P or nem Pi package so import as NP and that's pretty much it when you run this those packages will be imported into your workspace and now it's ready so at this point we can add a markdown block just to say uh let's load up the hotel booking data and then we will add a code block here in this code block let's just load it up.

Pandas uses a type of data structure called Data frame we can think of data frame as excel's range or Excel Stables it's basically a tabular structure where your data is maintained but the data frames are a bit more flexible and optimize it for performance so we can declare a variable DF is equal to and then PD dot read and when you see the read you have seen that you can see that you know there's lots of different functions that we can use I'm going to use the read CSV even though our file is a text file we can read it as a CSV because it's a tab delimited file so I'm just gonna say the file path because it's the same folder as our rest of the workbook we can simply say hotelbookingdata.txt and load it up and then in the next line I'm just gonna print DF dot head this will just print the first few rows for me let's run this and we get this now I thought we are gonna get four different columns but here you could clearly see that everything is read up as a single row with the backslash t as the separator this is a problem because we don't want this to be treated as one of the characters it should be treated as a separator so here I can use an extra parameter called delim and then like that let's try again and now our data frame looks so much better with four columns and the first five rows are printed here with the information this looks great but as you can see here we have got n a n wherever that text value is I'm just going to quickly keep our hotel booking data here so you can see that this is what is happening in between we have got these things which has only the text value and nothing else and that is what really causes the problem pandas offers many different functions and operations on the data using which you can kind of deal with these kind of scenarios we will deal with this in a minute but first let's understand what else can be done here on the data so for example if I want to understand what is in my data I can use df.describe which will basically does some description of the data it will generate some counts and other things on the usually the number our columns so it will give you room number because room number is the only number column it will do the analysis this 153 records there the mean is this standard deviation is that Etc of course some of these statistics have no meaning because we can't really add up or average room numbers but that's what describe does so if you have got multiple columns where this kind of analysis can be done the describe will go and add all of those columns here for you let's add one more piece of code and then let's just see DF and you can use the square bracket and then kind of select a field from your data so company is one of the columns so we can say DF dot company and then you can use the value counts option which will basically tell you how many items are there so for example uh leakso company made three bookings five chat company made three bookings and Twitter Works company made only one booking just a reminder here this data is all randomly made up but uh that's really what value counts does one way to think about value count says it's kind of like a basic pivot table uh where we are taking each unique value in the company column and then telling you how many counts are there again feel free to go ahead and experiment with this most of this is especially with the interactive style of programming that we are doing here where you write a line of code and then run this and see the results it's really easy to experiment and try out different things and work on it all right so now let's go ahead and fix the problem with our data which is we have got these uh text values that are kind of breaking the pattern so one solution for this problem is we don't want any of this itself and here we can just use the DF dot drop n a and and then you can kind of specify how you want to do it there are multiple parameters for many of these functions and as I'm still learning uh pandas I myself have no idea what other things are possible so normally this is what I have been doing all this while when I'm learning and figuring it out I would open the bracket and vs code conveniently shows me the help information about all the parameters so I would try to read this to understand exactly what is going on uh and for example how any or all default is any determine if a row or column is removed from the data mode when we have at least one n a or all n name so I think any is what works and if I don't say that that is the default so that's what is going to be you know we just run this and then it's going to drop that and now it has 134 rows by four columns and as you can see the hotels thing is gone while this is good this is not really a ideal situation many times you may want to retain this data but instead of having it here I want to move it to a separate column here so this is where you need to understand a little bit more about manipulating the data frames again pandas offers many ways to do this will understand some of the basics first so that you can get familiarity with this again.

I want to put a warning here that I am also a learner of pandas.

I'm just figuring out these things as we as I go along so whatever method or technique that I'm showing may not be the optimal or the best way of doing this but you know we are all learning to understand this concept let's first add a column which is twice the room number again that would be a meaningless type of a column but it will give you an idea of how such things can be added so it's a simple process of DF square brackets and then the new column name I'm going to call this as 2x room and this is equal to DF of square brackets room number and then twice two that's pretty much it we will now have a column called DX df2x rule I can see this and when I run this I'll have this extra column here to X room again our Nan appears here this is because when we drop it does generate a new data frame but that just gets printed on the screen we are not updating the drop room number into the data frame to update it you would need to use DF dot drop n a and then in place is equal to true now I don't want to do this because we do need to treat the data and reuse the hotel's information as a new column so this is why I didn't do that but if you use the in place option then it will update the data frame and create the new one another way of doing this is you can simply say DF drop n a and then df2 is equal to DF dot drop n a what this will do is it will create a new data frame which doesn't have the n a option and then we can just say df2.head and that will show you new data frame df2 which doesn't have those n arrows anyhow now that we understand some of the basics let's go ahead and make a column that will have this value added as a separate column to do this we will first need to understand what these values are these are n a n values and internally we can access them using the n a related functions so for example I want this value every time one of these columns has n a n so to know if one of these columns is n a n i can use for example DF and then any column I'm just going to go with room number because it will consistently have a value or n a n so rule number is n a now is n a is going to be a bunch of true or false values and this will produce a data frame that is a single column data frame that has true or false values so you can see that here it will be false false false and then true for the fourth row because the fourth one is hotels and it is n a n here right so this is the basic approach and what we will now need to do is if this is false then we don't need to do anything if it is true then we want the corresponding value to be added as a new column we'll create a variable called mask and then store DF room number is an a so the mask will contain the true or false values next up we will say DF and then text value this is our new column that we are going to create is equal to so this is where the dampy module that we brought in is helpful so NP dot where this is a function that that is part of the NP or the numpy library and you can specify the condition so the condition is mask and then based on the mask True Value false value is what we need to provide so it's kind of like if formula but it goes in a loop for each value in The Mask it will apply the conditions so NP wear of mask and then if it is true I want DF of the date and if it is false likes and let's just Sprint to the DF here we can see how that looks so we will have either hotels or all these values as blacks now while the blank value is all right it does kind of open up a problem whereas down the line so ideally what I want is instead of leaving them as blank I want them to be a n a value so here I'm just gonna change my code to np.net so this is how you can kind of access the Nan variable and then print that there you might think how does chandu really know all of these things well I didn't really know any of these things yesterday but as I was trying to solve this problem I went and searched online and read up on the pandas website as well as went through many many stack Overflow articles and posts to kind of understand how to solve this problem so now when I run this I can see my DF has a text value column that would be either the value or the Nan depending on what it is now wherever this is Nan I want to take this value and fill it up in that direction so this is called filling it up and you can do that using the fill functions that are available in the data frames or Badness so we will say DF of text value you can specify the parameters so you can see that there are different types of filling methods that are available B fill pad F fill Etc so B fill will go and fill in the back so it will kind of go in this direction F field will fill in the forward Direction so we need the B fill method because we need to take the hotels and then fill it up there so method equals B fill that will basically backfill the values and you now have hotels all the way through again this is I don't think it is does it does that in in place so if I try to add another code block and then print the DF we will still have Nan values here so we need to be doing this in place so here we will say DF text value fill in a method is equal to B fill in place is equal to true so this would basically update the existing data frame text value column and then let's just run this DF again and you can see that the values are now reused this is perfect at this point if I now drop all these Nan values our data frame would be clean so we'll just say DF dot drop in a um in place is equal to true okay and here is our final data frame our clean data is now ready and I can use this for doing some analysis like for example I can see uh value counts by text value or something else so for example I can say DF dot value counts uh I'm not even sure if this is the correct way of doing this but we'll just say text value yeah and then it will give you how many times each of the booking sites or the booking methods have been used to make the bookings pandas also has a a simple plotting Library so if you want to create a plot you can make this this is something that I have not tried I only read the examples so I'm gonna try this one out DF 3 is equal to so we make a new data frame that has these value cons and then df3 Dot Plot so this will create hopefully a plot that shows how many bookings are there I think the default is a line so this is showing me a line probably we could use a bar chart here this is something that I need to learn so I'll leave that to you as homework why don't you learn a little bit more about the Python Programming itself here is a long video that goes into greater detail about python on the other hand you can also solve these kind of problems using power Query in Excel and power bi and that requires writing no lines of code at all so check out this video for that I'll catch you in one of these two places bye.