Data Analytics with Python


Welcome to data analytics using python my name is richard kirschner with the simply learn team that's www.simplylearn.com get certified get ahead so we're going to cover data analytics with python we're going to go over what is data analytics applications of data analytics types of data analytics data analytics process steps why python for data analytics and then we'll dive into use case demo so you can actually see some script and actually see what it looks like in the python code what is data analytics data analytics is a process of exploring and analyzing large datasets to make predictions and help data-driven decision-making now the definition of large datasets keeps changing and so this can range really from just about anything to anything but usually in today's world we're talking significantly larger amounts of data that you can't just glance at and try to figure it out yourself and the two steps are analyze the data and then make decisions based on the data applications of data analytics now the sky's the limit on this in today's world almost every business act of life your music on your spotify are driven by data analytics but some of the big players when you go in their job hunting are going to be your fraud analysis if you want to go make a lot of money and you're good at it and you like dealing with numbers go join the banks and track down the criminals who are stealing money it's a lot of you know it's a big thing to protect credit cards protect sales purchases bad checks any of those things when you can track them down is huge health care exploding there's everything from trying to find cures for uh the covet virus or any of the viruses out there using your cell phone to diagnose different ailments that way you don't have to go and see the doctor you can actually just go in there and take a picture of the funky growth on your arm hopefully it's not too big and then they send it in there and the data analytics goes in there looks at it and says oh this is what this is this is a professional you need to go see or don't need to see and that's just one aspect of healthcare the database is being generated by healthcare and getting the right doctors and helping the doctors analyze whether something is benign or malignant if it's cancerous all those things are now part of the ongoing healthcare growth in data analytics inventory management think one of those huge warehouses where they're shipping out all the goods how do you inventory that in such a way so that you maximize the stuff that's being purchased the most near the entrance and all the other stuff towards the back or even pre-ship it so it's huge to be able to inventory the manager inventory and pretty soon they'll just have a drone come in there and start picking up some of those boxes and move them around also deliver your logistics again this goes from getting from point a to point b uh you can combine it with our inventory so you pre-ship stuff if you know a certain area is more likely to purchase it how do you get it the delivery to the most destinations the quickest in the short amount of time and then they even pre-stack the trucks going out and that's all done with data analytics how do we stack all that stuff so it comes out in the right order targeted marketing huge industry any kind of marketing whether you're generating uh the right content for the marketing who are you targeting with that marketing researching the people what they want so you know what products to market out there all those things are huge and these are just a few examples you can probably go way beyond this from tracking forest fires to astrology and studying the stars all of this is part of data analytics now and plays a huge role in all these different areas city planning is another one you know you can see a nice organized city like this one where you can get in and out of the neighborhoods if you're a fire truck uh police officers need to be able to get in and out you want your tourists to be able to come in you still want the place to look nice and you have the right commercial development the right industrial development late enough residents for people to stay all those things are part of your city planning again huge in data analytics so sky's the limit on what you use it for let's take a look at types of data analytics and this can be broken up in so many ways but we're going to start with looking at the most basic questions that you're going to be asking in data analytics and the first one is you want descriptive analytics what has happened hindsight how many cells per call ratio coming out of the call center if we have 500 tourists in a forest and you have a certain temperature how many fires were started how many times did the police have to show up to certain houses all that's descriptive the next one is predictive predictive analytics is what will happen next we want to predict this is great if you have a ice cream store and you want to predict how many people to work at the ice cream store in a certain day based on the temperature coming up in the time of the year and then one of the biggest growing and most important parts of the industry is now prescriptive analytics and you can think of that as combining the first two we have descriptive and we have predictive then you get pre-scriptive analytics how can we make it happen foresight what can we change to make this work better in all the industries we looked at before we can start asking questions especially in city development there's a good one if we want to have our city generate more income and we want that income to be commercial based uh what kind of commercial buildings do we need to build in that area that are going to bring people over do we need huge warehouse sales costco sales buildings or do we need little mom pod joints that are going to bring in people from the country to come shop there or do you want an industrial setup what do you need to bring that industry in there is our car industry available in that area if it's not a car industry what other industries are in that area all those things are prescriptive we're guessing we're guessing what can we do to fix it what can we do to fix crime in area with education what kind of education are we going to use to help people understand what's going on so that we lower the rate of crime and we help our communities grow better that's all prescriptive it's all guessing we want foresight into how can we make it happen how can we make this better and we really can't not go into enough detail on these three because a lot of people stumble on this when they come in and are doing analytics whether you're the manager shareholder or the data scientist coming in you really need to understand the descriptive analytics where you're studying the total units of furniture sold and the profit that was made in the past here we go into predictive analytics predicting the total units that would sell and the profit we can expect in the future gear up for how many employees we need how much money we're going to make and prescriptive analytics finding ways to improve the sales and the profit so we can sell maybe a different kind of furniture we're going to guess at what the area is looking for and how that marketing is going to change data analytics process steps so let's take a look at some of the basic processing and what that looks like when you're working with this data so there's five basic steps uh the five steps of processing and this changes and there's a lot of things that go on when they talk about agile programming the whole concept of agile is you take some kind of framework like this and then you build on it depending on what your business needs so the first step is data collection and usually with a large company you might have somebody who uh is responsible for the database management you may have another one where they're pulling apis and they're pulling data off of maybe the census bureau maybe something very very specific domain specific so if you're analyzing cancerous growths and how to understand them then the data collection is going to be those measurements they take from the mri or maybe even the mri images they've used those also so there's a lot of things with data collection and how to control that and make sure it has what you need and is clean and you don't have miss information coming in once you have the data collected there's a data preparation so stage two is we take that data and we format it into something we can use probably one of the biggest formats that you see is when you're processing text how do you process text well you use what they call a one hot encoder and each word is represented uh by a yes no kind of setup so it'd be like a long array of bits that's one way to prepare it and so you know bit number one is the bit number two is has or whatever it is other preparations might be if you're using neural networks you might be taking integers or float numbers and converting them to a value between zero and one that way you don't have one of them creating a bias in there so there's a lot of different things that go into data preparation that is eighty percent of data science so we talk about the data analytics which is a little bit more on the math side and they usually say talk about a data scientist kind of being the overall preparer of this stuff you're going to spend 80 percent of your data preparation data exploration uh that's the fun part this is where you're exploring things and it is maybe 10 to 15 percent of what you do with the data you spend with the data exploration it is probably the most important step because this is where you got to start asking questions if you ask your questions wrong you're going to get some wrong information if you're working with a company and they want to know the marketing values then you really got to focus on hey how do we generate money for this company or fraud how do we lower the fraud rate while still generating a profit four data modeling this is where we start actually getting into the data code uh which model to use that predicts what's going to happen and then result interpretation we want to be able to interpret those results usually see that in your matplot library we create nice beautiful images so it shows up on their dashboard for the marketing manager or for the ceo so they can take a quick look and say hey i can see what's going on there you want to reduce it to something they can easily read they don't want to hear the scientific terms they want to see something they can use and we'll talk about that a little bit more and we start looking at some of this in a demo since this is data analysis with python we've got to ask the question why python for data analytics i mean there's c plus there's java there's dot net from microsoft why do people go to python for it so the number of reasons one it's easy to learn with simple syntax you don't have a very high type set like you do in java and other coding so it allows you to kind of be a little lazy in your programming that doesn't mean that it can't be set that way and that you don't have to be careful it just means you can spin up a code much quicker in python the same amount of code to do something in python a lot of times is one two or three or four lines where when i did the same thing say in java i found myself with 10 12 13 20 lines depending on what it was it's very scalable and flexible so there's our flexibility because you can do a lot with it and you can easily scale it up you can go from something on your machine to using pi spark under the spark environment and spread that across hundreds if not thousands of servers across terabytes of data or petabytes of data so it's very scalable there's a huge collection of libraries this one's always interesting because java has a huge collection of libraries c has a huge collection of libraries dot net does and they're always in competition to get those libraries out scala for your spark all those have huge collections libraries this is always changing but because python's open source you almost always have easy to access libraries that anybody can use you don't have to go check your licensing and have special licensing like you do in some packages graphics and visualization they have a really powerful package for that so it makes it easy to create nice displays for people to read and community support because python is open source it has a huge community that supports it you can do a quick google and probably find a solution for almost anything you're working on python libraries let's bring it together we have data analytics and we have python so when we're talking data analytics we're talking python libraries for data analytics and the big five players are numpy pandas matplot library scipy which is going to be in the background so we're not going to talk too much about the scientific formulas inside pi and psi kit so numpy supports in-dimensional arrays provides numerical computing tools useful for linear algebra and fourier transform and you can think of this as just a grid of numbers and you can even have a grid inside a grid or data it's not even numbers because you can also put words and characters and just about anything into that array but you can think of a grid and then you can have a grid inside a grid and you end up with a nice three-dimensional array if you want to talk three-dimensional array you can think of images you have your three channels of color four if you have an alpha and then you have your xy coordinates for the image we're looking at so you can go x y and then what are the three channels to generate that color and numpy isn't restricted to three dimensions you could imagine watching a movie well now you have your movie clips and they each have their x number of frames and each of those frames have x number of x y coordinates for the pictures in each frame and then you have your three dimensions for the colors so numpy is just a great way to work with in-dimensional arrays now closely with numpy is pandas useful for handling missing data perform mathematical operations provides functions to manipulate data pandas is becoming huge because it is basically a data frame and if you're working with big data and you're working in spark or any of the other major packages out there you realize that the data frame is very central to a lot of that and you can look at it as a excel spreadsheet you have your columns you have your rows or indexes and uh you can do all kinds of different manipulations of the data within including filling in missiling data which is a big thing when you're dealing with large pools or lakes of data where they might be collected differently from different locations and matplot library we did kick over the scipy which is a lot of mathematical computations which usually runs in the background of the fur of numpy and pandas although you do use them they're useful for a lot of other things in there but the matte plot library that's the final part that's what you want to show people and this is your plotting library in python several toolkits extend matplot library functionality there's like a hundred different toolkits to extend matplot library which range from how to properly display star constellations from astronomy there's a very specific one built just for that all the way to some very generic ones we'll actually add seaborne in when we do the labs in a minute several toolkits extend matplot library functionality and it creates interactive visualization so there's all kinds of cool things you can do as far as just displaying graphs and there's even some that you can create interactive graphs we won't do the interactive graphs but you'll see you'll get a pretty good grasp of some of the different things you can do in matplot library let's jump over to the demo which is my favorite roll up our sleeves and get our hands in on what we're doing now there's a lot of options when we're dealing with python you can use pie charm is a really popular one and you'll see this all over the place so it's one of the main ones that's out there and there's a lot of other ones i used to use netbeans which is kind of lost favor don't even have it installed on my new computer but the most popular one right now for data science now pycharm is really popular for python general development for data science we usually go to jupiter notebook or anaconda and we're going to jump into anaconda because that's my favorite one to go to because it has a lot of external tools for us we're not going to dig into those but we will pop in there so you can see what it looks like so with anaconda we have our jupiter lab we have our notebook these are identical jupiter lab is an upgrade to the notebooks with multiple tabs that's all it is and we'll be using the notebook and you can see that pycharm is so popular with python that we even have it highlighted here in anaconda as part of the setup jupiter notebook can also be a standalone so we're actually going to be running the jupiter notebook and then you have your different environments i have we're going to be under main pi 36 there's a root one and i usually label it pi three six the reason is is currently as of writing this tensorflow only works in three six and not in three seven or three eight for doing neural networks but you can actually have multiple environments which is nice there they separate the kernels so it helps protect your computer when you're doing development and this is just a great way to do a display or a demo especially if you're looking for that job pull up your laptop open it up or if you're doing a meeting get it broadcast up to the big screen so that the ceo can see what you're looking at and when we launch the notebook it actually opens up a file browser in whatever web browser you have this happens to be chrome and then you can just go under new there's a lot of different options depending what you have installed python3 and this just creates an untitled version of this and you can see here i'm actually in a simply learn folder for other work i've done for simplylearn and that's where i save all my stuff and i can browse through other folders making it real easy to jump from one project to another and under here we'll go ahead and change the name of this and we'll go ahead and rename it data analytics data analytics just so i can remember what i was doing which is probably about 50 of the folders in here right or files in here right now uh so let's go ahead and jump in there and take a look at some of these different tools that we were looking at and as we go through the demo let's start with the numpy uh the least visually exciting and i'm going to zoom in here so you can see what we're doing and the first thing we want to do is import numpy and we'll import it as np that is the most common numpy terminology and let's go ahead and change the view so we also have the line numbers i don't know why we probably won't need them but make it for easy reference and then we'll create a one dimensional array we'll just call this array1 and it equals np.

Array and you put your array information in here in this case we'll spell it out you can actually do like a range in other ways there's lots of ways to generate these arrays but we'll just do a one two three so three integers and if we print our array one we can go ahead and run this and you can see right here prints one two three you can see why this is a really nice interface to show other people what you're doing with the jupiter notebook so this is the basic we've created an array this is a one dimensional array and then the array is one two three one of the nice things about the jupiter notebook is whatever ran in this first setup is still running it's still in the kernel so it still has the numpy imported as np and it still has our variable arr1 for array one equal to np array of one two three so when we go to the next cell we can check the type of the array we're just gonna print we say hey what's what what is this setup in here and we want type and then we want what is the type of array one let's go ahead and run that and it says class numpy nd array so it's its own class that's all we're doing is checking to see what that class is if you're going to look at the array class uh probably the biggest thing you do i don't know how many times i find myself uh doing this because i forget what i'm working on and i forget i'm working with a three-dimensional or four-dimensional array and i have to reformat somehow so it works with whatever other things i have and so we do the array shape the ray shape is just three because it has three members and it's a one-dimensional array that's all that is and with the numpy array we can easily access stick with the print statement if you actually put a variable in jupyter notebook and it's the last one in the cell it will the same as a print statement so if i do this where array one of two is the same as doing print array of two that's those are identical statements in our jupiter notebook we'll go and stick with the print on this one and it's three so there's our print space two and we have zero one two two equals three we can easily change that so we have array one of place two equals five and then if we print our array one uh you can see right down here when it comes out it's one two and five and there i left the print statement off because it's the last variable in the list and i'll always print the variable if you just put it in like that that's a jupiter notebook thing don't do that in pycharm i've forgotten before doing a demo and we talked about multiple dimensions so we'll do an array two-dimensional array and this is again a numpy array and in the numpy array we need our first dimension we'll do one two three and our second dimension uh three four five and you can see right here that when we hit the uh we'll do this we'll just do array two and we can run that and there's our array two one two three 3 4 5 we can also do array 2 of 1 and then we can do let's do 0 it doesn't really matter which one actually do 2 there we go and if i run this it will print out five because here we are this is zero zero one two three is under zero row three four five is on our one row and we start with zero and then the two 0 1 2 goes to the 5.

And then maybe we forgot what we were working with so we'll do array 2 dot shape and if we do array two of shape we'll go and run that we'll see we have two rows and each row has three elements a two dimensional array two three if you looked up here when we did it before it just had three comma nothing when you have a single entity it always saves it as a tuple with a blank space but you can see right here we have two comma three and if you remember from up here we just did this array two of uh let's go what is it one comma two we run that we get the five you can also count backwards this is kind of fun and you'll see i just kind of switch something on you because you can also do one comment two to get to the same spot now two is the last one zero one two it's the last one in there we can count backwards and do minus one and if we run this we get the same answer whether we count it as let's go back up here whether we count this as 0 1 2 or we count backwards as minus 1 minus 2 minus 3. and you can see that if i change this minus 1 to a minus 2 and run that i get 4 which is going backwards minus 1 minus 2. so there's a lot of different ways to reference what we're working on inside the numpy array it's really a cool tool it's got a lot of things you can do with it and we talked about the fact that it can also hold things that are not values and we'll call this array s for strings equals uh np dot array put our setup in there brackets and let's go china um india usa mexico doesn't matter we can make whatever we want on here and if we print that out and we run this you can see that we get another numpy of ray china india usa mexico it even gives us our d type of a u6 and a lot of times when you're messing with data we'll call this array r for range just to kind of keep it uniform in p dot a range so this is a command inside numpy to create a range of numbers and if you're testing data maybe you want maybe you have equal time increments that are spaced a certain point apart but in this case we're just going to do integers and we're going to do a setup from 0 20 skipping every other one and we'll print it out and see what that looks like and you can see here we have 0 2 4 6 8 10 12 14 16 18 like you expected it skips every one and just a quick note there's no 20 on here why well this starts at 0 and counts up 220 so if you're used to another language where it explicitly says uh less than or less than equal to 20 like for x equals 0 x plus plus x is less than 20.

That's what this is it just assumes x is less than 20 on here and if we want to create a very uniform set you know 0 2 4 6 what happens if i want to create numbers from 0 to 10 but i need 20 increments in there we can do that with line space so we can create an r uh we'll call this l equals i don't think we'll actually use any of this again so i don't know why i'm creating unique identifiers for it but we'll do np lin space and we're going to do 0 to 10 or 0 to 9 remember it doesn't it goes up to 10 and then we want to let's say we have 20 different um increments in there so we're creating a we have a data set and we know it's over a certain time period and we need to divide that time period by 20 and it happens to just have 10 pieces in it and here we go you can see right here we have 20 or it has 20 pieces in it but it's over 10 years and we got divided in the middle and you can see it does it goes 0.52 remember the others are 10 on the end so it goes up to 10.

Uh and then we can also do random there's np.random if you're doing neural networks usually you start it by seating it with random numbers and we'll just do np.random and we'll just call this array we'll stop giving it unique numbers we'll print that one out and run it and you can see we have random numbers they are zero to one so you'll see that all these numbers are under one and you can easily alter that by multiplying them out or something like that if you want to do like 0 to 100 you can also round them up if it's integer 0 to 100 there's all kinds of things you can do but generates a random float between zero and one and you have a couple options you could reshape that or you can just generate them in whatever shape you want and so we can see here we did three and four and so you can see three rows by four variables same thing as doing a reshape of 12 variables to three and four and if you're going to do that you might need an empty data set i have had this come up many times or i need to start off with zero and i don't know you know because i'm gonna be adding stuff in there or it might be zero and one or one is uh if you're removing the background of an image you might want the background is zero and then you figure out where the image is and you set all those boxes to one and you create a mask so creating masks over images is really big and doing that with a numpy array of zero and we can also uh give it a space and we'll just do this all in one shot this time and we'll do the same thing like we did before zeros and in this case we'll do uh two comma three and so when we run this i forgot the asterisks around it i knew it was forgetting something there we go so when we run this you can see here we have our 10 zeros in a row and maybe this is a mask for an image and so it has two rows of three digits in it so it's a very small image a little tiny pixel and maybe you're looking to do something the opposite way instead of creating a mask of zeros and filling in with ones maybe you want to create a mask of ones and fill them in with zeros and we'll just do just like we did before with the three comma four and when we run this you'll see it's all ones and we could even do this even maybe we'll do it this way let's do 10 10 by 10 icon and then you have your three colors so it creates quite a large array there for doing pictures and stuff like that when you add that third dimension in if we take that off it's a little bit easier to see we'll do 10 again and you can easily see how we have 10 rows of 10 ones and you can also do something like create an array and we'll do 0 1 2 and then in this array we actually print it right out we want a repeat so you can actually do a repeat of the array and maybe you need this array let's repeat it three times so there's our repeat of an array repeat three times and if we run this you'll see we have zero zero zero one one one two two two and whenever i think of a repeat i don't really think of repeating being the first digit three times the second digit i really always think of it as zero one two zero one two zero one two it catches me every time but the actual code for that one is going to be tile and again if we do a range three and we run this you can see how you can generate one zero one two zero one two zero one 2.

And if you're dealing with an identity matrix we can do that also if you're big on you're doing your matrixes and we'll just identity i guess we'll go ahead and spell it out today may trix and the command we're looking for is um i e y e and we'll do three and then we'll just go ahead and print this out there we go there's our identity matrix and it comes out by a three by three array because there's our matrix and then it puts the ones down the middle and for doing a different matrix math and we can manipulate that a little bit too we talk about matrixes we might not want ones across the middle in which case we now have the diagonal so we can do an np dot diagonal and we do a diagonal let's put in the diagonal one two three four five and when we run this again this generates a value and by just putting that value in there's the same as putting print around it or putting array equals and then print array and you can see it generates a diagonal one two three four five and there's your uh your beginning of your matrix array for working with uh matrixes and we can actually go in reverse uh let's create an array equals remember our random random.

Random and we'll do a five by five array oops there we go five by five and just so you can see what that looks like helps if i don't miss type the numbers which in this case i just need to take out the brackets and there you go you have your your five by five array set up in there and we can now because we're working with matrixes we might want to do this in reverse and extract the diagonals which would be the 0.79 the 0.678 and so on and we simply type in np.diagonal and we put our array in there and this will of course print it out because it returns it as a variable and you can see here here's our diagonal going across from our matrix and we did talk about shape earlier if you remember you can do print the shape out you can also do the dimensions so in dimensions very similar to shape it comes out and just has two dimensions we can also look at the size so if we do size on here we can run that and you can see it has a size of 25 two dimensions and of course five by and that was from the shape from earlier that we looked at there's our 5x5 shape and if you remember earlier we did random well you can also do random i talked a little bit about manipulating 0 to 1 and how you can get different answers you can also do straight for the integer part and we'll do minus 10 to 10 4 and so we're going to generate random integers between minus 10 to 10. we're going to generate four of those and so when we run that we have seven minus three minus six minus three they're all between minus ten and ten and there's four of them and now we jump into some of the functionality of arrays which is really great because this is where they come in here's your array and you can add 10 to it and if i run this there takes my original array from up here with the integers and adds 10 to all of those values so now we have oh this is the decimal that's right this is a random decimal i had stored an array but this takes a random decimal the random numbers i had from 0 to 1 and adds 10 to them and we can just as easily do minus 10.

We could even do times two and we could do divide by 2 and it would it'll take that random number we generated and cut in half so now all these numbers are under 0.5 another way you can change the numbers to what you need on there and as you dig deeper into numpy we can also do exponential so as an exponential function which would generate some interesting numbers off of the random so we're taking them to the power i don't even remember what the original numbers in the um array were because we did the random numbers up there here's our original numbers and if you build an exponential on there this is where you get e to the x on this and just like you can do e to the x you can also do the log so if you're doing logarithmic functions that reinforce learning you might be doing some kind of log setup on there and you can see the logarithmic of these different ray numbers and if you're working with log base 2 you can do you can just change it in there in p log 2. you have to look it up because this is not log 1 2 3 4 5. it is log and log 2. so just a quick note that's not a variable going in that is an actual command there's a number of them in there and you'll have to go look and see what the documentation is but you can also do log 10. so here's log value 10. some other really cool functions you can do with this is your sign so we can take a sine value of all of our different values in there and if you have sine you of course have cosine we can run that so here's the cosine of those and if you're doing activations in your numpy array and you're doing a tangent activation there's your tangent for that and the tangent activation is actually uh from neural networks that's one of the ways you can activate it because it forms a nice curve between uh from whether you're generating one to negative one uh with some discrepancy in the middle just jumping a little bit in there into neural networks and then we get into we just put the array back out there so we can see it while we're doing this as we're getting into this you can also sum the values so we have np sum and you can do a summation of all the values in this array and you'll see that if you added all these together they'd equal 12.

519 and so on i don't know what the whole setup is in there but you can see right here the summation of this one of the things you can also do is by axes so we could do axes equals zero and if we run the summation of the axis equals zero and you can think of that in numpy as the rows so that would be or you can think of that in numpy as being the columns where summing these columns going across and you can also change this to 1 and now we're summing the rows and so that is the summation of this row and so forth and so forth going down and maybe you don't need to know the summation maybe what you're looking for is the minimum so here's our minimal you know you're looking for and this comes up a lot because you have like your errors we want to find the minimal error inside of this array and just like um the other one we can do axes equals zero and you can see here .0645 is the smallest number in this first column is 0.0645 and so on and if you have a minimum well you might also want to know the max maybe we're looking for the maximum profit and here we go you can see maximum 0.79 is the maximum on this first column and just like we did before you can change this to a 1 on axis you can take the axes out of here and just find the max value for the whole array and the max value in here was 0.

8344 so on so on and since we're talking data analytics we want to go ahead and look at the mean pretty much the same as the average this is the mean across the whole thing and just like we did before we could also do axes equals zero and then you'll see this is the mean of this axis and so on and we have mean we might want to know the median and there's our median our most common numbers uh if we have median we might want to know the standard deviation or if we have the average a lot of times you do the means in the standard deviation we can run that and there's our standard deviations along the axis we can also do it across the whole array if we're going to do standard deviations there's also variance which is your var and there's our variance across the different levels and so if we looked at that we looked at variance we looked at standard deviation the median and the means there's more but those are the most common ones used with data analytics and then going through your data and figuring out what you're going to present to the shareholders and some other things we can do is we can actually take slices you'll hear that terminology and a slice might be like we have a five by five array but maybe we don't want the whole array maybe we want uh from one on we don't want the zero in there so we got up to four and maybe on the second part we just want two to row three and see this notation right here says one to the end and if we run this you can see how that generates uh a single row to the end and then row two and three now remember it doesn't include three that's why we only get the one column so if you wanted two and three you would need to go ahead and go two to four so it goes up to four we could also do this in reverse just like we learned earlier we can go minus one whoops and when we go to minus one it's the same thing because we have zero one two three four this is the same thing as two to four it goes two to the last one also very common with arrays is you're going to want to sort them so we still have our array up here that we randomly generated and we might want to sort it and we'll go and throw an axis back in there axes equals one if we run this you can see from the axes that it sorts it the point two being the lowest value to the highest value by the row we can also change this of course to axis 0 if you're sorting it by column so maybe your values are based on columns and then of course you can do the whole array and we can sort that don't usually do that but you know i guess sometimes you might that might come up and so you can see right here we have a nice sorted array something now let's just go ahead and reprint our array so we can look at it again starting to get too many boxes up there something else you can do with an array is we can take and transpose it this comes up more than you would think when you transpose it you'll see that the rows and the column are transposed so where 0.

79.57 0.064 is a column now we've switched it and we have 0.79.42 as the index you can see this really more dramatic if we take a slice and we'll just do a slice of the first couple and then we'll just do all the other the full rows and if we run this you can see how it comes up a little bit different and we'll just do the same slice up here so you can see how those two look next to each other there we go there's our slice run and so you can see the slice comes up and it has one two three four five columns now we have one two three four five rows and three columns versus three rows and the original version when they first started putting this together uh was a function so the original version was transpose and this still works you can still see it generates the same value as just a capital t so many times we flip this data because we'll have an x y value or we'll have an image or something like that and it's being read one way into the next process and the next one needs it the opposite so this actually happens a lot you need to know how to transpose the data really quick and we can go ahead oh let's just take here's our transpose we'll just stick with the transpose on here and instead of doing it this way we might need to do something called flattening why would you flatten your data if this is an array going into a neural network you might want to send it in as one set of values instead of two rows and you can see here is all the values as a single array it just flattens it down into one array so we covered our scientific means transpose median some different variations on here some of the other things we want to do is what happens if we want to append to our array so let's create a new array i'm getting tired of looking at the same set of random numbers we generated earlier so we'll go ahead and create a new array here something a little simpler so it's easier to see what we're doing and four five six seven eight uh that's good enough let's do four five six seven eight and if we print this array there it is four five six seven eight and we might wanna append something to the array so we have our array we need to extend it you gotta be very careful about appending things to your array and there's a number of reasons for that one is run time because of the way the numpy ray is set up a lot of times you build your data and then push it into the numpy array instead of continually adding on to the array and then it also usually it automatically generates a copy for protecting your data so there's a lot of reasons to be careful about appending this way but you can certainly do it and we can just take our array we're going to create a new array array 1 and if we print array 1 and we append 8 to it you'll see 4 5 6 7 and then there's our 8 appended onto the end and if you want to append something to an array um you'd probably also want to whoops array one let's try that again there we go now we have the eight appended onto the end so you can see four five six seven eight and then we pinned it another eight on there and if you're going to append something you might want to um go ahead and insert instead of appending it might be you need to keep a certain order and we can do the same thing we do our array and we're going to pin or insert at the beginning and let's go ahead and insert uh one two three one two three and we go ahead and print our array two we run it and you can see one two three a pin is inserted at the beginning inserts a lot more powerful and that you can put it anywhere in the array we can move it to the one spot and there we go one two three we can do a minus one just for fun and you'll see it comes up one two three and we're counting backwards by one i imagine do a minus zero and run this and it turns out that minus zero puts it back at the beginning because that's why it registers a 0 just takes a minus sign off and just like we add numbers on we might want to delete numbers and so let's do an np dot delete well let's let's keep it a little bit make it a little easy here to watch we'll go ahead and create an array three and we'll do np delete and we're just working with array uh two and we want to do is delete zero space so if you look at this here's our array two array two starts with one and when we delete the space on here and print that out we deleted the one right out of there and we can also do something like this where we can do it as a slice and we can do let's do one comma three and if we run one comma three you'll see we've deleted the one space and the three space out which deleted our 2 and 4.

Now keep in mind when you're messing with adding lines and deleting lines you have to be really careful because there's a time element involved as far as where the date is coming from and it's really easy to delete the wrong data and corrupt what you're working on or to insert stuff where you don't want it so there's always a warning when we talk about manipulating numpy arrays and just like anything else we're doing uh we'll create an array c which equals we'll just do our um our numpy array that we just created our numpy array three and we can do copy so you can make a copy of it maybe you want to protect your original data or maybe you're making a mask and so you copy the array and then the new array make all these alterations and change it from values to zero to one to mask over the first one and of course we if we do array c since it equals a copy of uh array three it's the same thing one three five six seven eight and now we're getting into uh combine and split arrays i end up doing a lot of this and i don't know how many times i end up fiddling with this and having a mess uh so but but you do it a lot you know you combine your arrays you split them you might need one set of data for one thing another set of data for the other so let's go ahead and create two arrays array 1 or a2 and i want you to note in the terminology we're going to look for is concatenate what that means is we're going to take we'll call this a raycat i like a raycat there we go our array cat our concatenated array we're taking array one and two and it's very important to really pay attention to your axes and your counts i can't merge two arrays that have like if their axes are messed up and i'm merging on axis 0 it's going to give me an error and i'll have to reshape them so you got to make sure that whatever you're concatenating together works and what that means as you can see here we have one two three four one two three four and then five six seven eight five six seven eight along the zero axes these each are four values so it's a two by four value and if we go ahead and switch this to one you can see how that flips it a little bit so now we have one two three four five six seven eight it's interesting that we chose that one if i did something like this where this is now there we go and we concatenate it run this and it gives me an answer okay because i have two by two and i'm using axes one but if i switch this to axis zero where now it's got three and five it gives me an error so you gotta be really careful on that to make sure that your whatever axes you are putting together that they match so like i said this one oops axes one axis one has two entities and since we're going on axes one or by row you can see that it lets it merge it right onto the end there and you could imagine this if this was a xy plot of value or the x value going in and the predicted y value coming out and then you have another prediction and you want to combine them this works really easy for that and we'll go back and let's just put this back to where we had it oops i forgot how many changes i made there we go i'll just put it oops i messed up in my concatenation order here there we go okay so you can see that we went through the different concatenation axes is really important when you're doing your concatenation values on here and we'll switch this back to one just because i like the looks of that better there we go two rows now there are other commands in here um so we can do cat v equals npv v stack this is nothing more than your concatenation um but instead we don't have to put the axes in there because it's v stands for vertical and so if we print out cat v and we run this you can see we get the one two three four one two three four and that would be the same as making this axis zero for vertical stack and if you're going to have a vertical stack you can also have an h stack so if we change this to from v stack to oops here we go h stack and we'll just change this from cat to cat and i run this it's the same as doing axis 0.

The process is identical in the background this is like a legacy setup your v stack and your h stack most people just use concatenate and then put the axes in there because it's much has a lot more clarity and is more more commonly used nowadays the last section in numpy we're going to cover is underst is kind of uh data exploration um and that'll make a little bit more sense in just a moment sometimes they call them set operations but let's say we have an array one two three four five six three whatever it is uh and so we generate a nice little array here and what i want to go ahead and do is find the unique values in that array so maybe i'm generating what they call a one hot encoder and so these values then all become i need to know how long my bit array is going to be so each word how many how many each word is represented by a number and then i want to know just how many of those words are in there if we're doing word count very popular thing to do and you can see here when we do unique uh we have one two three four five six those are our unique values some of the things we can do with the unique values is we can also instead of doing just unique we can do uniques our new unique values and counts of each unique value and this is very similar to what we just did up here where we uh we're doing np unique but we're going to add a little bit more into there and it's just part of the arguments in this and we want to do return counts equals true so instead of just returning the unique values we want to know how many of those unique values are in each one and we'll go ahead and print our uniques and print our counts when we run that you can see here we have our unique value one two three four five six just like we had before and then there's two of the first of two ones two twos two threes two fours one five two sixes and so on and you can go through and actually look at that if you want to count them but a quick way to find out your distribution of different values so you might want to know how often the word the is used versus the word and if each word is represented as a unique number and along the set variables we might want to know um let me just put a note up here we're going to start looking at uh intersection and we might want to also know differentiation and neither so when we're whoops neighbor neither so what we're looking at now is we want to know hey where do these two arrays intersect and we have one two three four five three four five six seven we might wanna know what is common between the two arrays and so when we do that we have np intersect and it's a 1d array one dimensional array and then we need to go ahead and put array one array two and if we run this we can see they intersect at three four five that's what they have common uh and because we're going to go ahead and go through these and look at a couple different options let's change this from intersect 1d and we'll do the same thing we'll go ahead and print this so we might want to know the intersection where they have commonalities another unique word is union of 1d so instead of intersect we want to know all the values that are in both of them so here's our union of 1d when we run that you can see we have one two three four five six seven so that's all the different values in there and the last one of the last words we have two more to go uh we want to know what the set difference is uh and so that's where the you'll see if you remember set we talked about that being the what they call these things um so the set difference of a 1d array when we run that you can see that one is only in one array and two is only in one array and if we want to know what's in array 1 but not an array 2 we might want to know what is in array 1 but not 2 and what's in 2 but not 1 and this would be the set x or 1d on here so we have the four different options here where we can do an intersection what do they both have in common we can do a union what are all the unique values in both arrays we can see the difference what's in array one but not array two so set diff one d and then set x or what is not in one but is in two and what is in not in two but in one so we dug a lot in numpy because we're talking there's a lot of different little mathematical things going on in numpy a lot of this can also be done in pandas although usually the heavy lifting is left for numpy because that's what it's designed for let's go ahead and open up another python 3 setup in here and so we want to explore what happens when you want to display this this is where it starts getting in my opinion a little fun because you're actually playing with it and you have something to show people and we'll go ahead and rename this we're going to call this pandas and pie plot so pandas pie plot just so we can remember for next time and we want to go ahead and import the necessary libraries we're going to import pandas as pd now remember this is a data frame so we're talking rows and columns and you'll see how pandas work so nicely when you're actually showing data to people and then we're going to have numpy in the background numpy works with pandas so a lot of times you just import them by default seaborn sits on top of the matplot library so sometimes we use the seaborn because it kind of extends it's one of the 100 packages that extends the matplot library probably the most commonly used because it has a lot of built-in functionality almost by default i usually just put cborn in there in case i need it and of course we have matplot library as pi plot as plt and note we have as pd as np as sns as plt those are pretty standard so when you're doing your imports i would probably keep those just so other people can read your code and it makes sense to them that's pretty much a standard nowadays and then we have the strange line here it says amber sign matplot library inline that is for jupiter notebook only so if you're running this in a different package it will have a pop-up when it goes to display the matplot library you can with the most current version of jupiter usually leave that out and it will still display it right on the page as we go and we'll see what that looks like and then we're going to go ahead and just do the seaborn the sns.

Set and we're going to set the color codes equals true let them just keep the default one so we don't have to think about it too much and we of course have to run this the reason we run this is because these values are all set if we don't run this and i access one of these afterward it'll crash the cool thing about jupiter notebooks is if you forgot to import one of these you forgot to install it cause you do have to install this under your anaconda setup or whatever setup you're in you can flip over to anaconda and run your install for these and then just come back and run it you don't have to close anything out and we'll go ahead and paste this one in here real quick where we have car equals pd dot read underscore csv and then we have the actual path this path of course will vary depending on what you are working with so it's wherever you save the file at and you can see here i have like my onedrive documents simply learn python data analytic using python slash car csv it's quite a long file when we open that up what we get is we get a csv file and we have the make the model the year the engine fuel type engine horsepower cylinders and so on and this is just a comma separated file so each row is like a row of data think of it as a spreadsheet and then each one is a column of data on here and as you can see right here it has the make model so it has columns for a header on here now your pandas just does an excellent job of automatically pulling a lot of this in so when you start seeing the pandas on here you realize that you are already like halfway done with getting your data in i just love pandas for that reason numpy also has it you can load a csv directly into numpy but we're working with pandas and this is where it really gets cool is i can come down here and i can print remember our print statement we can actually get rid of it and we're just going to do car head because it's going to print that out the head is going to print the top values of that data file we just ran in and so you can see right here it does a nice printout it's all nice and inline because we're in jupyter notebook i can scroll back and forth and look at the different data and just like we expected we have our column and it brought the header right in one thing to note is the index it automatically created an index 0 1 2 3 4 and so on and we're just looking at the head so we got zero one two three four you can change this you might wanna just look at the top two we can run that there's our top two bmws another thing we can do is instead of head we can do tail and look at the last three values that are in that data file and you can see right here it numbered them all the way up to 11 913 oh my goodness they put a lot of data in this file i didn't even look to see how big the file was so you can really easily get through and view the different data in here when you're talking about big data you almost never just print out car in fact let's see what happens when we do if we run this and we just run the car it's huge in fact it's so big that the pandas automatically truncates it and just does head plus tail so you can see the two um so we really don't want to look at the whole thing i'm going to go back to we'll stick with the head displaying our data there we go so there's a head of our data gives us a quick look to see what's actually in there i can zoom out if you want so you can actually get a better view although we'll keep it zoomed in so you can see the code i'm working on and then from the data standpoint we course want to look at data types what's going on with our data what does it look like now this you know you show your when you're talking to your shareholders they like to see these nice easy to read charts they look like a spreadsheet so it's a nice way of displaying pieces of the chart we talk about the data types now we're getting into the data science side of it what are we working with well we have make model we have an integer 64 for the year engine fuel type is an object if we go up here you can see that there most of them are like you know it's a set manual rear wheel drive so they might be very limited number of types in there and so forth and you'll it's either going to be a float64 an integer or an object is the way it's going to read it on here and the next thing you're going to know is like your columns and since it loaded the columns automatically we have here the make the model the year the engine the size all the way up to the msrp and just out of something you'll see come up a lot is whenever you're in pandas and you type in dot values it converts it from a pandas list to a numpy array and that's true of any of these so then you end up in a numpy array so you'll see a little switch in there in the way that the data is actually stored and that's true of any of these uh in this case we want car dot columns you have a total list of your car columns and like any good data scientist we want to start looking at analytical summary of the data set what's going on with our data so we can start trying to piecemeal it together so we can do car describe and then we'll do is we'll do include equals all so a nice panda command is to describe your data if you're working with r this should start looking familiar and we come down here and you can see um count there's a make the model the year how many of each one how many unique values of each one the top value of each one what's most common the frequency the mean clearly on some of these it's an object so really can't tell you what the average is it'd just be the top ones the average i guess the year what's the average year on there all this stuff comes down here your standard deviation your minimum value your maximum value uh what's in the lower quarter 50 mark where's that line at and what's in the upper 75 percent the top 25 percent going into the max now this next part is just cool uh this is what we always wanted computers to be back like in the 90s instead of 5 000 lines of code to do this maybe not 5 000.

All right i built my own plot library back in 95 and the amount of code for doing a simple plot was um i don't know probably about 100 lines of code this is being done in one line of code we have our car which is our pandas we generated that it's our data frame and we have dot hist for histogram that is the power of seaborn now it's still going to generate a numpy graph but seaborn sits on top and then we can do the figure size this is just um so it fits nicely on the paper on here and we do something simple like this and you can see here where it comes up and does say matplot library and does subplots and everything but we're looking at a histogram of all the different pieces in our database and we have our engine cylinders that's always a good one because you can see like they have some that are they had a null on there so they came out as zero maybe a couple maybe one of them had a two-cylinder engine away back when four is a common uh six a little less common and then you see the eight-cylinder uh 12-cylinder engines well it's got to be a speedster or something uh but you can see right here just breaks it down so now you have uh how many cars with how many whatever it is cylinders horsepower uh and so on and it does a nice job displaying it you can see if you're working with your uh um you're going into your demo it's really nice just to be able to type that in and boom there it is it can see it all the way across and we might want to zero in and use like a box plot and this time we'll go ahead and call the um seaborn sns box plot and we're going to go ahead and do vehicle size in versus engine horsepower xy plot and the data comes from the car so if we run this we end up with a nice box plot you see our mid-size compact and large you can see the variation there's our outlier showing up there on the compact that must be a high-end sports car a large car might have a couple engines and again we have all these outliers and then your deviation on them very powerful and quick way to zero in on one small piece of data and display it for people who need to have it reduced to something they can see and look at and understand and that's our seabourn box plot our sns dot box plot and then if we're going to back out and we want a quick look at what they call pair plotting we can run that and you can see with the seaborne it just does all the work for you it takes just a moment for it to pull the data in and compile it and once it does it creates a nice grid in this grid if you look at this one space here which is you might not be able to see the small number it says engine horsepower this is engine horsepower uh to the year was built and it's just flipped so everything to the right of the middle diagonal is just the rotation of what's on the left and as you expect the engine horsepower gets bigger and bigger and bigger as time goes on so the the year it was built the further up in the year the more likely you are to have a heavy horsepower engine and you can quickly look at trends with our pair plot coming up and look how fast that was that was it took a couple a moment to process but right away i get a nice view of all these different information which i can look at visually and and kind of see how things group and look now if i was doing a meeting i probably wouldn't show all the data um one of the things i've learned over the years is um people myself included love to show all our work you know we're taught in school show all your work prove what you know the ceo doesn't want to see a huge uh grid of of graphs i guarantee it so we want to do is we want to go ahead and drop the stuff that might not be interested in and we're going to i'm not really a car person a guy in the back is obviously so you have your engine fuel type we're going to drop that we're going to drop market category vehicle style popularity number of doors vehicle size and we have the axes in here if you remember from numpy we have to include that axis to make it clear what we're working on that's also true with pandas and then we'll look at just what it looks like from the head and you can see that we dropped out those categories and now we have the make model year and so forth and we took out the engine fuel type market category etc and this should look familiar to you now when you start working with pandas i just love pandas for this reason look how easy it is it just displays it as a nice spreadsheet for you you can just look at it and view it very easily it's also the same kind of view you're going to get if you're working in spark or pi spark which is python for spark across big data this is the kind of thing that they they come up with this is why pandas is so powerful and we may look at this and decide we don't like these columns and so you can go in here and we can actually rename the columns simple command car equals car rename columns equals engine horsepower equals horsepower this is just your standard python dictionary um so it just maps them out and you know instead of having like a lengthy effect here we had engine horsepower we just want horsepower we don't need to know what's the engine horsepower engine cylinders we don't need to know that it's for the engine because there's only one thing we're describing if we're talking about cars and that cylinders and we'll go ahead and just run this and again here's our car head and you can see how that changed we have model year in horsepower versus m