Python and Data Analysis
Welcome to our data analysis with Python tutorial. My name is Santiago and I will be your instructor. This is a joint initiative between Free Code Camp and remoter. In this tutorial, we'll explore the capabilities of Python on the entire PI Data stack to perform data analysis, we'll learn how to read data from multiple sources such as databases, CSV and Excel files, how to clean and transform it by applying statistical functions and how to create beautiful visualizations will show you all the important tools of the PI Data stack pandas, matplotlib, Seabourn and many others. This tutorial is going to be useful both for Python beginners that want to learn how to manage data with Python, and also traditional data analysts coming from Excel tableau, etc. You learn how programming can power up your day to day analysis. So let's get started. Welcome to our data analysis with Python tutorial My name is Santiago and I am an instructor@remoter.com an online Data Science Academy. This tutorial is a result of a joint effort by remoter and Free Code Camp, and it's totally free. It includes slides, Jupyter, notebooks and coding exercises. Let me tell you a little bit more about remoter were an online hands on Data Science Academy. We specialize in data science, including data analysis, programming and machine learning. We have a complete course catalog and we're adding more content every month. If you're interested in learning data science or data analysis, check us out. As part of this joint effort between Free Code Camp and remoter you can get a 10% discount in your first month by using the following discount coupon. Let's quickly review the contents of this tutorial. In the description of this video, we have included direct links to each section, so you can jump between them. This is the first section and we are going to discuss one is data analysis. We'll also talk about data analysis with Python and why programming tools like Python SQL and pandas are important. In the following section will show you a real example of data analysis using Python.
So you can see the power of it will not explain the tools in detail. It's just a quick demonstration for you to understand what this tutorial is about. The following sections will be the ones explaining each tool in detail, there are two more sections that I want to especially point out. The first one is section number three Jupiter tutorial. This is not mandatory, and you can skip it if you already know how to use Jupyter notebooks. Also the last section Python in under 10 minutes. This is just a recap of Python. If you're coming from other languages, you might want to take this first. If that's the case, again, you can use the links in the video description to jump straight to it. All right now let's define what is data analysis. I think the Wikipedia article summarizes perfectly the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, you forming conclusions and support decision making. Let's analyze this definition piece by piece. The first part of the process of data analysis is usually tedious. It starts by gathering the data and cleaning it and transforming it for further analysis. This is where Python and the PI Data Tools Excel. We're going to be using pandas to read, clean and transform our data. Modeling data means adapting real life scenarios to information systems using inferential statistics to see if any pattern or model arise. For this we're going to be using the statistical analysis features panelists and visualizations for matplotlib and Seabourn. Once we have processed the data and created models out of it, we'll try to drive conclusions from it finding interesting patterns or anomalies that might arise. The word information here is key. We're trying to transform data into information. Our data might be a huge list of all the purchases made in Walmart in the last year, the information will be something like pop tarts sell better on Tuesdays.
This is the final objective data analysis we need to provide evidence of our findings, create a readable reports and dashboards and aid other departments with the information we've gathered. Multiple actors will use your analysis, marketing sales, accounting executives, etc. They might need to see a different view of the same information. They might all need different reports or level of detail what tools are available today for data analysis. We've broken these down into two main categories, our managed tools, our close products, tools you can buy and start using right out of the box. Excel is a good example. Tableau and luchar are probably the most popular ones for data analysis. In the other extreme, we have what we call programming languages or we Call them open tools. These are not sold by an individual vendor, but they are a combination of languages open source libraries and products. Python R and Giulia are the most popular ones in this category. Let's explore the advantages and disadvantages of them. The main advantage of close tools like Tableau or Excel is that they are generally easy to learn. There is a company writing documentation providing support and driving the creation of the product. The biggest disadvantage is that the scope of the tool is limited, you can cross the boundaries of it. In contrast, using Python and the universe of PI Data Tools gives you amazing flexibility. Do you need to read data from a closed API using secret key authentication for example, you can do it? Do you need to consume data directly from AWS kinases, you can do it a programming language is the most powerful tool you can learn. Another important advantage is a general scope of a programming language. What happens if Tableau for example, goes out of business. Or if you just get bored from it and feel like your career is taught you need a career change? learning how to process data, using a programming language gives you freedom? The main disadvantage of a programming language is that it's not as simple to learn as with a tool, you need to learn the basics of coding first, and it takes time.
Why are we choosing Python to do data analysis? Python is the best programming language to learn to code. It's simple, intuitive, and unreadable. It includes 1000s of libraries to do virtually anything from cryptography to IoT. Python is free and open source. That means that there are 1000s of PI's very smart people seeing the internals of the language under libraries. from Google to Bank of America, major institutions rely on Python every day, which means that it's very hard for it to go away. Finally, Python has a great open source spirit. The community is amazing, the documentation, so exhaustive, and there are a lot of free tutorials around checkout for conferences in your area, it's very likely that there is a local group of Python developers in your city. We couldn't be talking about data analysis without mentioning r r is also a great programming language. We prefer Python because it's easier to get started and more general in the libraries and tools it includes. R has a huge library of statistical functions. And if you're in a highly technical discipline, you should check it out. Let's quickly review the data analysis process. The process starts by getting the data where is your data coming from? Usually it's in your own database, but it could also come from files stored in a different format, or a web API. Once you've collected the data, you will need to clean it. If the source of the data is your own database, then it's probably in writing shape. If you're using more extreme sources like web scraping, then the process will be more tedious. With your data clean, you'll now need to rearrange and reshape the data for better analysis, transforming fields merging tables, combining data from multiple sources, etc. The objective of this process to get the data ready for the next step. The process of analysis involves extracting patterns from the data that is now clean and in shape.
Capturing trends or anomalies. statistical analysis will be fundamental in this process. Finally, it's time to do something with data analysis. If this was a data science project, we could be ready to implement machine learning models. If we focus strictly on data analysis, we'll probably need to build reports communicate our results, and support decision making. Let's finish by saying that in real life, this process isn't so linear, we're usually jumping back and forth between the step and it looks more like a cycle than a straight line. What is the difference between data analysis and data science? The boundaries between data analysis and data science are not very clear. The main differences are that data scientists usually have more programming and math skills, they can then apply these skills in machine learning on ETL processes. The analysts on the other hand, have a better communication skills creating better reports with stronger storytelling abilities. By the way, these Weiler chart you're seeing right here is available in the notes in case you want to check out the source code. Let's explore the Python and PI Data ecosystem, all the tools and libraries that we will be using. The most important libraries that we will be using are pandas for data analysis, and matplotlib and Seabourn for visualizations. But the ecosystem is large and there are many useful libraries for specific use cases. How do Python data analysts think if you're coming from a traditional data analysis place using tools like Excel and Tableau you're probably used to have a constant visual reference of your data. All these tools are point on Click. This works great for a small amount of data. But it's less useful when the amount of records grow. It's just impossible for humans to visually reference too much data, and the processing gets incredibly slow. In contrast, when we work with Python, we don't have a constant visual reference of the data we're working with.
We know it's there. We know how it looks like. We know the main statistical properties of it, but we're not constantly looking at it. These allows us to work with millions of records incredibly fast. This also means you can move your data analysis processes from one computer to the other, and for example, to the cloud without much overhead. And finally, why would you like to add Python to your data analysis skills aside from the advantages of freedom and power theories, another important reason, according to PayScale, data analysts that no Python and SQL are better paid than the ones that don't know how to use programming tools. So that's it. Let's get started in our following section will show you a real world example of data analysis with Python, we want you to see right away what you will be able to do after this tutorial. We're gonna start this tutorial by working with a real example of data analysis and data processing with Python, we're not going to get into the details yet, the following sections will explain what each one of the tools does, and what is the best way to apply them combining and the details of them. In general, this is just for you to have a quick on high level reference of our day to day processes, data analysts, data managers, data scientist using Python. So the first data set that we're going to use is a CSV file that has this form, you can find it right here, under the data directory, the data we're going to be used is this, I have just transformed it into a spreadsheet. So we can pretty much look at it from a more visual perspective. But remember, as we said in the introduction, as data analysts are not constantly looking at the data, right, we don't have a constant visual reference, we are more driven by the understanding of the data right in the back of our head, and we understand how what the data looks like, what's the shape of it. And that's what it's conducting our analysis.
So the first thing we're going to do is we're going to read it this CSV into Python, and you can see how simple it is just one line of code gets us the CSV read into byte, then we're going to give a quick reference. And this is what the data frame that we have created looks like data frame is a special word is a special data structure, we use independent tool. And again, we're going to see that in detail in the pan this part of this tutorial. The data frame is pretty much the CSV representation, but it has a few more enforced things like for example, each column has a strict data type. And we will not be able to change it to tetra, it's a better way to conduct our analysis, the shape of our data frame tells us how many rows and how many columns we have. So you can imagine that with these amount of rows, it's not so simple to again, follow a visual representation of it's like, it's pretty much infants crawling, in this point 100,000 rows. But the way we work is by immediately after we load our data we have we want to find some sort of reference in the shape and the the properties of the data we're working with. And for that we're going to do first an info to quickly understand the columns we're working with. In this case, we have date, which is a date time field, we have day, month year on that are just complimentary to date, we have the customer age, which is uninjured, which makes sense right? age group, you can say it's right here. It's age group youth, customer gender, we have an idea again, of the of the entire data set, we know the columns we have, but we also know how large it is. And we don't care what's in between, we will be cleaning it probably, but we don't need to actually start looking row per row, right just with our very limited eyes, we have a better understanding of the structure of our data in this way. And we're going one step further, we will also have a better understanding of the statistical properties of this data frame with a describe method.
For all those numeric fields, I can have an idea of the statistical properties of those. So for example, I know that the average age of these data set is 35 years old. I also know that the maximum age in this case if these Or is the sales data is 87 years old, I know the minimum is 17 years old. And again, I can start building right if my understanding of this that physical properties of it. So in this case, the median of my age is very close to the mean. So this is telling me, all is telling me something, and the same thing is going to happen for each one of the columns that we are using. For example, we have a negative profit here, and we have very large values here are these correct, is maybe there's a mistake, again, it's by having a quick statistical view of our data, we're going to be driving the process of analysis without the need of constantly looking at all the rows that we have. It's a, it's a more general holistic overview. So we're gonna start with unit cost, let's, let's see what it looks like. And we're going to do a describe only if you need coast, which is pretty much what we had right here. In the previous in this line, what we did was for the entire data frame for the entire data, in this case, we're just focusing in the unit coast, cost, sorry, column, the mean, the median, all fields, we know already pretty much from this, and we're gonna quickly plot them, we're going to use these tools to visualize them. And it's the same tool, it's paying this that it's using on top, right? It's using matplotlib. So the visualization is created with matplotlib. But we're doing it directly from pandas. And again, don't worry, this is all explained in pandas lessons. So this is unit costs, right is what this is the box, but we have just created, we have the whiskers that mean that shows us the the first and third quartile, the median. And then we see all the outliers that we have right here. So we see that our product study is around $500 is considered to be an outlier.
And the same thing if we do a density plot, right. So this is what it looks like. We're going to draw two more charts, right, in which we're going to pretty much point out the mean and the median, right in the distribution charts. And we're going to do a quick histogram of the costs of our products. Moving forward, we're going to talk about age groups with the age of a customer. And at any moment, we can always do something like sales sort here to give a quick reference, we know that the the age of the customer is expressed in actual years old they were but also they have been categorized with three, four, actually four age groups, seniors, youth, young adults and adults, right. So they we have given categories were creative, right to better understand these groups, and we do that with values. Value counts, we can quickly get a pie chart out of it, or we could get a bar chart out of it. As you can see, right here, we're doing an analysis of our data, we see that adults right here are the largest group in our for our data at least. So moving forward, what about a correlation analysis? What is a correlation between some of our properties, we will probably have high correlation for example, between profit and unit cost, for example, or order quantity, that's kind of expected, but that's all something that we can do right here. This is matrix right of correlation showing in red high correlation. So order quantity, and unit cost or where is profit right here. Profit is right here. So we see high correlation with unit with cost with profit. Now with profit, actually, it's the opposite blue is high correlation, I'm sorry, the diagonal, which is blue, is correlation is equals one. So high correlation is blue. And we see that profit has huge correlate has a lot of correlation, positive correlation with unit cost and unit price. And negative correlation is with dark red. So we again can have a quick idea. Let's see, for example, here profit, it has negative correlation with order quantity, which is interesting, right? It's we wouldn't dig deeper into that, of course, the profit has a high correlation positive with revenue, right? And again, it's just a quick correlation analysis.
We can also do a quick scatterplot to analyze the customer age and the revenue right to see if there is any, any correlation there. Right? And the same thing for revenue and profit. This is obvious, right? We can we can quickly draw a diagonal here, right. So there is a lot Linear depth and dependency between these variables. So a form a few more box plots, in this case, understanding the profit per age group, right, so we can see how the profit will be, will change depending of the customer's age, and a few more box plots. And we're creating these these grid of year customer age, unit costs, etc, for multiple things. So moving forward, something that we can quickly do when we're working with Python, especially within this is Drew shape or data or derive it from other columns, right. So this is pretty common in Excel, we can create these revenue per age column, if you're here in Google spreadsheets, you're going to do something like revenue, per age, and you're going to do something like equals, right? Equals revenue, divided, I don't remember if this correct formula we're using, but just for, for you to have a reference. And we're going to pretty much extend this whole thing. There we go, Oh, well is processing, and I have 100,000 rows. So you can see how slow it is, I let's compare that just to the way Python works, I'm gonna execute this thing. It was instant, you know, extremely fast. And it was all calculated seems that we have the same results as expected. same results as expected. And we can quickly plot both the in a density plot and in a histogram, as you can see, right there, now that revenue parade is going to be relevant. In any case, it's just to show you the capabilities of what we can do. Let's annual analyze, well, we're gonna create a new column, which is calculated cost is the total, the total orders the total, the quantity of the order, times the cost, right, extremely simple formula, very fast process.
And we're gonna get right here, how many rows had a different value than what was provided by cost? So what we're doing right here is like, we're quickly checking if the cost provided by the data set, at some point doesn't align with the actual cost we are calculating. So is there any mistakes that were made by the I don't know the original system, or people doing a data entry, if these new column is different from cost, we want to know about that. And that doesn't happen. So again, quick, quick, regression plot. In this case, it's very obvious that there is some linear dependency between calculate cost and profit. So more formulas, in this case costs part cost plus profit. So we're going to adding a little bit more, there is no difference with the revenue and the calculated revenue that we are having. So that all makes sense, we're going to do a quick histogram of the revenue. We can, for example, on 3%, to all the prices that we are using, we need to increase prices. How are you going to do that? Well, it's very simple with Python, we're just going to do increase everything by point 03. And now all the prices have changed. What else we're going to be able to do quick filtering, let's get all the sales from the state of Kentucky right. So these are all the sales from the state of Kentucky, we can get only the average of the sales by these age group on only revenue, right. So these, all these filtering options, and extremely simple to get with Python. In this case, we say, give me all the sales from these age group, and also from this country, right, and we're gonna get the average revenue from these groups that we are selecting. And again, to modify the data, we can make just a few quick modifications, like in this case, we're going to say, all the sales from country right to revenue, we're going to increase it by 1.
1. I don't know why, which is doing it arbitrarily. It's just for me to show you how it works. So far, so good. Again, we've done a couple things, you don't need to know about the details, we will actually go through that in the NumPy independence sections in this tutorial. So just for you to have a quick reference of it. There are exercises associated with these given lectures. So if you want to pause right now and get into the exercises, that's going to be very helpful. We're going to move forward now with the second lecture in which we will be using a database this Akila database and we're going Be erasing data, instead of from a CSV file, as we did before, we're going to read data now from a database. Reading data from a SQL database is as simple as it is from an Excel file or a CSV file, as we were doing with our previous example. And once you've read the data, that's we're going to do now the process is the same. So what we have right here is a query a SQL query, if you don't know about SQL, you can check our courses or other courses online. Basically, we're pulling the data from the database. This is one of the advantages of Python, it's not, there are connectors for pretty much every database provided out there, Oracle, Postgres, MySQL, SQL Server, etc. In this particular example, we're going to be using MySQL. So once you construct the query, and you pull the data from the database, then the process is the same, we have just converted these outside data into a data frame that we can use with our Python skills. The first step, as usual, is to check the shape information description of our data of our data frame. In this case, we want to, again understand the structure of it. So we want to know how many rows we have 16,000, we want to know a little bit more about our rows, we want to know about a little bit more about our columns, and how many rows how many records we have for each one of them and the type of each one of these columns.
And we also want to have a better statistical understanding of our data. So we do a quick describe, and we have more details about it. If we want to focus in individual columns, right, we can just do that by in this case, we're gonna focus in film rental rate, right, pretty much how much you pay to rent a film. Um, we're gonna see the kind of distribution we have, we can call it distribution, it's pretty much a categorical field in this case, but basically, the rentals are divided into three main categories are prices, zero 99 299 499. So that's these box plot these pretty much perfect, never seen in real life plot box plot gives you those prices. And move forward, we can also check very quickly a categorical analysis, understanding the distribution of rentals between cities, so we have two cities. And it's pretty much even as you can see right here, creating new columns and reshaping the data for further analysis, etc, is relatively simple. In this case, we're going to analyze their return in rentals, right, which, which films are going to be more profitable for the company div, dividing the rental rate, how much we charge, divided by the cost, how much it costs us to acquire the film. So in this case, we can see the distribution of that, right. So most rentals are here in the beginning. And then we have more profitable rentals, were making up to 60% above the rental. And we can quickly analyze the mean and the median fit right to have a quick idea of all that. Finally, selection and indexing, if you want to start focusing, if you want to go into data, right, you want to zoom in, you want to have a better understanding. So you start filtering, in this case, we can filter by customer, but if you want to do it per city, if you want to do it per state, if you want to do it per film, per price category, etc. It's very simple to filter to filter and zooming, which is one particular characteristic of your data.
So you can perform a more detailed analysis. So in this case, we have all the the films are rented by the customer last name, Hanson, which doesn't mean it's the same person. But again, it's very simple to filter dot. And here, we can do we can very quickly see which ones are the price, the film's sorry, that have the highest replacement cost, right. So basically, what we're doing is we're going to isolate those films that have the highest replacement cost. And also we can see right here just for you to have an idea, all the films that are in the category PG or pG 13. It's very simple to to filter that data. So this is the process we usually follow. we imported the data, we reshape it somehow create columns, there is an important process of cleaning up or not highlighting this part of the tutorial, we're going to talk about it in the tutorial itself. There's the process of cleaning, then reshaping creating new columns, combining data and creating visualizations. This is the process, right? We're following here with our Python skills, but it's a tone more to odd as you might imagine, from creating reports to running machine learning processes, creating linear regressions, etc. For now, this is just a quick understanding of the process. We follow. Now starting now we're gonna move forward with more details of each one of the individual tools we're going to talk about. We're going to talk about Jupyter notebooks. We're going to talk about NumPy. We're going to talk about pandas, we're going to talk about mapa, lib, seaborne, etc. Starting now, right? The first thing we're going to see is, what is this whole thing that I've been using this Jupyter Notebook, I want you to now too, if you want, if you if you don't have experience with it, I want you to have an idea of how it works. And then we're going to move forward the individual tools, NumPy, pandas, etc. Remember, there are exercises also associated with this particular lecture. So you can always go back again, and work with them.
Once you get more a better understanding of the tools we are using. Before we jump into the actual data analysis course, and we start talking about Python, pandas, all the tools, we're going to use import files, read data from databases, etc, I want to show you the environment that we work with. It's our primary environment, it's the tool that we use 99% of the time on its Jupyter Notebook, there are going to be different terms here, I'm going to be referring to it as Jupyter Notebook. But as you are going to see in this, in this part of the of our tutorial, you can see that Jupiter is actually a whole ecosystem of tools. And it's a very interesting project. Jupiter is a free and open source, again, ecosystem of multiple tools. And primarily, we're gonna talk about first, what is a Jupyter Notebook. What you're seeing right here, and you're gonna see live in a second, I can actually show it to you is this thing we're going to use. And we are also going to talk about Jupiter lab. Okay, which is the evolution of the regular Jupyter Notebook. So, I think this could be familiar to you already. Usually the questions in the question is, what's the difference between Jupyter Notebook and Jupiter lab? Well, the difference is that Jupiter lab is just a nicer interface on top of Jupyter notebooks. It's not just the plain notebook. This is a notebook, but I'm scrolling right now. It's also the addition of tree view, it's an addition of get tools, as an addition of command to lead and multiple other things. You can open some files with a nice preview in it, etc. So, Jupiter lab Jupyter Notebook, they are similar Jupiter lab easy, again, the evolution of a Jupyter Notebook. And that's what we're using. Again, Jupiter is a free and open source project. So anybody can install it, anybody can download it, it's very simple to get it set up in your local computer. In this case, we're using something we call notebooks AI, it's a project that provides Jupiter environment for free in the cloud.
So you don't need to install things locally, you don't need to put things in sync in your own hard drive, right you That means you don't need to buck it up, for example, because it's just a service, it's all worked in the cloud. So said that, I want to tell you that we have compiled a very quick list of everything, we're going to talk in this part of the tutorial, in this list of two, it's just a thread of with multiple, multiple hints of how to use Jupyter notebooks. So after the video after the course, if you forget some of these concepts, you can always go back to this to it, it's a quick reference for you to have. So let's get started. Why do we use a Jupyter Notebook? Because it's an interactive real time environment to produce our or to to explore our data and to do our data analysis. It's a tool you're gonna fire commands, and it will immediately respond with something back. It's a very interactive tool, when we're working with data analysis, and this is mainly main difference with some other tools like for example, Excel, tableau, etc, is that we are not constantly looking at the data, there is no visual reference, like for example you have in Excel, right? So in Excel, you're constantly looking at the data, you have it in front of you, there are 100,000 cells and you can stroll and see them. The problem is that that's not scalable, right? It's like nobody can work with 100,000 rows in their, in their, in their mind, we will always forget something. So the way we work with Python indeed, analysis is by always having a reference of how our data looks like but always at the back of our head and we're not constantly looking at it. We're like this person from the matrix, you know, the, the the commander of the matrix that commands people to get get in and out. We're basically telling people telling people that basically asking data, right asking questions to the data, and having a picture in our mind of how that's going to work, we're not constantly looking at it, we're just having a reference, or in our in the back of our heads of what our data looks like.
So that's why this tool is very useful. This tool is useful Also, if you're just training your Python skills, and or their permanent language skills, because what you're gonna see is it's just a regular Python interpreter. In this case, I can execute some code, that's two one times, actually one plus three, there we go. And the result is four. Right. So this is a Python is a fully featured Python interpreter. The good thing is that again, it's going to respond to us pretty much immediately I create a command and I immediately get a response, I can do something a print here, hello world. And I immediately get a response, I can do Hello, world, times, times three. Again, it's a again, a Python interpreter, a fully feature Python interpreter, but it's not being accessed from a terminal, which you can write this is the good thing about Jupiter lab to have a terminal, I can do Python, right. And I can do two, time three, and I get an answer back. But this is not convenient to work with our data, we need something a little bit more interactive, we can also mix with documents, that's going to be the advantage of a Jupyter. notebook. So what what's the way we work with Jupyter notebooks, there are a few concepts, very important concepts that we are going to follow a Jupyter Notebook is just a sequence of multiple cells, okay, everything is a cell. And as you can see, when I click on these cells, even if even if it doesn't look like being a cell, it is, you will see that these blue thing right here, right is pretty much following me because I'm clicking on the cell, and I'm selecting that particular cell. Everything happens within a cell, if I want to execute some code I can do, again, one plus five, and to get a result or a result back, right, that's, that's how it works. So I'm creating a cell, I'm deleting a cell, I create another cell again.
So it's everything happens with a cell, and I'm going to tell you how to add the cells, how to remove them how to execute code, etc. The interesting thing about a cell is that it can either be Python code, or any other programming language you're using in this case is a Python data analysis course. It can be Python code, as we're we were doing before one plus three, this is Python code, or it can be what we call markdown, okay, which is a formatting format, right? To create text, that will be a render with sort of HTML ID at the output. So in this case, this is what the source code of the markdown looks like in markdown, any line that starts with this part, it's going to be a title, in this case, it's going to be the largest, the biggest title you can have is just one pod, and then you keep adding to reviews the size in this case, level three title. And then you can have for example, this is a quote this is bold, this is it Alex, this is a link, right? So let me actually, I could copy the cell and open the source code. There we go. So this is a link right issue, issue is created or it's rendered as a link. So markdown, what is is that is a text formatting tool, right or protocol, we could say that in this case, we just specify us we have some some rules to use in our in our text, and markdown knows how to interpret them and format right or return a formatted document after them. So for example, here, we have green divider, which is a picture and we know it's a picture because it starts with an exclamation marks. And that's that what you're saying right here. So again, a cell can be either Python code, or it can be markdown. markdown is an entire thing on its own. You can get any tutorial online free, it's it's fairly simple to get started with. And it's also very important because when you're formatting your reports, right, when you're creating your reports, you want them to look pretty, you can use markdown for not and what we're going to see later So you can export these notebooks and they will generate PDFs, right.
So this whole thing can be a PDF or an or an HTML page. So after you're done with your data analysis, you can hand over to whoever asked for the analysis, a PDF report, which is pretty neat. So moving forward, again, any cell is going to be either markdown, or it's going to be code right here. So these ones code, and you can switch the modes, you can say, this LS code, or actually, let's make it markdown. So right now, if it's a code, it doesn't doesn't matter, or just, it's not executing anything, because the cell is interpret as markdown. So now, I'm switched back to code. And now it works. Again, I said, Sure. It can also be raw, but to be honest, we don't use raw very often. So again, you have this this general cell type, this cell we're using, what type is it? Is it code is it markdown, you can switch it with these with the selector right here. So a few more things that I have to tell you right away, so you can start internalizing them, and it's gonna take some time to get used to it. But once you get used to it, you're gonna move very fast in your data analysis with Python Jupyter notebooks. The first thing is, as you're seeing right here, every cell has been given an execution number. So any, the cells will be moved, right, they will be moving around, you will be moving them around. But you will always know which one executed before another one. And that's because every execution, you run will be assigned an execution number. In this case, this is the seventh time I have executed code. If I execute code again, for example, I don't know, two times two, this is the eighth time that I've executed code. And if I move this thing, right here, if you're reading this thing, top down, you will not be full, right? You will understand this thing. It was moved, the cell was moved, the structure of the notebook changed. But these thing was executed after this other cell, right? xact.
And this is seven. So the execution order is always preserved. So that's an important thing. Something else that you're seeing me change the structure, and do things with the notebook without using any menu. And that's because I know how to use keyboard commands keyword shortcuts to run most of these commands. So for example, how can I add a new cell I have these is a markdown cell. This is a code cell, if I need a cell before these one, what's what's that command that I'm going to issue in order to create the cell, in this case, the command is going to be the letter A, I just type A, and there is a new cell creative. How can I delete the cell, it can be two times that the key two times the D key. And again, this is all these reference with built. So for example, right here, whereas hit at some point, you can. Here, you can type, you can press A to create a new cell, you can press B to create a new cell, what we call below. So let me put something here, this is a reference. And I'm going to put here the letter B and it's going to create a cell B below the currently selected one. So the selection here is here in the blue, I hit let me delete this one, I hit B. And again, it's going to create a cell below the previously selected one, if I hit a, it's going to create a cell above that previously created one. So these, these are the mnemonics of the creation. Something else and it's very important why when I'm in this cell, and I hit the letter, a leader, literally it just hits the letter A in my keyword, no control, no command, just a, it creates a new cell, and it doesn't type A inside the document, right? So right here, if I type A, it's adding an actual a character in the cell. Why didn't that happen before. And you're going to notice that when I change, when I'm going to call a mode in a second, you're going to see that the content of the cell is grayed out, show what now when I when I press on the letter A it actually creates to sell and it's not adding content to the sell itself.
If I go back again to the other mode, and I'm going to give you a better explanation in a second. If I type anything, in this case, a it's actually appended to the text within it. So this is my interaction to sell modes and this is very important. The Jupyter Notebook is a mode base editor, right? So there are multiple editors are, for example, vim or VI, vi, those are mode based editors, which basically, the behavior of your work will change depending on the mode that it's currently activated. So for example, in this case, I am in addition mode, because any character that I type will be appended to the cell, A, B, C, D, etc. If I switch out of editing mode to what we're gonna call command mode, I switch out of that mode. Now the cell is grayed out, and any key that I hit, it's gonna do something different associated with that key. So A is going to create a new cell above, B is going to create a new cell below, Double D is going to delete this cell, right. So that's, that's the important part of Mo. That's one of the most important parts in order to understand how to work with Jupyter notebooks, the mode that you're currently working with, and there are only two modes, so it's fairly simple. This is command mode. And we recognize command mode, because this cell is grayed out. When we get into edit mode, there is a regular prompt, as you're saying before, the number one on the cell is actually subjects of addition. So that's the way we can realize that, how are you going to switch from modes, in this case, I'm in editing mode, if I'm using my mouse just pointing, I can click outside, I'm gonna get out of the edit mode into command mode. If I point inside and going back again, to the Edit Mode, but let me tell you something right away and then say, we don't like to use our mouse, we don't like to point and click, because that's very slow. We like to use our keyboard, we move very fast with our keyboard. So how are you going to switch from, from editing mode back to command mode, that's going to be with the Escape key to go from editing to command, edit as Escape key, it's going to switch out of editing, but when mode.
And if you actually want to make modifications to the cell, basically, you want to get into edit mode, you're going to hit the return key, that's going to get you into edit mode, again. So we have tackle multiple things are writing, again, we said in Jupyter notebooks, we're going to use Python code very quickly to interact with our data, we need a real time, you know, I'm asking a you're answering type of editor. That's what the Jupyter Notebook is. The Jupyter Notebook has these two modes, edit and, and command mode. And then the cells which is pretty much everything is the most important, it's a fundamental part of the notebook, the cell is going to have two types can be either code, or it's going to be markdown, right. And now I'm going to start showing you more features. And I'm going to show you, I'm going to show you the most important commands. And of course, how the what the keyboard shortcuts for those commands are, so you can move freely. And and and work with Jupyter Notebooks in the most efficient way. So let's get started. First of all, for for from the most important commands is moving right. So navigating, it's very simple to navigate, just use your arrow keys, up and down, up and down. And you're going to move around in your notebook. If you wanted to switch the type, right going from markdown to code, etc, you can switch use these drop down or you can press the specific key is to switch to either markdown or Python. So for my markdown, you're gonna switch sorry, hit the M key, that's going to make it markdown. For Python, you're going to hit the Y key, that's going to make it Python code. So M and y are going to switch you back and forth. Keep an eye on the selector you're going to hit y m y m is going to switch it from code to markdown. What else how can you execute code once you are within your typing code and you want to execute it, there are two types of executions you can run.
The first one is going to keep the selection the currently selected an active cell is going to stay the same place you are and that's going to be my by keeping press the Ctrl key and hitting return that's going to run decode on the cell there the prompt or the current selected cell will remain being the same. So I'm running this thing a couple of times already on this selection or the currently highlighted cell stays the same, I can change that by using shift return. So I keep the shift key pressed. And I hit return and is going to execute the code. But it will immediately switch the prompt or the currently selected cell to the following one. And that's useful when you have multiple cells, you want to execute one after the other. So you can keep hitting shift, return, return, return return, and it keeps you moving right from top to bottom. Alright, so Ctrl return or shift return to change the execution is the same is just what's going to happen with the currently selected cell. We already saw how to create cells with the A key, we create a cell above with B key we create a cell below. To delete a cell, you're going to hit the D key, the D key two times one after the other very quickly, dd is going to delete these the cell. What happens if you made a mistake, and you want to undo the previously issued commands? Well, the mnemonic here is going to be Ctrl Z, you know the mnemonic, it's not the command, it's going to be Ctrl Z, you only need to press the Z key, you know, you don't need Ctrl Z, and it's gonna undo whatever you did in your previous command. Alright, so a B, D deletion, and then Z to undo the all the commands were saying they all have a correspondence in this toolbar or in this command palette. So for example, right here, I could run this code by pressing these play button right here you see it, the execution is changing.
There are multiple ones and you can search them if you don't remember right here. And the neat thing about it is that you actually have the shortcuts to issue the same command. So let's say you don't remember how execute and stay stay in the same cell, or move whatever you can search for run. And you can see what's the name, and what's the actual command that you have, right there, right. So you can, at least for your first ad or a month working with Jupyter notebooks, you will usually need to go back to these commands, and try to remember the the quick shortcuts. And with time and practice, those will just come naturally. So moving forward, what else, we have a few other commands, in this case, we have something to cut and paste the cell somewhere else, just cut and paste, that's going to be x to cut it, or you can also use the scissors here, x to cut it. And to paste it, you can use this button or actually these buttons sorry, or you can just press the V key V is going to paste it wherever you're currently standing it. So I'm going to cut it, I'm going to remove it from here, and I'm going to paste it below there. Or you can also copy it. So instead of cutting it, you can press the C key just going to cut, sorry, copy. And then you can actually say where you want to paste it. In this case, we have duplicated the same cell. And it looks something interesting here, the execution count remains the same. So again, there is like this unique identifier for your executions, which means that you know, when and where something was executed. Moving forward, we're going to use some code here, we're going to import some tools, you can see some characteristics or advantages of Jupyter notebooks and why we use it so often compared to, for example, the regular Python terminal. One very important thing is visualizations, we as data analyst, we're constantly getting data on expressing it through images, or animated animations, right. But most commonly, images. The main library we use in Python is model live.
And model lib is a first class citizen in Jupyter notebooks, which means that you can just run the figures from matplotlib. And they will just show up directly in your notebook without the need of doing anything. Crazy. So can you imagine showing these these beautiful picture in this terminal? That's that's very hard, of course. So again, that's one of the main advantages of a Jupyter Notebook. Moving forward, what we're going to do is we're going to first we're going to get some data from a public API. So there is these crypto watch service, which basically has crypto information, Bitcoin, ether, etc. And you can check the docs, we can actually open them. It's gonna give you market data Tesla. You can check the docs and How you can get in this case it's BTC Bitcoin to euro, sexual see if we can change it to USD USD price. There we go. So this is the current price of bitcoin results, surprise, etc. And we're actually going to do markets do we have crack and BTC USD, let's do, let's actually issue the same query we're going to use which is open high, low, close Oh h LC. And don't worry, this looks ugly. But this is actually what we're using. There's a list of results write for all different candles, we call them, we get the idea of the open price, close price, high price and low price. So we're going to issue those, we're going to issue these requests to the internet to these API, the crypto the crypto watch API, so you can get information about bacon to do some analysis, I say they can, you can actually get it from ether for for ether for author different types of crypto or currencies. So the function we're defining is get history, get historic price, it's a very simple function that uses pandas is one of the most important tools, we're going to be using this course. And the requests library, which is also very famous library for Python. And what we're going to do here is we're going to get Bitcoin on ether prize for an entire week.
Right. So from ferreted that the second February sorry, February 25, up to today, right? So depending on when I'm shooting this video, and we're gonna get a quick reference of the prices open, high, low, close. So in this case, we have four information per hour. Okay, so this is something you can actually change in the in the, in the request you're making to the API, you can reuse the candles eyes. In this case, we're keeping it per hour. So we have by the hour information about Bitcoin, in this particular market, which is bitstamp. Here, we have these day these day, and these are right, when I'm in the morning, open, close, highest price and lowest price, and also the volume that was operated within this time period. And we're gonna immediately plot the price. So we see that in these time, which I think is an entire day, we the price dropped, it's actually a few days, like an entire week, the price dropped from $9,600 below, right 9000. So it was a pretty significant drop. Let's see ether highperformance. We have here all the records, and how it moved. So this is what I tell you that when you're doing data analysis with programming tool like Python rar, you're not constantly looking at the data. So what I'm showing you right here are the first five records, we actually have. Let's do that. We actually have 169. Records, okay, 169 Records. And this is per hour. So if we do 169 hours divided by 24 hours, we have seven days, right? So we have seven days of data 169 Records, and then we have a little bit more information keeps this to go. I'm gonna get to that in a second. But basically, this is one I tell you 169 Records, to be honest, something you could be saying in a spreadsheet. But I want you to get the concept here. We're not just looking at our data, we have it in our brain, we know what did it we know what shape it has. We know how many records it had, we know information standard deviation, what's the mean of that? Right? So close price was the standard deviation, right? What's the the average, the mean, the median, right? So we have information about our data.
It's sitting behind, you know, in our brain, but we're not looking at it. And that's because with a very simple example, with only 169 Records, but in real life, we're dealing with millions of records, so it's impossible to see it. Have you ever tried is crawling in an Excel spreadsheet through millions of records. It's crazy. It's not possible. It's just unusable. So that's again, the way we work with data analysis in Python and R and other tools. We don't constantly keep an eye on the data. We know the shape of it. And we just have these quick references like show me the first five records. I mean, the last five records, show me this chunk here down there, but that's it. So again, these are the visualizations we're creating on Jupyter notebooks. Again, it's just very simple to get the plot done right there. We're going to also see in Jupyter notebooks, a few other pretty neat things. The first one is that we can use another library, which is called bokeem. And the difference is that boakye will have charts that are interactive. So I'm moving it right here, it has JavaScript. And it's interactive, you look back again, to what we had here. This is a static chart, it's just a PNG, you can actually export it as a PNG, there is nothing you can do with it. With bokeem, it's actually a dynamic, dynamically generated interactive charts. So I can, I can zoom in piece of data, right, I can move it around, I can just do whatever I want with it. I can refresh and reset it to whatever it was. And it's a dynamically generated chart. The difference is, if you're working with data, dynamically in your analysis, sort of in your exploration, then boek is a planning tool because you can zoom in, right, so what's going on here, let's, let's look at these things. If we're working on a mean, reverting strategy, for example, we see a high volume, we see a low volume, the mean is going to be here.
So we see some mean reversion in there. It's very interesting. If you need to, for example, export a PDF, export a huge HTML file, then static images are going to be probably better. So that's the difference between them. To be honest, model lib is a lot more popular than bogey, we use model live a lot more because it's we actually have a few other tools like seaborne that make it very easy to access and use it. What else Jupyter Notebooks work very well with some Excel, Excel files with all the file formats csvs, XML, Excel files, etc. And that's also the the availability of Jupiter lab. So Jupiter lab can immediately interpret and opens his v files can open with some extensions, XLS files, XML files, JSON files has a very nice editor and tree view for Jason. So the Jupiter lab environment combined with Python Jupyter Notebooks will give you a good idea of Jupiter in general. So in this case, we have just saved I'm not going to execute these you can try it out. But you can execute and run what we have just done and export this crypto file us an Excel spreadsheet. So you can just click on here and you can basically download it, you're going to open it and see what has There we go. So let me reduce the size of this thing. There we go. So you can see that we have just exported to spread two sheets, in this case, Bitcoin on ether, right? With the data that we had in our previous notebook, right. So that's all again, the combination of Jupiter, the combination of Python and the combination of Jupiter lab, which are tools just work very well together. So we're gonna keep moving forward, in this video, this tutorial, I'm talking about more data analysis, in general, we're going to talk about Python, we're going to do a quick review of Python. Maybe when we when I was running these commands, you felt you felt a little bit lost what I was doing with it. So we're gonna do a quick review of Python and all that. And of course, we're gonna get directly deep into data analysis with pandas with some other tools, I want to tell you something before we finish this chapter.
And it's not, it's very important for you to get familiar with data analysis, with sorry, with Jupyter notebooks, because you're going to spend a ton of time with it. And it's a very, very valuable skill that you can get if you get proficient, comfortable with Jupyter notebooks, you know, like creating cells, deleting cells, cutting, pasting, moving things around, etc. For you to generate reports Jupyter notebooks are going to be excellent. So keep an eye on it. Keep practicing, it's the only way to learn it to the to the analysis. Keep practicing it, keep open the command Polat. So you can always want if you forgot, how can it caught a cell? Well, there is here it is command x, right? It's gonna just tell you upfront, keep an eye on it, keep working with it and practicing it. And once you get familiar with Jupyter notebooks, you're going to move very, very fast. Remember, they have these nice list of compiled commands and reference you can always access if you need extra help. And we're going to keep moving forward now with more data analysis. Now it's time to talk about NumPy, one of the most important libraries in the Python ecosystem for data processing. In general, it's the one that got pretty much everything started. And if you trace back NumPy, it, it's a very old developed library. 20 years, maybe it's it's an extremely popular library and important library, I'm not gonna say popular. And I'm going to explain why in just a second. But it's a very, very important library in the Python ecosystem for data processing. NumPy is a library that will lead you it's a numeric competing library, it's just to process numbers to calculate things with numbers. And that's it. So NumPy has a very limited scope, we could say, and this is an on purpose, a very simple library, when you look at it, and when you look at the API, which is very consistent, by the way, why is NumPy so important? Well, in Python, numeric processing, and just pure Python processing numbers is very slow.
Okay, Python is not slow as itself compared to other programming languages. But when you go down, right to very deep levels of performance, when you are processing large amounts of data, right, and you need to squeeze, even, you know, that tiny bite at the end of your pipeline, you need to squeeze every flow up from your CPU, then Python is not the right tool for non Python as as a pure python programming language. NumPy is actually solving that NumPy is a very efficient numeric processing library that sits on top of Python, and gives you the same API as you're going to work with with just writing Python code, as you're saying here. But low level, it's going to be using high performance, numeric computations and, and arrays of numbers and representations, etc. That's it. That's it for pi NumPy. It's extremely simple from from an API perspective, but it's extremely powerful. Why did I say that? It's not so popular. But yes, it's so important. Well, because in reality, we don't usually employ NumPy directly, you will not see yourself using NumPy. So often, but you will be using other tools in Python, like for example, pandas, and matplotlib. And they are all working on top of NumPy. They're all relying on relying on NumPy for their numeric processing. So that's why NumPy is so important. So the for, at least for this part of the tutorial NumPy. I'm going to divide it into pieces. The first one is going to be a very detail, low level explanation of how NumPy works, why we need to use NumPy. And what are the differences between different bite sizes for numbers, we're going to talk about integers. But this is going to apply for decimals and data types also. And why you need a very low level, optimize to us number. Now you can, you can skip this part, you're going to find in the description of this tutorial, the precise moment in time.
So you can just skip and go directly to the second part, which is when we actually start using NumPy. And I show you how to create arrays, how to make computations, etc. So for now, we're going to divide it in two parts, we're going to start first with the low level explanation which you can escape if you want, because it's not going to be crucial, you can easily use NumPy. Without it. We have found that for some of our students, it's it's important to understand the low level basics of it, especially if you didn't have a computer science background, it can help you get you know, raise your right your level of understanding of computers, and how to make your computations more efficient. But don't worry if you if you don't want to go through that now it's fine. You can skip this part and come back later or any other at any other moment. You don't need the ease to use NumPy seriously, you don't need it. It's going to be beneficial, but you don't absolutely lead so you can just skip and come later. So with that said, let's actually go into into a deep understanding and explanation of how computers store integers, numbers in memory and what are bytes bits etc. In order to understand why NumPy is so important. We have to go back again to the basics. What are numbers, how they are represented in computers, etc. As you might know already a computer can only process ones and zeros bits, it can't process numbers or just decimal numbers to be more correct, sorry, it only can process ones and zeros. A computer is just always storing and processing ones and zeros. It's a binary machine. Your memory is the central place around the random access memory in your computer is the the central place where your computer is storing the data that it's actively processing, right. So you have, for example, a hard drive, which stores long term data. But the computer can process data directly from your hard drive. Before doing that, it has to load it into your ram into your random access memory again, usually, right a computer is going to have what eight gigabytes 1632 doesn't matter.
Let's say you have eight gigabytes of memory, that at some point is going to translate to number of bits that your computer can store. So if you follow, if you follow these we have right here, you can see the total number of bits available in a regular computer with eight gigabytes of memory. Why is this important? Because again, the objective of these of these tutorial is the objective of this bar, at least is to explain how you can squeeze out of every single bit you can in your computer, right? How can you make it more efficient? For your numeric processing, both in storage? use less memory for the same data? And also how to make it faster, right for your calculations. So in terms of physical storage, or actually memory storage, right? How can we make it? How can we optimize to use the least amount of memory for this given problem? That's the objective of optimizing it, we need to understand how numbers decimals or sorry, integers into decimal numeric system are represented in binary, right. So these table right here shows you the first nine numbers, 01234, etc. and their binary representation. In your computer. Let's say you want to store the age of user age of a user, which is 32. You can't store 32 in here, because your computer again doesn't know about decimals, it only knows about binary. To do that, you will need to find the correct representation in ones and zeros of 3030. All right, sorry, two, which is not this one, to be honest, I'm just making it up as we go. But again, you need to know the correct binary representation of these number in norther. To store that data, how can you know that? Well, there is this whole binary arithmetic, right? There's a whole part of math dedicated to binary doesn't matter for now. But I'm going to just drive the intuition of it so you can have a better understanding. And if you're interested, you can dig deeper later.
So basically, any decimal number needs to be stored in a binary format, which of course only steaks ones and zeros. And what we usually do is just we keep increasing zeros and ones in positions, right. So in this case, we have the number zero, the number one, that's fine. Once we need to store the number two, winning now to increase the number, the position right here we need to increase, right, so we need to go from two to one zero, we'd go to the number three, it's one one, and then we need to go to number four, we need to increase positions again, because we only have two symbols, zero and one. So as you're seeing right here, up to these level, we need only one position. Up to this level, we need two positions. This level, we need three positions. And these levels going to need four positions. And you'll see how the size of each of these is increasing. And it has a an explanation behind it that we're going to see in a second. So the question is how many decimal numbers you can store with n bytes and bits, sorry, bits. So let's say we have n bits. And let's say n is equals to three. That means that you only have three positions, right three bits, how many total decimal numbers, you can store with it? Well we can store 000 we can store zero, we can store 100 we can start stores are you one zero, right? So in this size, we can store up to here, we can store up to seven numbers 111 is equals to seven was, once we've filled all the positions, right, we've reached the limit, r
So you can see the power of it will not explain the tools in detail. It's just a quick demonstration for you to understand what this tutorial is about. The following sections will be the ones explaining each tool in detail, there are two more sections that I want to especially point out. The first one is section number three Jupiter tutorial. This is not mandatory, and you can skip it if you already know how to use Jupyter notebooks. Also the last section Python in under 10 minutes. This is just a recap of Python. If you're coming from other languages, you might want to take this first. If that's the case, again, you can use the links in the video description to jump straight to it. All right now let's define what is data analysis. I think the Wikipedia article summarizes perfectly the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, you forming conclusions and support decision making. Let's analyze this definition piece by piece. The first part of the process of data analysis is usually tedious. It starts by gathering the data and cleaning it and transforming it for further analysis. This is where Python and the PI Data Tools Excel. We're going to be using pandas to read, clean and transform our data. Modeling data means adapting real life scenarios to information systems using inferential statistics to see if any pattern or model arise. For this we're going to be using the statistical analysis features panelists and visualizations for matplotlib and Seabourn. Once we have processed the data and created models out of it, we'll try to drive conclusions from it finding interesting patterns or anomalies that might arise. The word information here is key. We're trying to transform data into information. Our data might be a huge list of all the purchases made in Walmart in the last year, the information will be something like pop tarts sell better on Tuesdays.
This is the final objective data analysis we need to provide evidence of our findings, create a readable reports and dashboards and aid other departments with the information we've gathered. Multiple actors will use your analysis, marketing sales, accounting executives, etc. They might need to see a different view of the same information. They might all need different reports or level of detail what tools are available today for data analysis. We've broken these down into two main categories, our managed tools, our close products, tools you can buy and start using right out of the box. Excel is a good example. Tableau and luchar are probably the most popular ones for data analysis. In the other extreme, we have what we call programming languages or we Call them open tools. These are not sold by an individual vendor, but they are a combination of languages open source libraries and products. Python R and Giulia are the most popular ones in this category. Let's explore the advantages and disadvantages of them. The main advantage of close tools like Tableau or Excel is that they are generally easy to learn. There is a company writing documentation providing support and driving the creation of the product. The biggest disadvantage is that the scope of the tool is limited, you can cross the boundaries of it. In contrast, using Python and the universe of PI Data Tools gives you amazing flexibility. Do you need to read data from a closed API using secret key authentication for example, you can do it? Do you need to consume data directly from AWS kinases, you can do it a programming language is the most powerful tool you can learn. Another important advantage is a general scope of a programming language. What happens if Tableau for example, goes out of business. Or if you just get bored from it and feel like your career is taught you need a career change? learning how to process data, using a programming language gives you freedom? The main disadvantage of a programming language is that it's not as simple to learn as with a tool, you need to learn the basics of coding first, and it takes time.
Why are we choosing Python to do data analysis? Python is the best programming language to learn to code. It's simple, intuitive, and unreadable. It includes 1000s of libraries to do virtually anything from cryptography to IoT. Python is free and open source. That means that there are 1000s of PI's very smart people seeing the internals of the language under libraries. from Google to Bank of America, major institutions rely on Python every day, which means that it's very hard for it to go away. Finally, Python has a great open source spirit. The community is amazing, the documentation, so exhaustive, and there are a lot of free tutorials around checkout for conferences in your area, it's very likely that there is a local group of Python developers in your city. We couldn't be talking about data analysis without mentioning r r is also a great programming language. We prefer Python because it's easier to get started and more general in the libraries and tools it includes. R has a huge library of statistical functions. And if you're in a highly technical discipline, you should check it out. Let's quickly review the data analysis process. The process starts by getting the data where is your data coming from? Usually it's in your own database, but it could also come from files stored in a different format, or a web API. Once you've collected the data, you will need to clean it. If the source of the data is your own database, then it's probably in writing shape. If you're using more extreme sources like web scraping, then the process will be more tedious. With your data clean, you'll now need to rearrange and reshape the data for better analysis, transforming fields merging tables, combining data from multiple sources, etc. The objective of this process to get the data ready for the next step. The process of analysis involves extracting patterns from the data that is now clean and in shape.
Capturing trends or anomalies. statistical analysis will be fundamental in this process. Finally, it's time to do something with data analysis. If this was a data science project, we could be ready to implement machine learning models. If we focus strictly on data analysis, we'll probably need to build reports communicate our results, and support decision making. Let's finish by saying that in real life, this process isn't so linear, we're usually jumping back and forth between the step and it looks more like a cycle than a straight line. What is the difference between data analysis and data science? The boundaries between data analysis and data science are not very clear. The main differences are that data scientists usually have more programming and math skills, they can then apply these skills in machine learning on ETL processes. The analysts on the other hand, have a better communication skills creating better reports with stronger storytelling abilities. By the way, these Weiler chart you're seeing right here is available in the notes in case you want to check out the source code. Let's explore the Python and PI Data ecosystem, all the tools and libraries that we will be using. The most important libraries that we will be using are pandas for data analysis, and matplotlib and Seabourn for visualizations. But the ecosystem is large and there are many useful libraries for specific use cases. How do Python data analysts think if you're coming from a traditional data analysis place using tools like Excel and Tableau you're probably used to have a constant visual reference of your data. All these tools are point on Click. This works great for a small amount of data. But it's less useful when the amount of records grow. It's just impossible for humans to visually reference too much data, and the processing gets incredibly slow. In contrast, when we work with Python, we don't have a constant visual reference of the data we're working with.
We know it's there. We know how it looks like. We know the main statistical properties of it, but we're not constantly looking at it. These allows us to work with millions of records incredibly fast. This also means you can move your data analysis processes from one computer to the other, and for example, to the cloud without much overhead. And finally, why would you like to add Python to your data analysis skills aside from the advantages of freedom and power theories, another important reason, according to PayScale, data analysts that no Python and SQL are better paid than the ones that don't know how to use programming tools. So that's it. Let's get started in our following section will show you a real world example of data analysis with Python, we want you to see right away what you will be able to do after this tutorial. We're gonna start this tutorial by working with a real example of data analysis and data processing with Python, we're not going to get into the details yet, the following sections will explain what each one of the tools does, and what is the best way to apply them combining and the details of them. In general, this is just for you to have a quick on high level reference of our day to day processes, data analysts, data managers, data scientist using Python. So the first data set that we're going to use is a CSV file that has this form, you can find it right here, under the data directory, the data we're going to be used is this, I have just transformed it into a spreadsheet. So we can pretty much look at it from a more visual perspective. But remember, as we said in the introduction, as data analysts are not constantly looking at the data, right, we don't have a constant visual reference, we are more driven by the understanding of the data right in the back of our head, and we understand how what the data looks like, what's the shape of it. And that's what it's conducting our analysis.
So the first thing we're going to do is we're going to read it this CSV into Python, and you can see how simple it is just one line of code gets us the CSV read into byte, then we're going to give a quick reference. And this is what the data frame that we have created looks like data frame is a special word is a special data structure, we use independent tool. And again, we're going to see that in detail in the pan this part of this tutorial. The data frame is pretty much the CSV representation, but it has a few more enforced things like for example, each column has a strict data type. And we will not be able to change it to tetra, it's a better way to conduct our analysis, the shape of our data frame tells us how many rows and how many columns we have. So you can imagine that with these amount of rows, it's not so simple to again, follow a visual representation of it's like, it's pretty much infants crawling, in this point 100,000 rows. But the way we work is by immediately after we load our data we have we want to find some sort of reference in the shape and the the properties of the data we're working with. And for that we're going to do first an info to quickly understand the columns we're working with. In this case, we have date, which is a date time field, we have day, month year on that are just complimentary to date, we have the customer age, which is uninjured, which makes sense right? age group, you can say it's right here. It's age group youth, customer gender, we have an idea again, of the of the entire data set, we know the columns we have, but we also know how large it is. And we don't care what's in between, we will be cleaning it probably, but we don't need to actually start looking row per row, right just with our very limited eyes, we have a better understanding of the structure of our data in this way. And we're going one step further, we will also have a better understanding of the statistical properties of this data frame with a describe method.
For all those numeric fields, I can have an idea of the statistical properties of those. So for example, I know that the average age of these data set is 35 years old. I also know that the maximum age in this case if these Or is the sales data is 87 years old, I know the minimum is 17 years old. And again, I can start building right if my understanding of this that physical properties of it. So in this case, the median of my age is very close to the mean. So this is telling me, all is telling me something, and the same thing is going to happen for each one of the columns that we are using. For example, we have a negative profit here, and we have very large values here are these correct, is maybe there's a mistake, again, it's by having a quick statistical view of our data, we're going to be driving the process of analysis without the need of constantly looking at all the rows that we have. It's a, it's a more general holistic overview. So we're gonna start with unit cost, let's, let's see what it looks like. And we're going to do a describe only if you need coast, which is pretty much what we had right here. In the previous in this line, what we did was for the entire data frame for the entire data, in this case, we're just focusing in the unit coast, cost, sorry, column, the mean, the median, all fields, we know already pretty much from this, and we're gonna quickly plot them, we're going to use these tools to visualize them. And it's the same tool, it's paying this that it's using on top, right? It's using matplotlib. So the visualization is created with matplotlib. But we're doing it directly from pandas. And again, don't worry, this is all explained in pandas lessons. So this is unit costs, right is what this is the box, but we have just created, we have the whiskers that mean that shows us the the first and third quartile, the median. And then we see all the outliers that we have right here. So we see that our product study is around $500 is considered to be an outlier.
And the same thing if we do a density plot, right. So this is what it looks like. We're going to draw two more charts, right, in which we're going to pretty much point out the mean and the median, right in the distribution charts. And we're going to do a quick histogram of the costs of our products. Moving forward, we're going to talk about age groups with the age of a customer. And at any moment, we can always do something like sales sort here to give a quick reference, we know that the the age of the customer is expressed in actual years old they were but also they have been categorized with three, four, actually four age groups, seniors, youth, young adults and adults, right. So they we have given categories were creative, right to better understand these groups, and we do that with values. Value counts, we can quickly get a pie chart out of it, or we could get a bar chart out of it. As you can see, right here, we're doing an analysis of our data, we see that adults right here are the largest group in our for our data at least. So moving forward, what about a correlation analysis? What is a correlation between some of our properties, we will probably have high correlation for example, between profit and unit cost, for example, or order quantity, that's kind of expected, but that's all something that we can do right here. This is matrix right of correlation showing in red high correlation. So order quantity, and unit cost or where is profit right here. Profit is right here. So we see high correlation with unit with cost with profit. Now with profit, actually, it's the opposite blue is high correlation, I'm sorry, the diagonal, which is blue, is correlation is equals one. So high correlation is blue. And we see that profit has huge correlate has a lot of correlation, positive correlation with unit cost and unit price. And negative correlation is with dark red. So we again can have a quick idea. Let's see, for example, here profit, it has negative correlation with order quantity, which is interesting, right? It's we wouldn't dig deeper into that, of course, the profit has a high correlation positive with revenue, right? And again, it's just a quick correlation analysis.
We can also do a quick scatterplot to analyze the customer age and the revenue right to see if there is any, any correlation there. Right? And the same thing for revenue and profit. This is obvious, right? We can we can quickly draw a diagonal here, right. So there is a lot Linear depth and dependency between these variables. So a form a few more box plots, in this case, understanding the profit per age group, right, so we can see how the profit will be, will change depending of the customer's age, and a few more box plots. And we're creating these these grid of year customer age, unit costs, etc, for multiple things. So moving forward, something that we can quickly do when we're working with Python, especially within this is Drew shape or data or derive it from other columns, right. So this is pretty common in Excel, we can create these revenue per age column, if you're here in Google spreadsheets, you're going to do something like revenue, per age, and you're going to do something like equals, right? Equals revenue, divided, I don't remember if this correct formula we're using, but just for, for you to have a reference. And we're going to pretty much extend this whole thing. There we go, Oh, well is processing, and I have 100,000 rows. So you can see how slow it is, I let's compare that just to the way Python works, I'm gonna execute this thing. It was instant, you know, extremely fast. And it was all calculated seems that we have the same results as expected. same results as expected. And we can quickly plot both the in a density plot and in a histogram, as you can see, right there, now that revenue parade is going to be relevant. In any case, it's just to show you the capabilities of what we can do. Let's annual analyze, well, we're gonna create a new column, which is calculated cost is the total, the total orders the total, the quantity of the order, times the cost, right, extremely simple formula, very fast process.
And we're gonna get right here, how many rows had a different value than what was provided by cost? So what we're doing right here is like, we're quickly checking if the cost provided by the data set, at some point doesn't align with the actual cost we are calculating. So is there any mistakes that were made by the I don't know the original system, or people doing a data entry, if these new column is different from cost, we want to know about that. And that doesn't happen. So again, quick, quick, regression plot. In this case, it's very obvious that there is some linear dependency between calculate cost and profit. So more formulas, in this case costs part cost plus profit. So we're going to adding a little bit more, there is no difference with the revenue and the calculated revenue that we are having. So that all makes sense, we're going to do a quick histogram of the revenue. We can, for example, on 3%, to all the prices that we are using, we need to increase prices. How are you going to do that? Well, it's very simple with Python, we're just going to do increase everything by point 03. And now all the prices have changed. What else we're going to be able to do quick filtering, let's get all the sales from the state of Kentucky right. So these are all the sales from the state of Kentucky, we can get only the average of the sales by these age group on only revenue, right. So these, all these filtering options, and extremely simple to get with Python. In this case, we say, give me all the sales from these age group, and also from this country, right, and we're gonna get the average revenue from these groups that we are selecting. And again, to modify the data, we can make just a few quick modifications, like in this case, we're going to say, all the sales from country right to revenue, we're going to increase it by 1.
1. I don't know why, which is doing it arbitrarily. It's just for me to show you how it works. So far, so good. Again, we've done a couple things, you don't need to know about the details, we will actually go through that in the NumPy independence sections in this tutorial. So just for you to have a quick reference of it. There are exercises associated with these given lectures. So if you want to pause right now and get into the exercises, that's going to be very helpful. We're going to move forward now with the second lecture in which we will be using a database this Akila database and we're going Be erasing data, instead of from a CSV file, as we did before, we're going to read data now from a database. Reading data from a SQL database is as simple as it is from an Excel file or a CSV file, as we were doing with our previous example. And once you've read the data, that's we're going to do now the process is the same. So what we have right here is a query a SQL query, if you don't know about SQL, you can check our courses or other courses online. Basically, we're pulling the data from the database. This is one of the advantages of Python, it's not, there are connectors for pretty much every database provided out there, Oracle, Postgres, MySQL, SQL Server, etc. In this particular example, we're going to be using MySQL. So once you construct the query, and you pull the data from the database, then the process is the same, we have just converted these outside data into a data frame that we can use with our Python skills. The first step, as usual, is to check the shape information description of our data of our data frame. In this case, we want to, again understand the structure of it. So we want to know how many rows we have 16,000, we want to know a little bit more about our rows, we want to know about a little bit more about our columns, and how many rows how many records we have for each one of them and the type of each one of these columns.
And we also want to have a better statistical understanding of our data. So we do a quick describe, and we have more details about it. If we want to focus in individual columns, right, we can just do that by in this case, we're gonna focus in film rental rate, right, pretty much how much you pay to rent a film. Um, we're gonna see the kind of distribution we have, we can call it distribution, it's pretty much a categorical field in this case, but basically, the rentals are divided into three main categories are prices, zero 99 299 499. So that's these box plot these pretty much perfect, never seen in real life plot box plot gives you those prices. And move forward, we can also check very quickly a categorical analysis, understanding the distribution of rentals between cities, so we have two cities. And it's pretty much even as you can see right here, creating new columns and reshaping the data for further analysis, etc, is relatively simple. In this case, we're going to analyze their return in rentals, right, which, which films are going to be more profitable for the company div, dividing the rental rate, how much we charge, divided by the cost, how much it costs us to acquire the film. So in this case, we can see the distribution of that, right. So most rentals are here in the beginning. And then we have more profitable rentals, were making up to 60% above the rental. And we can quickly analyze the mean and the median fit right to have a quick idea of all that. Finally, selection and indexing, if you want to start focusing, if you want to go into data, right, you want to zoom in, you want to have a better understanding. So you start filtering, in this case, we can filter by customer, but if you want to do it per city, if you want to do it per state, if you want to do it per film, per price category, etc. It's very simple to filter to filter and zooming, which is one particular characteristic of your data.
So you can perform a more detailed analysis. So in this case, we have all the the films are rented by the customer last name, Hanson, which doesn't mean it's the same person. But again, it's very simple to filter dot. And here, we can do we can very quickly see which ones are the price, the film's sorry, that have the highest replacement cost, right. So basically, what we're doing is we're going to isolate those films that have the highest replacement cost. And also we can see right here just for you to have an idea, all the films that are in the category PG or pG 13. It's very simple to to filter that data. So this is the process we usually follow. we imported the data, we reshape it somehow create columns, there is an important process of cleaning up or not highlighting this part of the tutorial, we're going to talk about it in the tutorial itself. There's the process of cleaning, then reshaping creating new columns, combining data and creating visualizations. This is the process, right? We're following here with our Python skills, but it's a tone more to odd as you might imagine, from creating reports to running machine learning processes, creating linear regressions, etc. For now, this is just a quick understanding of the process. We follow. Now starting now we're gonna move forward with more details of each one of the individual tools we're going to talk about. We're going to talk about Jupyter notebooks. We're going to talk about NumPy. We're going to talk about pandas, we're going to talk about mapa, lib, seaborne, etc. Starting now, right? The first thing we're going to see is, what is this whole thing that I've been using this Jupyter Notebook, I want you to now too, if you want, if you if you don't have experience with it, I want you to have an idea of how it works. And then we're going to move forward the individual tools, NumPy, pandas, etc. Remember, there are exercises also associated with this particular lecture. So you can always go back again, and work with them.
Once you get more a better understanding of the tools we are using. Before we jump into the actual data analysis course, and we start talking about Python, pandas, all the tools, we're going to use import files, read data from databases, etc, I want to show you the environment that we work with. It's our primary environment, it's the tool that we use 99% of the time on its Jupyter Notebook, there are going to be different terms here, I'm going to be referring to it as Jupyter Notebook. But as you are going to see in this, in this part of the of our tutorial, you can see that Jupiter is actually a whole ecosystem of tools. And it's a very interesting project. Jupiter is a free and open source, again, ecosystem of multiple tools. And primarily, we're gonna talk about first, what is a Jupyter Notebook. What you're seeing right here, and you're gonna see live in a second, I can actually show it to you is this thing we're going to use. And we are also going to talk about Jupiter lab. Okay, which is the evolution of the regular Jupyter Notebook. So, I think this could be familiar to you already. Usually the questions in the question is, what's the difference between Jupyter Notebook and Jupiter lab? Well, the difference is that Jupiter lab is just a nicer interface on top of Jupyter notebooks. It's not just the plain notebook. This is a notebook, but I'm scrolling right now. It's also the addition of tree view, it's an addition of get tools, as an addition of command to lead and multiple other things. You can open some files with a nice preview in it, etc. So, Jupiter lab Jupyter Notebook, they are similar Jupiter lab easy, again, the evolution of a Jupyter Notebook. And that's what we're using. Again, Jupiter is a free and open source project. So anybody can install it, anybody can download it, it's very simple to get it set up in your local computer. In this case, we're using something we call notebooks AI, it's a project that provides Jupiter environment for free in the cloud.
So you don't need to install things locally, you don't need to put things in sync in your own hard drive, right you That means you don't need to buck it up, for example, because it's just a service, it's all worked in the cloud. So said that, I want to tell you that we have compiled a very quick list of everything, we're going to talk in this part of the tutorial, in this list of two, it's just a thread of with multiple, multiple hints of how to use Jupyter notebooks. So after the video after the course, if you forget some of these concepts, you can always go back to this to it, it's a quick reference for you to have. So let's get started. Why do we use a Jupyter Notebook? Because it's an interactive real time environment to produce our or to to explore our data and to do our data analysis. It's a tool you're gonna fire commands, and it will immediately respond with something back. It's a very interactive tool, when we're working with data analysis, and this is mainly main difference with some other tools like for example, Excel, tableau, etc, is that we are not constantly looking at the data, there is no visual reference, like for example you have in Excel, right? So in Excel, you're constantly looking at the data, you have it in front of you, there are 100,000 cells and you can stroll and see them. The problem is that that's not scalable, right? It's like nobody can work with 100,000 rows in their, in their, in their mind, we will always forget something. So the way we work with Python indeed, analysis is by always having a reference of how our data looks like but always at the back of our head and we're not constantly looking at it. We're like this person from the matrix, you know, the, the the commander of the matrix that commands people to get get in and out. We're basically telling people telling people that basically asking data, right asking questions to the data, and having a picture in our mind of how that's going to work, we're not constantly looking at it, we're just having a reference, or in our in the back of our heads of what our data looks like.
So that's why this tool is very useful. This tool is useful Also, if you're just training your Python skills, and or their permanent language skills, because what you're gonna see is it's just a regular Python interpreter. In this case, I can execute some code, that's two one times, actually one plus three, there we go. And the result is four. Right. So this is a Python is a fully featured Python interpreter. The good thing is that again, it's going to respond to us pretty much immediately I create a command and I immediately get a response, I can do something a print here, hello world. And I immediately get a response, I can do Hello, world, times, times three. Again, it's a again, a Python interpreter, a fully feature Python interpreter, but it's not being accessed from a terminal, which you can write this is the good thing about Jupiter lab to have a terminal, I can do Python, right. And I can do two, time three, and I get an answer back. But this is not convenient to work with our data, we need something a little bit more interactive, we can also mix with documents, that's going to be the advantage of a Jupyter. notebook. So what what's the way we work with Jupyter notebooks, there are a few concepts, very important concepts that we are going to follow a Jupyter Notebook is just a sequence of multiple cells, okay, everything is a cell. And as you can see, when I click on these cells, even if even if it doesn't look like being a cell, it is, you will see that these blue thing right here, right is pretty much following me because I'm clicking on the cell, and I'm selecting that particular cell. Everything happens within a cell, if I want to execute some code I can do, again, one plus five, and to get a result or a result back, right, that's, that's how it works. So I'm creating a cell, I'm deleting a cell, I create another cell again.
So it's everything happens with a cell, and I'm going to tell you how to add the cells, how to remove them how to execute code, etc. The interesting thing about a cell is that it can either be Python code, or any other programming language you're using in this case is a Python data analysis course. It can be Python code, as we're we were doing before one plus three, this is Python code, or it can be what we call markdown, okay, which is a formatting format, right? To create text, that will be a render with sort of HTML ID at the output. So in this case, this is what the source code of the markdown looks like in markdown, any line that starts with this part, it's going to be a title, in this case, it's going to be the largest, the biggest title you can have is just one pod, and then you keep adding to reviews the size in this case, level three title. And then you can have for example, this is a quote this is bold, this is it Alex, this is a link, right? So let me actually, I could copy the cell and open the source code. There we go. So this is a link right issue, issue is created or it's rendered as a link. So markdown, what is is that is a text formatting tool, right or protocol, we could say that in this case, we just specify us we have some some rules to use in our in our text, and markdown knows how to interpret them and format right or return a formatted document after them. So for example, here, we have green divider, which is a picture and we know it's a picture because it starts with an exclamation marks. And that's that what you're saying right here. So again, a cell can be either Python code, or it can be markdown. markdown is an entire thing on its own. You can get any tutorial online free, it's it's fairly simple to get started with. And it's also very important because when you're formatting your reports, right, when you're creating your reports, you want them to look pretty, you can use markdown for not and what we're going to see later So you can export these notebooks and they will generate PDFs, right.
So this whole thing can be a PDF or an or an HTML page. So after you're done with your data analysis, you can hand over to whoever asked for the analysis, a PDF report, which is pretty neat. So moving forward, again, any cell is going to be either markdown, or it's going to be code right here. So these ones code, and you can switch the modes, you can say, this LS code, or actually, let's make it markdown. So right now, if it's a code, it doesn't doesn't matter, or just, it's not executing anything, because the cell is interpret as markdown. So now, I'm switched back to code. And now it works. Again, I said, Sure. It can also be raw, but to be honest, we don't use raw very often. So again, you have this this general cell type, this cell we're using, what type is it? Is it code is it markdown, you can switch it with these with the selector right here. So a few more things that I have to tell you right away, so you can start internalizing them, and it's gonna take some time to get used to it. But once you get used to it, you're gonna move very fast in your data analysis with Python Jupyter notebooks. The first thing is, as you're seeing right here, every cell has been given an execution number. So any, the cells will be moved, right, they will be moving around, you will be moving them around. But you will always know which one executed before another one. And that's because every execution, you run will be assigned an execution number. In this case, this is the seventh time I have executed code. If I execute code again, for example, I don't know, two times two, this is the eighth time that I've executed code. And if I move this thing, right here, if you're reading this thing, top down, you will not be full, right? You will understand this thing. It was moved, the cell was moved, the structure of the notebook changed. But these thing was executed after this other cell, right? xact.
And this is seven. So the execution order is always preserved. So that's an important thing. Something else that you're seeing me change the structure, and do things with the notebook without using any menu. And that's because I know how to use keyboard commands keyword shortcuts to run most of these commands. So for example, how can I add a new cell I have these is a markdown cell. This is a code cell, if I need a cell before these one, what's what's that command that I'm going to issue in order to create the cell, in this case, the command is going to be the letter A, I just type A, and there is a new cell creative. How can I delete the cell, it can be two times that the key two times the D key. And again, this is all these reference with built. So for example, right here, whereas hit at some point, you can. Here, you can type, you can press A to create a new cell, you can press B to create a new cell, what we call below. So let me put something here, this is a reference. And I'm going to put here the letter B and it's going to create a cell B below the currently selected one. So the selection here is here in the blue, I hit let me delete this one, I hit B. And again, it's going to create a cell below the previously selected one, if I hit a, it's going to create a cell above that previously created one. So these, these are the mnemonics of the creation. Something else and it's very important why when I'm in this cell, and I hit the letter, a leader, literally it just hits the letter A in my keyword, no control, no command, just a, it creates a new cell, and it doesn't type A inside the document, right? So right here, if I type A, it's adding an actual a character in the cell. Why didn't that happen before. And you're going to notice that when I change, when I'm going to call a mode in a second, you're going to see that the content of the cell is grayed out, show what now when I when I press on the letter A it actually creates to sell and it's not adding content to the sell itself.
If I go back again to the other mode, and I'm going to give you a better explanation in a second. If I type anything, in this case, a it's actually appended to the text within it. So this is my interaction to sell modes and this is very important. The Jupyter Notebook is a mode base editor, right? So there are multiple editors are, for example, vim or VI, vi, those are mode based editors, which basically, the behavior of your work will change depending on the mode that it's currently activated. So for example, in this case, I am in addition mode, because any character that I type will be appended to the cell, A, B, C, D, etc. If I switch out of editing mode to what we're gonna call command mode, I switch out of that mode. Now the cell is grayed out, and any key that I hit, it's gonna do something different associated with that key. So A is going to create a new cell above, B is going to create a new cell below, Double D is going to delete this cell, right. So that's, that's the important part of Mo. That's one of the most important parts in order to understand how to work with Jupyter notebooks, the mode that you're currently working with, and there are only two modes, so it's fairly simple. This is command mode. And we recognize command mode, because this cell is grayed out. When we get into edit mode, there is a regular prompt, as you're saying before, the number one on the cell is actually subjects of addition. So that's the way we can realize that, how are you going to switch from modes, in this case, I'm in editing mode, if I'm using my mouse just pointing, I can click outside, I'm gonna get out of the edit mode into command mode. If I point inside and going back again, to the Edit Mode, but let me tell you something right away and then say, we don't like to use our mouse, we don't like to point and click, because that's very slow. We like to use our keyboard, we move very fast with our keyboard. So how are you going to switch from, from editing mode back to command mode, that's going to be with the Escape key to go from editing to command, edit as Escape key, it's going to switch out of editing, but when mode.
And if you actually want to make modifications to the cell, basically, you want to get into edit mode, you're going to hit the return key, that's going to get you into edit mode, again. So we have tackle multiple things are writing, again, we said in Jupyter notebooks, we're going to use Python code very quickly to interact with our data, we need a real time, you know, I'm asking a you're answering type of editor. That's what the Jupyter Notebook is. The Jupyter Notebook has these two modes, edit and, and command mode. And then the cells which is pretty much everything is the most important, it's a fundamental part of the notebook, the cell is going to have two types can be either code, or it's going to be markdown, right. And now I'm going to start showing you more features. And I'm going to show you, I'm going to show you the most important commands. And of course, how the what the keyboard shortcuts for those commands are, so you can move freely. And and and work with Jupyter Notebooks in the most efficient way. So let's get started. First of all, for for from the most important commands is moving right. So navigating, it's very simple to navigate, just use your arrow keys, up and down, up and down. And you're going to move around in your notebook. If you wanted to switch the type, right going from markdown to code, etc, you can switch use these drop down or you can press the specific key is to switch to either markdown or Python. So for my markdown, you're gonna switch sorry, hit the M key, that's going to make it markdown. For Python, you're going to hit the Y key, that's going to make it Python code. So M and y are going to switch you back and forth. Keep an eye on the selector you're going to hit y m y m is going to switch it from code to markdown. What else how can you execute code once you are within your typing code and you want to execute it, there are two types of executions you can run.
The first one is going to keep the selection the currently selected an active cell is going to stay the same place you are and that's going to be my by keeping press the Ctrl key and hitting return that's going to run decode on the cell there the prompt or the current selected cell will remain being the same. So I'm running this thing a couple of times already on this selection or the currently highlighted cell stays the same, I can change that by using shift return. So I keep the shift key pressed. And I hit return and is going to execute the code. But it will immediately switch the prompt or the currently selected cell to the following one. And that's useful when you have multiple cells, you want to execute one after the other. So you can keep hitting shift, return, return, return return, and it keeps you moving right from top to bottom. Alright, so Ctrl return or shift return to change the execution is the same is just what's going to happen with the currently selected cell. We already saw how to create cells with the A key, we create a cell above with B key we create a cell below. To delete a cell, you're going to hit the D key, the D key two times one after the other very quickly, dd is going to delete these the cell. What happens if you made a mistake, and you want to undo the previously issued commands? Well, the mnemonic here is going to be Ctrl Z, you know the mnemonic, it's not the command, it's going to be Ctrl Z, you only need to press the Z key, you know, you don't need Ctrl Z, and it's gonna undo whatever you did in your previous command. Alright, so a B, D deletion, and then Z to undo the all the commands were saying they all have a correspondence in this toolbar or in this command palette. So for example, right here, I could run this code by pressing these play button right here you see it, the execution is changing.
There are multiple ones and you can search them if you don't remember right here. And the neat thing about it is that you actually have the shortcuts to issue the same command. So let's say you don't remember how execute and stay stay in the same cell, or move whatever you can search for run. And you can see what's the name, and what's the actual command that you have, right there, right. So you can, at least for your first ad or a month working with Jupyter notebooks, you will usually need to go back to these commands, and try to remember the the quick shortcuts. And with time and practice, those will just come naturally. So moving forward, what else, we have a few other commands, in this case, we have something to cut and paste the cell somewhere else, just cut and paste, that's going to be x to cut it, or you can also use the scissors here, x to cut it. And to paste it, you can use this button or actually these buttons sorry, or you can just press the V key V is going to paste it wherever you're currently standing it. So I'm going to cut it, I'm going to remove it from here, and I'm going to paste it below there. Or you can also copy it. So instead of cutting it, you can press the C key just going to cut, sorry, copy. And then you can actually say where you want to paste it. In this case, we have duplicated the same cell. And it looks something interesting here, the execution count remains the same. So again, there is like this unique identifier for your executions, which means that you know, when and where something was executed. Moving forward, we're going to use some code here, we're going to import some tools, you can see some characteristics or advantages of Jupyter notebooks and why we use it so often compared to, for example, the regular Python terminal. One very important thing is visualizations, we as data analyst, we're constantly getting data on expressing it through images, or animated animations, right. But most commonly, images. The main library we use in Python is model live.
And model lib is a first class citizen in Jupyter notebooks, which means that you can just run the figures from matplotlib. And they will just show up directly in your notebook without the need of doing anything. Crazy. So can you imagine showing these these beautiful picture in this terminal? That's that's very hard, of course. So again, that's one of the main advantages of a Jupyter Notebook. Moving forward, what we're going to do is we're going to first we're going to get some data from a public API. So there is these crypto watch service, which basically has crypto information, Bitcoin, ether, etc. And you can check the docs, we can actually open them. It's gonna give you market data Tesla. You can check the docs and How you can get in this case it's BTC Bitcoin to euro, sexual see if we can change it to USD USD price. There we go. So this is the current price of bitcoin results, surprise, etc. And we're actually going to do markets do we have crack and BTC USD, let's do, let's actually issue the same query we're going to use which is open high, low, close Oh h LC. And don't worry, this looks ugly. But this is actually what we're using. There's a list of results write for all different candles, we call them, we get the idea of the open price, close price, high price and low price. So we're going to issue those, we're going to issue these requests to the internet to these API, the crypto the crypto watch API, so you can get information about bacon to do some analysis, I say they can, you can actually get it from ether for for ether for author different types of crypto or currencies. So the function we're defining is get history, get historic price, it's a very simple function that uses pandas is one of the most important tools, we're going to be using this course. And the requests library, which is also very famous library for Python. And what we're going to do here is we're going to get Bitcoin on ether prize for an entire week.
Right. So from ferreted that the second February sorry, February 25, up to today, right? So depending on when I'm shooting this video, and we're gonna get a quick reference of the prices open, high, low, close. So in this case, we have four information per hour. Okay, so this is something you can actually change in the in the, in the request you're making to the API, you can reuse the candles eyes. In this case, we're keeping it per hour. So we have by the hour information about Bitcoin, in this particular market, which is bitstamp. Here, we have these day these day, and these are right, when I'm in the morning, open, close, highest price and lowest price, and also the volume that was operated within this time period. And we're gonna immediately plot the price. So we see that in these time, which I think is an entire day, we the price dropped, it's actually a few days, like an entire week, the price dropped from $9,600 below, right 9000. So it was a pretty significant drop. Let's see ether highperformance. We have here all the records, and how it moved. So this is what I tell you that when you're doing data analysis with programming tool like Python rar, you're not constantly looking at the data. So what I'm showing you right here are the first five records, we actually have. Let's do that. We actually have 169. Records, okay, 169 Records. And this is per hour. So if we do 169 hours divided by 24 hours, we have seven days, right? So we have seven days of data 169 Records, and then we have a little bit more information keeps this to go. I'm gonna get to that in a second. But basically, this is one I tell you 169 Records, to be honest, something you could be saying in a spreadsheet. But I want you to get the concept here. We're not just looking at our data, we have it in our brain, we know what did it we know what shape it has. We know how many records it had, we know information standard deviation, what's the mean of that? Right? So close price was the standard deviation, right? What's the the average, the mean, the median, right? So we have information about our data.
It's sitting behind, you know, in our brain, but we're not looking at it. And that's because with a very simple example, with only 169 Records, but in real life, we're dealing with millions of records, so it's impossible to see it. Have you ever tried is crawling in an Excel spreadsheet through millions of records. It's crazy. It's not possible. It's just unusable. So that's again, the way we work with data analysis in Python and R and other tools. We don't constantly keep an eye on the data. We know the shape of it. And we just have these quick references like show me the first five records. I mean, the last five records, show me this chunk here down there, but that's it. So again, these are the visualizations we're creating on Jupyter notebooks. Again, it's just very simple to get the plot done right there. We're going to also see in Jupyter notebooks, a few other pretty neat things. The first one is that we can use another library, which is called bokeem. And the difference is that boakye will have charts that are interactive. So I'm moving it right here, it has JavaScript. And it's interactive, you look back again, to what we had here. This is a static chart, it's just a PNG, you can actually export it as a PNG, there is nothing you can do with it. With bokeem, it's actually a dynamic, dynamically generated interactive charts. So I can, I can zoom in piece of data, right, I can move it around, I can just do whatever I want with it. I can refresh and reset it to whatever it was. And it's a dynamically generated chart. The difference is, if you're working with data, dynamically in your analysis, sort of in your exploration, then boek is a planning tool because you can zoom in, right, so what's going on here, let's, let's look at these things. If we're working on a mean, reverting strategy, for example, we see a high volume, we see a low volume, the mean is going to be here.
So we see some mean reversion in there. It's very interesting. If you need to, for example, export a PDF, export a huge HTML file, then static images are going to be probably better. So that's the difference between them. To be honest, model lib is a lot more popular than bogey, we use model live a lot more because it's we actually have a few other tools like seaborne that make it very easy to access and use it. What else Jupyter Notebooks work very well with some Excel, Excel files with all the file formats csvs, XML, Excel files, etc. And that's also the the availability of Jupiter lab. So Jupiter lab can immediately interpret and opens his v files can open with some extensions, XLS files, XML files, JSON files has a very nice editor and tree view for Jason. So the Jupiter lab environment combined with Python Jupyter Notebooks will give you a good idea of Jupiter in general. So in this case, we have just saved I'm not going to execute these you can try it out. But you can execute and run what we have just done and export this crypto file us an Excel spreadsheet. So you can just click on here and you can basically download it, you're going to open it and see what has There we go. So let me reduce the size of this thing. There we go. So you can see that we have just exported to spread two sheets, in this case, Bitcoin on ether, right? With the data that we had in our previous notebook, right. So that's all again, the combination of Jupiter, the combination of Python and the combination of Jupiter lab, which are tools just work very well together. So we're gonna keep moving forward, in this video, this tutorial, I'm talking about more data analysis, in general, we're going to talk about Python, we're going to do a quick review of Python. Maybe when we when I was running these commands, you felt you felt a little bit lost what I was doing with it. So we're gonna do a quick review of Python and all that. And of course, we're gonna get directly deep into data analysis with pandas with some other tools, I want to tell you something before we finish this chapter.
And it's not, it's very important for you to get familiar with data analysis, with sorry, with Jupyter notebooks, because you're going to spend a ton of time with it. And it's a very, very valuable skill that you can get if you get proficient, comfortable with Jupyter notebooks, you know, like creating cells, deleting cells, cutting, pasting, moving things around, etc. For you to generate reports Jupyter notebooks are going to be excellent. So keep an eye on it. Keep practicing, it's the only way to learn it to the to the analysis. Keep practicing it, keep open the command Polat. So you can always want if you forgot, how can it caught a cell? Well, there is here it is command x, right? It's gonna just tell you upfront, keep an eye on it, keep working with it and practicing it. And once you get familiar with Jupyter notebooks, you're going to move very, very fast. Remember, they have these nice list of compiled commands and reference you can always access if you need extra help. And we're going to keep moving forward now with more data analysis. Now it's time to talk about NumPy, one of the most important libraries in the Python ecosystem for data processing. In general, it's the one that got pretty much everything started. And if you trace back NumPy, it, it's a very old developed library. 20 years, maybe it's it's an extremely popular library and important library, I'm not gonna say popular. And I'm going to explain why in just a second. But it's a very, very important library in the Python ecosystem for data processing. NumPy is a library that will lead you it's a numeric competing library, it's just to process numbers to calculate things with numbers. And that's it. So NumPy has a very limited scope, we could say, and this is an on purpose, a very simple library, when you look at it, and when you look at the API, which is very consistent, by the way, why is NumPy so important? Well, in Python, numeric processing, and just pure Python processing numbers is very slow.
Okay, Python is not slow as itself compared to other programming languages. But when you go down, right to very deep levels of performance, when you are processing large amounts of data, right, and you need to squeeze, even, you know, that tiny bite at the end of your pipeline, you need to squeeze every flow up from your CPU, then Python is not the right tool for non Python as as a pure python programming language. NumPy is actually solving that NumPy is a very efficient numeric processing library that sits on top of Python, and gives you the same API as you're going to work with with just writing Python code, as you're saying here. But low level, it's going to be using high performance, numeric computations and, and arrays of numbers and representations, etc. That's it. That's it for pi NumPy. It's extremely simple from from an API perspective, but it's extremely powerful. Why did I say that? It's not so popular. But yes, it's so important. Well, because in reality, we don't usually employ NumPy directly, you will not see yourself using NumPy. So often, but you will be using other tools in Python, like for example, pandas, and matplotlib. And they are all working on top of NumPy. They're all relying on relying on NumPy for their numeric processing. So that's why NumPy is so important. So the for, at least for this part of the tutorial NumPy. I'm going to divide it into pieces. The first one is going to be a very detail, low level explanation of how NumPy works, why we need to use NumPy. And what are the differences between different bite sizes for numbers, we're going to talk about integers. But this is going to apply for decimals and data types also. And why you need a very low level, optimize to us number. Now you can, you can skip this part, you're going to find in the description of this tutorial, the precise moment in time.
So you can just skip and go directly to the second part, which is when we actually start using NumPy. And I show you how to create arrays, how to make computations, etc. So for now, we're going to divide it in two parts, we're going to start first with the low level explanation which you can escape if you want, because it's not going to be crucial, you can easily use NumPy. Without it. We have found that for some of our students, it's it's important to understand the low level basics of it, especially if you didn't have a computer science background, it can help you get you know, raise your right your level of understanding of computers, and how to make your computations more efficient. But don't worry if you if you don't want to go through that now it's fine. You can skip this part and come back later or any other at any other moment. You don't need the ease to use NumPy seriously, you don't need it. It's going to be beneficial, but you don't absolutely lead so you can just skip and come later. So with that said, let's actually go into into a deep understanding and explanation of how computers store integers, numbers in memory and what are bytes bits etc. In order to understand why NumPy is so important. We have to go back again to the basics. What are numbers, how they are represented in computers, etc. As you might know already a computer can only process ones and zeros bits, it can't process numbers or just decimal numbers to be more correct, sorry, it only can process ones and zeros. A computer is just always storing and processing ones and zeros. It's a binary machine. Your memory is the central place around the random access memory in your computer is the the central place where your computer is storing the data that it's actively processing, right. So you have, for example, a hard drive, which stores long term data. But the computer can process data directly from your hard drive. Before doing that, it has to load it into your ram into your random access memory again, usually, right a computer is going to have what eight gigabytes 1632 doesn't matter.
Let's say you have eight gigabytes of memory, that at some point is going to translate to number of bits that your computer can store. So if you follow, if you follow these we have right here, you can see the total number of bits available in a regular computer with eight gigabytes of memory. Why is this important? Because again, the objective of these of these tutorial is the objective of this bar, at least is to explain how you can squeeze out of every single bit you can in your computer, right? How can you make it more efficient? For your numeric processing, both in storage? use less memory for the same data? And also how to make it faster, right for your calculations. So in terms of physical storage, or actually memory storage, right? How can we make it? How can we optimize to use the least amount of memory for this given problem? That's the objective of optimizing it, we need to understand how numbers decimals or sorry, integers into decimal numeric system are represented in binary, right. So these table right here shows you the first nine numbers, 01234, etc. and their binary representation. In your computer. Let's say you want to store the age of user age of a user, which is 32. You can't store 32 in here, because your computer again doesn't know about decimals, it only knows about binary. To do that, you will need to find the correct representation in ones and zeros of 3030. All right, sorry, two, which is not this one, to be honest, I'm just making it up as we go. But again, you need to know the correct binary representation of these number in norther. To store that data, how can you know that? Well, there is this whole binary arithmetic, right? There's a whole part of math dedicated to binary doesn't matter for now. But I'm going to just drive the intuition of it so you can have a better understanding. And if you're interested, you can dig deeper later.
So basically, any decimal number needs to be stored in a binary format, which of course only steaks ones and zeros. And what we usually do is just we keep increasing zeros and ones in positions, right. So in this case, we have the number zero, the number one, that's fine. Once we need to store the number two, winning now to increase the number, the position right here we need to increase, right, so we need to go from two to one zero, we'd go to the number three, it's one one, and then we need to go to number four, we need to increase positions again, because we only have two symbols, zero and one. So as you're seeing right here, up to these level, we need only one position. Up to this level, we need two positions. This level, we need three positions. And these levels going to need four positions. And you'll see how the size of each of these is increasing. And it has a an explanation behind it that we're going to see in a second. So the question is how many decimal numbers you can store with n bytes and bits, sorry, bits. So let's say we have n bits. And let's say n is equals to three. That means that you only have three positions, right three bits, how many total decimal numbers, you can store with it? Well we can store 000 we can store zero, we can store 100 we can start stores are you one zero, right? So in this size, we can store up to here, we can store up to seven numbers 111 is equals to seven was, once we've filled all the positions, right, we've reached the limit, r