Data Analysis and Exploration with Pandas: Calculating Boolean Statistics|packtpub.com


Hi welcome to the next section boolean indexing in this section we will calculate billion statistics construct multiple boolean conditions filter with billion indexing and replicate billion indexing with index selection. Then we will move ahead to select rows. With unique and sorted indexes gain perspective on stock prices and translate sequel where clauses further we will determine the normality of stock market returns improve readability of billion indexing with query method and preserve series with the wear method at the end we will explore about masking data frame rows and selecting with billions integer location and labels. Let's move to the first video of this section where we will learn about calculating boolean statistics. In this video we will create a boolean series by applying a condition to a column of data and then calculate summary statistics from it. Filtering data from a dataset is one of the most common and basic operations. There are numerous ways to filter our subset data in pandas with boolean indexing boolean indexing which is also known as billion selection can be a confusing term but for the purposes of pandas it refers to selecting rows by providing a billion value. That can be either true or false for each row. These boolean values are usually stored in a series of num p n d array and are usually created by applying a boolean condition to one or more columns in the data frame we begin by creating a boolean series and calculating statistics on them and then move on to creating more complex conditionals before using boolean indexing in a wide variety of ways to filter data when first getting introduced to boolean series it can be informative to calculate basic summary statistics on them each value of a boolean series evaluates to 0 or 1 so all the series methods that work with numerical values also work with billions. Now let's get started with some code to see how it works. First of all read in the movie data set set the index to the movie title and inspect the first few rows with the head method.

Most data frames will not have columns or billions like our movie data set the most straightforward method to produce a boolean series is to apply a condition to one of the columns using one of the comparison operators so let's determine whether the duration of each movie is longer than two hours by using the greater than comparison operator with the duration series. We can now use this series to determine the number of movies that are longer than two hours for this. Use the sum method in all there are 1039 movies which are longer than two hours to find. The percentage of movies in the data said longer than two hours use the mean method so there are around 20% movies which meet this criteria. We calculated two important quantities from a boolean series. Its sum and mean these methods are possible as Python evaluates false or true as 0 or 1 respectively. You can prove to yourself that the mean of a boolean series represents the percentage of true values to do this use. The value counts method to count with the normalized parameter set to true to get its distribution this returns the false distribution and true distribution unfortunately this is misleading. The duration column has a few missing values. If you look back at the data frame output from our first step you will see that the last row is missing a value for duration. We need to drop the missing values first then evaluate the condition and take the mean also dropping. These missing values will allow us to calculate the correct statistic. So let's do it in one step through method chaining with the drop n/a method after this. We use the describe method to output a few summary statistics on the boolean series pandas treats billion columns similarly to how it treats object data types by displaying frequency information. This is a natural way to think about boolean series rather than displaying quantiles like it does with numeric data it is possible to compare two columns from the same data frame to produce a boolean series for instance we could determine the percentage of movies that have act to one with more.

Facebook Likes than actor two to do this we would select both of these columns and then drop any of the rows that had missing values for either movie. Then we would make the comparison and calculate the mean. He is the set of code for this. Let's run there are around 97% of movies where actor one has more. Facebook Likes than actor. Two quite easy isn't it.