Interview questions for Data analyst
Hello and welcome to data analytics interview questions my name is richard kirschner with the simply learn team that's www.simplylearn.com get certified get ahead today we're going to jump into some common questions you might see on numpy arrays and pandas data frames and the python along with some excel tabloo and sql let's start with our first question what is the difference between data mining and data profiling it's real important to note that data mining is a process of finding relevant information which has not been found before it is a way in which raw data is turned into valuable information you can think of this as anything from the cells stats and from their sql server all the way to web scraping and census bureau information where the heck do you mine it from where do you get all this data and information then we look at data profiling is usually done to assess a data set for its uniqueness consistency and logic it cannot identify incorrect or inaccurate data values so if somebody has a statistical analysis on one side and they're doing you might in the wrong data to then program your data setup so you got to be aware that when you're talking about data mining you need to look at the integrity of what you're bringing in where it's coming from data profiling is looking at it and saying hey how is this going to work what's the logic what's the consistency is it related to what i'm working with find the term data wrangling and data analytics data wrangling is a process of cleaning structuring and enriching the raw data into a desired usable format for better decision making and you can see a nice chart here with our discover it we just structure the data how we want it we clean it up get rid of all those null values we enrich it so we might take and reformat some of the settings instead of having five different terms for height of somebody you know in america in english or whatever clean some of that up and we might do a calculation and bring some of them together and validate i was just talking about that the last one need to validate your data make sure you have a solid data source and then of course it goes into the analysis very important to notice here in data wrangling eighty percent of data analytics is usually in this whole part of wrangling the data getting it to fit correctly and don't confuse that with data cooking which is actually when you're going into neural networks cooking the data so it's all be lit between 0 and 1 values what are common problems that data analysts encounter during analysis handling duplicate and missing values collecting the meaningful write data the right time making data secure and dealing with compliance issues handling data purging and storage problems again we're talking about data wrangling here 80 percent of most jobs are in wrangling that data and getting it in the right format making sure it's good data to use number four what are the various steps involved in any analytics project understand the problem we might spend 80 percent doing wrangling but you better be ready to understand the problem because if you can't you're going to spend all your time in the wrong direction this is probably the most important part of the process everything after it falls in and then you can come back to it two data collection data cleaning number three four data exploration analysis and five interpret the results number five is a close second for being the most important if you can't interpret what you bring to the table to your clients you're in trouble so when this question comes up you probably want to focus on those two noting that the rest of it does eighty percent of the work is in two three and four while one and five are the most important parts which technical tools have you used for analysis and presentation purposes being a data analyst you are expected to have knowledge of the below tools for analysis and presentation purposes there's a wide variety out there sql server mysql you have your excel your spss which is the ibm platform tabloo python you have all these different tools in here now certainly a lot of jobs are going to be narrowed in on just a few of these tools like you're not going to have a microsoft sql server mysql server but you better understand how to do basic sql polls it's also understanding excel and how the different formats for column and how to get those set up number six what are the best practices for data cleaning this is really important to remember to go through this in detail these always come up because 80 percent of most data analysis in cleaning the data make a data cleaning plan by understanding where the common errors take place and keep communications open identify and remove duplicates before working with the data this will lead to an effective data analysis process focus on the accuracy of the data maintain the value types of data provide mandatory constraints and set cross-field validation standardize the data at the point of entry so that is less chaotic and you will be able to ensure that all the information is standardized leading to fewer errors on entry number seven how can you handle missing values in a data set list wise deletion in list wise deletion method entire record is excluded from analysis if any single value is missing sometimes we're talking about records remember this could be a single line in a database so if you have your sql comes back and you have 15 different columns every one of those has a missing value you might just drop it just to make it easy because you already have enough data to do the processing average imputation use the average value of the responses from the other participants to fill in the missing value this is really useful and they'll ask you why these are useful i guarantee it if you have a whole group of data that's collected and it doesn't have that information in it at that point you might average it in there regression substitution you can use multiple regression analysis to estimate a missing value that kind of goes with the average imputation input regression model means you're just going to get you're going to actually generate a prediction as to what you think that value should be for those people based on the ones you do have multiple imputation so we talk about multiple inputs it creates plausible values based on the correlations for the missing data and then average the simulated data sets by incorporating random errors in your predictions what do you understand by the term normal distribution and the second you hear the word normal distribution should be thinking a bell curve like we see here normal distribution is a type of continuous probability distribution that is symmetric about the mean and in the graph normal distribution will appear as a bell curve the mean median and mode are equal that's a quick way to know if you have normal distribution is you can compute mean median and mode all of them are located at the center of the distribution 68 of the data lies within one standard deviation of the mean 95 of the data falls within two standard deviations of the mean 99.
7 percent of the data lies within three standard deviations of the mean what is time series analysis time series analysis is a statistical method that deals with ordered sequence of values of a variable of equally spaced time intervals time series data on a coveted 19 cases and you can see we're looking at by days so our spaces of days and each day goes by if we take and graph it you can see a time series graph always looks really nice we have like two different in this case we have what the united states going over there i have to look at the other set up in there they picked a couple different countries uh and it is it's time sensitive you know with the next result is based on what the last one was kova is an excellent example of this anytime you do any word analytics where you're figuring out what someone's saying what they said before makes a huge difference is what they're going to say next another form of time series analysis 10.
How is joining different from blending in tableau so now we're going to jump into the tableau package data joining data joining can only be done when the data comes from the same source combining two tables from the same database or two or more worksheets from the same excel file all the combined tables or sheets contains common set of dimensions and measures data blending data blending is used when the data is from two or more different sources combining the oracle table with the sql server or two sheets from excel or combining excel sheet and oracle table in data blending each data source contains its own set of dimensions and measures how is overfitting different from underfitting always a good one uh overfitting probably the biggest danger in data analytics today is overfitting model trains from the data too well using the training set the performance drops significantly over the test set happens when the model learns the noise and random fluctuations in the training data set in detail and again the performance drops way below what the test set has the model neither trains a data well nor can generalize to new data performs poorly both on train and the test set happens when there is less data to build and an accurate model and also when we try to build a linear model with a non-linear data in microsoft excel a numeric value can be treated as a text value if it proceeds with an apostrophe definitely not an exclamation if you're used to programming in python you'll look for that hash code and not an amber sign and we can see here if you enter the value 10 into a fill but you put the apostrophe in front of it it will read that as a text not as a number what is the difference between count count a count blank and count if in excel we can see here when we run in just count d1 through d23 we get 19 and you'll notice that there is 19 numbers coming down here and so it doesn't count the cost of each which is a top bracket it doesn't count the blank spaces either with the straight count when you do a count a you'll get the answer is 20.
So now when you do count a it counts all of them even the title cost of each when you do count blank we'll get three why there's three blank fields and finally the count if we do countif of e 1 to e 23 is greater than 10 there's 11 values in there basic counting of whatever's in your column pretty solid on the table there explain how vlookup works in excel vlookup is used when you need to find things in a table or a range by row the syntax has four different parts to it we have our lookup value that's a value you want to look up we have our table array the range where the lookup value is located column index number the column number and range that contains the return value and the range lookup specify true if you want an approximate match or false if you want an exact match of the return value so here we see vlookup f3 a2 to c8 2 comma 0 for prints now they don't show the f3 f3 is the actual cell that prints is in that's what we're looking at is f3 so there's your prints he pulls in from f3 a2 to c8 is the the data we're looking into and then number two is a column in that data so in this case we're looking for uh age and we count name as one age is two keep in mind this is excel versus a lot of your python and programming languages where you start at 0. in excel we always look at the cells as 1 2 3. so 2 represents the age 0 is false for having an exact match up versus one we don't actually need to worry about that too much in this zero or one would work with this example and you can see with the angela lookup again her name would be in the f column number four that's what the f4 stands for is where they pulled angela from and then you have a1 to c8 and then we're looking at uh number three so number three is height name being one h2 and then height three and you'll see here pulls in her height 5.
8 so we're going to run jump over to sql how do you subset or filter data in sql to subset or filter data in sql we use where and having claws you can see we have a nice table on the left where we have the title the director the year the duration we want to filter the table for movies that were directed by brad bird why just because we want to know who what brad bird did so we're going to do select star you should know that the star refers to all in this case we're what are we going to return we're going to return all title directory year and duration that's what we mean by all from movies movies being our table where director equals brad bird and you can see he comes back and he did the incredible and ratatouille to subset or filter data sql we can also use the where and having clause so we're going to take a closer look at the different ways we can filter here filter the table for directors whose movies have an average duration greater than 115 minutes so there's a lot of really cool things into this sql query and these sql queries can get pretty crazy select director sum duration as total duration average duration as average duration from movies group by director having average duration greater than 115 uh so again what are we going to return we're going to return whatever we put in our select which in this case is director we're going to have total duration and that's going to be the sum of the duration we're going to have the average duration average underscore duration which is going to be the average duration on there and then we of course go ahead and group by director and we want to make sure we group them by anyone that has an having an average duration greater than 115. these sql queries are so important i don't know how many times your the sql comes up and there's so many different other languages not just mysql and not microsoft sql in addition to that where the sql language comes in especially with hadoop in other areas so you really should know your basic sql it doesn't hurt to get that little cheat sheet and glance over it and double check some of the different features in sql what is the difference between where and having clause in sql where where clause works on row data in where clause the filter occurs before any groupings are made aggregate functions cannot be used so the syntax is select your columns from table where what the condition is having clause works on aggregated data having is used to filter values from a group aggregate functions can be used in the syntax is select column names from table where the condition is grouped by having a condition ordered by column names what is the correct syntax for reshape function in numpy so we're going to jump to the numpy array program and what you come up with is you have in this case it'd be numpy.
Reshape a lot of times you do an import numpy as np reshape and then your array and the new shape and you can see here as we uh as the actual example comes in the reshape is a and we're going to reshape it in two comma five uh setups and you can see the print out in there that prints in two rows with five values in each one what are the different ways to create a data frame in pandas well we can do it by initializing a list so you can put your pandas as pd very common data equals tom 30 jerry20 angela 35. we'll go ahead and create the data frame and we'll say pd.dataframe is the data columns equals name and age so you can designate your columns you can also it is a index in there you should always remember that the index uh in this case maybe you want the index instead of one two to be the date they signed up or who knows you know whatever and you can see right there it just generates a nice pandas data frame with tom jerry and angela another way you can initialize a data frame is from dictionary you can see here we have a dictionary where the date equals name tom jerry angela mary ages 20 21 1918 and if we do a dfpd data frame on the data you'll get a nice the same kind of setup you get your name age tom jerry angela and mary write the python code to create an employee's data frame from the emp.
Csv file and display the head in summary of it to create a data frame in python you need to import the pandas library and use the read csv function to load the csv file and here you can see we have import pandas as pd employees or the data frame employees equals pd.read csv and then you have your path to that csv file there's a number of settings in the read csv where you can tell it how many rows are the top index you can set the columns in there you can have skip rows there's all kinds of things you can also go in there and double check with your read csv but the most basic one is just to read a basic csv how will you select the department and age columns from an employee's data frame so we have import pandas as pd you can see we have created our data we will go ahead and create our employees pd data frame on the left and then on the right to select department and age from the data frame we just do employees and you put the brackets around it now if you're just doing one column you could do just department but if you're doing multiple columns you got to have those in a second set of brackets it's got to be a reference with a list within the reference what is the criteria to say whether a developed data model is good or not a good model should be intuitive insightful and self-explanatory follow the old saying kiss keep it simple the model develops should be able to easily consumed by the clients for actionable and profitable results so if they can't read it what good is it a good model should easily adapt to changes according to business requirements we live in quite a dynamic world nowadays so it's pretty self-evident and if the data gets updated the model should be able to scale accordingly to the new data so you have a nice data pipeline going where when something when you get new data coming in you don't have to go and rewrite the whole code what is the significance of exploratory data analysis exploratory data analysis is an important step in any data analysis process exploratory data analysis eda helps to understand the data better it helps you obtain confidence in your data to a point where you're ready to engage a machine learning algorithm it allows you to refine your selection of feature variables that will be used later for model building you can discover hidden trends and insights from the data how do you treat outliers in a data set an outlier is a data point that is distant from other similar points they may be due to variability in the measurement or may indicate experimental errors one you can drop the outlier records pretty straightforward you can cap your outliers data so it doesn't go past a certain value you can assign it a new value you can also try a new transformation to see if those outliers come in if you transform it slightly differently explain descriptive predictive and prescriptive analytics descriptive provides insights into the past to answer what has happened uses data aggregation and data mining techniques examples an ice cream company can analyze how much ice cream was sold which flavors were sold and whether more or less ice cream was sold than before predictive understands the future to the answer what could happen uses statistical models and forecasting techniques example predicts the sale of ice creams during the summer spring and rainy days uh so this is always interesting because you have your descriptive which comes in and your businesses are always looking to know what happened hey did we have good sales last quarter what are we expecting next quarter in the cells and we have a huge jump when we do prescriptive suggest various courses of action to answer what should you do uses optimization and simulation algorithms to advise possible outcomes example lower prices to increase sell of ice creams produce more or less quantities of certain flavor of ice cream and we can certainly uh today's world with the coveted virus because we had that on our earlier graph you could see that as a descriptive what's happened how many people have been infected how many people have died in an area predictive where do we predict that to go do we see it going to get worse is it going to get better what do we predict that we're going to need in hospital beds and prescriptive what can we change in our setup to have a better outcome maybe if we did more social distancing if we tracked the virus how do these different things directly affect the end and can we create a better ending by changing some underlying criteria what are the different types of sampling techniques used by data analysis sampling is a statistical method to select a subset of data from an entire data set population to estimate the characteristics of the whole population one we can do a simple random sampling so we can just pick out 500 random people in the united states to sample them they call it a population in regular data we also call that a population just because that's where it came from was mainly from doing census systematic sampling cluster sampling stratified sampling and judgment or propositive sampling then we have our systematic sampling that's where you're doing like uh using 1 5 10 15 20 use a very systematic approach for pulling samples from the setup cluster sampling that's where we look at it we say hey some of these things just naturally group together if you were talking about population which is the really a nice way of looking at this cluster sampling would be maybe by zip code we're going to do everybody's zip code and just naturally cluster it that way stratified sampling would be more looking for shared things a group has like income so if you're studying something on poverty you might look at their naturally group people based on income to begin with and then study those individuals in the income to find out what kind of traits they have and then judgmental that is where the researcher very carefully selects each member of their own group uh so it's very much based on their personal knowledge jumping on the 26 what are the different types of hypothesis testing hypothesis testing is a procedure used by statisticians and scientists to accept or reject statistical hypothesis we start with the hypothesis testing we have null hypothesis and alternative hypothesis on the null hypothesis it states that there is no relation between the predictor and the outcome variables in the population it is denoted by h naught example there is no association between patients bmi and diabetes alternative hypothesis it states there is some relation between the predictor and outcome variables in the population it is denoted by h1 example there could be an association between patients bmi and diabetes and that's the body mass index if you didn't catch the bmi and you're not in medical describe univariate bivariate and multivariate analysis a univariate analysis it is the simplest form of data analysis where the data being analyzed contains only one variable an example of studying the heights of players in the nba because it's so simple it can be described using central tendencies dispersion quartiles bar charts histograms pie charts frequency distribution tables the bivariate analysis it involves analysis of two variables to find causes relationships and correlations between the variables example analyzing sale of ice creams based on the temperature outside bivariate analysis can be explained using correlation coefficients linear regression logistic regression scatter plots and box plots and multivariate analysis it involves analysis of three or more variables to understand the relationship of each variable with the other variables example analyzing revenue based on expenditure so if we have our tv ads we have our newspaper ads our social media ads and our revenue we can now compare all those together the multivert analysis can be performed using multiple regression factor analysis classification and regression trees cluster analysis principle component analysis clustering bar chart dual axis chart what function would you use to get the current date and time in excel in excel you can use the today and now function to get the current date and time you can see down here with the two examples where just equals today or equals now using the sumifs function in excel find the total quantity sold by sales representatives whose names start with a and the cost of each item they have sold is greater than 10.
And you can see here on the left we have our actual table and then we want to go ahead and sum ifs so we want the e2 through e20 b2 through b20 greater than 10. and this basically is just saying hey we're going to take everything in the e column and we're going to sum it up but only those objects where the d column is greater than 10 that's what that means there is the below query correct if not how will you rectify it select customer id year order date as order year from order where order year is greater than or equal to 2016. and hopefully you caught it right there uh it's in the devil's in the details we can't not use the alias name while filtering data using the where clause so the correct format is all the same except for where it says where the year order date is greater than or equal to 16 versus using the order year which we assign under the select setup how our union intersect and accept used in sql the union operator is used to combine the results of two or more select statements and you can see here we have select star from region one and we're going to make a union with select star from region 2 and it basically takes both these sql tables and combines them to form a full new table on there so that's your union as we bring everything together we look at the intersect operator returns the common records that are the result of the two or more select statements so you can see here we select star from region 1 intersect select star from region 2 and we come up with only those records that are shared that have the same data in them and hopefully you jumped ahead to the accept the accept operator returns the uncommon records that are the result of two or more select statements so these are the two records or the records that are not shared between the two databases using the product price table write an sql query to find the record with the fourth highest market price so here we have a little bit of a brain teaser they're always fun and the first thing we want to do is we're going to go ahead and i'm going to if you look at the script on the left we really want the fourth one down so we're going to select the top four from product price but we're going to order it by market price descending sp order by market price ascending so we do is we take the top four of the market price ascending and that's going to give us the four greatest values and then we're going to reverse that order and do descending and we're going to take the top one of that which is going to give us the lowest value which would be the fourth greatest one in the list from the product price table find the total and average market price for each currency where the average market price is greater than 100 and currency is in the inr or the aud so inr or aud india rupal or australia dollar you can see over here the sql query if you had trouble putting this together you might actually do some of it in reverse and you can see right here where the average market price is greater than 50.
Remember we use having not where at the end because it's part of the group so group by currency because we want those two currencies and we want the currency india the inr or the aud and as you keep going backwards we're actually going to be selecting the currency the sum of the market price as total price and the average market price as average price so there's our select it's going to come from the product price which is just our table over there and then we have where our currency is in uh and like i said you can put together however you want but hopefully you got to the end there so this question will test your knowledge in tableau exploring the different features of tableau and creating a suitable graph to solve a business problem and of course tableau is very visual in its use so it's very hard to test it without actually just getting your hands on and if you can't visualize some of this and how to do it then you should go back and refresh yourself using the sample superstore data set create a view to analyze the cells profits and quantities sold across different subcategories of items present under each category so the first step is to go ahead and load the sample superstore data set so make sure you know how to load the sample the superstore data set that's underneath either the connect button in the upper left or the tableau icon up there and be able to pull in the data set and then once you've done that you just drag the category and subcategory on rows and salaries onto columns it will result in a horizontal bar chart so in this one we're just going to drag profit onto color and quantity onto label sort the sales axes in descending order of sum and cells within each sub-category and if you're at home doing this you'll see that chairs under furniture category had the highest sales and profit while tables had the lowest profit for office supplies subcategory binders made the highest profit even though storage had the highest sales under technology category copiers made the highest profit though it was the least amount of sales let's work to create a dual axis chart in tableau to present cells and profits across different years using sample superstore data set load the orders sheet from the sample superstore data set drag the ordered data field from the dimensions onto columns and convert it into continuous month drag cells onto rows and profits to the right corner of the view until you see a light green rectangle one of those things if you haven't done this hands on you don't know what you're doing you're gonna run into a bank so you're gonna be just kind of dropping it and wondering what happened synchronize the right axis by right clicking on the profit axis and then let's finalize it by going under the marks card change some cells to bar and sum profit to line and adjust the size and then we have a nice display that we can either print out or save and send off to the shareholders let's go and do one more tableau design a view in tableau to show statewide cells and profits using the sample superstore data set in here you go ahead and drag the country field onto the view section and expand it to see the states drag the states field onto size and profit onto color increase the size of the bubbles add a border and a halo color states like washington california and new york have the highest sales and profits while texas pennsylvania and ohio have a good amount of sales but the least amount of profits we'll go ahead and skip back to python numpy suppose there is an array number equals np or numpy if you're using numpy depending on how you set it up dot array and we just have one to nine broken up into three groups extract the value 8 using 2d indexing so you can see on the left we have our import numpy as in p number equals our np array if we print the number we have one two three four five six seven eight nine since the value eight is present in the second row and first column we use the same index position and pass it to the array you just have number two comma one and you get eight and remember we're in python so you start at zero not one like you do in excel always gets me if i'm working between excel and python where i just kind of flip and usually see excel that messes up because i do a lot more programming suppose there's an array that has values 0 1 all the way up to 9.
How will you display the following values from the array 1 3 5 7 9. so first of all we go ahead and create the array np dot a range of 10 which goes from 0 to 9 because there's 10 numbers in it but we don't include the 10. we print it out the first thing you want to do is what's going on here with 1 3 5 7 9 well if we divide by 2 there's going to be a remainder equal to 1. and then from python we remember that if you use the percentage sign you get the remainder on there so the remainder's one and then you have the your numpy array and then we just want to do a logical statement of all values that have a remainder of 1 and that generates our nice 1 3 5 7 9. there are two arrays a and b stack the arrays a and b horizontally boy these horizontal vertical questions will get you every time and in numpy we go ahead and we've created two different arrays over here a and b uh the first one is your concatenate np dot concatenate a and b on axes equal one that is the same as h stack and in the back end they're still identical they run the same that's all h stack is a concatenate axes equals one how can you add a column to a pandas data frame suppose there's an imp data frame that has information about few employees let's add address column to that data frame you can see in the left we have our basic data frame you should know your data frames very well basically looks like an excel spreadsheet as you come over here it's really simple you just do df of address equals the address once you've assigned values to the address using the below given data create a pivot table to find the total sales made by each cells represented for each item display the cells as a percentage of the grand total so we're back in tableau select the entire table range click on insert tab and choose pivot table select the table range and the worksheet where you want to place the pivot table it will return a pivot table where you can analyze your data drag the cell total on the values and sales rep and item onto row labels it will give the sum of the sales made by each representative for each item they have sold and finally right click on sum of cell total and expand show values as to select percentage of grand total real important just understand what a pivot table is we're just pivoting it from rows and columns and switching this direction on there and finally we have our final pivot table and you can see the values rolls and sum of total sale so we're going to go ahead and take a product table this is off of an sql so we can do some sql here and we're going to use the product and sales order detail table find the products that have total units sold greater than 1.
5 million and here's our sales order detail table so we have a product table and a sales order detail table two separate tables in the database and we're going to do is put together the sql query we want to select pp name sum sod unit price as cells and then we have our pp.product id from production product as pp interjoin cells order detail as sod on pp product id equals sod.product id group by pp.name comma pp.productid having a sum of saw.unit price greater than the 150 million there that's a mouthful and again these sql queries they start looking really crazy until you just break them apart and do them step by step and what we're looking for is the inner join and how did you do the group by this really wanted to know how do you do this inner join this comes up so much in sql how do you pull in the id from one chart and the information from another chart and the sum totals on that chart how do you write a stored procedure in sql let's create a storage procedure to find the sum the squares are the first n natural numbers so here we have our formula n times n plus 1 times two n plus one over six and you can see from the command prompt uh or the setup you have depending on what your login is the command is create procedure square sum one declare our variable at in of integer as begin then we're going to declare the sum of integer set sum equals n times n plus 1 plus 2 times n plus 1 over 6.
And then of course we can go ahead and print those out print first cast amber sign in or our variable as a variable character 20 natural numbers print the sum of the square is cast the at sum as variable character 40 end then we do the output display the sum of the square for first four natural numbers we have execute square sum one and then we're going to put in four and you can see here where it brings up the first four natural number sum of square is 30. write a store procedure to find the total even number between two user given numbers a couple of things to note here first we go ahead and create our procedure you have your two different variables the n1 n2 and we go ahead and begin we're going to declare our variable count as an integer we're going to set count equal to zero and then we have while n is less than n2 we're going to begin and if n1 remainder two equals zero so we're gonna divide it by two even number begin we're gonna set the count equal to count plus one we're gonna print even number plus cast in as a variable character 10 for printing count is plus cast variable count as variable character 10 end else print odd number plus cast variable number one is variable character 10 and then we go ahead and set the increment our variable one up one so they go from n one all the way to n two and i'll print the total number of even numbers and you can see here we went ahead and executed it we're going to count the even numbers between 30 and 45 and you see it goes all the way down to eight what is the difference between tree maps and heat maps in tableau now if you've worked in python other programmings you should automatically know what a heat map is but a tree map are used to display data in nested rectangles you use dimensions to define the structure of the tree map and measure to define the size or color of individual rectangles tree maps are relatively simple data visualization that can provide insight in a visually attractive format and again you can see the squares over here this is our tree map over here with the each block also has this information inside of its different blocks a heat map helps to visualize measures against dimensions with the help of colors and size to compare one or more dimensions and up to two measures the layout is similar to a text table with variations in values encoded as colors in heat map you can quickly see a wide array of information and in this one you can see they use the colors to denote one thing and the size of the little square to denote something else a lot of times you can even graph this into a three-dimensional graph with other data so it pops out but again a heat map is the color and the size using the sample super stored data set display the top five and bottom five customers based on their profit so you start by dragging the customer name field onto rows and profit on columns right click on the customer name column to create a set give a name to the set and select top tab to choose top 5 customers by some profit similarly create a set for the bottom 5 customers by some profit select both the sets right click to create a combined set give a name to the set and choose all members in both sets and then you can drag top and bottom customer sets onto the filters and profit field onto color to get the desired results as we get down to the end of our list we're going to try to keep you on your toes we're going to skip back to numpy how to print four random integers between 1 and 15 using numpy to generate random numbers using numpy we use the random random integer function you can see here we did the import numpy as in p random arrangement equals np.
Random.randominteger 1 through 15 of 4. from the below data frame can jump again on you now we're into pandas how will you find the unique values for each column and subset the data for age less than 35 and height greater than 6.
To find the unique values and the number of unique elements use the unique and the in unique function you see here we just did df heights we're selecting just the height column and we want to look for the unique that returns an array where in unique if we do that on the height or the age will return just the number of unique values and then we can do a subset the data for ages less than 35 and height greater than 6. so if we look over here we have a new df remember this is going to be taking slices of our original data frame it doesn't actually change the data frame so our new df equals the data frame or df the data frame where age is less than 35 and the height is greater than 6. and in case you're not using tableau which has a lot of its own different mapping programs in there make sure you understand how to use the basics of matplot library plot a sine graph using numpy and matplot library in python and the way we did this is we went ahead and generate an x we know our y equals np dot sine of x if you print out x you'll see a whole value here our map plot library pi plot as plt if you are working in jupiter notebook make sure you understand the matplot library inline that little percentage sign matplot library in line that prints it on the page in the jupiter notebook the newer version of jupiter notebook or jupiter labs automatically does that for you but i usually put it in there just in case i end up on an older version if you print y you can see here we have our different y values and our different x values you simply put in plt.plot x y and do a plot show and before we go let's get one more in we're going to do a pandas using the below pandas data frame find the company with the highest average cells derive the summary statistics for the cells column and transpose these statistics that's a mouthful and just like any of these computer problems break it apart so first of all we're looking for the highest average cells so group the company column and use the mean function to find the average cells you see here by company equals df.
Groupby company once we've done that using the describe function we can now go ahead and look at the summary of statistics on here use the describe function to find the summary so by company those are groups we're just going to describe them and you could actually bundle those together if you wanted and just do them all in one line so here we go by company.descri you can see we have a nice breakout always good to remember whether you're using any of the packages whether it's tableau or pandas in python or even r or some other package being able to quick look and describe your data is very important and then we go ahead and just do a basic apply a transpose function over the describe method to transpose the statistics all we've done here is flip the index with the column names but if you're following the numbers a lot of times it's easier to follow across one line or maybe you want to average out the count or it's all kinds of different reasons to do that well that wraps it up i want to thank you for joining us today i hope you're ready for those data analytics interview questions coming your way and that great job is coming right down the line for you you can always contact us for more information and visit www.simplylearn.com again my name is richard kershner with the simply learn team get certified get ahead hi there if you like this video subscribe to the simply learn youtube channel and click here to watch similar videos turn it up and get certified click here.
7 percent of the data lies within three standard deviations of the mean what is time series analysis time series analysis is a statistical method that deals with ordered sequence of values of a variable of equally spaced time intervals time series data on a coveted 19 cases and you can see we're looking at by days so our spaces of days and each day goes by if we take and graph it you can see a time series graph always looks really nice we have like two different in this case we have what the united states going over there i have to look at the other set up in there they picked a couple different countries uh and it is it's time sensitive you know with the next result is based on what the last one was kova is an excellent example of this anytime you do any word analytics where you're figuring out what someone's saying what they said before makes a huge difference is what they're going to say next another form of time series analysis 10.
How is joining different from blending in tableau so now we're going to jump into the tableau package data joining data joining can only be done when the data comes from the same source combining two tables from the same database or two or more worksheets from the same excel file all the combined tables or sheets contains common set of dimensions and measures data blending data blending is used when the data is from two or more different sources combining the oracle table with the sql server or two sheets from excel or combining excel sheet and oracle table in data blending each data source contains its own set of dimensions and measures how is overfitting different from underfitting always a good one uh overfitting probably the biggest danger in data analytics today is overfitting model trains from the data too well using the training set the performance drops significantly over the test set happens when the model learns the noise and random fluctuations in the training data set in detail and again the performance drops way below what the test set has the model neither trains a data well nor can generalize to new data performs poorly both on train and the test set happens when there is less data to build and an accurate model and also when we try to build a linear model with a non-linear data in microsoft excel a numeric value can be treated as a text value if it proceeds with an apostrophe definitely not an exclamation if you're used to programming in python you'll look for that hash code and not an amber sign and we can see here if you enter the value 10 into a fill but you put the apostrophe in front of it it will read that as a text not as a number what is the difference between count count a count blank and count if in excel we can see here when we run in just count d1 through d23 we get 19 and you'll notice that there is 19 numbers coming down here and so it doesn't count the cost of each which is a top bracket it doesn't count the blank spaces either with the straight count when you do a count a you'll get the answer is 20.
So now when you do count a it counts all of them even the title cost of each when you do count blank we'll get three why there's three blank fields and finally the count if we do countif of e 1 to e 23 is greater than 10 there's 11 values in there basic counting of whatever's in your column pretty solid on the table there explain how vlookup works in excel vlookup is used when you need to find things in a table or a range by row the syntax has four different parts to it we have our lookup value that's a value you want to look up we have our table array the range where the lookup value is located column index number the column number and range that contains the return value and the range lookup specify true if you want an approximate match or false if you want an exact match of the return value so here we see vlookup f3 a2 to c8 2 comma 0 for prints now they don't show the f3 f3 is the actual cell that prints is in that's what we're looking at is f3 so there's your prints he pulls in from f3 a2 to c8 is the the data we're looking into and then number two is a column in that data so in this case we're looking for uh age and we count name as one age is two keep in mind this is excel versus a lot of your python and programming languages where you start at 0. in excel we always look at the cells as 1 2 3. so 2 represents the age 0 is false for having an exact match up versus one we don't actually need to worry about that too much in this zero or one would work with this example and you can see with the angela lookup again her name would be in the f column number four that's what the f4 stands for is where they pulled angela from and then you have a1 to c8 and then we're looking at uh number three so number three is height name being one h2 and then height three and you'll see here pulls in her height 5.
8 so we're going to run jump over to sql how do you subset or filter data in sql to subset or filter data in sql we use where and having claws you can see we have a nice table on the left where we have the title the director the year the duration we want to filter the table for movies that were directed by brad bird why just because we want to know who what brad bird did so we're going to do select star you should know that the star refers to all in this case we're what are we going to return we're going to return all title directory year and duration that's what we mean by all from movies movies being our table where director equals brad bird and you can see he comes back and he did the incredible and ratatouille to subset or filter data sql we can also use the where and having clause so we're going to take a closer look at the different ways we can filter here filter the table for directors whose movies have an average duration greater than 115 minutes so there's a lot of really cool things into this sql query and these sql queries can get pretty crazy select director sum duration as total duration average duration as average duration from movies group by director having average duration greater than 115 uh so again what are we going to return we're going to return whatever we put in our select which in this case is director we're going to have total duration and that's going to be the sum of the duration we're going to have the average duration average underscore duration which is going to be the average duration on there and then we of course go ahead and group by director and we want to make sure we group them by anyone that has an having an average duration greater than 115. these sql queries are so important i don't know how many times your the sql comes up and there's so many different other languages not just mysql and not microsoft sql in addition to that where the sql language comes in especially with hadoop in other areas so you really should know your basic sql it doesn't hurt to get that little cheat sheet and glance over it and double check some of the different features in sql what is the difference between where and having clause in sql where where clause works on row data in where clause the filter occurs before any groupings are made aggregate functions cannot be used so the syntax is select your columns from table where what the condition is having clause works on aggregated data having is used to filter values from a group aggregate functions can be used in the syntax is select column names from table where the condition is grouped by having a condition ordered by column names what is the correct syntax for reshape function in numpy so we're going to jump to the numpy array program and what you come up with is you have in this case it'd be numpy.
Reshape a lot of times you do an import numpy as np reshape and then your array and the new shape and you can see here as we uh as the actual example comes in the reshape is a and we're going to reshape it in two comma five uh setups and you can see the print out in there that prints in two rows with five values in each one what are the different ways to create a data frame in pandas well we can do it by initializing a list so you can put your pandas as pd very common data equals tom 30 jerry20 angela 35. we'll go ahead and create the data frame and we'll say pd.dataframe is the data columns equals name and age so you can designate your columns you can also it is a index in there you should always remember that the index uh in this case maybe you want the index instead of one two to be the date they signed up or who knows you know whatever and you can see right there it just generates a nice pandas data frame with tom jerry and angela another way you can initialize a data frame is from dictionary you can see here we have a dictionary where the date equals name tom jerry angela mary ages 20 21 1918 and if we do a dfpd data frame on the data you'll get a nice the same kind of setup you get your name age tom jerry angela and mary write the python code to create an employee's data frame from the emp.
Csv file and display the head in summary of it to create a data frame in python you need to import the pandas library and use the read csv function to load the csv file and here you can see we have import pandas as pd employees or the data frame employees equals pd.read csv and then you have your path to that csv file there's a number of settings in the read csv where you can tell it how many rows are the top index you can set the columns in there you can have skip rows there's all kinds of things you can also go in there and double check with your read csv but the most basic one is just to read a basic csv how will you select the department and age columns from an employee's data frame so we have import pandas as pd you can see we have created our data we will go ahead and create our employees pd data frame on the left and then on the right to select department and age from the data frame we just do employees and you put the brackets around it now if you're just doing one column you could do just department but if you're doing multiple columns you got to have those in a second set of brackets it's got to be a reference with a list within the reference what is the criteria to say whether a developed data model is good or not a good model should be intuitive insightful and self-explanatory follow the old saying kiss keep it simple the model develops should be able to easily consumed by the clients for actionable and profitable results so if they can't read it what good is it a good model should easily adapt to changes according to business requirements we live in quite a dynamic world nowadays so it's pretty self-evident and if the data gets updated the model should be able to scale accordingly to the new data so you have a nice data pipeline going where when something when you get new data coming in you don't have to go and rewrite the whole code what is the significance of exploratory data analysis exploratory data analysis is an important step in any data analysis process exploratory data analysis eda helps to understand the data better it helps you obtain confidence in your data to a point where you're ready to engage a machine learning algorithm it allows you to refine your selection of feature variables that will be used later for model building you can discover hidden trends and insights from the data how do you treat outliers in a data set an outlier is a data point that is distant from other similar points they may be due to variability in the measurement or may indicate experimental errors one you can drop the outlier records pretty straightforward you can cap your outliers data so it doesn't go past a certain value you can assign it a new value you can also try a new transformation to see if those outliers come in if you transform it slightly differently explain descriptive predictive and prescriptive analytics descriptive provides insights into the past to answer what has happened uses data aggregation and data mining techniques examples an ice cream company can analyze how much ice cream was sold which flavors were sold and whether more or less ice cream was sold than before predictive understands the future to the answer what could happen uses statistical models and forecasting techniques example predicts the sale of ice creams during the summer spring and rainy days uh so this is always interesting because you have your descriptive which comes in and your businesses are always looking to know what happened hey did we have good sales last quarter what are we expecting next quarter in the cells and we have a huge jump when we do prescriptive suggest various courses of action to answer what should you do uses optimization and simulation algorithms to advise possible outcomes example lower prices to increase sell of ice creams produce more or less quantities of certain flavor of ice cream and we can certainly uh today's world with the coveted virus because we had that on our earlier graph you could see that as a descriptive what's happened how many people have been infected how many people have died in an area predictive where do we predict that to go do we see it going to get worse is it going to get better what do we predict that we're going to need in hospital beds and prescriptive what can we change in our setup to have a better outcome maybe if we did more social distancing if we tracked the virus how do these different things directly affect the end and can we create a better ending by changing some underlying criteria what are the different types of sampling techniques used by data analysis sampling is a statistical method to select a subset of data from an entire data set population to estimate the characteristics of the whole population one we can do a simple random sampling so we can just pick out 500 random people in the united states to sample them they call it a population in regular data we also call that a population just because that's where it came from was mainly from doing census systematic sampling cluster sampling stratified sampling and judgment or propositive sampling then we have our systematic sampling that's where you're doing like uh using 1 5 10 15 20 use a very systematic approach for pulling samples from the setup cluster sampling that's where we look at it we say hey some of these things just naturally group together if you were talking about population which is the really a nice way of looking at this cluster sampling would be maybe by zip code we're going to do everybody's zip code and just naturally cluster it that way stratified sampling would be more looking for shared things a group has like income so if you're studying something on poverty you might look at their naturally group people based on income to begin with and then study those individuals in the income to find out what kind of traits they have and then judgmental that is where the researcher very carefully selects each member of their own group uh so it's very much based on their personal knowledge jumping on the 26 what are the different types of hypothesis testing hypothesis testing is a procedure used by statisticians and scientists to accept or reject statistical hypothesis we start with the hypothesis testing we have null hypothesis and alternative hypothesis on the null hypothesis it states that there is no relation between the predictor and the outcome variables in the population it is denoted by h naught example there is no association between patients bmi and diabetes alternative hypothesis it states there is some relation between the predictor and outcome variables in the population it is denoted by h1 example there could be an association between patients bmi and diabetes and that's the body mass index if you didn't catch the bmi and you're not in medical describe univariate bivariate and multivariate analysis a univariate analysis it is the simplest form of data analysis where the data being analyzed contains only one variable an example of studying the heights of players in the nba because it's so simple it can be described using central tendencies dispersion quartiles bar charts histograms pie charts frequency distribution tables the bivariate analysis it involves analysis of two variables to find causes relationships and correlations between the variables example analyzing sale of ice creams based on the temperature outside bivariate analysis can be explained using correlation coefficients linear regression logistic regression scatter plots and box plots and multivariate analysis it involves analysis of three or more variables to understand the relationship of each variable with the other variables example analyzing revenue based on expenditure so if we have our tv ads we have our newspaper ads our social media ads and our revenue we can now compare all those together the multivert analysis can be performed using multiple regression factor analysis classification and regression trees cluster analysis principle component analysis clustering bar chart dual axis chart what function would you use to get the current date and time in excel in excel you can use the today and now function to get the current date and time you can see down here with the two examples where just equals today or equals now using the sumifs function in excel find the total quantity sold by sales representatives whose names start with a and the cost of each item they have sold is greater than 10.
And you can see here on the left we have our actual table and then we want to go ahead and sum ifs so we want the e2 through e20 b2 through b20 greater than 10. and this basically is just saying hey we're going to take everything in the e column and we're going to sum it up but only those objects where the d column is greater than 10 that's what that means there is the below query correct if not how will you rectify it select customer id year order date as order year from order where order year is greater than or equal to 2016. and hopefully you caught it right there uh it's in the devil's in the details we can't not use the alias name while filtering data using the where clause so the correct format is all the same except for where it says where the year order date is greater than or equal to 16 versus using the order year which we assign under the select setup how our union intersect and accept used in sql the union operator is used to combine the results of two or more select statements and you can see here we have select star from region one and we're going to make a union with select star from region 2 and it basically takes both these sql tables and combines them to form a full new table on there so that's your union as we bring everything together we look at the intersect operator returns the common records that are the result of the two or more select statements so you can see here we select star from region 1 intersect select star from region 2 and we come up with only those records that are shared that have the same data in them and hopefully you jumped ahead to the accept the accept operator returns the uncommon records that are the result of two or more select statements so these are the two records or the records that are not shared between the two databases using the product price table write an sql query to find the record with the fourth highest market price so here we have a little bit of a brain teaser they're always fun and the first thing we want to do is we're going to go ahead and i'm going to if you look at the script on the left we really want the fourth one down so we're going to select the top four from product price but we're going to order it by market price descending sp order by market price ascending so we do is we take the top four of the market price ascending and that's going to give us the four greatest values and then we're going to reverse that order and do descending and we're going to take the top one of that which is going to give us the lowest value which would be the fourth greatest one in the list from the product price table find the total and average market price for each currency where the average market price is greater than 100 and currency is in the inr or the aud so inr or aud india rupal or australia dollar you can see over here the sql query if you had trouble putting this together you might actually do some of it in reverse and you can see right here where the average market price is greater than 50.
Remember we use having not where at the end because it's part of the group so group by currency because we want those two currencies and we want the currency india the inr or the aud and as you keep going backwards we're actually going to be selecting the currency the sum of the market price as total price and the average market price as average price so there's our select it's going to come from the product price which is just our table over there and then we have where our currency is in uh and like i said you can put together however you want but hopefully you got to the end there so this question will test your knowledge in tableau exploring the different features of tableau and creating a suitable graph to solve a business problem and of course tableau is very visual in its use so it's very hard to test it without actually just getting your hands on and if you can't visualize some of this and how to do it then you should go back and refresh yourself using the sample superstore data set create a view to analyze the cells profits and quantities sold across different subcategories of items present under each category so the first step is to go ahead and load the sample superstore data set so make sure you know how to load the sample the superstore data set that's underneath either the connect button in the upper left or the tableau icon up there and be able to pull in the data set and then once you've done that you just drag the category and subcategory on rows and salaries onto columns it will result in a horizontal bar chart so in this one we're just going to drag profit onto color and quantity onto label sort the sales axes in descending order of sum and cells within each sub-category and if you're at home doing this you'll see that chairs under furniture category had the highest sales and profit while tables had the lowest profit for office supplies subcategory binders made the highest profit even though storage had the highest sales under technology category copiers made the highest profit though it was the least amount of sales let's work to create a dual axis chart in tableau to present cells and profits across different years using sample superstore data set load the orders sheet from the sample superstore data set drag the ordered data field from the dimensions onto columns and convert it into continuous month drag cells onto rows and profits to the right corner of the view until you see a light green rectangle one of those things if you haven't done this hands on you don't know what you're doing you're gonna run into a bank so you're gonna be just kind of dropping it and wondering what happened synchronize the right axis by right clicking on the profit axis and then let's finalize it by going under the marks card change some cells to bar and sum profit to line and adjust the size and then we have a nice display that we can either print out or save and send off to the shareholders let's go and do one more tableau design a view in tableau to show statewide cells and profits using the sample superstore data set in here you go ahead and drag the country field onto the view section and expand it to see the states drag the states field onto size and profit onto color increase the size of the bubbles add a border and a halo color states like washington california and new york have the highest sales and profits while texas pennsylvania and ohio have a good amount of sales but the least amount of profits we'll go ahead and skip back to python numpy suppose there is an array number equals np or numpy if you're using numpy depending on how you set it up dot array and we just have one to nine broken up into three groups extract the value 8 using 2d indexing so you can see on the left we have our import numpy as in p number equals our np array if we print the number we have one two three four five six seven eight nine since the value eight is present in the second row and first column we use the same index position and pass it to the array you just have number two comma one and you get eight and remember we're in python so you start at zero not one like you do in excel always gets me if i'm working between excel and python where i just kind of flip and usually see excel that messes up because i do a lot more programming suppose there's an array that has values 0 1 all the way up to 9.
How will you display the following values from the array 1 3 5 7 9. so first of all we go ahead and create the array np dot a range of 10 which goes from 0 to 9 because there's 10 numbers in it but we don't include the 10. we print it out the first thing you want to do is what's going on here with 1 3 5 7 9 well if we divide by 2 there's going to be a remainder equal to 1. and then from python we remember that if you use the percentage sign you get the remainder on there so the remainder's one and then you have the your numpy array and then we just want to do a logical statement of all values that have a remainder of 1 and that generates our nice 1 3 5 7 9. there are two arrays a and b stack the arrays a and b horizontally boy these horizontal vertical questions will get you every time and in numpy we go ahead and we've created two different arrays over here a and b uh the first one is your concatenate np dot concatenate a and b on axes equal one that is the same as h stack and in the back end they're still identical they run the same that's all h stack is a concatenate axes equals one how can you add a column to a pandas data frame suppose there's an imp data frame that has information about few employees let's add address column to that data frame you can see in the left we have our basic data frame you should know your data frames very well basically looks like an excel spreadsheet as you come over here it's really simple you just do df of address equals the address once you've assigned values to the address using the below given data create a pivot table to find the total sales made by each cells represented for each item display the cells as a percentage of the grand total so we're back in tableau select the entire table range click on insert tab and choose pivot table select the table range and the worksheet where you want to place the pivot table it will return a pivot table where you can analyze your data drag the cell total on the values and sales rep and item onto row labels it will give the sum of the sales made by each representative for each item they have sold and finally right click on sum of cell total and expand show values as to select percentage of grand total real important just understand what a pivot table is we're just pivoting it from rows and columns and switching this direction on there and finally we have our final pivot table and you can see the values rolls and sum of total sale so we're going to go ahead and take a product table this is off of an sql so we can do some sql here and we're going to use the product and sales order detail table find the products that have total units sold greater than 1.
5 million and here's our sales order detail table so we have a product table and a sales order detail table two separate tables in the database and we're going to do is put together the sql query we want to select pp name sum sod unit price as cells and then we have our pp.product id from production product as pp interjoin cells order detail as sod on pp product id equals sod.product id group by pp.name comma pp.productid having a sum of saw.unit price greater than the 150 million there that's a mouthful and again these sql queries they start looking really crazy until you just break them apart and do them step by step and what we're looking for is the inner join and how did you do the group by this really wanted to know how do you do this inner join this comes up so much in sql how do you pull in the id from one chart and the information from another chart and the sum totals on that chart how do you write a stored procedure in sql let's create a storage procedure to find the sum the squares are the first n natural numbers so here we have our formula n times n plus 1 times two n plus one over six and you can see from the command prompt uh or the setup you have depending on what your login is the command is create procedure square sum one declare our variable at in of integer as begin then we're going to declare the sum of integer set sum equals n times n plus 1 plus 2 times n plus 1 over 6.
And then of course we can go ahead and print those out print first cast amber sign in or our variable as a variable character 20 natural numbers print the sum of the square is cast the at sum as variable character 40 end then we do the output display the sum of the square for first four natural numbers we have execute square sum one and then we're going to put in four and you can see here where it brings up the first four natural number sum of square is 30. write a store procedure to find the total even number between two user given numbers a couple of things to note here first we go ahead and create our procedure you have your two different variables the n1 n2 and we go ahead and begin we're going to declare our variable count as an integer we're going to set count equal to zero and then we have while n is less than n2 we're going to begin and if n1 remainder two equals zero so we're gonna divide it by two even number begin we're gonna set the count equal to count plus one we're gonna print even number plus cast in as a variable character 10 for printing count is plus cast variable count as variable character 10 end else print odd number plus cast variable number one is variable character 10 and then we go ahead and set the increment our variable one up one so they go from n one all the way to n two and i'll print the total number of even numbers and you can see here we went ahead and executed it we're going to count the even numbers between 30 and 45 and you see it goes all the way down to eight what is the difference between tree maps and heat maps in tableau now if you've worked in python other programmings you should automatically know what a heat map is but a tree map are used to display data in nested rectangles you use dimensions to define the structure of the tree map and measure to define the size or color of individual rectangles tree maps are relatively simple data visualization that can provide insight in a visually attractive format and again you can see the squares over here this is our tree map over here with the each block also has this information inside of its different blocks a heat map helps to visualize measures against dimensions with the help of colors and size to compare one or more dimensions and up to two measures the layout is similar to a text table with variations in values encoded as colors in heat map you can quickly see a wide array of information and in this one you can see they use the colors to denote one thing and the size of the little square to denote something else a lot of times you can even graph this into a three-dimensional graph with other data so it pops out but again a heat map is the color and the size using the sample super stored data set display the top five and bottom five customers based on their profit so you start by dragging the customer name field onto rows and profit on columns right click on the customer name column to create a set give a name to the set and select top tab to choose top 5 customers by some profit similarly create a set for the bottom 5 customers by some profit select both the sets right click to create a combined set give a name to the set and choose all members in both sets and then you can drag top and bottom customer sets onto the filters and profit field onto color to get the desired results as we get down to the end of our list we're going to try to keep you on your toes we're going to skip back to numpy how to print four random integers between 1 and 15 using numpy to generate random numbers using numpy we use the random random integer function you can see here we did the import numpy as in p random arrangement equals np.
Random.randominteger 1 through 15 of 4. from the below data frame can jump again on you now we're into pandas how will you find the unique values for each column and subset the data for age less than 35 and height greater than 6.
To find the unique values and the number of unique elements use the unique and the in unique function you see here we just did df heights we're selecting just the height column and we want to look for the unique that returns an array where in unique if we do that on the height or the age will return just the number of unique values and then we can do a subset the data for ages less than 35 and height greater than 6. so if we look over here we have a new df remember this is going to be taking slices of our original data frame it doesn't actually change the data frame so our new df equals the data frame or df the data frame where age is less than 35 and the height is greater than 6. and in case you're not using tableau which has a lot of its own different mapping programs in there make sure you understand how to use the basics of matplot library plot a sine graph using numpy and matplot library in python and the way we did this is we went ahead and generate an x we know our y equals np dot sine of x if you print out x you'll see a whole value here our map plot library pi plot as plt if you are working in jupiter notebook make sure you understand the matplot library inline that little percentage sign matplot library in line that prints it on the page in the jupiter notebook the newer version of jupiter notebook or jupiter labs automatically does that for you but i usually put it in there just in case i end up on an older version if you print y you can see here we have our different y values and our different x values you simply put in plt.plot x y and do a plot show and before we go let's get one more in we're going to do a pandas using the below pandas data frame find the company with the highest average cells derive the summary statistics for the cells column and transpose these statistics that's a mouthful and just like any of these computer problems break it apart so first of all we're looking for the highest average cells so group the company column and use the mean function to find the average cells you see here by company equals df.
Groupby company once we've done that using the describe function we can now go ahead and look at the summary of statistics on here use the describe function to find the summary so by company those are groups we're just going to describe them and you could actually bundle those together if you wanted and just do them all in one line so here we go by company.descri you can see we have a nice breakout always good to remember whether you're using any of the packages whether it's tableau or pandas in python or even r or some other package being able to quick look and describe your data is very important and then we go ahead and just do a basic apply a transpose function over the describe method to transpose the statistics all we've done here is flip the index with the column names but if you're following the numbers a lot of times it's easier to follow across one line or maybe you want to average out the count or it's all kinds of different reasons to do that well that wraps it up i want to thank you for joining us today i hope you're ready for those data analytics interview questions coming your way and that great job is coming right down the line for you you can always contact us for more information and visit www.simplylearn.com again my name is richard kershner with the simply learn team get certified get ahead hi there if you like this video subscribe to the simply learn youtube channel and click here to watch similar videos turn it up and get certified click here.