Statistical tests for data analysis in research


Hi welcome to our channel talks on management and research in today's session we are going to cover an important topic in research called type of data analysis technique broadly we can classify the data analysis technique into two categories one is descriptive techniques and another one is inferential techniques these both comes under the category of two different type of statistical techniques that is predominantly used let us discuss it one by one now when it comes to the descriptive statistical technique as the name suggests it describes about the characteristics of the population or in other words whatever we are getting from the population it used to represent it in a summarized form now we can have several kinds of techniques under this we can use a statistical technique or statistical test like mean median mode charts tables all these things comes under this classification of descriptive techniques here we don't take any particular sample of the population but we take each and every unit of the population here we take a complete data set of the population and we give the entire characteristics of the population through the data that we have obtained let me give you an example in the slide that you can see i have shown you the placement statistics of sibm b school in pune wherein you can see that almost 180 students have attended the placement out of which you can see the highest placement figure is 29 lakhs per annum after that the median ctc is 16.425 lakhs per annum and average ctc is 17.48 lakh per annum so it is quite evident that what is the highest package what is the average package what is the median package everything has been represented through the data that is available from the 180 students now this is a typical example of a descriptive statistical technique here you can see that i have not taken any particular sample but i have considered the entire data of the population and i am talking about the population through the entire data that is available right now this is what we call it as descriptive statistics now let us discuss about another technique which comes under statistic called inferential statistics now look at the word carefully denote the word they have used something called inference that means i am actually inferring about something through something else here what do we do we predict the characteristics of the population through the sample that is available that is why we have used a term called inferential statistics so what i do is that i start predicting the population and i use something called hypothesis for this so i take about a set of sample and i predict about the characteristics of the population say for example in 2021 it is expected that the legislative assembly is going to election is going to happen in tamilnadu and almost let's say that there are some say 234 constituencies in tamilnadu and i'm not sure which is the political party which is going to come it might be dmk admk congress or bjp now if at all i wanted to know which political party is going to come to the power during 2021 mla election the easy way is that i need not to check with each and every 5.

8 crore voters in tamil nadu instead i can take the constituencies start checking with 100 people each and 100 into 234 this is the number of people that i have surveyed and from that survey i am predicting that who is going to win this 2021 mla election now this is what we call it as inferential statistics that means instead of i conducting a survey among 5.8 crore people in tamil nadu i have taken a shortcut what i have done i have taken 100 people each from 234 constituencies which comes something around 23 400 samples and from there i am predicting that who could be the possible winner in the upcoming 2021 mla election and i am telling this with 95 percent confidence with 5 margin of error now look at the difference in the statement guys when i have told you about the descriptive statistics i did not use anything like confidence interval margin of error now this normally we will not use when it comes to descriptive statistics when it comes to inferential statistics i am not 100 sure i might be 90 to 95 sure and i also have some kind of tolerance of 5 to 10 percent which i call it as margin of error in statistical terms now this is the kind of understanding that we need to have when it comes to the inferential statistics now let us get deep into the inferential statistics because most of the time we will be doing many type of analysis under the category of inferential statistics so that we can prove or disprove our hypothesis now when it comes to inferential statistics we can classify them into two broad categories one is that parametric test and the other one is non-parametric test what is multi-parametric test make it simple parametric test or those tests where you have some assumption about the population who already have told you that inferential statistics is all about you predicting the characteristics of the population through the sample now this parametric test is something you have some assumption about the population from where you are drawing the sample and this assumption is something related with normality where you assume that the population that you are drawing is a normal data there is a normal population now when you have this set of assumption thinking that the population from where i am drawing the sample is normal and my sample data is also normal that time i have a choice to use parametric test now when these assumptions are not been met that time i will be using non-parametric test like this we have classified the test into two broad categories depending upon the distribution right let us discuss it one by one so that let us get some idea on what test that we have to use for our analysis now when it comes to the parametric test as i have already told you that parametric test is usually upright for something which is normal but when it comes to non-parametric test it is not necessary that it should be normal so it is a distribution free test so the first difference is parametric test the data is normal when it comes to non-parametric test the data is not normal and the second one is that when it comes to the type of data the type of data in parametric test is continuous it could be either ratio or interval and mostly it is for the dependent variable especially when we discuss about the techniques like anova or student tick test we talk only about the different dependent variable when it comes to the continuous scale of course it is also applicable for independent variable also in case if at all we are planning to use something like regression or correlation right second one is that non-parametric test in non-parametric test the typical set of data it is either nominal or nominal ordinal so it is applicable for both independent as well as for dependent variable the second the third one is that when it comes to these central measures usually we will be able to use something called the mean for the parametric test whereas when it comes to the non-parametric test because the data that we have obtained is only in the form of either nominal or ordinal we can use a technique like median we cannot use techniques like mean like in terms of parametric test the other one is that we can draw many conclusions based on parametric tests but we cannot have so many conclusions based on non-parametric test we can also have one more difference like when it comes to the sample size parametric test require a bigger sample size when it comes to non-parametric test you don't need to have that much big sample size it might give you a good result when it comes to a small sample size also the last one is which is the most important in today's session is that when it comes to the different type of test i can use tests like independent sample t test pad sample thetas so on and so forth other type of test in parametric test and the equivalent non-parametric test like man witness use test crystal valley test chi square test so these are all the different kinds of non-parametric test that i can use for my analysis so these are the main two broad classifications that we need to understand when it comes to whether we have to use this parametric test and non-parametric test now let us also discuss about some of the assumptions of parametric test so let us give some kind of clear idea already i have told you about the difference between parametric test and non-parametric test let us discuss about the assumptions one by one so that i will be very clear that which type of test that i need to use for my analysis technique the number one assumption the most important assumption is that the data that i am having about the sample or about the population it should be in continuous form that means either it should be interval or it should be in the form of ratio i can't have both independent and dependent variable as categorical variable when it comes to the parametric test this should be the first and foremost assumption for the parametric test no matter whether you are using correlation regression student t test right the second important assumption is that about normality that means the sample data that i am drawing from the population the population should be normally distributed now what is meant by normality make it simple normality is about the distribution of the data and this distribution of the data is in such a way that most of the data is collected around the mean and very few datas would be falling on the either sides say for example if i am collecting your height and the height of the participants are ranging from five feet to six feet and the average of the height is almost 5.

4 that is 5 feet 4 inches the normality says that most of the data would be collected around the mean very few people would be of five feet and very few people would be of six piece now this is what we call it as normality in simple terms now how do i check the normality of the data for that we have several tests we have shapiro wilcox test and we have ks test we do also have another test like i mean skewness and kurtosis all these tests we can use and we can find out the data that they have collected is normal or not but the best test that we can always recommend is that the shapiro will cox test now the third assumption is the sample that i am drawing is it independent or not i don't think so uh that is something that needs to be explained because it says that whenever i am drawing a sample from the population the draw what i am making it should be independent of the another draw right this should be the third assumption the fourth assumption is all about the homogeneity of variance that means the variance of the dependent variable across the independent variable it should be equal that means the standard deviation of the dependent variable the standard deviation of the dependent variable which is present across two or three groups of independent variable it should be equal now this is something it is related with uh i mean anova student t test that we can explain it when you are explaining on the topic of anova or student t test right and when it comes to testing of these type of assumption usually use a test like leaveness test so whenever the p value is uh we have some kind of critical and a cutoff point beyond which we fulfill the assumption of this type of test under which we reject it right and the fourth type of assumption is all about the uh sorry fifth one is all about linearity and homo schedasticity so now this hormone scadasticity and homogeneity of variance is almost of same type we use the word homogeneity of variance when it comes to anova and student t test and we use the term homo's catastrophe when it comes to the regression part linearity is all about how the variables are linearly related if at all i say about height and weight are related with each other the linearity is all about what extent that height is related with the weight in terms of linear form if you can see the slide you can see that the dots are present if i am able to plot the dots which is almost touching the nearby dots that is what we call it as linearity if this dot if the line what i am drawing is in the straight line okay now when it comes to the homo scadasticity when it talks about the variance is that variance is same across the line okay this is what we call it as homos catastrophe that means the dots what i have drawn uh is the dots is it almost having the same distance from the line this is what we call it as a homo scadasticity so it should be like a tubular structure okay now when it comes to the outliers an outlier is another concept which comes under the sixth assumption of parametric test which is also equally important say for example if i am interested in calculating the average income of mumbai for me mukesh ambani would be the outlier so it depends upon what kind of test that you are performing and what is the objective of your research but usually we try to remove the outliers you can see that in the slide that where do you find this kind of outliers usually outlier we we categorize a particular variable as outlier the moment that particular variable is 3.

39 standard deviation from the mean right so sorry 3.29 standard deviation from the mean so we use this number as a thumb rule so that we can categorize a particular variable as an outplayer and why it is a problem because if at all if you are having an outlier most of the time we will not get a normal data and the skewners or cortosis might not fall into the particular range so sometimes we tend to take out the outlier so that we can have a normal curve so that we can use some kind of parametric test and the last one is it should not have multi collinearity what does be multicollinearity the variables should be correlated with each other that means the independent variable should be correlated with each other but it should not have a very high correlation so very high correlation is always dangerous and we used when it comes to the test like even chronic alpha when above 0.

7 is good but if it is going above 0.95 then there is doubtful that is my data is having a multicollinearity problem especially we use very due importance to this concept of multicollinearity when we are using technique like regression so before we categorize whether i am going to use parametric test or non-parametric test i need to perform all these seven assumptions which is related with normality homogeneity of variance linearity outlier and multicollinearity and then i'll decide whether i am going to use parametric test and non-parametric test now in the upcoming slides let us discuss about the different type of parametric test and its equivalent non-parametric test that i can use for my statistical applications right so the next slide is fully going to be on the part of application and what type of parametric and it's equal and non-parametric test that i can use it for my analysis right the first technique what i am going to discuss today is that pearson correlation technique now what is piercing correlation technique it measures the linearity between two continuous variable say for example if i have two continuous variable like height and weight so i can use pearson correlation test because both the variables are in continuous form that means it is in the ratio form so i can use this pearson correlation test in this slide i have taken the example of i wanted to study the relationship between gdp and stock market of course in the short trend gdp and stock market it may not have a positive correlation but in the long run if you take for decades if you take 100 year old data you can always see that the stock market is the reflection of countries gdp so i can say that stock market and the country's gdp is positively related with each other now when it comes to the correlation i have a value ranging from minus 1 to plus 1 if i'm getting a value of minus 1 it indicate that it is highly negatively correlated plus one indicate that it is positively correlated zero indicate that there is no correlation between both the variables now i have also mentioned what should be the type of data that should be for independent variable dependent variable and should the data should be normal or not and its equivalent non parametric test that is available in front of you in the slides so make it simple the independent and dependent variable it should be continuous meaning it should be either interval or ratio that's it the second one is that the data should be normal that means most of the data should be collected in the main very few data can fall on either sides how do i check it out already have told you i can use tests like shapiro wilcox test i can also check it out with skewness and kurtosis if the value is coming near to zero then i can come to a logical conclusion that yes my data is approximately normal so when these two important conditions are met then i can also check out other conditions thereby i can get the confidence that yes i can use pearson correlation test if any one of these assumptions are getting violated don't worry we always have an equivalent non-parametric test called experiment rank correlation i am not going to explain fully about this peer mentoring correlation but understand that if my data is not normal what do i do i convert the data into rank and through that rank i use this type of experiment rank correlation okay let's get into the second technique second technique i use something called linear regression now when it comes to the linear regression here what do i do i predict the dependent variable through the available independent variable now linear regression test most of the time i use it mainly for prediction purpose now when it comes to linear regression here also i do have two type of linear regression test one is that simple linear regression another one is that multiple linear regression right what is meant by simple linear regression when i am trying to find out the impact of one independent variable on the one dependent variable i use a technique called simple linear regression say for example i am studying the relationship between salary and job satisfaction salary could be an independent variable job satisfaction could be a dependent variable and i find i'm trying to find out the strength of the impact that time i can use a technique called simple linear regression what is gonna be multiple linear regression multiple linear regression is something when i'm trying to find out the strength of the impact of multiple independent variables two or more than two that time i will be using a technique called multiple linear regression okay i have given you an example let's say that we wanted to study the impact of steady hours and attendance and gender on examination score now here you can see that there are three independent variables one steady hours second attendance and third is gender and its impact on examination score now this i have clearly mentioned it in uh in the form of what what should be the characteristics of independent variable dependent variable what about the distribution and stuffs like that in this slide you can see that when it comes to the dependent variable it should be continuous data that means either it should be interval scale or it should be a ratio scale and when it comes to the independent variable it can be interval scale ratio scale or nominal or ordinal scale not a problem we have that flexibility we use some technique called dummy variable technique through which we can come to a conclusion that what is the strength of the impact that independent variable is having a depend on dependent variable even if the independent variable is categorical that is not at all a problem that we can use it through this linear regression technique and the distribution what i'm having it should be normal there is no compromise on it now what if these assumptions are not been met not only these assumption discuss about the assumption seven assumptions that i have discussed in the beginning of the session all these seven assumptions has to be met so that we can use these type of test what if it is not getting met then the best way is that you try converting your dependent variable in case if it is not normal to a normal variable how do i do that we use a few algorithms we use some technique like z scores which is available in spss and other softwares and through the z score we will be able to get a normal distribution so that i can get the first assumption of yes it is a normality test and i can also take out some of the outliers and thereby i can use this linear regression test otherwise one of the technique which other researcher uses is that you can also use a technique called bootstrapping so bootstrapping is another technique which cannot be explained in this session that will have a separate session we can discuss about the bootstrapping technique now this is what all about when it comes to the linear regression test okay the third technique which we can discuss is one sample t test now when it comes to the one sample t test it compares the means of a single sample to a predetermined value to determine if the sample mean is significantly greater than or lesser than the particular value say for example i have some 100 men and i wanted to check whether the average height of all the 100 men is equivalent to the national average the national average of men is 5 feet and 4.

9 inch let's say that i have calculated the average height of this man is 5 feet and 3 inches now i wanted to check whether is this matching with the national average of men now if at all if this is my application the kind of test that i can use for this application is one sample t test i have.

Also mentioned what should be the scale of independent variable what should be the scale of dependent variable okay in this particular site of course yes the dependent variable should be continuous that means it should be either interval or ratio and the independent variable should be normal data of the dependent variable of course it should be normal data and it should also fulfill all the other assumptions of parametric test then i can use one sample t test if not if i am not able to meet the other assumptions of parametric test i have an equivalent test called one sample will cox unsigned rank test now this alternate technique i can use so don't don't get disheartened if your data is not able to fulfill the assumptions of parametric test one way is that you can convert your data to a normal data by using z-score another way is that instead of you using parametric tests you can use its equivalent non-parametric test so the choice is yours now let us get into another technique the other test is independent sample thetas now this independent sample t test it compares the means of two independent group to determine which which which mean is significantly different from the other to make it very simple it compares the means across two different groups of population right i have used the word two different means of two different populations so i am not talking about the same population it should be different population let me give you an example so that you can have a better understanding on this topic say for example i have a new medicine for corona and i wanted to check is this medicine is effective among people so now for that what do i do i take two i take gender as a criteria and i give the medicine to 100 males and 100 females and i am checking how is the recovery rate among the men and how is the recovery rate among the women that means i am actually checking that who is recovering faster is men is recovering faster or is it women is recovering faster.

Now how do i check it out i check their body temperature now when i am doing these type of tests where i compare the mean of the body temperature across two different group that type of test what i will be using is independent sample t test i have also given what should be the characteristics of independent variable dependent variable in the slide you can refer it and i have given it in the way that you can feed this data in spss that means i mentioned it does one group under one group you have two groups of different population like this okay now i have also mentioned the equivalent uh non-parametric test man witness u test okay fine let us discuss about the pad sample t test the other technique now here in pad sample t test we compare the means of the same population look at the difference between independent sample t test and paired sample t test in independent sample thetas i am comparing the means across two different population in paired sample t test i am comparing the means across two samples of the same population now let me explain it say for example here also i am taking this corona as an example i wanted to have i am checking the new medicine of corona across patient before and after the treatment now most of the time i use this bad sample t test in terms of conditions or in terms of situation that means the population will be same say for example before this session what was your knowledge on data analysis what is going to be your knowledge on data analysis if i thought i wanted to use this assessment and the type of test what i will be using is pad sample t-test in this case i am giving the medicine i am giving a treatment i am checking what was their body temperature before the treatment and what is their body temperature after the treatment across the same patient okay not with different different people unlike independent sample t test now when i'm using this technique is what i call it as pad sample t test and i have also mentioned the characteristics.

Of the scale that independent variable will need to have and dependent variable need to have and i have also mentioned the equivalent non-parametric test which we will be using a technique called pad sample wilcoxon test now let us discuss about the one-way anova what do you mean by this one-way anova here also here i compare the means of not just only two population but two or more population now so this anova and t test both these tests i use it for comparing the means when i'm comparing the means across two population i will be using t test in anova i can compare the means across two population that is two groups of population or two or more groups of population but typically i will be using one way anova to compare three or more groups of population why though i can compare two groups this the study says that student t test is more powerful than anova test when it comes to comparing two groups so when i wanted to compare two group i prefer to go for student t test but when it comes to three or more group i prefer to go for anova when it comes to one way anova in this case i am comparing the means of two or more group but i will be having only one independent variable now i will be explaining it further when you talk about uh two-way anova so that we will compare two-way anova with one way or another i'll give you an example let's say for example that we are comparing the cat cat test exam scores across the various streams of students including commerce engineering and science when i am doing this type of comparison the kind of test what i will be using is anova because cat score is a dependent variable different streams or three different groups okay so the perfectly i can use one way anova for this example equivalent non-parametric test i can use a test like crystal valley's test when it comes to this scenario so mean when you're comparing across three or more groups of population that time i will be using a test called one way anova though it.

Is also capable of testing two groups i prefer to use student t test over one way or another when it comes to two groups more than two groups i go for one way or another now when it comes to two-way anova here is the difference now in one way another i was actually having only one nominal variable and one interval variable under one nominal variable i had two groups but when it comes to two-way anova here i am having two nominal variable with two groups each and one interval variable and normal now in this case i will be using two-way anova that means if the application is same but instead of i having one nominal variable for independent in case if i have two nominal variable with uh of course different subgroups that time i will be using two-way anova but what is the application of two-way anova when will i use two-way anova basically i am using two-way anova just to determine the effect of both the two nominal variables on the dependent variable or in other words when i wanted to find out the interaction effect of independent variable on the dependent variable only that case i will be using this two-way anova technique let us say that i wanted to study the interaction effect of gender and educational level with interest to politics let's say that i am collecting whether a person is interested in politics in the rating of 0 to 100 okay that is it may be either in interval scale or it is in ratio scale and i check it out whether somebody is male or female and i'm also asking one more question what is his or her qualification 10 12 ug and pg whatever it is now i am actually finding out the combined effect the interaction effect of gender on politics and how this relationship is been affected by a third variable called this one whether it is a undergraduate or post graduate it could be vice versa here i wanted to find out how the education level is affecting the uh interest in politics and how it has been influenced the strength of the impact of education level.

On interest in indian politics is affected by the gender so now this is what we call it as interaction effect this we have explained it further in the previous videos which you can always refer so if this is your situation you can use something called two-way anova provider it fulfills the assumptions of parametric test if it is not fulfilling any of the assumptions of parametric test that we already discussed then not to worry about it we have an equivalent test called either crucial values test in case if you are using independent group otherwise you can use treatment test if in case you are using a dependent test dependent groups the last one is about the chi square test nothing to complicate it because it is not a parametric test so naturally there is no assumptions of normality you need not to worry about the fulfilling the other assumptions of parametric test like i have discussed but this is also a very popular technique which researchers uses because here both the independent and dependent variable both the scales are nominal okay they are not any continuous skill let's say that you wanted to study the relationship between gender and smoking and you think that uh the smoking is more among men than women also now how do you do that you ask two questions please indicate your gender and you ask us i mean second question whether you smoke or not so now what do you do you perform a simple test called chi square test which has a cross tabulation and you find it out whether this smoking and gender is it interrelated with each other and the kind of test what you will be using in this case would be chi square test okay and there is not mandatory that you need to have a kind of normal distribution curve for this because it is both of them are nominal data right so now what i am trying to say is this one now if you can see this slide i have clearly mentioned what type of data and what type of technique that i can use it very clearly you can take a screenshot of.

It i don't mind so that it becomes a quick reference for you to find it out what type of test that i can use for what type of application right so let me conclude this session a very long session by a small note now what i'm trying to say is that refer the previous two videos and kindly recollect the video that we have discussed it today when it comes to the right data analysis technique it has been affected by four important thing first what is the kind of variable that i have that i have identified is it an independent variable dependent variable is it a mediating variable moderating variable that we have to check it out number one number two is that what is the type of measurement scale i have is it a nominal ordinal ratio or interval no that is one side and third one is that is my data fulfilling the assumptions of parametric test those six or seven assumptions that have already mentioned you can refer it out and fourth one is that what is the application what is the situation so based on these four things i'll come to a conclusion that what type of test that i can use for which application hope you have found this technique and this session useful thank you very much.