# What is data in data analysis

What is data right I'm pretty sure that's data Right is this data, you know this picture or that data Is this data? What what is data? So we talked a lot about data in the last video Why is it important that we can analyze and understand data, but what is data? Everybody has data everybody's generating it Companies are generating on us. We're generating it ourselves, you know when we use social media so on but what is it and Understanding what it is is a prerequisite for being able to use it properly Perhaps the most important thing as far as we're concerned So people who are trying to analyze data sort of scientifically is the data has to be measurable, right? so the idea is, you know, if you're going to do a survey on what people like Everyone's got to be using the same scale and the same rating system Otherwise, it doesn't make any sense Well, we can't have someone rating things from one to five and someone else saying I thought it was good Right because which one of one to five is good. We don't you know, we don't know All right So everyone is going to be doing the same thing your data's got to be a consistent format and once that's achieved at least We're a little bit closer To be able to make some sense of it. Broadly speaking when we talk about data We kind of have four different types and we summarize this with this nice noir word. So n, o, i, r, noir And each of these different types of data we can do different things with all right So n that's the first type so this is nominal data The normal data is where we have no distance between the values that we can measure Right because they're not really quantities and we can't order them. So a good example would be colors So maybe you have your favorite color is red. And my favorite color is blue I don't know which is better than the other There is no measurement between them right is blue closer to green the medes. You know, that doesn't make any sense, right? We're not talking about wavelengths. We're just talking about the colors, right? Another good example would be lets say in football player numbers on your back right now symbolically Sometimes certain player numbers have a meaning but you can't compare and contrast them You can't say that 8 is 2 times better than 4.

All right, that doesn't make any sense, right? You also can't really order them in general right player 16 doesn't go before or after player 13 in a list but you know, but that doesn't make any sense, right? So nominal data is data where and it's useful, right? It could be really important but it's data where we we kind of have labels But no way of ordering these labels so you can still analyze it, but you can't for example calculate the average that the mean average right? That wouldn't make any sense What you can do is calculate the mode so you can calculate the most common one so you could say that more people prefer red To blue but you couldn't say you know The average color that people like is a sort of muddy brown right. That doesn't make any sense at all, right So as we go down this list, we get slightly more and more informative in some sense types of data So the next one is ordinal so in ordinal data we have an order but we can't measure distances between things so a good example would be something like Positions people finished in a race. So, you know, maybe I finished first I'm super quick right you didn't you finished third But how far we are a part that isn't included in that kind of data You'd have to have a separate value for that another example what we're all familiar with is rating systems, right? So perhaps you I rate a film from one to five stars and you rate the film from one to five stars but you can't really say that a film that's got four stars is two times better than one that scored two Because that's a very subjective and it's there's no real sort of measurable distance between these stars if you have ordinal data You can calculate the mode again. You can calculate the most Common value of all the values that were returned or you can calculate the median the one that sits in the middle, right? So maybe you know fifty runners in a race the 25th position roughly speaking is going to be you know around the median So it's still not hugely useful, right the next up.

We have interval data interval data We have an order and we have a distance, but we have no sort of absolute zero for this scale So a good example would be something like degree Celsius or degrees Fahrenheit Zero degrees Celsius isn't no temperature. It's it's a specific temperature, right? So we can't say that fifty degrees is half of a hundred degrees I have a numbers a half but doesn't really make sense, right? They are we can we can say that a hundred degrees is hotter than 50, which is hotter than zero, right? So this is interval data now interval data Lets us do a few more things than we could with ordinal as well as be able to calculate the mode and median we can Now calculate the mean temperature. That's okay And we could also calculate things like the rain the minimum and maximum temperatures for a certain window, right? So that's pretty useful another good example of interval will be pH level right again, the pH of zero means very acidic It doesn't mean there is no acidity at all or no pH at all. We can say that a So 13 is higher than a pH of 7 is higher than a pH of 3 And we know how far apart these numbers are but we can't necessarily say if one is double one another one So the final kind of data we're going to look at is ratio data So this is exactly like interval, except we now have a sort of true zero value So a good example of this would be degrees Kelvin right. So Kelvin has an absolute zero which is the absolute average absence of any kind of heat right and when it goes upwards so we can say that in terms of Kelvin a hundred is Half of 200 and so on like this and we can get to 0 another example would be number of children, right? Zero children means the absence of any children and you can also say that let's say four children is double the amount of two children And two many to look after in my opinion So that is an example of ratio data Right now ratio data is quite similar in terms of what you can calculate to interval, but it allows some more complicated statistical measures such as t-test So these are the types of data now actually, it's quite important how you structure your data in general We can't just have it sitting in some massive spreadsheet with no thought given to where everything is, right There's actually a pretty standard way of doing this that we're going to look at Data comes in lots of forms, right different types of measurements different experiments people are going to collect it in different ways But actually there's a very standard way that we use To represent data once it's actually on a computer so we can have some kind of table of our data We almost always represent our data in a matrix like this a Two-dimensional table because it's much easier to do and so along the top We're going to have our attributes right which are the the things we've been measuring So an example would be maybe we're collecting data on people so we could have name That would be some nominal data and then, you know age height So the columns are attributes all the things we've been measuring the rows Those are the instances or the samples we've got so that's all the individual people So here's person 1 and person 2 person 3 and person 3 is called John and there You know 54 and you know 5 foot 11 or whatever, you know Whatever right and so on and you can put you know have as many rows as you want so when we talk about attributes We're talking about the number of columns people use lots of different terms for these.

I like to think of them as features Attributes is another one and we have instances or samples down the rows now quite often on the very last column of your data Sometimes separated out but not really important.

We'll have our output Maybe we're trying to make a decision based on these people Maybe these are candidates for a football team and we're saying, you know, are they gonna be on the team or not? So this is yes. No John's made it Yes, no, no and so on and that way we could perhaps analyze our decision-making process and decide you know Is there any aspect of these things that inform our decision-making process as an example right now? We always structure data in this way But if we don't it becomes a huge problem because you end up spending all this time formatting and trying to work out What's what and you know, why is John listed down the table or not across the table? And you know, nothing makes any sense anymore So let's look at an actual data set and we'll see all this in action So we have here a data set of whether someone goes to play tennis Right and whether or not they go is going to depend a little bit on what the weather conditions are, right So we don't like to play for example When it's too hot the tennis data set is just the same structure as a data set. We looked at already We're gonna load it into R it's held in a CSV file. So tennis read CSV Tennis now we're using R for this because it's free and it has a load of decent functions for analyzing examining Visualizing data, right? So we're going to be using it throughout these videos obviously you could use MATLAB or Python or some other library if you wanted to I think that you should use whatever you're most comfortable with Looking at these rows and tables I mean, it looks a lot like something like Microsoft Excel You could do this data analysis in Excel Some people would disagree. No, Excel is perfectly good for what it does you could do with data analysis in it. I think that Excel in it doesn't enforce anything to do with Observations versus variables and things like that. These are distinctions that are not really made in Excel Obviously if you enforce those rules yourself that's going to work, but you have to be a little bit more You know regimented and rule-based about it Think the consensus would be that if you really want to get into data analysis and start doing things like principal component analysis or more Advanced statistical measures something like R or Python is going to help a lot more Okay So I've loaded the data set and if we look up the data set so we look at the top few rows of the data you'll see that there are 6 different variables or 6 attributes and This data set has 14 instances or observations R calls them observations.

So what we're saying is we have six columns and fourteen rows right of our data set and this data set is structured exactly like This people data set that I was looking at a minute ago So we can examine a single instance we can say what is it about day three? So let's have a look at day three so we can say tennis on day 3 And we can say on day three it was overcast. The temperature was only five degrees The humidity was high there wasn't any wind so they decided to play tennis, right? So it's a bit chilly, but I guess they gave it a go So on we could also look at all the different temperatures, for example, all the different forecasts tennis dollar outlook All right And we can look at all the outlooks in the data set so we can say we've got sunny sunny overcast rainy rainy rainy and so on and we can get a feel for what kind of weather we're looking at here as well using something like R You can examine the instances You can examine the individual attributes you can group them together or not as you see fit and then you can start to drill into What this dataset means Now this dataset has in it the final column which is whether they actually played so you could use something like machine learning To predict that final column based on the other columns. That's something you could do one other thing about this dataset It's quite interesting is it has a few examples of the different kinds of data. We were looking at earlier So remember we have nominal ordinal interval and ratio So for example Outlook is really a nominal field right, it's a nominal data type You could perhaps suggest that you could order it from rainy through to sunny, but then cloudy overcast, you know It doesn't really make any sense so this is kind of nominal you could calculate for example the mode and say that most of the days were rainy or something like this Temperature as we discussed before this is in Celsius.

So this is going to be Interval we can order the data and we can say but one of them is 50 away from another one But we can't say how much of a difference that it's like. Is that double the temperature or half a temperature? We can't really say so humidity is ordinal so we can say high is more humidity than normal, right? But we can't really say how much that's going to depend on who was measuring it and where their differences lie and finally Wind in kilometers per hour. Well, zero is no wind. Yeah, you can't have negative wind. So this is a ratio, right? You can say that 20 mile an hour wind or 20 kilometers an hour wind, is two times more than ten That's something you can say this little dataset contains all the kinds of data so the different Statistics and measures you can calculate using these it's going to depend on what kind of data they are So we can see that even a very simple data set Like this has loads of different kinds of data and different ways we could interpret this data Right, if you make a decision to play based only on whether the Outlook is good You're maybe not going to solve the whole problem, right? So these are the kind of things we'll be looking at as we go forward And one thing we might do next is to visualize this data. Start to try and understand some patterns or extract some kind of knowledge They're very important tool but you've gotta use it properly You can't just plot anything and everything Every chart you use has got to support your hypothesis.

Or it's got to try and show the story You're trying to tell right? You don't just plot something because it could be plotted right? There's got to be a point to if there's a lot of problems with using inappropriate graphs and only picking subsets of your data That's a huge problem.

All right, that doesn't make any sense, right? You also can't really order them in general right player 16 doesn't go before or after player 13 in a list but you know, but that doesn't make any sense, right? So nominal data is data where and it's useful, right? It could be really important but it's data where we we kind of have labels But no way of ordering these labels so you can still analyze it, but you can't for example calculate the average that the mean average right? That wouldn't make any sense What you can do is calculate the mode so you can calculate the most common one so you could say that more people prefer red To blue but you couldn't say you know The average color that people like is a sort of muddy brown right. That doesn't make any sense at all, right So as we go down this list, we get slightly more and more informative in some sense types of data So the next one is ordinal so in ordinal data we have an order but we can't measure distances between things so a good example would be something like Positions people finished in a race. So, you know, maybe I finished first I'm super quick right you didn't you finished third But how far we are a part that isn't included in that kind of data You'd have to have a separate value for that another example what we're all familiar with is rating systems, right? So perhaps you I rate a film from one to five stars and you rate the film from one to five stars but you can't really say that a film that's got four stars is two times better than one that scored two Because that's a very subjective and it's there's no real sort of measurable distance between these stars if you have ordinal data You can calculate the mode again. You can calculate the most Common value of all the values that were returned or you can calculate the median the one that sits in the middle, right? So maybe you know fifty runners in a race the 25th position roughly speaking is going to be you know around the median So it's still not hugely useful, right the next up.

We have interval data interval data We have an order and we have a distance, but we have no sort of absolute zero for this scale So a good example would be something like degree Celsius or degrees Fahrenheit Zero degrees Celsius isn't no temperature. It's it's a specific temperature, right? So we can't say that fifty degrees is half of a hundred degrees I have a numbers a half but doesn't really make sense, right? They are we can we can say that a hundred degrees is hotter than 50, which is hotter than zero, right? So this is interval data now interval data Lets us do a few more things than we could with ordinal as well as be able to calculate the mode and median we can Now calculate the mean temperature. That's okay And we could also calculate things like the rain the minimum and maximum temperatures for a certain window, right? So that's pretty useful another good example of interval will be pH level right again, the pH of zero means very acidic It doesn't mean there is no acidity at all or no pH at all. We can say that a So 13 is higher than a pH of 7 is higher than a pH of 3 And we know how far apart these numbers are but we can't necessarily say if one is double one another one So the final kind of data we're going to look at is ratio data So this is exactly like interval, except we now have a sort of true zero value So a good example of this would be degrees Kelvin right. So Kelvin has an absolute zero which is the absolute average absence of any kind of heat right and when it goes upwards so we can say that in terms of Kelvin a hundred is Half of 200 and so on like this and we can get to 0 another example would be number of children, right? Zero children means the absence of any children and you can also say that let's say four children is double the amount of two children And two many to look after in my opinion So that is an example of ratio data Right now ratio data is quite similar in terms of what you can calculate to interval, but it allows some more complicated statistical measures such as t-test So these are the types of data now actually, it's quite important how you structure your data in general We can't just have it sitting in some massive spreadsheet with no thought given to where everything is, right There's actually a pretty standard way of doing this that we're going to look at Data comes in lots of forms, right different types of measurements different experiments people are going to collect it in different ways But actually there's a very standard way that we use To represent data once it's actually on a computer so we can have some kind of table of our data We almost always represent our data in a matrix like this a Two-dimensional table because it's much easier to do and so along the top We're going to have our attributes right which are the the things we've been measuring So an example would be maybe we're collecting data on people so we could have name That would be some nominal data and then, you know age height So the columns are attributes all the things we've been measuring the rows Those are the instances or the samples we've got so that's all the individual people So here's person 1 and person 2 person 3 and person 3 is called John and there You know 54 and you know 5 foot 11 or whatever, you know Whatever right and so on and you can put you know have as many rows as you want so when we talk about attributes We're talking about the number of columns people use lots of different terms for these.

I like to think of them as features Attributes is another one and we have instances or samples down the rows now quite often on the very last column of your data Sometimes separated out but not really important.

We'll have our output Maybe we're trying to make a decision based on these people Maybe these are candidates for a football team and we're saying, you know, are they gonna be on the team or not? So this is yes. No John's made it Yes, no, no and so on and that way we could perhaps analyze our decision-making process and decide you know Is there any aspect of these things that inform our decision-making process as an example right now? We always structure data in this way But if we don't it becomes a huge problem because you end up spending all this time formatting and trying to work out What's what and you know, why is John listed down the table or not across the table? And you know, nothing makes any sense anymore So let's look at an actual data set and we'll see all this in action So we have here a data set of whether someone goes to play tennis Right and whether or not they go is going to depend a little bit on what the weather conditions are, right So we don't like to play for example When it's too hot the tennis data set is just the same structure as a data set. We looked at already We're gonna load it into R it's held in a CSV file. So tennis read CSV Tennis now we're using R for this because it's free and it has a load of decent functions for analyzing examining Visualizing data, right? So we're going to be using it throughout these videos obviously you could use MATLAB or Python or some other library if you wanted to I think that you should use whatever you're most comfortable with Looking at these rows and tables I mean, it looks a lot like something like Microsoft Excel You could do this data analysis in Excel Some people would disagree. No, Excel is perfectly good for what it does you could do with data analysis in it. I think that Excel in it doesn't enforce anything to do with Observations versus variables and things like that. These are distinctions that are not really made in Excel Obviously if you enforce those rules yourself that's going to work, but you have to be a little bit more You know regimented and rule-based about it Think the consensus would be that if you really want to get into data analysis and start doing things like principal component analysis or more Advanced statistical measures something like R or Python is going to help a lot more Okay So I've loaded the data set and if we look up the data set so we look at the top few rows of the data you'll see that there are 6 different variables or 6 attributes and This data set has 14 instances or observations R calls them observations.

So what we're saying is we have six columns and fourteen rows right of our data set and this data set is structured exactly like This people data set that I was looking at a minute ago So we can examine a single instance we can say what is it about day three? So let's have a look at day three so we can say tennis on day 3 And we can say on day three it was overcast. The temperature was only five degrees The humidity was high there wasn't any wind so they decided to play tennis, right? So it's a bit chilly, but I guess they gave it a go So on we could also look at all the different temperatures, for example, all the different forecasts tennis dollar outlook All right And we can look at all the outlooks in the data set so we can say we've got sunny sunny overcast rainy rainy rainy and so on and we can get a feel for what kind of weather we're looking at here as well using something like R You can examine the instances You can examine the individual attributes you can group them together or not as you see fit and then you can start to drill into What this dataset means Now this dataset has in it the final column which is whether they actually played so you could use something like machine learning To predict that final column based on the other columns. That's something you could do one other thing about this dataset It's quite interesting is it has a few examples of the different kinds of data. We were looking at earlier So remember we have nominal ordinal interval and ratio So for example Outlook is really a nominal field right, it's a nominal data type You could perhaps suggest that you could order it from rainy through to sunny, but then cloudy overcast, you know It doesn't really make any sense so this is kind of nominal you could calculate for example the mode and say that most of the days were rainy or something like this Temperature as we discussed before this is in Celsius.

So this is going to be Interval we can order the data and we can say but one of them is 50 away from another one But we can't say how much of a difference that it's like. Is that double the temperature or half a temperature? We can't really say so humidity is ordinal so we can say high is more humidity than normal, right? But we can't really say how much that's going to depend on who was measuring it and where their differences lie and finally Wind in kilometers per hour. Well, zero is no wind. Yeah, you can't have negative wind. So this is a ratio, right? You can say that 20 mile an hour wind or 20 kilometers an hour wind, is two times more than ten That's something you can say this little dataset contains all the kinds of data so the different Statistics and measures you can calculate using these it's going to depend on what kind of data they are So we can see that even a very simple data set Like this has loads of different kinds of data and different ways we could interpret this data Right, if you make a decision to play based only on whether the Outlook is good You're maybe not going to solve the whole problem, right? So these are the kind of things we'll be looking at as we go forward And one thing we might do next is to visualize this data. Start to try and understand some patterns or extract some kind of knowledge They're very important tool but you've gotta use it properly You can't just plot anything and everything Every chart you use has got to support your hypothesis.

Or it's got to try and show the story You're trying to tell right? You don't just plot something because it could be plotted right? There's got to be a point to if there's a lot of problems with using inappropriate graphs and only picking subsets of your data That's a huge problem.