Introduction to data analysis


Okay, so artificial intelligence machine learning data mining data analysis clustering classification data pre-processing big data It's hard to go anywhere now without hearing about AI and machine learning and data data, particularly It's everywhere research We've suggested that every two years we generate more data than ever existed before So the amount of data is doubling every two years now, that isn't absolutely am, you know astronomical amount of data but the thing is of course that This data doesn't necessarily mean anything the fact you can create tables of data But unless you understand what's in them and what they mean, you haven't got any knowledge, right? So there's a distinction between having data and having knowledge. So all very well saying yes as a species We're producing a huge amount of data But actually a lot of it doesn't get used a lot of it sits there on a hard disk Waiting for someone to look at it and that's kind of what we're talking about here if we want to extract knowledge from data we're going to need some tools and processes to do this in a formal way and that's that's what data science is, right and Things like machine learning and AI have a place within it So perhaps if you do this for your job, then data analysis is going to be useful for you Maybe your company's generating data and you want to analyze this data? But on the other hand, perhaps you're just a consumer and companies are using data on you. They're generating data on you And actually they're profiting from data on you. These are sometimes life-changing decisions that are being made on your data And so it's empowering to know how this process works and I'm a very simple example Which you might even do yourself suppose you go online to book some flights for a holiday And then you decide that actually two flights via an intermediate Airport is cheaper than a single flight, right? You're doing data analysis Say you're taking lots of different data sources and working out the optimal route and this of course happens automatically as well Depending on the flight website that you're using.

All right, so this kind of stuff you're already doing it It's just a case of trying to formalize this process. So what do any of the things I listed at the beginning mean? Well one problem is that everyone's definitions differ slightly But also I think that a lot of these terms are used completely interchangeably AI is the classic example So AI is everywhere right talk You can't buy a product without it having been having AI added to it a lot of the time you see AI We're actually talking about machine learning so machine learning is the idea that we're Training a machine to perform a task without explicitly programming it to do so. A good example of AI that isn't machine learning would be lets say a mouse in a maze where all You're doing is telling it to turn left or right at random not learning anything It doesn't understand what the maze is but it will eventually get to the end right that's a kind of rudimentary artificial intelligence That doesn't involve learning anything Machine learning is about not giving it Conditions not saying if you're here turn left if you're here turn, right It's just giving it examples and hoping it will learn to perform most tasks itself, right? So machine learning is a subset of AI but they shouldn't be used interchangeably if we're using machine learning often What we'll do is we train it based on samples of data So we'll have some existing data set that we're trying to train on and we're trying to use the machine learning to either tease out information or make predictions on this data The problem is that not all data is sort of made equal some of its noisy and messy Maybe we don't know what it is and don't know whether we can apply a certain technique to it Right. And so we need to clean this data up. We need to take this data understand what it is and extract some knowledge So that we can then apply these AI or machine learning techniques to it So this combination of things that can take data and prepare it in a way that we can then use it or understand it That's data science There are quite a few ways we could do this data analysis right throughout this course We could use R, we could use Python, we could use MATLAB.

They all have their pros and cons We're gonna use R because it's free and it's really good for statistical analysis It's got loads of great libraries If you're really familiar with Python, then maybe that's what you want to start with for this kind of stuff But we know we're going to be working with R We have our script area here where we can write scripts and run scripts. You can save them and then come back to them later Console where we're going to be putting in, you know specific commands we have our environment which is where all our Variables and our data is held and we can look at them there and then we have plots any plots of which you can do quite a lot of different plots in R, very versatile. That's going to appear down here Okay, so you've probably got everything you need to get started with data analysis. In my opinion The best way to get into R is just to kind of have a go So it's going to look at a few of the most obvious things that it does it has A little bit of a learning curve only because it's syntax is slightly unusual If you can program you'll be fine but even if not you should get there pretty quickly. Most of the time in R we'll be using either matrices or vectors or Which are kind of a special case of matrices or maybe data frames data frames a really nice aspect of R which you can kind Of think of like a table that you might have in in Excel, except you've also got headings for your columns so let's have a look at some of these things and just a few of the things we can do with them before we perhaps Go into a little bit more detail in other videos so for example We might look at our variable X which I've created and X is a sequence going from 0 all the way up to a few multiples of Pi which I used to create this plot That was only one line of code that produced that and I've used that to create my plot by essentially saying y equals sine X And then just simply plotting that if you wanna get a little bit more complicated we can start looking at matrix data So I created a CSV file with a Gaussian function in it.

So essentially a two dimensional array of Values that get bigger in the center very straightforward the CSV file is essentially a text file with commas separating those values very easy to read and write these out of Excel and other packages and so they're off you'll often find data is passed around in this way at least Moderately sized data, if it isn't too, you know to it too huge. I can load this in using my read CSV function So I can say name data Now the arrow operator is essentially equivalent in R for the assignment operators or equals equals will often work But I tend to try and use this one. So namedata I'm going to assign read dot CSV and the file is going to be norm dot CSV And I've got no header for this file. So I don't want it to use the top row for the labels So I'm going to say header equals false and that's loaded in namedata and we can have a look so I'm gonna click on namedata here and if we click On it you can see we've got the rows and the columns of our data in here We can look at individual elements in this array so we can say data at position three four right And that's going to be the third row down and the fourth value across we can also leave one empty and just have an entire row or Conversely an entire column like this and so it's very easy to take ranges of values You've got a huge table of data selecting certain columns looking at certain columns plotting certain columns This is one of the reasons why R is very popular quite often when you're looking at data We'll actually be looking at something called a data frame. Now a data frame. I've got a load one up is simply a In essence a table of values, but it will have to be the same type So in an array, normally they'll all be floats or they'll all be integers.

In a data frame, there can be different things So you could have first and last name next to age. For example So I've just created a tiny little CSV file with some random people in it. So let's load this up So I'm going to say namedata assign read CSV names dot CSV and if I look at name data, you can see that it's got three columns it's got first name surname and age and Five rows and there's five people in this dataset and then you can do just like I did before but now we can also index By the names of these columns so I could say I want all of the first names for example so I can say namedata dollar first-name and I can see All the different first names so you can start to look at this data set and more in more detail, obviously This isn't absolute tiny data set but you get the idea you could also look at individual instances So we could say name data and I want just the second row for example name data the second row There we go, Bill Jones and he's 18 years old as we move through these videos It's going to be very common for us to load in Datasets like this in this format and then start to process them based on these data frames. So perhaps an example, right? so, so let's imagine you're an online retailer and someone comes into your shop and buy some things and maybe they you Trying to understand what it is what they do so that you can let's say send them emails to try and get them to buy More products or show them recommended products and things like this So you want to try and build up a pattern of their behavior, right? And all you've got is what they click on what they add to their basket and what they buy, right? So you've learned that they're looking at these kinds of items and they look at these ones regularly And then sometimes they just buy something completely random seemingly, and that goes in their basket and gets bought straight away Maybe it's a present right? So maybe it's not tied to them as a person So you're taking all of this data all of these purchases all of these? Products are they're looking at and you're turning this into a kind of picture of this person and you're clustering that person in with other consumers that bought similar things and trying to predict what they want to buy next, right? And that's when you send them an email say you should look at this one because this one's really good and you didn't buy it Last time but you'll definitely want to buy it this time.

So we've got some data we want to extract some knowledge What's the first thing we do? well We have to start to look at it and try and tease out some kind of information Right or analyze this data the data analysis is the idea of using statistical measures to try and work out what's going on This is kind of a cycle. We're going to analyze the data So we're going to do a data analysis and perhaps sometimes just using statistics to analyze the data isn't enough You can't really learn everything about it Yes, you can learn, you know, mathematically how it works, but you might not understand about what it all means So visualizing the data can be really helpful. So what we'll also do is we'll visualize the data Visualization so that's going to be charting it plotting it trying to work out trends and Links between different variables and things like this and these are kind of being back and forth Right, you could do both of these things numerous times and work out what we've got, right? So you're gonna do something like this. And then what we're going to do is we're going to pre-process the data Often you'll be finding your recording much more data than you actually need. Right. This is certainly true of an online shop I'm going to be looking at a lot of products But I don't end up buying and I was never really going to buy I know maybe a pipe dream and they've got a sort Of weed out this information to work out what it is that they might actually better convince me to buy right? So this is going to you going to preprocess data and remove a nonsense and drill right down to the stuff that's really useful So this is pre-processing and this is going to be a kind of cycle of analysis and visualization and Pre-processing and we can repeat these things and then we can really drill down and whittle down our data into the most usable sort of Core of knowledge that we can And get the most out of it.

Now it may be that just analysing the data is enough, right? You've now sort of you've obtained some knowledge You kind of understand what the trends are and maybe that was all you wanted to do. That's sometimes the case Maybe actually what we want to do is take things a little bit further We're going to use machine learning or modeling to try and model this system and predict what's going to happen next? So for example in the case of an online shop We might want to start predicting what people are going to buy next and if we can do that That's when we can send out these emails or flag things in their recommended items and get many more sales as an example Let's imagine that someone has spent a lot of time looking at DIY tools right. I've you know recently moved house I spent a lot of time doing DIY and I'm always trying to buy new tools because it just seems like a good idea So, you know, maybe I buy a certain kind of saw and then you know a few months later. They're starting to recommend me a slightly different kind of saw that serves a slightly different purpose that suddenly I definitely need to be doing and I think another yeah Maybe I will buy that and then the end I have 10 saws and I don't know how to use any of the saws But you know, the retailers job is done It's if we want to extract this data We're going to use machine learning or modeling to put to model this system and make predictions right now So for example, we could cluster the data together.

We could link my purchase history with similar people. What are they buying? Can I be tempted to buy those things as well, right? Maybe I'm very different from someone else And so it's not a good idea to recommend me certain products because I'm unlikely to buy those things Perhaps use a different example in the medical domain It's quite common to classify people into kind of risk categories, right so that we can maybe use preventative treatments So every time I go to a doctor they're going to collect data on me on what I can't cope What's currently one with me? And what was wrong with me before and? Combine that with with you know standard data like how much exercise someone does and you know their family history and How what their stress levels are and things like this? We can combine all these things to make a prediction as to what they were at risk of in the future So, you know heart disease or something else like this. It could save someone's life If you spot that they're at risk of a certain thing and you can really advise that person to you know Increase their level of exercise or alter their diet. There are two other terms that we come across, you know a lot, right? So there's data mining and big data right now I'm not really sure what data mining is because I don't think anyone is it's a bit. It's a bit of a buzzword Really what data mining is is a combination of pre-processing your data and maybe using clustering to extract some knowledge from it, right? So that's our sort of it's a word that's come to be used in place of those things, right? If someone says they're doing data mining, that's what they're doing. They're pre-processing and extracting some knowledge from their data It's a night it's a cool sounding word. You're not actually mining anything, right? you're just doing what everyone else does on data. Big data is the idea that maybe we've collected a lot of examples of something You know a huge number or each of our examples is quite complicated and it has a lot of variables right in that case The amount of data we've got is sort of unwieldy, right? So I would argue perhaps that big data is not data that you can run on your laptop like you might be using cloud compute Infrastructure or certainly parallel processing in some way to to pre-process and analyze this data Right so exactly where the line, how big is big.

I don't know but exactly where we draw the line in some ways It's not really important, right the idea is just that The amount of data we as a species are now producing more and more of our data is becoming big data But you know exactly where the cutoff is isn't it's not doesn't really matter What is data right? I'm pretty sure that's data Right is this data? this picture or that data Is this data? What what is data?.