Exploratory data analysis


Exploratory data analysis or EDA, is a method used by data scientists to analyze datasets and summarize their main characteristics. It helps determine how best to manipulate data sources to get the answers you need, making it easier to discover patterns, spot anomalies, testing hypotheses, or to check assumptions. In fact, it's it's quite a lot like hunting for buried treasure. Let me explain. Meet Nate, the treasure hunter, and Sophie, the data scientist. When it comes to treasure and insights, they both go about things in much the same way. You see, Nate, a treasure hunter, starts out by identifying a potential treasure trove location. In the same way, Sophie, the data scientist starts by identifying a dataset that looks promising. Nate He then scopes out the area looking for clues that there is indeed treasure to be found. And in the same way, Sophie looks at the dataset looking for patterns or anomalies that could be exploited. Our treasure Hunter then starts digging, looking for the treasure. The data scientist starts manipulating the data, looking for hidden patterns. And finally, on a good day. Nate finds the treasure and brings it back to be enjoyed. And Sophie? Well, Sophie finds the insights from the data set and brings them back to the business to be used. So when it comes to finding what they're looking for, treasure and insights, you could say that Nate and Sophie, well, they have a lot in common. So the main purpose of exploratory data analysis or EDA is to analyze and summarize data sets. Now, there are four primary types of EDA which we can classify into two subgroups. So there's univariat. As the first subgroup and then this multivariant. As the second subgroup. Univariate theta is theta that can be described just using one variable while multivariate can be described is multiple variables. Now within univariate there are actually two other classifications. There's non graphical and graphical. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

And since it's a single variable, it doesn't deal with causes or relationships. Now, common types of univariate graphics include stem and leaf plots, which show all the data values and the shape of the distribution. And this also histograms. That's a bar plot in which each bar represents the frequency or proportion of cases for a range of values. Multivariate. Non graphical. Well, that is typically used for techniques that generally showed the relationship between two or more variables of the data through cross tabulation or statistics and then multivariate graphics... Well, some examples of that include groups, bar charts, which each group represents one level of one of the variables, and each bar within a group represents the levels of the other variable. There is also bubble charts, heatmaps and run charts as well. Now some of the most common data science tools. But we have available to use to create EDA, well, those include Python and R. Python and EDA can be used together to identify missing values in the dataset, which is important so you can decide how to handle missing values for machine learning. And the other language is widely used among statisticians in data science, in developing statistical observations and data analysis. Using EDA data, scientists can identify obvious errors, better understand patterns within the data, detect outliers and find interesting relations among the variables using exploratory analysis and use the results, they produce a valid and applicable to any desired business outcome and goal. And once EDA is complete and the insights are drawn, its features can then be used for more sophisticated data analysis or modeling. Like, well, like helping Nate find that buried treasure. If you have any questions, please drop us a line below. And if you want to see more videos like this in the future, please like and subscribe. Thanks for watching.