# Problems of Data Analysis

Module 1 lecture 2 the two biggest problems with data analysis before getting into the two biggest problems. Let's first review a couple of crucial concepts a model and data set yield accurate estimates if the estimates are close to the true value similarly a model and data set you'll valid estimates if the estimates capture what they're supposed to be measuring these two terms accuracy and validity are often used interchangeably. A model and data set you'll precise estimates if different estimates generated from the same model and data are close together and similar to precision a model and data set yield reliable estimates if they provide similar estimates when you repeat the analysis these two terms precision and reliability are often used interchangeably. I like using the image of a bull's eye to visualize the difference between accuracy and validity on the one hand and precision and reliability. On the other if your analysis provides estimates that are all close to one another then you've done a good job of providing precise or reliable estimates however your clustered estimates might still all be far away from the actual number. You're trying to get at in this case. Your estimates are precise been inaccurate. Alternatively you can come up with estimates that are not so close to one another but on average they get you to the true value. These estimates are accurate. But imprecise. It's important to start off your analysis with a good understanding of the situation that you face the data. You have where you are and where you want to go. This is more likely to get to get you close to where you want to be. That is you're more likely to end up with higher accuracy as in panel. B then you can start to hone your methods to get even closer that is increase precision. If you jump right into the analysis on the other hand before taking stock you're much more likely to end up with initial estimates that are off the mark and then waste your time trying to hone your model to increase precision.

You end up at panel a with precision but low accuracy. Let's consider a specific example that illustrates accuracy versus precision. Suppose your task is to predict sales volumes for your company's latest gadget. Over the next six months you have two different data sets from two different sources which are supposed to be capturing the same information you come up with a model to predict future sales using the data contained in the two different data sets you use your model with the first data set and get an estimate of three hundred and seventy six thousand three hundred ninety two units using that same model on the second set of information gives you an estimate of fifty thousand units. You then tweak the model a bit and do a second round of estimations. The second set of estimates yields you on the first data set gives you an estimate of three hundred seventy five thousand four hundred and sixty-seven units and using the second set of data you get an estimate of fifty five thousand units actual sales end up being fifty two thousand four hundred and fifty seven units. The first data set gives you estimate that are precise but inaccurate whereas the second data set gives you estimates that are accurate but imprecise. There's probably some mismatch in the first data set between the data you have and the data you need on the other hand. The second data set appears to contain data that are much more aligned with what you need now that we have a better understanding of the difference between accuracy and precision. We can move on to the two biggest problems with data analysis. The first problem is torture the data and it will confess to anything and the second problem is you can have data without information but you cannot have information without data. Both problems torturing the data and meaningless information lead to inaccurate and or imprecise results. Let's start with the first big problem. Torturing the data torturing the data is when you keep massaging the data or modifying your until you get significant results.

The image at the bottom of the slide is my bad attempt at trying to get to the idea of having pieces of a jigsaw puzzle that don't quite fit together but you make them fit by pounding them together in place similarly if you look hard enough at the data that is if you keep adjusting your data or your analyses. You'll eventually find something but just because you've managed to pound the puzzle pieces together doesn't mean you've created the true picture the next time you hear or read about the results of stunts some study remember this. Most published research findings are false. The significant findings reported in most studies result from analysts torturing their data until they get significant results. John Ioannidis paper why most research findings are false. Is the most downloaded file from the PLoS website. The study's been down at downloaded two and a half million times and it has been cited over 3,000 times torturing the data is a real problem in the world of data analysis. The second big problem in data analysis is data without information a 2014. IDC report indicated that a mere one and a half percent of total data is target-rich that means 98 and a half percent of data are not very meaningful. It's really important to understand that just because you have a set of data doesn't mean it contains any real information. Only a very small portion of all data collected have any real value. Most data are garbage and as the saying goes garbage in garbage out if you use meaningless data in your analyses your results will also be meaningless so to summarize the two biggest problems. With data analysis torturing the data will not make up for false hypotheses or meaningless information it only wastes time and gives you a false sense of generating an answer and if you have meaningless data you'll get meaningless meaningless results regardless of the quality or sophistication of your analyses.

You end up at panel a with precision but low accuracy. Let's consider a specific example that illustrates accuracy versus precision. Suppose your task is to predict sales volumes for your company's latest gadget. Over the next six months you have two different data sets from two different sources which are supposed to be capturing the same information you come up with a model to predict future sales using the data contained in the two different data sets you use your model with the first data set and get an estimate of three hundred and seventy six thousand three hundred ninety two units using that same model on the second set of information gives you an estimate of fifty thousand units. You then tweak the model a bit and do a second round of estimations. The second set of estimates yields you on the first data set gives you an estimate of three hundred seventy five thousand four hundred and sixty-seven units and using the second set of data you get an estimate of fifty five thousand units actual sales end up being fifty two thousand four hundred and fifty seven units. The first data set gives you estimate that are precise but inaccurate whereas the second data set gives you estimates that are accurate but imprecise. There's probably some mismatch in the first data set between the data you have and the data you need on the other hand. The second data set appears to contain data that are much more aligned with what you need now that we have a better understanding of the difference between accuracy and precision. We can move on to the two biggest problems with data analysis. The first problem is torture the data and it will confess to anything and the second problem is you can have data without information but you cannot have information without data. Both problems torturing the data and meaningless information lead to inaccurate and or imprecise results. Let's start with the first big problem. Torturing the data torturing the data is when you keep massaging the data or modifying your until you get significant results.

The image at the bottom of the slide is my bad attempt at trying to get to the idea of having pieces of a jigsaw puzzle that don't quite fit together but you make them fit by pounding them together in place similarly if you look hard enough at the data that is if you keep adjusting your data or your analyses. You'll eventually find something but just because you've managed to pound the puzzle pieces together doesn't mean you've created the true picture the next time you hear or read about the results of stunts some study remember this. Most published research findings are false. The significant findings reported in most studies result from analysts torturing their data until they get significant results. John Ioannidis paper why most research findings are false. Is the most downloaded file from the PLoS website. The study's been down at downloaded two and a half million times and it has been cited over 3,000 times torturing the data is a real problem in the world of data analysis. The second big problem in data analysis is data without information a 2014. IDC report indicated that a mere one and a half percent of total data is target-rich that means 98 and a half percent of data are not very meaningful. It's really important to understand that just because you have a set of data doesn't mean it contains any real information. Only a very small portion of all data collected have any real value. Most data are garbage and as the saying goes garbage in garbage out if you use meaningless data in your analyses your results will also be meaningless so to summarize the two biggest problems. With data analysis torturing the data will not make up for false hypotheses or meaningless information it only wastes time and gives you a false sense of generating an answer and if you have meaningless data you'll get meaningless meaningless results regardless of the quality or sophistication of your analyses.