# 01 Data Analytics: Statistics

Alright let's get started with a lecture on fundamental statistics and statistics concepts. And so what you should catch from. This lecture is the background on statistics. And its importance. I'm not going to cover the data in earth sciences as part of this but I will talk about sampling bias and concepts for mitigation of a separate lecture roll. I'll talk specifically about data that we have available in or sciences. So let's put some definitions out there first. Statistics is the science of collecting and pooling samples and making inferences. If we don't make a decision with our methodologies they don't have any value so we should put here making inferences to support decision making so geostatistics is a specifically a branch of statistics with a focus on the geologic context the spatial context with spatial correlations and accounting for size the scales scales of all the data and the estimates that we're trying to make the accuracy of the measurements ie uncertainty associated with everything that we're working with our data our estimates and so forth. So that's geostatistics. It's a branch of statistics. So how do we apply statistics specifically in the subsurface in order to answer questions. Well this would be. Ideally the way that we work we would start with some type of design. This is where we look at what the fundamental question is that we want to answer and we decide what information do we need to collect in order to answer that it's going to be a balancing act of. How much can we afford to collect. And how much time do we have for the study now. You'd hope that that's how we perceive many times in my experience is more of a case that the data has already been collected. And you need to go ahead and do something with that data. In fact part of the problem we deal with is the fact that our data is so expensive in the subsurface and we collect often so little of it because of that cost and we collect it to answer a variety of different questions as I'll talk about bias and sampling later on.

You'll find that in fact other drivers answering questions about the size of the prize and so forth definitely does take precedent and we may not have thoughtful design when it comes to the data in order to assess reservoirs the way we want to model them so we work with what we have sometimes description. This is where you just look at the data and try to understand summarizing and analyzing the obtained sample data this is data cleaning looking for obvious errors or perhaps subtle errors in the data. Do the way that it was handled. It was collected summary statistics finding out kind of in general. How does the data behave check for trends and changes over time in space we'll talk later about stationarity and segment it. Perhaps in two distinct regions if you need to if things are changing enough. This data cleaning step often is 80% of the work in a reservoir characterization study or in most subsurface related geostatistical spatial statistical studies 80% of time maybe in this step of data cleaning summary statistics and so forth modeling. Here's where we take the data and we try to move a little bit beyond the data. We use the physics interpretation proxies and so forth proxy modeling. I should say in order to try to understand the data better. We move beyond just the statistics the descriptions and we incorporate engineering. Geoscience information to extract more from and probably more importantly to check the data. This is where we use our subject matter expertise and realize that in fact this data has an issue or that. There's something we need to look more at or we need to do something. We need to go back and sample further and so forth and so. That's the modeling side. So the next step is statistical inference this is the opportunity to become convex or tip to Edward to basically look at the data and try to learn something from the data - if it's multivariate you have a bunch of different variables are working with its spatial God things located at different locations over your space.

You can take that that the sample statistics from the description the modeling and try to work out. What's going on. The most complicated difficult part of inference is to try to understand what's going on with the population to truly go back and try to understand. Okay what's going on at the subsurface at all different locations or it could be as simple as just trying to understand the complicated interactions of all of the variables with regard to each other. This is a chance that kind of spend time with your data and try to learn about it previous step. We were using the engineering and Geoscience more this step. We're using more of the statistics. Step number five we get into prediction. We're trying to forecast a ton sample locations. This could be over space. It could be spatial or it could be temporal. We can be looking at what could go on in the future specifically for dealing with dealing with flow simulation and such step number six is where we're trying to develop models of uncertainty. We'll have a whole different video where we'll get into details of subsurface uncertainty. There's a lot to deal with the subsurface uncertainty and so we'll try to develop a model of uncertainty for the variables of interest. We're going to try to account for all the different sources of uncertainty. There's spatial uncertainty model parameter uncertainty. There is sampling uncertainty in the measurements that we sample and so forth and we have to combine all these together step seven we take that uncertainty model. Now we make decisions optimum decisions in the presence of uncertainty well some type of criteria that we're trying to maximize the net present value flow rates or something and we're trying to pick the decision for development in the subsurface such that we maximize that result in the presence of uncertainty generally generally represented by multiple representations or realizations of the subsurface once again whole different topic to get into these as.

I've said before and I'll probably continue to repeat myself only add value when they impact a decision. If you're off in your corner and some company or working for some agency doing your modeling and it's never used to impact a decision. In fact your modeling does not add any value and so an example might be in the subsurface for a natural resource exploitation would be how many wells and where what's the injection rates. We should be using for a water flood and for natural sciences. If you didn't you could be natural resources like all water and gas. It could be used for um environmental remediation anything dealing with the subsurface in fact this could be used. Wow even geotechnical design if you're concerned about building tunnels in mining and so forth so let me give you a really simple example one thing I want you to notice. Is that in general. We don't always use exactly all of the steps we may improvise. We may simplify our workflows and so forth so the first thing is what we have here is you need to be able to. You know we want to answer this question about the subsurface want to understand the spatial distribution of porosity over this area of interest. And so we have this space right here. X Y. This is all in meters and we have this data that's given to us and so perhaps up front. We were involved in the discussion or the decision about where these data were collected subsurface. It's going to be wells or drill holes and perhaps we were part of. How do we decide where to go ahead and drill and it's always going to come down to. Are we trying to test a hypothesis about the subsurface. We may hypothesize that there is a better area or higher quality. Porosity in this region right here and we sampled here and then we sampled around it. In order to to test that hypothesis we might look at various different control variables. Maybe we're in fact trying to test porosity. But the same time we're going to remove the effects of compaction trends or other types of features from it and so we'll look at holding those constant or standardizing or normalizing for them.

We're going to pull all the available samples together and the next step is we're doing description statistical description and so we're looking at the simply here a frequency distribution so this is going to be a binned probability density function because this right here is in probability and so we can go ahead and look at the shape and the overall form the min the max we might look at the mean value the variances but given the fact that it's multimodal the mean and the variance is the measure of spread won't be as meaningful to us but we have looked at this and determined that it is clearly multimodal in fact it may actually be the combination of two Gaussian distributions shown here one with a higher frequency but lower porosity values want the lower frequency but higher porosity values. And so you might turn to maudlin you might say well what be the cause of that type of a distribution. Maybe it has something to do with. It's a natural breaking porosity dude. It's some type of physical process. Maybe we have two different types of deposition within this area maybe we had some type of segmentation of grains maybe we had some type of compaction trends on the flank or something we would be able to then use that to assure ourselves or to build up a reasonable defense for recognizing there's two separate segments to this set. There's two different things going on. So the inference part we get to the point and prediction. We'd recognize the fact that there is in fact this relationship and the data and then we start to have a predictive model where we would go ahead and say that this area right here we would predict that that would have systematically higher porosity. X' this area right here is the lower porosity distribution which is out here and so now we're mapping to distinct regions.

We're breaking up our risk or into or our subsurface setting into two distinct regions which provides us with a pretty strong prediction model. So oh let me just um go ahead and comment on this is a really fun read so Hadley Wickham is chief scientist at. Our studio is known for the development of open-source statistical packages for our specifically around the idea of making Statistics accessible and fun. And so you can go ahead and check out. I put the link his short short paper. I guarantee you. It's a very short read. Was teaching safe stats not statistical absent. So there's a little bit of tongue-in-cheek here there's some little bit of fun with it but what you'll see is a really great message in it which I really appreciate and that's why I put a slider and mention in the class teaching if we're involved in teaching statistics. I'm doing that right now. And that's part of what. I do here at the University of Texas at Austin we need to rethink statistics curriculum. We we risk becoming irrelevant. Statistics tends to be taught as a void unless you are a statistician or maybe a geo statistician. I hope or with one at one available to you to support you otherwise you could cause great harm. Danger risk abstain but there is not enough professional statisticians in my professional career. I've only encountered probably two in workers within the oilfield within the energy sector who in fact had PhDs in statistics and so povidone was actually working in a mining group. And so they're not going to be very common and what we need to do is we need to rather than stigmatize the amateur. We need to provide tools. That should be safer for use so we need tools that are easy and fun to use and encourage the use of statistics. They need to have flexible grammars. In other words they have basic building blocks that you can put together into workflows to get the job done minimal set of independent components and I would suggest that need to be somewhat IDIA proved from the standpoint that they have the ability detective.

When you're just doing things wrong. Are you using to view data for a highly parametrized model fit ie you're / fit and so forth. The other thing too is we should teach and understand that. Coding is really central to much of what people do in the scientific and engineering communities. And we need to teach. We need the people to go for it. Teach them programming even in the first courses achievable and so in this class we will be teaching definitely our coding and Python to at the same time so what's my job teach safe methods for using geo statistics statistics. So we're gonna use our art. The great thing about art is that so many of the methods are really well documented and with a single command. You can complete really important tasks with a lot of different outputs for interpretation and understanding what happened with your model and so it's very powerful also in Python and we'll use these packages in order to do it. Ok so next class we'll need to install anaconda in our studio on your laptops and we'll be getting started probably in the next week or so doing some coding working with workflows and Sibley. Ok so let's talk about um some sampling definitions. A variable is any property that's been measured observed in the study. It could be porosity permeability mineral concentration saturation contaminant concentrations so forth and so on in data mining machine learning. This is known as a feature. If we're dealing with prediction then we'll break up our variables into predictors that. Tell us something. And the response the thing that we're trying to predict with the predictors. The population in fact is the exhaustive finite list of properties of interest over the air over the area of interest so generally the entire population will not be accessible. If it's a subsurface you would have to literally strip a minute and lidar it. Image it at the resolution that you require in order to capture the entire population.

That's not possible we work at great depths. We sample one. Trillionth of the reservoir generally but the population is the entire reservoir. It would be a three-dimensional representation in the reservoir and the scale that you need to work at with all of you are variable. Let's start with all of your variables defined at every single location like a very fine mesh that then you'd you'd fully understand what's going on the subsurface that's a population but we don't have the access to the population. It's too bad we don't. But instead we have the sample the set of data that actually been measured you drilled into the reservoir you imaged it with seismic you did something in order to measure the subsurface and now you have it so example porosity data you may have measured it with a set of well logs that measure the density and so forth the fluids within the rock and you've used that now to assess porosity along the trajectory of a well. You may have done that. The parameter is a summary measure of the population population mean published in standard deviation. But we don't ever have access to this. We have an access to only an estimate of that. That's what we do so we don't know the mean porosity over an entire reservoir we don't know the mean permeability or maybe more importantly the variance of permeability over the reservoir. What we have in fact are statistics. The statistics are summary measures of the sample. That's what's available to us. So we have the sample mean the sample standard deviation and we use these statistics to make estimates of the parameters for the entire population that we don't have access to data cleaning let's make a let's make us some comments about data cleaning and show a really nice illustrative example. So what I did was I went online and I went to the Bakken North Dakota and I downloaded the data sets historical data sets for gas produced per month from the Bakken and the amount of gas. That's flirt so orange is the gas that's produced blue is the amount that is flared and this is on the basis of whatever six months or so.

Yeah at different four months or so okay so we go ahead and we have a nice temporal data set. What's going on over the entire. Bakken so what types of questions can we ask from this. And if you look at this it might be very difficult to discern too much from it. It's kind of difficult to really well. We can say that production has been increasing systematically there's been a leveling off of and increased again. We could say that there is gas flaring increased and decrease a little bit. Perhaps what about if we were to do a little bit of a calculation here and just take the ratio of the flared to be produced so what percentage of the produced was flirt and so this is a very simple manipulation of the data. What do we learn from the data when we applied that. And so this is part of data cleaning like learning about or you know checking the data seeing what's going on does it make sense and so if we do that what do we see. Well you might actually be able to start inferring some very interesting things about this. Temporal data first of all we have a general trend towards decreasing proportion of produced being flirt so this might be early utilization you started producing gas. Didn't have the infrastructure in place everywhere and we just started to kind of figure that up then what we have then what we could have is maybe a second phase where we have a stable low level production. We've got infrastructure in place we're not flaring that much. Then we have a sudden increase in production. Things are starting to ramp up and the facilities for gas handling have not kept up there's higher flowering proportion. And then we go back down to. Then we go with then what we have is a fourth stage with infrastructure catches up again and we're back to being able to reduce the flowering again and so this was a very simple exercise of just simply dividing the two variables from each other calculating a ratio and what we found was that there were clear patterns.

That emerged that we may not have seen. Maybe if we would have done this we would identify that. There was unreasonable values. Maybe we went outside maybe 40 percent. It just doesn't make any sense at all and so we might go back and say. Hey there's something wrong with this data set. These two variables are not agreeing with each other. And so we need to look at if there's an error in data and so this could be part data clean now. Forecasting is a very interesting thing. Forecasting future production moving beyond the sample data. Set would be very hard to do you. Trigger it like a statistical problem you could just fit a trendline you could get kind of complicated about that and fit some type of trend. That's trying to fit different shapes and features you could just apply a linear trend. You could do something like that but maybe we don't have enough data. What else would we need to do. A good job of forecasting in the future. We probably wanna know the number and location new wells. What's the schedule for new drilling. When are those going to be completed and come online. What's the decline rate for the available wells right now. You know what's going on. As far as downtime reworking all of this information would greatly improve our estimate or cast it into the future and so context and domain knowledge are essential we should never treat our statistical workflows treated like they're just statistics. We should use statistics to support our subject matter our domain knowledge in trying to answer the questions to address the various types of scientific questions. We have with regard to the subsurface. We can what type of what do we actually sample. And so when you're working with the subsurface you might be just working with one dimensional datasets and that might be enough perhaps it's a one dimensional along a vector or it might be some type of a temporal data set like we just showed you might be concerned in space about vertical variations.

Are there trends vertically or things changing vertical in mechanical properties mineralogy porosity and so forth or you might be just looking at a single well and looking at things kind of cycle and change along the well and is that related to some types of um you know some information that help you predict away from the well would be really important. You could also be working with 2d sampling for the spatial interpretation. And in this case we have geologic. Maps is the most common two-dimensional data set. We have this is because many of our reservoirs are thin relative to their aerial extent. And so you may have a reservoir. That's only 10 maybe a reservoir unit or 10 meters or maybe you have a row smaller unit. That's 100 meters or so. But then you have a reservoir that extends 4 kilometers in both directions. And so in that case you may find. That's very useful to just simplify the data set and treat it like a map and to do all your modeling in two dimensions. You may also be working with spatial analysis of a thin section where we take in rock and we cut it down really thin to the point that we can put it on a microscope shine light through the very thinly cut crystals and rock structure and analyze the void and rock surfaces and spaces and so forth. You might be working in three-dimensional samples for spatial interpretation. In this case you have three dimensional seismic volumes. You might be working with sets of correlative dwell logs that have been correlated so now we have a full three-dimensional representation of what's going on so if we talk about data and we try to classify data based on. What are we dealing with. What is specifically do we have. We could have categorical or continuous data categorical data takes on discrete values. And we can talk about two different types of categorical. Data nominal and ordinal the nominal would be something that there's no natural ordering in it a perfect example that would be mineralogy categories.

It could be quartz. It could be feldspar. It could be some other type of mineral components and which really there's no way to say okay. Quartz is higher than feldspar which is higher than maybe some carbonates limestone or dull stone or something out dolomite or what. There's no natural ordering in that they're just different things categorical ordinal. There's ordering in the categories there are perhaps you know geologic age. We could say that it's a different age. It could be something that. Paleozoic the Mesozoic Cenozoic play is seen with you know and so forth you could break it down even finer than that and and so you could. You could assign that to it. But there's an order we know that the Mesozoic came after the Paleozoic and that the Cenozoic was after that and there's her Cherie within it and and so forth we know Condit's order hardness is another good example most hardness skill is pretty useful for under standing or classifying or what should I say determining what type of mineral you're working with if you have a hand specimen and so the Mohs hardness scale is basically just a number Dow one to nine one to ten. Where basically you're able to assess kind of hardness based on what scratches what if you compare two minerals to each other and you have type minerals quartz and chunks. There's a whole bunch of other ones that you go ahead and then you determine okay. This has this hardness piece to scratch this mineral. It didn't scratch that mineral talc was the one. I was thinking of an inch and forgotten. Blanked on it okay. So that's categorical continuous. There's two different types of continuous now. The first thing about continuous think of a continuum like a time line. Okay so my time line goes one two three four five six. Okay now everything. Between one and six those were six different numbers but you could have a one point. Two one point two three five fours and so forth you could have like any value in between it's a continuous representation.

There's not discrete categories. There's not a jump. It's just anything could happen. In between one and six that would be a continuous variable now. An interval variable is one where the intervals are equal and four to be continuous. It has the least half of this. And what does that mean a perfect example of that is the temperature scale that we're all used to this. Celsius scale now. We have a zero degrees temperature and then we have ten degrees 20 degrees 30 degrees. The difference in temperature between zero and ten is the same as the difference in temperature from 10 to 20 and as 20 to 30. It doesn't change its scale. It remains the same amount of change. If you look at the fundamental physics of temperature you know the kinetic energy of molecules and so forth it is a continuous scale. That's it has intervals okay. Ratio means that the numerical value truly indicates a quantity being measured if you think about the Celsius the Celsius scale or if you're American the Fahrenheit scale the decision of the datum was arbitrary the Celsius approach is that it's going to be based on they melting of water and the boiling of water. For a hundred and then they made everything else work the Fahrenheit scale well. I'm Canadian I'm not going to defend or explain the Fahrenheit scale but you can see that they signed a different datum and they put a different magnitude on each one of those increments Kelvin scale. 0 and Kelvin is related to a fundamental physical process of the energy with the molecules and movements and so forth not a physicist but there is a physical meaning to it porosity. Zero percent porosity means no void in the rock 100 percent. Porosity means you have just voids okay permeability saturation so forth. These are ratio continuous variables they have physical meaning to the exact values okay types are data quantitative data and qualitative data.

There's a very simple way to explain this. A quantitative data is something that can be written in numbers. You put numbers on it. That's most of the things we work with this engineers are going to be quantitative qualitative data is information about quantities that you cannot directly measure it requires a interpretation of the measurement and so a really good example for subsurface would be. Rock type and facies. We would look at the you would look at the rock you probably look at porosity permeability other types of quantitative data but you make an interpretation. You say this rock is this thing. This is what this thing is and but it's not like we put a number on it quantitative qualitative. Um if you want examples right here there's a well log. We have probably gamma-ray spontaneous potential over here and we can go ahead and we can look at actual numbers from the subsurface. That's quantitative as soon as we go ahead and say well this is sandstone or this a certain rock type or a dull stone wacky stone or balanced owner for carbonaceous reservoirs. We have put a we have now have a qualitative type of data. I'm never suggesting that qualitative data is not valuable just suggesting that that it has a layer of interpretation and we need to understand of course the uncertainties related to that interpretation so another way to talk about data are to consider data is hard and soft data. Hard data is data for which there's a high degree of certainty usually hard data is something that we measured in the subsurface. We we collected a sample or we have a tool for indirect measurement that has a high degree of precision example. Well-core if you strapped a bell core and you subject it to porosity test that's pretty. That's pretty certain we have a pretty good understand. What's going on porosity. We could also have a combination of ethology. I mean i vlogged based prom log. Based information that provides us a pretty accurate measure of porosity and times.

We could consider that to be hard data too depending on its level of precision we might also feel that our litho phases are also very are very accurate our definition of lethality related faces or categories of rock and. We think that they're pretty good from the well. Lock well log suite that we're working with. We could also consider that to be hard data. Soft data this data that provides indirect measures of the property of interest and so they include a significant significant degree of uncertainty example. You could imagine that if you had seismic information all over your reservoir like shown in this example right now away from the wells that they may provide you some indication of a quality of the rock but still there would be a significant degree of uncertainty with regard to that and so at each location. There wouldn't be a single value. It would be a distribution of possible values for porosity informed by seismic soap use soft data. It's uncertain other tight. Other ways to describe data with primary data and secondary data. Simply stay that primary data is the variable of interest. That's the variable you're working with. It's the target that we're trying to model the target. We're trying to understand. Secondary data is any other variables or features that are used to provide information about the primary data in order to support the process of understanding the primary date. It's any type of secondary support to understand the primary data so let's the example could be porosity you haven't measured them cores and logs you try to build a full three-dimensional model porosity but you don't have prostate allocations but you have acoustic impedance which been measured by seismic in drew remotely and indirectly in the rock and so you could support the modeling of porosity away from the wells using this indirect measurement that is a secondary variable. If we're talking about data types that are available to us. We're concerned about coverage scale or support size and information type coverage is what proportion of the brisbane or the population has this data actually sampled as it's this data over.

Which is this data available. And so you imagine if you're dealing well logs they're only providing you information probably a couple of feet or meters into the subsurface away from a well location. If you're dealing with seismic it's coverage can in fact be very good you can have almost complete coverage of the reservoir but at a very low resolution and so forth everywhere in that case the scaler support size is the scale of the individual data measures so pore scale would be measures that tell us about micrometers millimetres of the subsurface individual grains and voids in the subsurface we could have or logs that provide us with cubic centimeter scale. But probably more likely in the order of you. Know meter cubed or so table scale and we could also have measurements like seismic if we're dealing sub salt we've got low resolution. Because we don't have a lot of high frequencies involved we could be in a situation where our vertical resolution is on the order of tens of meters. In which case we're now talking about reservoir units scale of information very coarse scale information information type. What does the data tell us the subsurface. It's telling us about the grain size the mineralogy the flow type the layering. You know the overall directionality in the subsurface barriers baffles conduits for floor flow and so forth right. Let's talk about how we get a representative sample. What methods would be used to get a sample for which we could calculate statistics and those statistics would be unbiased. They provide us the best indication for the amounts data. We can gather about the population so best way to do that is random sampling which means that. Every item in the population has an equal chance of being chosen. And there's no correlation between the locations that you choose randomly sampling throughout the reservoir well as ran is random sampling sufficient for the subsurface.

Is it something that would benefit us. Is it even something we could do. Just imagine going into your um your bosses office and telling them that you would like to drill the next well which cost subsea you know in the deep marine setting one hundred fifty million dollars. I want to drill that at random within the reservoir you may have to look for a new job. It's usually not available. It would in fact not be economic. We don't want to change the way that we sample. Data is collected to answer questions within the subsurface specifically we drill our initial wells to understand. Is there a reservoir. We drill our subsequent wells to understand to test the limits spatially of given we had reservoir there how far can we go away and still have reservoir that's important because that helps us assess the size of the prize in order to book reserves and so forth that adds value to the company. So we can't change that this type of sampling type of approach is very useful. The wells are located to maximize future future production. They're maximized to Maxima to provide maximum value of the project. Wells can also be a dual purpose for appraisal and for injection or production and so forth and so random sampling would not be a good idea. Regular sampling could be used to try to get representative samples. The is where the samples are drilled on some type of regular space and so but there's a warning there because what can happen is if you have regular sampling at some interval imagine if there was some type of sick lissa. T within the reservoir and that just happened to have that just happened to resonate with your sampling frequency in the intervals in space. You could create a bias like that and so what do we. What do we have when we're working. Well we have to just accept it. We have bias data in fact if you're working with the subsurface and you're working directly with the raw data.

You should question that you should be concerned about just working. With raw data to calculate calculate statistics and to make any types of decisions we also have opportunity sampling. We have the fact that we may have issues with access to the subsurface we may not be able to get to certain depths and so forth we may have something getting in the way the drilling hazards and so forth so we have to account for this bias. We'll talk more about deep biasing later but let's just give a couple of our ideas one and let's just let's exact. Let's go ahead and recognize that this is aggravated even further. If you drill a well that's selected in a bias manner. Then you're going to extract core from that well and that core is going to also be extracted in a bias manner. In fact you're never going to take a section of the core that's just shale and send that off for very expensive typical or standard core analysis or any type of advanced analysis of the core. You're not going to do that cord. Plugs are then often extracted from core samples for additional analysis. And those those into are going to be sampled in a biased manner from so you see there's three different orders a hierarchy of bias going on with our sampling. So how do we address this bias. So let's take this example right here a very simple example we have two dimensional map. It's in feet so a couple miles by a couple miles and we have our wells. We'll assume they're vertical wells. We've averaged over the reservoir interval. We got porosity. I don't have a color scale here but blue is going to be low. Porosity orange is going to be high. Pearl's would it be fair to take the average of all of these samples and to suggest that that in fact is the average prosity of this entire reservoir. I think we can agree. That up here in the high porosity x' and even here high porosity x' here in the low porosity x' we haven't sampled as much but here we sampled much more and so we're going to have a biased. We have a biased high estimate of porosity if we take them and just take the average.

So what do we do one way to do. It is the first defense against this type of bias is good geologic. Mapping you map that from the data and you identify. There's low quality area low porosity medium quality medium porosity and a high porosity region and you deal with them each of them separately. You break up the model into subsets and each one of them. You calculate the average across the over this region average across the over this region average across through this region. And if you do that if you were to use the average crossing in each one of these regions and put that together to get an average porosity over the entire area of interest that would be a pretty good estimate. You've probably done a pretty good job of avoiding some of the bias in the sampling. Another way to do it is just to spatially look at the spacing of the data like just look at the amount of space between the well data and then to assess a weight to the data in this part of the reservoir. This data is very sparse it should receive a high weight given greater weight. It's representative of more area or volume within the reservoir. This data right here should receive a low weight and these data right here received may be a nominal weight something. That's more medium weight if you do that you can calculate any sample statistic in fact the entire CDF cumulative distribution function all from using just using the clustered weights so now you have data values at each location all these locations and in each location. You're also going to have a weight to now so you have an additional variable now to deal with but this is very powerful. This is a good way for accounting device biases not just due to sampling. Bias is everywhere. It turns out that these brains that developed in hunter-gatherer. Society and trying not to be eaten by larger predators are full of all kinds of biases. That actually helped us back then but may not be helping us right now as we're trying to do scientific statistical studies of the subsurface and.

I'll give you some examples right here. For instance you have anchoring bias too much emphasis put on first piece of information. It's very interesting actually. They've shown that. I could ask you a question that you may not know very much about like what is the age of Kevin Bacon. I don't know maybe you're a bacon fan but if I said that but before I said it I first said 13 but my favorite number and then I said what's the age of Kevin Bacon and I asked 40 students and I repeated that exercise where I first said 90 and then Ashta what's the age of Kevin Bacon. Statistically speaking that second group would actually have a higher estimate on average in the first group. And it's because anchoring works even if it makes no sense and that's a scary part of it is you heard 13 and you just thought that's a very small number that's a very young age that doesn't make sense and then when you who thought about Kevin Bacon you thought. Well he's not 13 he's older. And so you anchored and you went up from there 90. The opposite thing happens and you're gonna estimate within higher age availability heuristic. This is where we put too much information on things that are easy to observe and this happens all the time all kinds of anecdotes and this is very dangerous. Bandwagon effect blind spot effect. You don't even know your own biases choice. Supportive effect clustering illusion. Confirmation bias is huge in a subsurface team. You get a hypothesis you have a theory now and everybody's excited and you start to maybe ignore data that contradicts that kind of the theory. New information is only used if it supports the current model and recency bias. I favor most recently collected data and survivorship bias. A really important one where we focus on these success success cases there was some type of filtering in the data that we're not accounting for so that was the end of our discussion around statistics.

I hope this was helpful as usual. Go ahead and now you know let me know in class or email me. I'm easy to find my perch a professor at the University of Texas at Austin and contact me if you have any ideas per minute.

You'll find that in fact other drivers answering questions about the size of the prize and so forth definitely does take precedent and we may not have thoughtful design when it comes to the data in order to assess reservoirs the way we want to model them so we work with what we have sometimes description. This is where you just look at the data and try to understand summarizing and analyzing the obtained sample data this is data cleaning looking for obvious errors or perhaps subtle errors in the data. Do the way that it was handled. It was collected summary statistics finding out kind of in general. How does the data behave check for trends and changes over time in space we'll talk later about stationarity and segment it. Perhaps in two distinct regions if you need to if things are changing enough. This data cleaning step often is 80% of the work in a reservoir characterization study or in most subsurface related geostatistical spatial statistical studies 80% of time maybe in this step of data cleaning summary statistics and so forth modeling. Here's where we take the data and we try to move a little bit beyond the data. We use the physics interpretation proxies and so forth proxy modeling. I should say in order to try to understand the data better. We move beyond just the statistics the descriptions and we incorporate engineering. Geoscience information to extract more from and probably more importantly to check the data. This is where we use our subject matter expertise and realize that in fact this data has an issue or that. There's something we need to look more at or we need to do something. We need to go back and sample further and so forth and so. That's the modeling side. So the next step is statistical inference this is the opportunity to become convex or tip to Edward to basically look at the data and try to learn something from the data - if it's multivariate you have a bunch of different variables are working with its spatial God things located at different locations over your space.

You can take that that the sample statistics from the description the modeling and try to work out. What's going on. The most complicated difficult part of inference is to try to understand what's going on with the population to truly go back and try to understand. Okay what's going on at the subsurface at all different locations or it could be as simple as just trying to understand the complicated interactions of all of the variables with regard to each other. This is a chance that kind of spend time with your data and try to learn about it previous step. We were using the engineering and Geoscience more this step. We're using more of the statistics. Step number five we get into prediction. We're trying to forecast a ton sample locations. This could be over space. It could be spatial or it could be temporal. We can be looking at what could go on in the future specifically for dealing with dealing with flow simulation and such step number six is where we're trying to develop models of uncertainty. We'll have a whole different video where we'll get into details of subsurface uncertainty. There's a lot to deal with the subsurface uncertainty and so we'll try to develop a model of uncertainty for the variables of interest. We're going to try to account for all the different sources of uncertainty. There's spatial uncertainty model parameter uncertainty. There is sampling uncertainty in the measurements that we sample and so forth and we have to combine all these together step seven we take that uncertainty model. Now we make decisions optimum decisions in the presence of uncertainty well some type of criteria that we're trying to maximize the net present value flow rates or something and we're trying to pick the decision for development in the subsurface such that we maximize that result in the presence of uncertainty generally generally represented by multiple representations or realizations of the subsurface once again whole different topic to get into these as.

I've said before and I'll probably continue to repeat myself only add value when they impact a decision. If you're off in your corner and some company or working for some agency doing your modeling and it's never used to impact a decision. In fact your modeling does not add any value and so an example might be in the subsurface for a natural resource exploitation would be how many wells and where what's the injection rates. We should be using for a water flood and for natural sciences. If you didn't you could be natural resources like all water and gas. It could be used for um environmental remediation anything dealing with the subsurface in fact this could be used. Wow even geotechnical design if you're concerned about building tunnels in mining and so forth so let me give you a really simple example one thing I want you to notice. Is that in general. We don't always use exactly all of the steps we may improvise. We may simplify our workflows and so forth so the first thing is what we have here is you need to be able to. You know we want to answer this question about the subsurface want to understand the spatial distribution of porosity over this area of interest. And so we have this space right here. X Y. This is all in meters and we have this data that's given to us and so perhaps up front. We were involved in the discussion or the decision about where these data were collected subsurface. It's going to be wells or drill holes and perhaps we were part of. How do we decide where to go ahead and drill and it's always going to come down to. Are we trying to test a hypothesis about the subsurface. We may hypothesize that there is a better area or higher quality. Porosity in this region right here and we sampled here and then we sampled around it. In order to to test that hypothesis we might look at various different control variables. Maybe we're in fact trying to test porosity. But the same time we're going to remove the effects of compaction trends or other types of features from it and so we'll look at holding those constant or standardizing or normalizing for them.

We're going to pull all the available samples together and the next step is we're doing description statistical description and so we're looking at the simply here a frequency distribution so this is going to be a binned probability density function because this right here is in probability and so we can go ahead and look at the shape and the overall form the min the max we might look at the mean value the variances but given the fact that it's multimodal the mean and the variance is the measure of spread won't be as meaningful to us but we have looked at this and determined that it is clearly multimodal in fact it may actually be the combination of two Gaussian distributions shown here one with a higher frequency but lower porosity values want the lower frequency but higher porosity values. And so you might turn to maudlin you might say well what be the cause of that type of a distribution. Maybe it has something to do with. It's a natural breaking porosity dude. It's some type of physical process. Maybe we have two different types of deposition within this area maybe we had some type of segmentation of grains maybe we had some type of compaction trends on the flank or something we would be able to then use that to assure ourselves or to build up a reasonable defense for recognizing there's two separate segments to this set. There's two different things going on. So the inference part we get to the point and prediction. We'd recognize the fact that there is in fact this relationship and the data and then we start to have a predictive model where we would go ahead and say that this area right here we would predict that that would have systematically higher porosity. X' this area right here is the lower porosity distribution which is out here and so now we're mapping to distinct regions.

We're breaking up our risk or into or our subsurface setting into two distinct regions which provides us with a pretty strong prediction model. So oh let me just um go ahead and comment on this is a really fun read so Hadley Wickham is chief scientist at. Our studio is known for the development of open-source statistical packages for our specifically around the idea of making Statistics accessible and fun. And so you can go ahead and check out. I put the link his short short paper. I guarantee you. It's a very short read. Was teaching safe stats not statistical absent. So there's a little bit of tongue-in-cheek here there's some little bit of fun with it but what you'll see is a really great message in it which I really appreciate and that's why I put a slider and mention in the class teaching if we're involved in teaching statistics. I'm doing that right now. And that's part of what. I do here at the University of Texas at Austin we need to rethink statistics curriculum. We we risk becoming irrelevant. Statistics tends to be taught as a void unless you are a statistician or maybe a geo statistician. I hope or with one at one available to you to support you otherwise you could cause great harm. Danger risk abstain but there is not enough professional statisticians in my professional career. I've only encountered probably two in workers within the oilfield within the energy sector who in fact had PhDs in statistics and so povidone was actually working in a mining group. And so they're not going to be very common and what we need to do is we need to rather than stigmatize the amateur. We need to provide tools. That should be safer for use so we need tools that are easy and fun to use and encourage the use of statistics. They need to have flexible grammars. In other words they have basic building blocks that you can put together into workflows to get the job done minimal set of independent components and I would suggest that need to be somewhat IDIA proved from the standpoint that they have the ability detective.

When you're just doing things wrong. Are you using to view data for a highly parametrized model fit ie you're / fit and so forth. The other thing too is we should teach and understand that. Coding is really central to much of what people do in the scientific and engineering communities. And we need to teach. We need the people to go for it. Teach them programming even in the first courses achievable and so in this class we will be teaching definitely our coding and Python to at the same time so what's my job teach safe methods for using geo statistics statistics. So we're gonna use our art. The great thing about art is that so many of the methods are really well documented and with a single command. You can complete really important tasks with a lot of different outputs for interpretation and understanding what happened with your model and so it's very powerful also in Python and we'll use these packages in order to do it. Ok so next class we'll need to install anaconda in our studio on your laptops and we'll be getting started probably in the next week or so doing some coding working with workflows and Sibley. Ok so let's talk about um some sampling definitions. A variable is any property that's been measured observed in the study. It could be porosity permeability mineral concentration saturation contaminant concentrations so forth and so on in data mining machine learning. This is known as a feature. If we're dealing with prediction then we'll break up our variables into predictors that. Tell us something. And the response the thing that we're trying to predict with the predictors. The population in fact is the exhaustive finite list of properties of interest over the air over the area of interest so generally the entire population will not be accessible. If it's a subsurface you would have to literally strip a minute and lidar it. Image it at the resolution that you require in order to capture the entire population.

That's not possible we work at great depths. We sample one. Trillionth of the reservoir generally but the population is the entire reservoir. It would be a three-dimensional representation in the reservoir and the scale that you need to work at with all of you are variable. Let's start with all of your variables defined at every single location like a very fine mesh that then you'd you'd fully understand what's going on the subsurface that's a population but we don't have the access to the population. It's too bad we don't. But instead we have the sample the set of data that actually been measured you drilled into the reservoir you imaged it with seismic you did something in order to measure the subsurface and now you have it so example porosity data you may have measured it with a set of well logs that measure the density and so forth the fluids within the rock and you've used that now to assess porosity along the trajectory of a well. You may have done that. The parameter is a summary measure of the population population mean published in standard deviation. But we don't ever have access to this. We have an access to only an estimate of that. That's what we do so we don't know the mean porosity over an entire reservoir we don't know the mean permeability or maybe more importantly the variance of permeability over the reservoir. What we have in fact are statistics. The statistics are summary measures of the sample. That's what's available to us. So we have the sample mean the sample standard deviation and we use these statistics to make estimates of the parameters for the entire population that we don't have access to data cleaning let's make a let's make us some comments about data cleaning and show a really nice illustrative example. So what I did was I went online and I went to the Bakken North Dakota and I downloaded the data sets historical data sets for gas produced per month from the Bakken and the amount of gas. That's flirt so orange is the gas that's produced blue is the amount that is flared and this is on the basis of whatever six months or so.

Yeah at different four months or so okay so we go ahead and we have a nice temporal data set. What's going on over the entire. Bakken so what types of questions can we ask from this. And if you look at this it might be very difficult to discern too much from it. It's kind of difficult to really well. We can say that production has been increasing systematically there's been a leveling off of and increased again. We could say that there is gas flaring increased and decrease a little bit. Perhaps what about if we were to do a little bit of a calculation here and just take the ratio of the flared to be produced so what percentage of the produced was flirt and so this is a very simple manipulation of the data. What do we learn from the data when we applied that. And so this is part of data cleaning like learning about or you know checking the data seeing what's going on does it make sense and so if we do that what do we see. Well you might actually be able to start inferring some very interesting things about this. Temporal data first of all we have a general trend towards decreasing proportion of produced being flirt so this might be early utilization you started producing gas. Didn't have the infrastructure in place everywhere and we just started to kind of figure that up then what we have then what we could have is maybe a second phase where we have a stable low level production. We've got infrastructure in place we're not flaring that much. Then we have a sudden increase in production. Things are starting to ramp up and the facilities for gas handling have not kept up there's higher flowering proportion. And then we go back down to. Then we go with then what we have is a fourth stage with infrastructure catches up again and we're back to being able to reduce the flowering again and so this was a very simple exercise of just simply dividing the two variables from each other calculating a ratio and what we found was that there were clear patterns.

That emerged that we may not have seen. Maybe if we would have done this we would identify that. There was unreasonable values. Maybe we went outside maybe 40 percent. It just doesn't make any sense at all and so we might go back and say. Hey there's something wrong with this data set. These two variables are not agreeing with each other. And so we need to look at if there's an error in data and so this could be part data clean now. Forecasting is a very interesting thing. Forecasting future production moving beyond the sample data. Set would be very hard to do you. Trigger it like a statistical problem you could just fit a trendline you could get kind of complicated about that and fit some type of trend. That's trying to fit different shapes and features you could just apply a linear trend. You could do something like that but maybe we don't have enough data. What else would we need to do. A good job of forecasting in the future. We probably wanna know the number and location new wells. What's the schedule for new drilling. When are those going to be completed and come online. What's the decline rate for the available wells right now. You know what's going on. As far as downtime reworking all of this information would greatly improve our estimate or cast it into the future and so context and domain knowledge are essential we should never treat our statistical workflows treated like they're just statistics. We should use statistics to support our subject matter our domain knowledge in trying to answer the questions to address the various types of scientific questions. We have with regard to the subsurface. We can what type of what do we actually sample. And so when you're working with the subsurface you might be just working with one dimensional datasets and that might be enough perhaps it's a one dimensional along a vector or it might be some type of a temporal data set like we just showed you might be concerned in space about vertical variations.

Are there trends vertically or things changing vertical in mechanical properties mineralogy porosity and so forth or you might be just looking at a single well and looking at things kind of cycle and change along the well and is that related to some types of um you know some information that help you predict away from the well would be really important. You could also be working with 2d sampling for the spatial interpretation. And in this case we have geologic. Maps is the most common two-dimensional data set. We have this is because many of our reservoirs are thin relative to their aerial extent. And so you may have a reservoir. That's only 10 maybe a reservoir unit or 10 meters or maybe you have a row smaller unit. That's 100 meters or so. But then you have a reservoir that extends 4 kilometers in both directions. And so in that case you may find. That's very useful to just simplify the data set and treat it like a map and to do all your modeling in two dimensions. You may also be working with spatial analysis of a thin section where we take in rock and we cut it down really thin to the point that we can put it on a microscope shine light through the very thinly cut crystals and rock structure and analyze the void and rock surfaces and spaces and so forth. You might be working in three-dimensional samples for spatial interpretation. In this case you have three dimensional seismic volumes. You might be working with sets of correlative dwell logs that have been correlated so now we have a full three-dimensional representation of what's going on so if we talk about data and we try to classify data based on. What are we dealing with. What is specifically do we have. We could have categorical or continuous data categorical data takes on discrete values. And we can talk about two different types of categorical. Data nominal and ordinal the nominal would be something that there's no natural ordering in it a perfect example that would be mineralogy categories.

It could be quartz. It could be feldspar. It could be some other type of mineral components and which really there's no way to say okay. Quartz is higher than feldspar which is higher than maybe some carbonates limestone or dull stone or something out dolomite or what. There's no natural ordering in that they're just different things categorical ordinal. There's ordering in the categories there are perhaps you know geologic age. We could say that it's a different age. It could be something that. Paleozoic the Mesozoic Cenozoic play is seen with you know and so forth you could break it down even finer than that and and so you could. You could assign that to it. But there's an order we know that the Mesozoic came after the Paleozoic and that the Cenozoic was after that and there's her Cherie within it and and so forth we know Condit's order hardness is another good example most hardness skill is pretty useful for under standing or classifying or what should I say determining what type of mineral you're working with if you have a hand specimen and so the Mohs hardness scale is basically just a number Dow one to nine one to ten. Where basically you're able to assess kind of hardness based on what scratches what if you compare two minerals to each other and you have type minerals quartz and chunks. There's a whole bunch of other ones that you go ahead and then you determine okay. This has this hardness piece to scratch this mineral. It didn't scratch that mineral talc was the one. I was thinking of an inch and forgotten. Blanked on it okay. So that's categorical continuous. There's two different types of continuous now. The first thing about continuous think of a continuum like a time line. Okay so my time line goes one two three four five six. Okay now everything. Between one and six those were six different numbers but you could have a one point. Two one point two three five fours and so forth you could have like any value in between it's a continuous representation.

There's not discrete categories. There's not a jump. It's just anything could happen. In between one and six that would be a continuous variable now. An interval variable is one where the intervals are equal and four to be continuous. It has the least half of this. And what does that mean a perfect example of that is the temperature scale that we're all used to this. Celsius scale now. We have a zero degrees temperature and then we have ten degrees 20 degrees 30 degrees. The difference in temperature between zero and ten is the same as the difference in temperature from 10 to 20 and as 20 to 30. It doesn't change its scale. It remains the same amount of change. If you look at the fundamental physics of temperature you know the kinetic energy of molecules and so forth it is a continuous scale. That's it has intervals okay. Ratio means that the numerical value truly indicates a quantity being measured if you think about the Celsius the Celsius scale or if you're American the Fahrenheit scale the decision of the datum was arbitrary the Celsius approach is that it's going to be based on they melting of water and the boiling of water. For a hundred and then they made everything else work the Fahrenheit scale well. I'm Canadian I'm not going to defend or explain the Fahrenheit scale but you can see that they signed a different datum and they put a different magnitude on each one of those increments Kelvin scale. 0 and Kelvin is related to a fundamental physical process of the energy with the molecules and movements and so forth not a physicist but there is a physical meaning to it porosity. Zero percent porosity means no void in the rock 100 percent. Porosity means you have just voids okay permeability saturation so forth. These are ratio continuous variables they have physical meaning to the exact values okay types are data quantitative data and qualitative data.

There's a very simple way to explain this. A quantitative data is something that can be written in numbers. You put numbers on it. That's most of the things we work with this engineers are going to be quantitative qualitative data is information about quantities that you cannot directly measure it requires a interpretation of the measurement and so a really good example for subsurface would be. Rock type and facies. We would look at the you would look at the rock you probably look at porosity permeability other types of quantitative data but you make an interpretation. You say this rock is this thing. This is what this thing is and but it's not like we put a number on it quantitative qualitative. Um if you want examples right here there's a well log. We have probably gamma-ray spontaneous potential over here and we can go ahead and we can look at actual numbers from the subsurface. That's quantitative as soon as we go ahead and say well this is sandstone or this a certain rock type or a dull stone wacky stone or balanced owner for carbonaceous reservoirs. We have put a we have now have a qualitative type of data. I'm never suggesting that qualitative data is not valuable just suggesting that that it has a layer of interpretation and we need to understand of course the uncertainties related to that interpretation so another way to talk about data are to consider data is hard and soft data. Hard data is data for which there's a high degree of certainty usually hard data is something that we measured in the subsurface. We we collected a sample or we have a tool for indirect measurement that has a high degree of precision example. Well-core if you strapped a bell core and you subject it to porosity test that's pretty. That's pretty certain we have a pretty good understand. What's going on porosity. We could also have a combination of ethology. I mean i vlogged based prom log. Based information that provides us a pretty accurate measure of porosity and times.

We could consider that to be hard data too depending on its level of precision we might also feel that our litho phases are also very are very accurate our definition of lethality related faces or categories of rock and. We think that they're pretty good from the well. Lock well log suite that we're working with. We could also consider that to be hard data. Soft data this data that provides indirect measures of the property of interest and so they include a significant significant degree of uncertainty example. You could imagine that if you had seismic information all over your reservoir like shown in this example right now away from the wells that they may provide you some indication of a quality of the rock but still there would be a significant degree of uncertainty with regard to that and so at each location. There wouldn't be a single value. It would be a distribution of possible values for porosity informed by seismic soap use soft data. It's uncertain other tight. Other ways to describe data with primary data and secondary data. Simply stay that primary data is the variable of interest. That's the variable you're working with. It's the target that we're trying to model the target. We're trying to understand. Secondary data is any other variables or features that are used to provide information about the primary data in order to support the process of understanding the primary date. It's any type of secondary support to understand the primary data so let's the example could be porosity you haven't measured them cores and logs you try to build a full three-dimensional model porosity but you don't have prostate allocations but you have acoustic impedance which been measured by seismic in drew remotely and indirectly in the rock and so you could support the modeling of porosity away from the wells using this indirect measurement that is a secondary variable. If we're talking about data types that are available to us. We're concerned about coverage scale or support size and information type coverage is what proportion of the brisbane or the population has this data actually sampled as it's this data over.

Which is this data available. And so you imagine if you're dealing well logs they're only providing you information probably a couple of feet or meters into the subsurface away from a well location. If you're dealing with seismic it's coverage can in fact be very good you can have almost complete coverage of the reservoir but at a very low resolution and so forth everywhere in that case the scaler support size is the scale of the individual data measures so pore scale would be measures that tell us about micrometers millimetres of the subsurface individual grains and voids in the subsurface we could have or logs that provide us with cubic centimeter scale. But probably more likely in the order of you. Know meter cubed or so table scale and we could also have measurements like seismic if we're dealing sub salt we've got low resolution. Because we don't have a lot of high frequencies involved we could be in a situation where our vertical resolution is on the order of tens of meters. In which case we're now talking about reservoir units scale of information very coarse scale information information type. What does the data tell us the subsurface. It's telling us about the grain size the mineralogy the flow type the layering. You know the overall directionality in the subsurface barriers baffles conduits for floor flow and so forth right. Let's talk about how we get a representative sample. What methods would be used to get a sample for which we could calculate statistics and those statistics would be unbiased. They provide us the best indication for the amounts data. We can gather about the population so best way to do that is random sampling which means that. Every item in the population has an equal chance of being chosen. And there's no correlation between the locations that you choose randomly sampling throughout the reservoir well as ran is random sampling sufficient for the subsurface.

Is it something that would benefit us. Is it even something we could do. Just imagine going into your um your bosses office and telling them that you would like to drill the next well which cost subsea you know in the deep marine setting one hundred fifty million dollars. I want to drill that at random within the reservoir you may have to look for a new job. It's usually not available. It would in fact not be economic. We don't want to change the way that we sample. Data is collected to answer questions within the subsurface specifically we drill our initial wells to understand. Is there a reservoir. We drill our subsequent wells to understand to test the limits spatially of given we had reservoir there how far can we go away and still have reservoir that's important because that helps us assess the size of the prize in order to book reserves and so forth that adds value to the company. So we can't change that this type of sampling type of approach is very useful. The wells are located to maximize future future production. They're maximized to Maxima to provide maximum value of the project. Wells can also be a dual purpose for appraisal and for injection or production and so forth and so random sampling would not be a good idea. Regular sampling could be used to try to get representative samples. The is where the samples are drilled on some type of regular space and so but there's a warning there because what can happen is if you have regular sampling at some interval imagine if there was some type of sick lissa. T within the reservoir and that just happened to have that just happened to resonate with your sampling frequency in the intervals in space. You could create a bias like that and so what do we. What do we have when we're working. Well we have to just accept it. We have bias data in fact if you're working with the subsurface and you're working directly with the raw data.

You should question that you should be concerned about just working. With raw data to calculate calculate statistics and to make any types of decisions we also have opportunity sampling. We have the fact that we may have issues with access to the subsurface we may not be able to get to certain depths and so forth we may have something getting in the way the drilling hazards and so forth so we have to account for this bias. We'll talk more about deep biasing later but let's just give a couple of our ideas one and let's just let's exact. Let's go ahead and recognize that this is aggravated even further. If you drill a well that's selected in a bias manner. Then you're going to extract core from that well and that core is going to also be extracted in a bias manner. In fact you're never going to take a section of the core that's just shale and send that off for very expensive typical or standard core analysis or any type of advanced analysis of the core. You're not going to do that cord. Plugs are then often extracted from core samples for additional analysis. And those those into are going to be sampled in a biased manner from so you see there's three different orders a hierarchy of bias going on with our sampling. So how do we address this bias. So let's take this example right here a very simple example we have two dimensional map. It's in feet so a couple miles by a couple miles and we have our wells. We'll assume they're vertical wells. We've averaged over the reservoir interval. We got porosity. I don't have a color scale here but blue is going to be low. Porosity orange is going to be high. Pearl's would it be fair to take the average of all of these samples and to suggest that that in fact is the average prosity of this entire reservoir. I think we can agree. That up here in the high porosity x' and even here high porosity x' here in the low porosity x' we haven't sampled as much but here we sampled much more and so we're going to have a biased. We have a biased high estimate of porosity if we take them and just take the average.

So what do we do one way to do. It is the first defense against this type of bias is good geologic. Mapping you map that from the data and you identify. There's low quality area low porosity medium quality medium porosity and a high porosity region and you deal with them each of them separately. You break up the model into subsets and each one of them. You calculate the average across the over this region average across the over this region average across through this region. And if you do that if you were to use the average crossing in each one of these regions and put that together to get an average porosity over the entire area of interest that would be a pretty good estimate. You've probably done a pretty good job of avoiding some of the bias in the sampling. Another way to do it is just to spatially look at the spacing of the data like just look at the amount of space between the well data and then to assess a weight to the data in this part of the reservoir. This data is very sparse it should receive a high weight given greater weight. It's representative of more area or volume within the reservoir. This data right here should receive a low weight and these data right here received may be a nominal weight something. That's more medium weight if you do that you can calculate any sample statistic in fact the entire CDF cumulative distribution function all from using just using the clustered weights so now you have data values at each location all these locations and in each location. You're also going to have a weight to now so you have an additional variable now to deal with but this is very powerful. This is a good way for accounting device biases not just due to sampling. Bias is everywhere. It turns out that these brains that developed in hunter-gatherer. Society and trying not to be eaten by larger predators are full of all kinds of biases. That actually helped us back then but may not be helping us right now as we're trying to do scientific statistical studies of the subsurface and.

I'll give you some examples right here. For instance you have anchoring bias too much emphasis put on first piece of information. It's very interesting actually. They've shown that. I could ask you a question that you may not know very much about like what is the age of Kevin Bacon. I don't know maybe you're a bacon fan but if I said that but before I said it I first said 13 but my favorite number and then I said what's the age of Kevin Bacon and I asked 40 students and I repeated that exercise where I first said 90 and then Ashta what's the age of Kevin Bacon. Statistically speaking that second group would actually have a higher estimate on average in the first group. And it's because anchoring works even if it makes no sense and that's a scary part of it is you heard 13 and you just thought that's a very small number that's a very young age that doesn't make sense and then when you who thought about Kevin Bacon you thought. Well he's not 13 he's older. And so you anchored and you went up from there 90. The opposite thing happens and you're gonna estimate within higher age availability heuristic. This is where we put too much information on things that are easy to observe and this happens all the time all kinds of anecdotes and this is very dangerous. Bandwagon effect blind spot effect. You don't even know your own biases choice. Supportive effect clustering illusion. Confirmation bias is huge in a subsurface team. You get a hypothesis you have a theory now and everybody's excited and you start to maybe ignore data that contradicts that kind of the theory. New information is only used if it supports the current model and recency bias. I favor most recently collected data and survivorship bias. A really important one where we focus on these success success cases there was some type of filtering in the data that we're not accounting for so that was the end of our discussion around statistics.

I hope this was helpful as usual. Go ahead and now you know let me know in class or email me. I'm easy to find my perch a professor at the University of Texas at Austin and contact me if you have any ideas per minute.