Efficient data analysis and reporting: DRY workflows in R
First of all. I'd like to begin by acknowledging the traditional owners of the land on which we meet today the people of the terrible and jagira nations and pay my respects to elder's past and present so today I want to. I want to talk about efficient data analysis and reporting and especially. Don't repeat yourself workflows you know and I guess I haven't. I haven't got a sleek well. It's not not too slick well. I've got I do have a package about that but it's not just about packages I guess my interest is more generic and and I guess not. There's not a one-size-fits-all often as we heard in the discussions just just a moment ago and I guess I'm certainly interested in a dry and a dry model. I come from inland Australia and I thought my my international colleagues here here might appreciate this. This is actually a creek bed and I think this is about the first time in 30 years so there was ever any water in it unfortunately when I took it well not unfortunately we just had rain so so so it's actually raining this creek bed better but it is usually quite dry near near Broken Hill where I'm from and I guess the thing about dry is we're talking about don't repeat yourself and and or as always Hadley Wickham sometimes says having a good sense of humor do repeat yourself and reuse your code in in contrast to wet wear where I think this is where I come from and where maybe a lot of people started out is write everything twice or we enjoy typing. That's a good one or waste everyone's time because we don't know what we cut and pasted from the report and so. I think I think dry approaches is much better and I know I want to talk about that but first of all I want to talk about some refree about reproducibility in general I feel like I'm probably preaching to the converted here but certainly in my in my area of work which is medical statistics these days and it used to be agricultural or genetic statistic prior to that certainly not everyone is aware of the problems and then.
I'd like to talk briefly about dry workflows for data analysis projects because I've been a consultant statistician for about thirty years and and during that time I realized I was repeating myself an awful lot even with different projects and so that so the Scott long has got a book about using Stata for this sort of stuff in it and he talks about the plan document organized. Carry out cycle. So we're all quite for me with that and it's quite an iterative process but on top of that. I started using version control in the in the 90s for data analysis project and also make at the same time my knowledge. I realized sample new. Acknowledgement slides are. Ignasi as I go along but my colleague Bob for us to introduce those to me in the early 90s and and he used them and and and and quite a few statisticians used those things for a long time and it was fantastic when asked you do introduced these things too and sort of introduced them to a wider audience. The other thing is is using our packages and functions. If you're doing the same thing over and over again. I once again. I'm preaching the converter but you but you should if you possibly can make a function and then I'll just briefly talk about conclusions there so what what I guess the main thing is is when when you sort of been working in this game a long time you see that a lot of it a lot of there is a reprieve pardon me reproducibility crisis and but we never quite know the extent of it so I guess manya Baker. No relation did a. Dutch study looked at looking at about 1,500 scientists and surveyed them. And so do you think there's a reproducibility crisis and more than half said yes another another 38% so so we're looking at sort of 90% there for thought that there probably is a slight crisis and there was only a few that didn't so. I guess that's only a survey but I have the in my own my own work I've been asked even by the integrity integrity officer at UQ to look at theses where the data in the back. I'm sure this well I'm sure what anyway it's it's confidential but I've seen these theses where where the data doesn't match up with the the results and we're all familiar with that and I think sometimes that's a lack of training more than more than an intentional fraud for instance and John Eyre notice in his 2005 paper said that most published scientific findings are false and probably there was a figure just over 50% at least anyway and I guess.
Part of the problem of reproducibility is that and the scientists recognize. This is the methods and the code aren't available. Raw data is not available. There's a problem there's a lot of problem if you try and reproduce people's papers and I think the our community because of these open sources it has a rather different view about these things and the other thing is it is that's crucial is a quality assurance is crucial so this is this is more in lab studies. I think it's probably worse for what I've seen. People do not our use of song but well in general. I've seen them do and I guess here there's 34 percent that said that they haven't still haven't they're aware of it but they still haven't established procedures for a reproduced building and I suspect that it's high. That's in the lab. I suspect it's hiring data analysis and I think that our statisticians and and data scientists we can contribute to study design and analysis. That was another thing on the list understanding variability but what. I want to talk about for the remainder of this time is about reproducible analysis and reporting so I guess the the the workflow that I that I followed for for quite some time when I when I realized it needed some improving or in my own work is to plan document. Organize and carry out. Now if you're working with clinical trials say and certainly you'd know that you'd you need to do analysis plans and and often before you even you've got some idea of the analysis before you before when you apply for the grants for instance and certainly you need code books to document variables labels levels ranges.
What instruments have used what data you'd expect and and that once you once you get the data then you carry out the analysis plan but it's an iterative process because the data may not be what you thought it was all this there's some problems with it and I have found that for large projects. Some of them some of them that go over a number of years for instance that. I tend to modularize the projects. I might have a have a area of a directory that I'm that I'm reading. Data emerging data and cleaning data. I might have another directory where I'm doing analyses and testing things. They might have another area where I did do reports and this modularization is very common in computing and and I guess just having a single markdown. For file will work in in small projects quite nicely but not not in large ones so. I guess this sort of cycle that I'm talking about in the traditional manual approach. I guess a lot what I've seen a lot especially in medical research is menu-driven copy and paste it's error-prone and it's not audible or auditable. Pardon me which is important if you if you want reproducible results and now guess there are more modern dry approaches to use syntax like I used to use 30 years ago so automation version control and and and make for build systems so I guess. My preferences are for consistent. Informative project names file names and so on to modularize the project unless it's very unless it's very small and have consistent directory names once you start doing a whole lot of projects that look similar even though they're not the same you it's it's easier if you sort us to follow the same sort of pattern and I wouldn't really want to dictate that but but if everyone sort of works in a different way but I guess we could still use standard well-established tools like right around our functions and saving some of the results so if we've done an analysis our is great we can save the objects we can save the plots we can save the tables and then we can print them out later.
We might just massage them but we don't have to go through that process again we'd and and sort of think back well two years ago. I used this bit of code and me to. Do you know to do to do this. Analysis you can you can look at. You can look at the results if you if you use something like make and I know there's other there's other systems around for regenerating the output so that you go through a logical step of that's dependent on the previous step and and if you use get right from day one then then you do have a record of everything and it is relatively easy to go to go back and and so I guess to go back if you if you want to change something so I guess here's a sort of standard sort of directory structure that I might have and I guess the the main thing that I point out is that further data. I've got an original and I try not to touch the original the original data. Because you don't really want to do that and we've got some derived data. I've also got codebook data because I want to use the codebook to actually check my data and so you can write functions to do that. You can use the tidy verse. I've written a package but he's moderately Elementary but did what I what I wanted to do. And and then we've got other other directories here for reading and cleaning and and having our our MD or our in W files and it and I guess in terms of our own functions and packages will. They're a really helpful thing too and so. I put them in a standard place so I know where they are. So this is me working as a solo consultant. Right naturally get we can put them on it on a git repository and and work in teams when we need to. And we can automate this whole process. So to set up directories move files to appropriate directories and so on and and we can also generate make files so there are there is the option in in our. I guess to write your own syntax when I just sort of written some.
I've written our. I've written make file rules so that so that you don't really need to know how how to much about how make works. You need to know a little bit but but you can generate these things using a package or you or you can use them simply and build out your experience and I can use the information in code books to check data and and that forms the basis of the dry dry workflow package but I probably won't talk that much about that today as I realize. I'm going to probably run out of time fairly soon. So so in terms of make it's a very it's a. I know people. Some people may avoid it and and that's perfectly good too. Drake is certainly in a very interesting development in in this space where we're basically if you know make you can use make directly but if you don't you can use our function so that so you can do these sorts of things some quite nicely but I still prefer to be in control. I guess so so I have some make file definitions for all sorts of software and I guess you can look at the directed acyclic graph to see the the files in green. Are that what. I put in the in the repository in the git git repository not necessarily github because I've got confidential data in medical study spoke put it on a server somewhere and then all the other files can be regenerated and I can of course put them into into some sort of version at times. I think I might skip over that because then I speaker gave an excellent overview of git and I guess we've marked down or sweet nadar. I think personally that these things have have revolutionized the way that we do reproducible reporting and and they're not really available to the similar sort of extend in other packages. I mean I know SAS has got its ODS system and that's that's pretty good but but I do think that has changed the way that we think about how how we can produce reproducible reports and it's and that the people that don't use this sort of system don't really have that mark might say that and probably don't realize that it's sort of possible at all one thing I will say is that you can load objects created and saved in previous steps so you can have quite a complicated process.
Save them into a binary file and and then and then load them back when you're doing the report and that way you're just concentrating on the on the report or someone else is concentrating on the report if they do in that part and so. I guess there's a whole lot of our packages that we can use and and there's four code books we can code it up ourselves using the tidy verse and test that I sort of packages or a COBOL car for. I'm sure we'll see other. There's some great workflow packages. We just saw one to do things to do things for us. The the dry workflow package can can create the directory structure. Move the files around and even set even do it initialize the the git repository and actually create. If you've got templates. It can create templates in tax files and markdown files and and there are other files around so so that we've we have a whole lot of we have a whole lot of we have a whole lot of tools at our disposal and now. I'm sure there's a lot there's a lot more around and I could just. I'm sure you probably can't even see it. But but in this in this in this project I've got two CSV files and this is just a demonstration but if I run if I run this I might just make this bit bigger if I'm allowed to unite me so if I so if I if I run just this oh there we go. It could not create thou shalt do the demo then but um that's that's how that's how reproducible. They sorry about that but um so worked worked before but Agua a basically you can use any any sort of methods that you like whether they're whether they're whether you write your own scripts in in whatever language or whether you whether you use reproducible. I mean sort of our. I mean our lends itself to really write writing lots of lots of functions packages and code and we and we'll see more today but our really really does lend itself to automating all sorts of parts of this and and I guess I'll just end up by saying saying that there really is a reproducibility crisis in science and it's it's not just not just medical science or or bioinformatics or whatever whatever field you might be in and often we knew we need to have very thorough planning and documentation and organization and and and all all this all these sorts of aspects to our to our to our cycle of data analysis.
But we can't automate quite a lot of it and it is a personal choice and it's very hard to when someone comes along to suggest exactly what they should do because every project sort of different but there are a lot of similar like similarities and if we can use it don't repeat yourself methods then then that that will have a lot more advantages there and there as I say there's quite a few our packages around including workflow are which must have fallen off the list and and we can use tools like make new make. There are alternatives around but it really is often quite quite simple to use make in the background especially when your project gets a bit larger and you only need sort of control over things version control markdown and unit tests and your own and your own functions and packages. I think I might stop at there so thanks very much any questions. I think cow bromans got some good shoots online. Well there's there's also it's also my J at Genesis paper when it ever get comes out when I get back to it. I'm so so there so there are some some resources around but I do think that if you use pattern rules so I've got some one github that you can sort of bypass a lot of the worries about make but make yes make can be intimidating and especially when you get to the get to the really modularized sort of huge workflow. If you've got a fairly long term project but but it's quite it's quite easy to build it up slowly and assuming that by that my workflow package work to which. I'm sure I just changed directory back.
Then then you then you can get it to set it up for you fairly straightforwardly and so so you can and. Drake is another one where you've got some sort of tools that you are your disposable to sort of avoid make but it really it seems intimidating but it's a little bit like some people say well we should use Python instead but that means I have to learn a whole new language so so I'd stick with make I guess there's my advice but that's your mileage may vary.
I'd like to talk briefly about dry workflows for data analysis projects because I've been a consultant statistician for about thirty years and and during that time I realized I was repeating myself an awful lot even with different projects and so that so the Scott long has got a book about using Stata for this sort of stuff in it and he talks about the plan document organized. Carry out cycle. So we're all quite for me with that and it's quite an iterative process but on top of that. I started using version control in the in the 90s for data analysis project and also make at the same time my knowledge. I realized sample new. Acknowledgement slides are. Ignasi as I go along but my colleague Bob for us to introduce those to me in the early 90s and and he used them and and and and quite a few statisticians used those things for a long time and it was fantastic when asked you do introduced these things too and sort of introduced them to a wider audience. The other thing is is using our packages and functions. If you're doing the same thing over and over again. I once again. I'm preaching the converter but you but you should if you possibly can make a function and then I'll just briefly talk about conclusions there so what what I guess the main thing is is when when you sort of been working in this game a long time you see that a lot of it a lot of there is a reprieve pardon me reproducibility crisis and but we never quite know the extent of it so I guess manya Baker. No relation did a. Dutch study looked at looking at about 1,500 scientists and surveyed them. And so do you think there's a reproducibility crisis and more than half said yes another another 38% so so we're looking at sort of 90% there for thought that there probably is a slight crisis and there was only a few that didn't so. I guess that's only a survey but I have the in my own my own work I've been asked even by the integrity integrity officer at UQ to look at theses where the data in the back. I'm sure this well I'm sure what anyway it's it's confidential but I've seen these theses where where the data doesn't match up with the the results and we're all familiar with that and I think sometimes that's a lack of training more than more than an intentional fraud for instance and John Eyre notice in his 2005 paper said that most published scientific findings are false and probably there was a figure just over 50% at least anyway and I guess.
Part of the problem of reproducibility is that and the scientists recognize. This is the methods and the code aren't available. Raw data is not available. There's a problem there's a lot of problem if you try and reproduce people's papers and I think the our community because of these open sources it has a rather different view about these things and the other thing is it is that's crucial is a quality assurance is crucial so this is this is more in lab studies. I think it's probably worse for what I've seen. People do not our use of song but well in general. I've seen them do and I guess here there's 34 percent that said that they haven't still haven't they're aware of it but they still haven't established procedures for a reproduced building and I suspect that it's high. That's in the lab. I suspect it's hiring data analysis and I think that our statisticians and and data scientists we can contribute to study design and analysis. That was another thing on the list understanding variability but what. I want to talk about for the remainder of this time is about reproducible analysis and reporting so I guess the the the workflow that I that I followed for for quite some time when I when I realized it needed some improving or in my own work is to plan document. Organize and carry out. Now if you're working with clinical trials say and certainly you'd know that you'd you need to do analysis plans and and often before you even you've got some idea of the analysis before you before when you apply for the grants for instance and certainly you need code books to document variables labels levels ranges.
What instruments have used what data you'd expect and and that once you once you get the data then you carry out the analysis plan but it's an iterative process because the data may not be what you thought it was all this there's some problems with it and I have found that for large projects. Some of them some of them that go over a number of years for instance that. I tend to modularize the projects. I might have a have a area of a directory that I'm that I'm reading. Data emerging data and cleaning data. I might have another directory where I'm doing analyses and testing things. They might have another area where I did do reports and this modularization is very common in computing and and I guess just having a single markdown. For file will work in in small projects quite nicely but not not in large ones so. I guess this sort of cycle that I'm talking about in the traditional manual approach. I guess a lot what I've seen a lot especially in medical research is menu-driven copy and paste it's error-prone and it's not audible or auditable. Pardon me which is important if you if you want reproducible results and now guess there are more modern dry approaches to use syntax like I used to use 30 years ago so automation version control and and and make for build systems so I guess. My preferences are for consistent. Informative project names file names and so on to modularize the project unless it's very unless it's very small and have consistent directory names once you start doing a whole lot of projects that look similar even though they're not the same you it's it's easier if you sort us to follow the same sort of pattern and I wouldn't really want to dictate that but but if everyone sort of works in a different way but I guess we could still use standard well-established tools like right around our functions and saving some of the results so if we've done an analysis our is great we can save the objects we can save the plots we can save the tables and then we can print them out later.
We might just massage them but we don't have to go through that process again we'd and and sort of think back well two years ago. I used this bit of code and me to. Do you know to do to do this. Analysis you can you can look at. You can look at the results if you if you use something like make and I know there's other there's other systems around for regenerating the output so that you go through a logical step of that's dependent on the previous step and and if you use get right from day one then then you do have a record of everything and it is relatively easy to go to go back and and so I guess to go back if you if you want to change something so I guess here's a sort of standard sort of directory structure that I might have and I guess the the main thing that I point out is that further data. I've got an original and I try not to touch the original the original data. Because you don't really want to do that and we've got some derived data. I've also got codebook data because I want to use the codebook to actually check my data and so you can write functions to do that. You can use the tidy verse. I've written a package but he's moderately Elementary but did what I what I wanted to do. And and then we've got other other directories here for reading and cleaning and and having our our MD or our in W files and it and I guess in terms of our own functions and packages will. They're a really helpful thing too and so. I put them in a standard place so I know where they are. So this is me working as a solo consultant. Right naturally get we can put them on it on a git repository and and work in teams when we need to. And we can automate this whole process. So to set up directories move files to appropriate directories and so on and and we can also generate make files so there are there is the option in in our. I guess to write your own syntax when I just sort of written some.
I've written our. I've written make file rules so that so that you don't really need to know how how to much about how make works. You need to know a little bit but but you can generate these things using a package or you or you can use them simply and build out your experience and I can use the information in code books to check data and and that forms the basis of the dry dry workflow package but I probably won't talk that much about that today as I realize. I'm going to probably run out of time fairly soon. So so in terms of make it's a very it's a. I know people. Some people may avoid it and and that's perfectly good too. Drake is certainly in a very interesting development in in this space where we're basically if you know make you can use make directly but if you don't you can use our function so that so you can do these sorts of things some quite nicely but I still prefer to be in control. I guess so so I have some make file definitions for all sorts of software and I guess you can look at the directed acyclic graph to see the the files in green. Are that what. I put in the in the repository in the git git repository not necessarily github because I've got confidential data in medical study spoke put it on a server somewhere and then all the other files can be regenerated and I can of course put them into into some sort of version at times. I think I might skip over that because then I speaker gave an excellent overview of git and I guess we've marked down or sweet nadar. I think personally that these things have have revolutionized the way that we do reproducible reporting and and they're not really available to the similar sort of extend in other packages. I mean I know SAS has got its ODS system and that's that's pretty good but but I do think that has changed the way that we think about how how we can produce reproducible reports and it's and that the people that don't use this sort of system don't really have that mark might say that and probably don't realize that it's sort of possible at all one thing I will say is that you can load objects created and saved in previous steps so you can have quite a complicated process.
Save them into a binary file and and then and then load them back when you're doing the report and that way you're just concentrating on the on the report or someone else is concentrating on the report if they do in that part and so. I guess there's a whole lot of our packages that we can use and and there's four code books we can code it up ourselves using the tidy verse and test that I sort of packages or a COBOL car for. I'm sure we'll see other. There's some great workflow packages. We just saw one to do things to do things for us. The the dry workflow package can can create the directory structure. Move the files around and even set even do it initialize the the git repository and actually create. If you've got templates. It can create templates in tax files and markdown files and and there are other files around so so that we've we have a whole lot of we have a whole lot of we have a whole lot of tools at our disposal and now. I'm sure there's a lot there's a lot more around and I could just. I'm sure you probably can't even see it. But but in this in this in this project I've got two CSV files and this is just a demonstration but if I run if I run this I might just make this bit bigger if I'm allowed to unite me so if I so if I if I run just this oh there we go. It could not create thou shalt do the demo then but um that's that's how that's how reproducible. They sorry about that but um so worked worked before but Agua a basically you can use any any sort of methods that you like whether they're whether they're whether you write your own scripts in in whatever language or whether you whether you use reproducible. I mean sort of our. I mean our lends itself to really write writing lots of lots of functions packages and code and we and we'll see more today but our really really does lend itself to automating all sorts of parts of this and and I guess I'll just end up by saying saying that there really is a reproducibility crisis in science and it's it's not just not just medical science or or bioinformatics or whatever whatever field you might be in and often we knew we need to have very thorough planning and documentation and organization and and and all all this all these sorts of aspects to our to our to our cycle of data analysis.
But we can't automate quite a lot of it and it is a personal choice and it's very hard to when someone comes along to suggest exactly what they should do because every project sort of different but there are a lot of similar like similarities and if we can use it don't repeat yourself methods then then that that will have a lot more advantages there and there as I say there's quite a few our packages around including workflow are which must have fallen off the list and and we can use tools like make new make. There are alternatives around but it really is often quite quite simple to use make in the background especially when your project gets a bit larger and you only need sort of control over things version control markdown and unit tests and your own and your own functions and packages. I think I might stop at there so thanks very much any questions. I think cow bromans got some good shoots online. Well there's there's also it's also my J at Genesis paper when it ever get comes out when I get back to it. I'm so so there so there are some some resources around but I do think that if you use pattern rules so I've got some one github that you can sort of bypass a lot of the worries about make but make yes make can be intimidating and especially when you get to the get to the really modularized sort of huge workflow. If you've got a fairly long term project but but it's quite it's quite easy to build it up slowly and assuming that by that my workflow package work to which. I'm sure I just changed directory back.
Then then you then you can get it to set it up for you fairly straightforwardly and so so you can and. Drake is another one where you've got some sort of tools that you are your disposable to sort of avoid make but it really it seems intimidating but it's a little bit like some people say well we should use Python instead but that means I have to learn a whole new language so so I'd stick with make I guess there's my advice but that's your mileage may vary.