ACCRE Pizza and Programming: Big Data Analysis with Spark
Okay so if you go to youtube this is say okay. We're good alright. Welcome everybody thanks a lot for coming by and break the heat super cool to extremes. Hopefully you guys enjoy that today. Josh is going to be talking about big data analysis. This is something worried we've received support from the University to do. And that is the build a cluster that's optimized for Big Data and so well I'm already doing big data analysis but just from the current requirement. How what's going on so our current environment absolutely supports the data analysis through many of the conventional scientific computing tools that you've grown to know and love like Python are in that lab that that cluster environment was designed with different sort of facing. Something's in line so it was with high performance computing simulation like molecular modeling in mine and not big data analysis and so the tools we're going to begin to talk about today are from the Hadoop ecosystems. Anyone heard of him - ok awesome without half so this was born out of one of the big web companies like Google and Yahoo in the early 2000s. We're trying to cope with these millions and millions of web pages and files and such that they are having to constantly manage so they developed something called HDFS and MapReduce to kind of help solve those problems and the interesting thing is it was born in industry. It's only now beginning to sort of work. Its way into academia so the do and spark was. Josh will talk about. Today are huge huge players in the data science industry in general. So if you're interested in becoming a scientist especially going industry it's very very likely that company would be using some subset of these tools that Josh will be talking about today a few other things first. We apologize for the brief downtime this morning. I was not blind at all. We're still digging into it but things should be available and up if there is anything odd still going on from your home the rectory or elsewhere feel free to open the ticket.
Let us know there was one more thing. I wanted to say yeah. We're going to start mostly so josh is going to be talking about this new programming environment today called spark and you can use spark from python you can use it from R which will be really convenient. I think for a lot of users but if you are interested in importing your court code over to spark and you feel like it's an insurmountable obstacle feel free to ping Joshua to ping us and we're happy to sit down with you and look at your current code and help you convert into this new environment and another thing I think will probably be setting up is basically spark and padieu office hours so maybe starting next week we'll just add a one or two hour block once a week. You can come by sit down with Josh. And we can help evaluate your current workflows and see if what you're doing is a good candidate for this new environment so with that. I will turn it over to Josh. All right thank you very much. Well thank you all for coming today enjoying the pizza. This is an intro to using spark on the Big Data cluster and. I'm going to ask first of all to bear with me a little bit because today's talk might seem a bit more academic and unless less applied. I'm going to get to some really typical use cases toward the end of the presentation and I'm going to actually invite you guys to talk and discuss. Maybe some of the ideas you're having on how to the kinds of problems you face with big data but I do want to emphasize the mechanics of what's going on in spark because I think it's sort of fundamental to understanding why big data works and why this sort of computing approach is useful but it is also different from traditional HPC so with that in mind because this is very new I want to be upfront. First of all about how to get help with with these issues. So we'll of course is alluded to office hours. But there's the normal channels of submitting helpdesk ticket. I strongly encourage you to join the Aker forum slack team which we do have a big data channel there.
We also have a github organization in parallel to the acre github account which is devoted strictly to big data so big data is -. Vande we have a blog that sort of piggybacking off that github site there with some sports step by step instructions and then finally we do take the big data. Roadshow around campus and talk to meet with research groups if that's something you're interested in so let's talk about. Hadoop I put this up here intentionally to confuse. And wow you with you know bright colors. Hadoop is the core of Big Data these days and essentially Hadoop is the Hadoop distributed file system but a huge ecosystem of applications have developed around it now. I'm not going to go through all the names. Catchy as they may be and/or the graphics. I just want to give you the idea that it is a very diverse ecosystem so how is it different from HPC well the fundamental idea in HPC is usually traditionally been that we have a code running on fairly performant hardware and we're going to move data to that code but big data is just the opposite. It's the born out of the idea that it can be more efficient to move your code to wherever the data is located and by taking advantage of that data locality you can speed up the analysis so typically what this means in terms of storage is that on our cluster on the traditional cluster where you use GPFS on Hadoop we have HDFS the distributed file system so it's not that every piece of data is accessible everywhere in the big data cluster. It's that data resides on different nodes within that cluster. And we'll get a little bit more into the specifics there. There is an emphasis on more performant languages like C C++ and God forbid Fortran if anyone's using that I kid it's a good language but really the emphasis in Big Data is sort of being able to program quickly and get allowing developers to get to solutions quickly so you have much more the the niceties in Java Scala Python etc and really this the sort of correlates with this idea of imperative programming where I traditionally you think about moving through a for loop and visiting.
I'm going to eat visit each piece of data once and when I get to the end of the data. I'm going to do something else. That's imperative programming in functional programming. The idea is more okay. I have this idea of data that needs to be transformed in a certain way. I don't really care how I access it. But it's going to be transformed to this and it's going to be distilled to this so it's really it's worth reading up on. It's sort of an interesting new paradigm coming out of programming so with that in mind what are candidates for big data workflows that perhaps need to transition from traditional eighth piece. HPC to Big Data. Well the obvious first answer is lots of data if you have lots of data this is going to work better in general so typically data tasks that are embarrassingly parallel but more or less computationally simple. So it's not so much. Protein folding would be a poor example of a big data application because we're not really optimizing for we're not CPU bound that is to say the chunk of processing is not coming from the time that it takes to calculate something but rather the time it takes to move data back and forth. Okay so we're talking about typically. IO bound applications and really where this is caught hold in the last decade or so is machine learning because machine learning in general always gets better with the more data you have and there have been very clever ways to sort of make machine learning algorithms work within the context of Hadoop and so this is something we'll talk about towards the end of the lecture today but also something that you might not think about that actually might be useful to a lot of you this network analysis so you could think about why this might have come up in the last 15 years or so in the context of industry social networks right.
There's definitely a need to compute properties of networks quickly and efficiently and it's become actually very mature the algorithms to do this kind of thing and lastly we won't talk about this too much but processing data that's coming in in real a real-time stream so actually applying machine learning network analysis or any of the above two streams of data that you're not actually necessarily storing as is this is another key component of what spark can do so. I'm not going to go through this slide to in too much detail but I just did want to sort of point out that the tools for traditional scientific computing versus big data analysis are different in sort of meeting different needs. But that's something you can check out in your own time. I suppose because I want to get into the meat of the presentation which is talking about the saw the hardware that we have here available at Aker so as will mentioned we have a test environment set up on recycled hardware now which has been in use for over a year now as opposed which is actually one of the big selling points of big data is that it can run on commodity hardware so you're not investing huge amounts of money into the computation itself now what. I want to point out here. Is that the new hardware purchase that so the new cluster. That will be coming online. This fall actually has a pretty beefy data nodes there so 80 terabytes of storage 32 cores and half terabyte of RAM. I don't think anyone here probably has a laptop with half a terabyte of RAM in it so this could be. We're super excited about being able to use these and I should mention again that this is free for the campus to use this has been funded. So it's this perfect opportunity to transition to a big data workflow so I do need to since we're talking about hardware. I'll just mention that. Cloud era manager is the tool that we use to manage the cluster. And that's probably all you need to know about it but I did want to show just a nap.
Shot of our ecosystem and sort of the the tools that we're managing with cloud era so many of the ones you saw in the graphic previously. I'm not going to go through all these but I did want to sort of take you to spark and in order to do that. I need to talk about these three specific services so HDFS that we've already sort of talked about yarn is the scheduling utility for big data for our cluster and SPARC is the application development software so with that said what is HDFS well it stands for the Hadoop distributed file system. It's a distributed file system designed to run on commodity hardware. It's highly fault tolerant. Because by default data is replicated three times so with one node fails you have two copies and your other nodes are aware that a node is failed so it goes ahead and makes a copy of the other two data. It does this all behind the scenes and as a programmer. It's great right. You don't have to worry about it at all. So certainly suitable for applications with large data sets so here is HDFS in a nutshell. You take a big file and you copy it to. HDFS and pieces of those files go. Everywhere in the cluster. So for different nodes you'll get different pieces of data but again it's it's replicated three times. So chances are you'll have or you sort of guaranteed to have redundant copies of the data everywhere. That's that's pretty much. The gist of it. So how do you interact with it. Well it's not quite like traditional. UNIX but it's close so I'm just going to live demo it here and probably regret that decision but let's let's go for it so I'm here logged into our cluster and so the syntax for querying the Hadoop file system is Hadoop FS but then you can provide commands specifically for that sort of mimic the Unix commands but they're actually arguments to this Hadoop FS call so if I do an LS everyone of course does it means list the files in the directory. I will list the files in my home. Directory on the cluster yes from the Hadoop Big Data cluster and obviously I'm streaming now so it's going to be a little slower but there it is.
I've got a lot of stuff in there right so it works essentially the same as UNIX. But but not quite unix so. I'll give you a hint here your home directory when you log in here will be slash user slash your the net. ID and when you do that you'll see the exact same files there so I'll just for one more demo. I'll show this data drive. This is again this is not the same drives data on. GPFS this is. Drive called data that I created on Hadoop which hosts some specific data sets so Google engrams for instance York City. Taxi information. So enough of that. This is sort of the the command-line interface to the Hadoop file system so this Hadoop FS. Command has a bunch of arguments 30 or so. I suppose but here are some of the more common ones. So you can view files the contents of files which is very useful. These two commands here really what the workhorses would be so if I have data on the normal disk on and I want to get it to HDFS what I'll need to do is use this Hadoop FS - copy from local which will do just what our graphic here says it will take a local file shard it up distributed and you don't have to keep track of it you can just call that file at any point and HDFS knows where to go to get that file so there's some more here I won't go through them all because I do want to get to the spark stuff but here here's a really useful one actually question so the first thing would be to copy it over to the cluster through SCP or FileZilla a copy utility and from there what you would want to use is the HD Hadoop FS - copy from local or move from local depending on how whether you want to keep the UNIX copy or not right precisely so so for example Hadoop FS - cat user Fido my file will concatenate the contents of that file and it actually uses globbing expressions - so wild cards for people familiar and you can actually pipe the contents of a file into another UNIX command like head and just get the first hundred lines or so question it's an actual copy yep like popular client OS clusters like an lattice CP.
Don't believe so at this point well yes working on a solution for them so everyone's an expert on. Hadoop FS now so. Let's move on to yarn. So yarn stands for yet another resource negotiator and its jobs are basically resource resource management and job scheduling so. We're never actually directly connecting to the data nodes on the cluster. We're operating as clients who submit jobs to yarn and yarn schedules. Those jobs appropriately so this should sound eerily familiar because this is effectively the same role that's learn place on the traditional cluster. Okay but it's a bit different because one doesn't really have to well first of all you don't have to write a batch job script a slurm script in order to specify a job it simply works behind the scenes as is and so resource allocation is essentially transparent to the user. That should come with a big asterisk asterisk on it because you can actually tweak these things but that's pretty. It's a little more advanced than what we want to talk about today. So yarn is the intermediary between you and the worker nodes so. Let's get on to spark. This is straight from sparks website documentation. It's a fast general-purpose. Palestra computing system which means that it supports high-level api's in java scala Python and R and an optimized engine that supports general execution graphs. Okay we'll talk a little bit about what general execution jack graphs mean but this graph is not a general execution graph. This shows the ancestry of spark as it is okay so leading reading - right you see that. PI spark and spark are our sort of wrappers for Python to execute spark code. Okay so these get interpreted into spark spark itself is written in scala. So it's essentially a specialized library of Scala and Scala is itself a specialized library of Java.
So the interesting thing about this. Is that any valid. Java code is also valid Scala code and invalid Java is also valid spark code long story short. You have many ways for different ways to program spark which is very very useful and again that sort of plays into this idea that you want to be able to develop solutions quickly not necessarily spend your time writing really complex code and so the other thing that really makes spark general-purpose is that it can run in a number of different ways so if you're interested we actually have a demo app on the acre github site which shows how you can set up a spark cluster on top of GPFS. So this wouldn't be the recommended way to do it is it's more. What's the word more educational just to be able to do such a thing. But as we mentioned it runs on yarn and it runs on mezzos if anyone's interested but let's get down to the nitty-gritty of it so I want to run a spark job. How do I do it. Well the first way and probably the best teaching tool is to run the spark repple so repple is a read evaluate print loop tool. It's essentially a shell that runs spark code for you so the general way to call this is to point to the your installation of spark and then the command spark shell so on our cluster we actually have two different versions of spark going and I want to emphasize here that since most everyone is picking it up as a new tool stick with spark too. It's really it's got some optimizations and some more functionality built into it so. I'll just mention here that one can start a spark show with spark to shell or PI spark to so this gives you the option of interacting with the entire cluster via a Scala spark interface or through a Python spark interface. And this is what we'll use for the rest of the presentation but there's also a more general way to submit jobs or jobs that you've really fleshed out what the code is supposed to do. And that's through spark submit so sparks submit or rather spark to submit will accept his argument either a jar file so compiled Java or Scala code or a Python script.
Okay so super useful and you'll see if you read the documentation for spark submit. You'll see that. There's many different ways to configure it. But in general if you run spark submit or spark to shell on our cluster. You're already configured to use yarn which means you're going to use as many cores as as you're allowed to which is really great so now that we know how to use to invoke spark let's talk about a simple application so word count it's part word count is the hello world application for Hadoop and MapReduce spark everything the idea is simple you're going to count the occurrences all the words in a text file and for this we're actually going to use the Scala repple so if you're interested this content basically comes straight from the spark QuickStart guide which is here and I do want to mention that for this demo. I'm going to be using Scala spark and you're probably thinking well man. I already know a little bit of Python or I know a lot of Python why is this guy trying to teach me Scott. There are some very good reasons to do this. It's becoming sort of the lingua franca if is a say of data science but also there's some really nice type check and built into it so. I think you're going to be pleasantly surprised that it's pretty easy to understand it's very much like Python in some ways but having said all that if you look through the documentation for SPARC you'll see that you have example code in both scala and python essentially everywhere you go so you're able to switch back and forth between the two and it's quite simple to pick up on so I think you'll be able to follow along. Don't yell at me too much please. So the first thing we're going to do is read in a text file using the spark context so the spark context traditionally you're conventionally written. SC is the entry point for the spark API so it's already created for us in the repple so why don't I switch here and just demo this in the repple so all I've done here is call spark to shell okay so I'm in the shell what I want to do is create a new value.
It's called lines. RTD and it's equal to SC text file. Okay so amazing right we see that we've created this variant of this value called lines. RDD which the repple is telling us exactly what it is so lions RDD is this class an RDD of string. That's coming from this file okay. So you're probably thinking at this point what in the world is an RDD it could be on HDFS. It could be in. Fact is by in on my home. Directory on the cluster. So it's a normal file. We'll talk a little bit about the mechanics of that. That's in just a minute. So resilient distributed data set everyone say it with me resilient distributed they said this is the core idea of spark and that's that an RDD is a fault tolerant collection of elements that can be operated on in parallel. So resilient why is that important well if we're talking about hundreds of gigabytes terabytes of data and we're operating on them. We're going to need a few nodes to be able to do that right. And if one node fails we don't want to restart the whole process. So the idea of resilient is that parts of this this data set can crash or whatever we don't care we can recover from that and that's baked into the the programming of the RDD itself so the other thing is that the parallelism parallelism on this RDD actually happens magically and transparently to you. Wow which is if you think about it. A very big departure from traditional HPC you have to either specify through job arrays or specify MPI calls to different nodes things like that here in in SPARC world. We know that we're going to be operating in parallel so we have rdd's to do all this hard work for us and we can just let it do. Its so resilient distributed data set if you don't take away anything else that's what you should remember to listen so as such because an RTD is distributed. We can't actually view it directly so to view an RTD we have to send all the data to a single node.
Okay so going back to. Will's point here. Where did this file live the SPARC readme.txt happened to be in my home directory so I'm pointing relative to my home directory. That file what SPARC does when I created this text file though is to send pieces of that file to different nodes in the cluster. Okay so it partitions it according to how it wants to partition it and it sends it there so as such. I can't just read it. I have to grab it back so I can do that with this. Call here lines already dot collect okay so collect is going to send all this data back to a single node and then. I'm going to run this function for each so for each element in this RTD. I'm going to print the line okay. So what we'll see is this. This does take a second but it because it's been distributed so it has to go and actually fetch it from different partitions so here we've just printed all the output and it's pretty big but this is the this is the readme for Apache spark. Okay so that's just our dummy file where we're playing with we can see that this. RTD happens to be an RDD of string so each element is its own string and you'll notice too that it stayed in order in this case it reads as it should so here the same duplicated content on the screen. So that's great we have an RDD there what what can we do with this thing. Well many of you probably heard map so we're going to we're going to delve into the world of MapReduce here and mapping simply means that I'm going to transform these elements in this RDD into something else. Okay so usually our data is not exactly the way it needs to be or doesn't give us information as is so we need to transform it. That's the entire idea. Behind mapping and so mapping actually is a common function in. Python it's common in Scala as well so here if I just have a scala string called foo it's just a simple line of texts. I can actually map this string into an array of strings using the special dot split function and you those of you know Python know that this is works exactly the same in Python so when I do this when I call split on a string the result is an array of string so essentially what we want to do is do this to every element of our RDD.
Okay so let's just do that if we want to know the words per line what we can do is apply this function here. The splitting function to each line in the RDD. Okay so let's take a second let that sink in so we started with this lines. RDD that's an RDD of string and we're going to map each string with this anonymous function here which takes a string and splits it. Okay so those of you familiar with Python will know that anonymous functions in Python you use lambda X right or it's a lambda function. The same idea here. We're just constructing this function. That tells us how to map it okay. So the result here that we call words is an RDD which is an array of strings. Okay and if we want to know how many words are in each line then what we need to do is just take the size of that. RDD so I have this words RTD here and I'm just going to map it. I know that each element is an array of string so I'm going to take the array of string and apply the method size which will give me the total number of elements so if I do this then I can store the result as words per line and I'm just going to print out the first five elements of this RDD. Okay so let's go back for a second and look at what our original data was it was a hashtag pachi spark and then a space and then spark is a blah blah blah. So what about thirteen lines or thirteen words in this line or so. So that's exactly the result that we have here the first line which was just a pachi spark then a blank line which counts as one in this case and then fourteen words etc so questions. What does it make sense at this point. Or what does make sense. The mapping is transforming each elements of the RDD independently so you can think of this as a completely independent transformation of whatever the element is in your RDD okay.
So here we've mapped a string of words into an array of words and then we've mapped that array into its size so an integer so we started with a string we went to an array of string and then to integer so that's exactly words per line is an RDD now of integers precisely. Yes it's very good point so one one one note somewhere we're restoring regular day that's not ready to use many smaller files because it's something just over to distributing I've got a small sure so here's a good example that's actually a real one say that I'm parsing through 500 gigabytes of website data. Okay and there's there's texts and content from blogs and some food bloggers and. I have a dictionary of food. Words that tells me what the flavor profile is for a given word so if I see bacon in the dictionary it's going to be delicious or terrible or whatever. The dictionary itself is not going to be a huge file right. It's going to be maybe a few megabytes. So in that case you can certainly store that locally read that in through Python or Scala as a normal text file and use those use that dictionary as is as a non distributed data set there is a mechanism to broadcast that data. Set to every node in the cluster. Okay so you get a copy of that thing everywhere and that way every node can look at that and reference that read-only data if you will so yes you're right in that case that you wouldn't want that for our example it's purely educational here. So yeah with for a text file of a few kilobytes. No this would not make sense to do that. But in terms of all the half a terabyte of website data that you wanted to analyze. Certainly you want that to be an RDD so good question and as Davide mentioned here this is sort of a key point. That is maybe not truly necessary to get it to work but it's necessary to understand and that's that each time you transform an RDD you're actually creating a new RDD you're not mutating the RDD that's there in place and the reason for this has to everything to do with resiliency okay because I can't know for sure when I'm going to access a piece of data or that is to say the nodes themselves don't know exactly when they're going to access a piece of data and they're all doing this in parallel so if they were all trying to modify the same piece of data they get into race conditions and would be very difficult to guarantee that you were going to get a consistent result every time you ran your program so the idea is that these struck search okay.
So it's sort of. It is an important point move not necessarily critical to get it to work but good question. So that was the mapping phase so if we want to get just the total number of words in this document that we have we can sum up our words per line. RDD so remember it's an RDD of ends and we're going to apply a function here our method which is dot reduce. Okay this should sound familiar. MapReduce so it also takes a function is its argument here which is a lambda function so it takes two elements and. I haven't told you what those are. But it tells me how to combine those two elements together. Okay so we just said mapping is transforming the data reducing is combining data. Okay and that's I mean that's the entire MapReduce paradigm in a nutshell. So this would be useful if you want to compute the mean of a data set or the standard deviation or something like that you have to have a way to combine values together and that's exactly what this reduce mechanism is doing so if we look at this function it takes two parameters a and B and I'm just going to give you a hint here these are both integers and it's not saying the normal way you would calculate this sum in say. C would be okay let's create a for loop and for I is equal to 0 to n let's step through each element of it and we're going to keep a variable called total and we'll just add to total every time here. The model is different.
Here's what we're saying. Is that if we had a set that was only two members that only contain two members. How would we combine them. Okay we'll just add them together but it doesn't actually matter the order in which we combine the two and that's again harkens back to this whole idea of parallelism and that you can't really know the order in which your operations are going to happen necessarily so luckily for us a plus B is equal to B plus a and. So there's no this is a commutative operation and it's perfectly valid as a reduce step. Okay so you can think of others. That might not be so trying to directly calculate a mean you don't know what the denominator is opera so you have to calculate means in two steps so. I might be getting a little bit into the weeds here but again let's let's take a look here so we started with an RDD event. We replied we applied our reduced method on it and our result is this value total. Which is an integer okay. So that's another hallmark of this reduced. Step that we take this distributed thing and we collapse it down into back into our normal. Scala objects okay and so you see this. This thing isn't distributed now anymore. It's an integer that lives on the current. The master node is it were and so it's directly viewable all right so this is sort of rehash of just what I said here but it is critical that you understand that the order in which you do this reduce operation cannot matter okay. That's it maybe admittedly a tough way to think about things first. But that's why we're doing the hello world example so you all will understand the mechanics of it. No no problem. So here let's just recap the MapReduce paradigm is that you have some data it's it's distributed typically and you transform it in some way so here. I've just put some examples of some transformations so we saw map. But certainly you could apply a filter function so we want to get rid of lines that say the word spark in them. We can totally do that with a filtering function on our RDD.
We'll talk about flat map in a minute. But other things sort of sequel ask or if you know yeah if you're familiar with databases or set theory then join Union intersection. These are all things you can do to map your data and then reduce so the common ones like reduce what we saw count which is simply count the number of samples maybe sample is one where you just generate a random sample of the data so in a nutshell that's it there's certainly nuances to these functions and there's actually many more functions that spark exposes to us for doing this kind of analysis so let's take it a little bit further. Wow all we did. There was count the total number of words but now let's say that we want to create a list of the most common words in this document. Well that truly is what word count is but let's think about just in pseudocode how we would do such a thing. So first step would be to map each line into an array of words so we're just going to split it so that we're not looking at sentences but words so you've seen that already the second step is that we're going to match each word into a word value pair or Tuffle so who here can tell me what a toughie. Liz who doesn't want to tell me but knows you. Is that a hand. No okay sure okay so it can be it. Can be ordered can be different types can be same time yes it is immutable. That's very good. So it's it's really a generalization of a double a triple quadruple couple is this immutable list. Like thing of a certain size so a size n would say so this happens to be a couple of size here too and generally in Python and in Scala you denote tuples with parentheses. Okay so this is a tupple of string and integer here. Okay so that's exactly what we're going to do to our text file the words that we're pulling out of our text file. Okay so in this tuple the first element is the key which is a unique identifier to this. Tuffle so let me ask you. This could an integer be the key of a Tuffle or could it serve as a key into a word about yeah just abstractly.
It doesn't matter what language but we see here. The string would be how about an integer. I'm seeing some yeses. What about a float or there's discrete number of floats. Well yeah but they can't so essentially the key has to be uniquely identify. It has to be hashable that is to say that if you create if you run a certain function on it. It's going to give you the same output every time so yeah that's a little little bit farther than I need to go there. So the first element is key. The second element is a value or payload and this idea of key value pairs are sort of special ideas in. Rdd's okay so they actually have a name for them pair pair rdd's and the reason is that sometimes we want to group data together okay and so the key is the thing by which we're going to group our data okay so if we wanted to calculate all the occurrences of the word the' in our corpus then for every time we see the word the the we would omit a key and the value 1 for this every time we see it okay and we can group all those occurrences of thee by the key the and sum them together and that gives us the total occurrences of the word in the corpus. Okay so I'll take a second and maybe say that again so key. Value pairs are the mechanism for allowing group by operations. And if you think about it this is the way it kind of has to work okay. So we're going to use a special function called reduce by key to do this for us on this particular. RTD okay so I know I've thrown a lot at you here and this is a pretty messy code. I probably wouldn't recommend doing it this way anyways but I just want to show you that yeah this these commands can be piped and once you understand the flow of data through it which sort of makes sense to you. Okay so let's start back at lines. RTD which is an RTD of type. String right we're going to apply this mapping function flatmap to it. And we'll get to what that is in just a minute but we know that this thing has to take a string because it's a string element in lines.
RDD so for this string we're going to split it on spaces again. Okay we do that. The result will be an array of string right well for each element in this array. We're going to map the word to its tupple key value pair which is word in the integer one. So what is the significance of putting a 1 here. Why not 10 or 20 single word yes. It's occurred one time every time. I see it. I know that it's occurred one time I know so. Each mapping function is just working very locally. It says okay. I know that I see this word is here once. So I'm going to admit that value and I'll let some other function take care of summing those together I'm not keeping track of the total number of words as I go along. I'm just passing on the information that I have okay so what we have here the results of this expression that I'm going to highlight here would be an array of tuples okay. Flat map is a special function that says okay. I know you're going to give me an array when you when you perform this function here but I don't care I don't want that array just flatten it out just for every element in that array create a new RTD element okay. So it's sort of like exploding. I mean you see this happen in other contexts too computer science. But it's one that kind of comes up quite a bit and so in this context what we're doing is just creating that RTD and flattening the whole thing out so that now the RTD that we're looking at up to this point would be an RTD of tuples right. Does that make sense everyone questions. They're not some deaf good question so we have an RDD there so the last step of the pipeline here and it's actually chained with this dot here is to reduce by key okay so this is essentially the same reduce function we were using before. But now what we're saying is that we're going to reduce by key which means we're going to group all the like keys together so the key is just the first element of that tupple.
So it's whatever word is so that guarantees that all the thes and all the sparks and all the Apaches that occur in that document are going to be grouped together and we can sum them up together. Okay question you need another step to map your array to expand it it would be an RDD of arrays of tuples. Okay yes that's a good question. In this case it would be different. Because the only transformation of data we've done is to split on white space. Okay so it's just and actually you split on single spaces. So yes. Our mapping function. This this mapping here we could make that a more refined function to say let's capitalize everything or let's lowercase it. Or let's take the root of the word but not not the ending so good question others wrap it up. Yeah okay. So here's the nice. The graph to the directed acyclic graph the generalized execution graph that we talked about earlier. So this is actually. What's going on under the hood of MapReduce. I encourage you to take a look at this. We'll have this posted on your website. Just wanted to give you an idea of the kind of things you could do so this gets much more complex but I think you guys will have a firm understanding now of the mechanics and the mechanics don't really change depending on the API but you can read CSV files for sure and you can query them using sequel like syntax so I just wanted to show you a little bit. There's a whole machine learning library. That spark maintains since developed which is extremely useful. So it has things like logistic regression clustering random forests. I'll show you here. I have an example on YouTube of how to classify SEM images so SEM and x-ray data. Some of you may have for sure what you can do is actually use these these built-in sort of classifiers which you don't have to program yourself and create pipelines of data to do things like that you can do computation on taxicab data for instance so here we've calculated the most popular taxi pickup slash drop-off area in New York and we did that using the PageRank algorithm which comes baked into spark and is what Google is used for years to compute to direct to you web searches to web pages so this is really I mean cutting edge from an algorithmic standpoint so I'll just try to finish it up five minutes over here so my tips read the documentation for sure use the check the github repositories for example code and the biggest one.
I think one of the main ones we use a good integrated development environment which will prevent a lot of your mistakes before they happen so again. Apologize for going over here but we are here to help you. We're certainly. I'm here to get you up and running. Don't hesitate to contact us through any of the channels and we'll be happy to consult with you so thank you very much.
Let us know there was one more thing. I wanted to say yeah. We're going to start mostly so josh is going to be talking about this new programming environment today called spark and you can use spark from python you can use it from R which will be really convenient. I think for a lot of users but if you are interested in importing your court code over to spark and you feel like it's an insurmountable obstacle feel free to ping Joshua to ping us and we're happy to sit down with you and look at your current code and help you convert into this new environment and another thing I think will probably be setting up is basically spark and padieu office hours so maybe starting next week we'll just add a one or two hour block once a week. You can come by sit down with Josh. And we can help evaluate your current workflows and see if what you're doing is a good candidate for this new environment so with that. I will turn it over to Josh. All right thank you very much. Well thank you all for coming today enjoying the pizza. This is an intro to using spark on the Big Data cluster and. I'm going to ask first of all to bear with me a little bit because today's talk might seem a bit more academic and unless less applied. I'm going to get to some really typical use cases toward the end of the presentation and I'm going to actually invite you guys to talk and discuss. Maybe some of the ideas you're having on how to the kinds of problems you face with big data but I do want to emphasize the mechanics of what's going on in spark because I think it's sort of fundamental to understanding why big data works and why this sort of computing approach is useful but it is also different from traditional HPC so with that in mind because this is very new I want to be upfront. First of all about how to get help with with these issues. So we'll of course is alluded to office hours. But there's the normal channels of submitting helpdesk ticket. I strongly encourage you to join the Aker forum slack team which we do have a big data channel there.
We also have a github organization in parallel to the acre github account which is devoted strictly to big data so big data is -. Vande we have a blog that sort of piggybacking off that github site there with some sports step by step instructions and then finally we do take the big data. Roadshow around campus and talk to meet with research groups if that's something you're interested in so let's talk about. Hadoop I put this up here intentionally to confuse. And wow you with you know bright colors. Hadoop is the core of Big Data these days and essentially Hadoop is the Hadoop distributed file system but a huge ecosystem of applications have developed around it now. I'm not going to go through all the names. Catchy as they may be and/or the graphics. I just want to give you the idea that it is a very diverse ecosystem so how is it different from HPC well the fundamental idea in HPC is usually traditionally been that we have a code running on fairly performant hardware and we're going to move data to that code but big data is just the opposite. It's the born out of the idea that it can be more efficient to move your code to wherever the data is located and by taking advantage of that data locality you can speed up the analysis so typically what this means in terms of storage is that on our cluster on the traditional cluster where you use GPFS on Hadoop we have HDFS the distributed file system so it's not that every piece of data is accessible everywhere in the big data cluster. It's that data resides on different nodes within that cluster. And we'll get a little bit more into the specifics there. There is an emphasis on more performant languages like C C++ and God forbid Fortran if anyone's using that I kid it's a good language but really the emphasis in Big Data is sort of being able to program quickly and get allowing developers to get to solutions quickly so you have much more the the niceties in Java Scala Python etc and really this the sort of correlates with this idea of imperative programming where I traditionally you think about moving through a for loop and visiting.
I'm going to eat visit each piece of data once and when I get to the end of the data. I'm going to do something else. That's imperative programming in functional programming. The idea is more okay. I have this idea of data that needs to be transformed in a certain way. I don't really care how I access it. But it's going to be transformed to this and it's going to be distilled to this so it's really it's worth reading up on. It's sort of an interesting new paradigm coming out of programming so with that in mind what are candidates for big data workflows that perhaps need to transition from traditional eighth piece. HPC to Big Data. Well the obvious first answer is lots of data if you have lots of data this is going to work better in general so typically data tasks that are embarrassingly parallel but more or less computationally simple. So it's not so much. Protein folding would be a poor example of a big data application because we're not really optimizing for we're not CPU bound that is to say the chunk of processing is not coming from the time that it takes to calculate something but rather the time it takes to move data back and forth. Okay so we're talking about typically. IO bound applications and really where this is caught hold in the last decade or so is machine learning because machine learning in general always gets better with the more data you have and there have been very clever ways to sort of make machine learning algorithms work within the context of Hadoop and so this is something we'll talk about towards the end of the lecture today but also something that you might not think about that actually might be useful to a lot of you this network analysis so you could think about why this might have come up in the last 15 years or so in the context of industry social networks right.
There's definitely a need to compute properties of networks quickly and efficiently and it's become actually very mature the algorithms to do this kind of thing and lastly we won't talk about this too much but processing data that's coming in in real a real-time stream so actually applying machine learning network analysis or any of the above two streams of data that you're not actually necessarily storing as is this is another key component of what spark can do so. I'm not going to go through this slide to in too much detail but I just did want to sort of point out that the tools for traditional scientific computing versus big data analysis are different in sort of meeting different needs. But that's something you can check out in your own time. I suppose because I want to get into the meat of the presentation which is talking about the saw the hardware that we have here available at Aker so as will mentioned we have a test environment set up on recycled hardware now which has been in use for over a year now as opposed which is actually one of the big selling points of big data is that it can run on commodity hardware so you're not investing huge amounts of money into the computation itself now what. I want to point out here. Is that the new hardware purchase that so the new cluster. That will be coming online. This fall actually has a pretty beefy data nodes there so 80 terabytes of storage 32 cores and half terabyte of RAM. I don't think anyone here probably has a laptop with half a terabyte of RAM in it so this could be. We're super excited about being able to use these and I should mention again that this is free for the campus to use this has been funded. So it's this perfect opportunity to transition to a big data workflow so I do need to since we're talking about hardware. I'll just mention that. Cloud era manager is the tool that we use to manage the cluster. And that's probably all you need to know about it but I did want to show just a nap.
Shot of our ecosystem and sort of the the tools that we're managing with cloud era so many of the ones you saw in the graphic previously. I'm not going to go through all these but I did want to sort of take you to spark and in order to do that. I need to talk about these three specific services so HDFS that we've already sort of talked about yarn is the scheduling utility for big data for our cluster and SPARC is the application development software so with that said what is HDFS well it stands for the Hadoop distributed file system. It's a distributed file system designed to run on commodity hardware. It's highly fault tolerant. Because by default data is replicated three times so with one node fails you have two copies and your other nodes are aware that a node is failed so it goes ahead and makes a copy of the other two data. It does this all behind the scenes and as a programmer. It's great right. You don't have to worry about it at all. So certainly suitable for applications with large data sets so here is HDFS in a nutshell. You take a big file and you copy it to. HDFS and pieces of those files go. Everywhere in the cluster. So for different nodes you'll get different pieces of data but again it's it's replicated three times. So chances are you'll have or you sort of guaranteed to have redundant copies of the data everywhere. That's that's pretty much. The gist of it. So how do you interact with it. Well it's not quite like traditional. UNIX but it's close so I'm just going to live demo it here and probably regret that decision but let's let's go for it so I'm here logged into our cluster and so the syntax for querying the Hadoop file system is Hadoop FS but then you can provide commands specifically for that sort of mimic the Unix commands but they're actually arguments to this Hadoop FS call so if I do an LS everyone of course does it means list the files in the directory. I will list the files in my home. Directory on the cluster yes from the Hadoop Big Data cluster and obviously I'm streaming now so it's going to be a little slower but there it is.
I've got a lot of stuff in there right so it works essentially the same as UNIX. But but not quite unix so. I'll give you a hint here your home directory when you log in here will be slash user slash your the net. ID and when you do that you'll see the exact same files there so I'll just for one more demo. I'll show this data drive. This is again this is not the same drives data on. GPFS this is. Drive called data that I created on Hadoop which hosts some specific data sets so Google engrams for instance York City. Taxi information. So enough of that. This is sort of the the command-line interface to the Hadoop file system so this Hadoop FS. Command has a bunch of arguments 30 or so. I suppose but here are some of the more common ones. So you can view files the contents of files which is very useful. These two commands here really what the workhorses would be so if I have data on the normal disk on and I want to get it to HDFS what I'll need to do is use this Hadoop FS - copy from local which will do just what our graphic here says it will take a local file shard it up distributed and you don't have to keep track of it you can just call that file at any point and HDFS knows where to go to get that file so there's some more here I won't go through them all because I do want to get to the spark stuff but here here's a really useful one actually question so the first thing would be to copy it over to the cluster through SCP or FileZilla a copy utility and from there what you would want to use is the HD Hadoop FS - copy from local or move from local depending on how whether you want to keep the UNIX copy or not right precisely so so for example Hadoop FS - cat user Fido my file will concatenate the contents of that file and it actually uses globbing expressions - so wild cards for people familiar and you can actually pipe the contents of a file into another UNIX command like head and just get the first hundred lines or so question it's an actual copy yep like popular client OS clusters like an lattice CP.
Don't believe so at this point well yes working on a solution for them so everyone's an expert on. Hadoop FS now so. Let's move on to yarn. So yarn stands for yet another resource negotiator and its jobs are basically resource resource management and job scheduling so. We're never actually directly connecting to the data nodes on the cluster. We're operating as clients who submit jobs to yarn and yarn schedules. Those jobs appropriately so this should sound eerily familiar because this is effectively the same role that's learn place on the traditional cluster. Okay but it's a bit different because one doesn't really have to well first of all you don't have to write a batch job script a slurm script in order to specify a job it simply works behind the scenes as is and so resource allocation is essentially transparent to the user. That should come with a big asterisk asterisk on it because you can actually tweak these things but that's pretty. It's a little more advanced than what we want to talk about today. So yarn is the intermediary between you and the worker nodes so. Let's get on to spark. This is straight from sparks website documentation. It's a fast general-purpose. Palestra computing system which means that it supports high-level api's in java scala Python and R and an optimized engine that supports general execution graphs. Okay we'll talk a little bit about what general execution jack graphs mean but this graph is not a general execution graph. This shows the ancestry of spark as it is okay so leading reading - right you see that. PI spark and spark are our sort of wrappers for Python to execute spark code. Okay so these get interpreted into spark spark itself is written in scala. So it's essentially a specialized library of Scala and Scala is itself a specialized library of Java.
So the interesting thing about this. Is that any valid. Java code is also valid Scala code and invalid Java is also valid spark code long story short. You have many ways for different ways to program spark which is very very useful and again that sort of plays into this idea that you want to be able to develop solutions quickly not necessarily spend your time writing really complex code and so the other thing that really makes spark general-purpose is that it can run in a number of different ways so if you're interested we actually have a demo app on the acre github site which shows how you can set up a spark cluster on top of GPFS. So this wouldn't be the recommended way to do it is it's more. What's the word more educational just to be able to do such a thing. But as we mentioned it runs on yarn and it runs on mezzos if anyone's interested but let's get down to the nitty-gritty of it so I want to run a spark job. How do I do it. Well the first way and probably the best teaching tool is to run the spark repple so repple is a read evaluate print loop tool. It's essentially a shell that runs spark code for you so the general way to call this is to point to the your installation of spark and then the command spark shell so on our cluster we actually have two different versions of spark going and I want to emphasize here that since most everyone is picking it up as a new tool stick with spark too. It's really it's got some optimizations and some more functionality built into it so. I'll just mention here that one can start a spark show with spark to shell or PI spark to so this gives you the option of interacting with the entire cluster via a Scala spark interface or through a Python spark interface. And this is what we'll use for the rest of the presentation but there's also a more general way to submit jobs or jobs that you've really fleshed out what the code is supposed to do. And that's through spark submit so sparks submit or rather spark to submit will accept his argument either a jar file so compiled Java or Scala code or a Python script.
Okay so super useful and you'll see if you read the documentation for spark submit. You'll see that. There's many different ways to configure it. But in general if you run spark submit or spark to shell on our cluster. You're already configured to use yarn which means you're going to use as many cores as as you're allowed to which is really great so now that we know how to use to invoke spark let's talk about a simple application so word count it's part word count is the hello world application for Hadoop and MapReduce spark everything the idea is simple you're going to count the occurrences all the words in a text file and for this we're actually going to use the Scala repple so if you're interested this content basically comes straight from the spark QuickStart guide which is here and I do want to mention that for this demo. I'm going to be using Scala spark and you're probably thinking well man. I already know a little bit of Python or I know a lot of Python why is this guy trying to teach me Scott. There are some very good reasons to do this. It's becoming sort of the lingua franca if is a say of data science but also there's some really nice type check and built into it so. I think you're going to be pleasantly surprised that it's pretty easy to understand it's very much like Python in some ways but having said all that if you look through the documentation for SPARC you'll see that you have example code in both scala and python essentially everywhere you go so you're able to switch back and forth between the two and it's quite simple to pick up on so I think you'll be able to follow along. Don't yell at me too much please. So the first thing we're going to do is read in a text file using the spark context so the spark context traditionally you're conventionally written. SC is the entry point for the spark API so it's already created for us in the repple so why don't I switch here and just demo this in the repple so all I've done here is call spark to shell okay so I'm in the shell what I want to do is create a new value.
It's called lines. RTD and it's equal to SC text file. Okay so amazing right we see that we've created this variant of this value called lines. RDD which the repple is telling us exactly what it is so lions RDD is this class an RDD of string. That's coming from this file okay. So you're probably thinking at this point what in the world is an RDD it could be on HDFS. It could be in. Fact is by in on my home. Directory on the cluster. So it's a normal file. We'll talk a little bit about the mechanics of that. That's in just a minute. So resilient distributed data set everyone say it with me resilient distributed they said this is the core idea of spark and that's that an RDD is a fault tolerant collection of elements that can be operated on in parallel. So resilient why is that important well if we're talking about hundreds of gigabytes terabytes of data and we're operating on them. We're going to need a few nodes to be able to do that right. And if one node fails we don't want to restart the whole process. So the idea of resilient is that parts of this this data set can crash or whatever we don't care we can recover from that and that's baked into the the programming of the RDD itself so the other thing is that the parallelism parallelism on this RDD actually happens magically and transparently to you. Wow which is if you think about it. A very big departure from traditional HPC you have to either specify through job arrays or specify MPI calls to different nodes things like that here in in SPARC world. We know that we're going to be operating in parallel so we have rdd's to do all this hard work for us and we can just let it do. Its so resilient distributed data set if you don't take away anything else that's what you should remember to listen so as such because an RTD is distributed. We can't actually view it directly so to view an RTD we have to send all the data to a single node.
Okay so going back to. Will's point here. Where did this file live the SPARC readme.txt happened to be in my home directory so I'm pointing relative to my home directory. That file what SPARC does when I created this text file though is to send pieces of that file to different nodes in the cluster. Okay so it partitions it according to how it wants to partition it and it sends it there so as such. I can't just read it. I have to grab it back so I can do that with this. Call here lines already dot collect okay so collect is going to send all this data back to a single node and then. I'm going to run this function for each so for each element in this RTD. I'm going to print the line okay. So what we'll see is this. This does take a second but it because it's been distributed so it has to go and actually fetch it from different partitions so here we've just printed all the output and it's pretty big but this is the this is the readme for Apache spark. Okay so that's just our dummy file where we're playing with we can see that this. RTD happens to be an RDD of string so each element is its own string and you'll notice too that it stayed in order in this case it reads as it should so here the same duplicated content on the screen. So that's great we have an RDD there what what can we do with this thing. Well many of you probably heard map so we're going to we're going to delve into the world of MapReduce here and mapping simply means that I'm going to transform these elements in this RDD into something else. Okay so usually our data is not exactly the way it needs to be or doesn't give us information as is so we need to transform it. That's the entire idea. Behind mapping and so mapping actually is a common function in. Python it's common in Scala as well so here if I just have a scala string called foo it's just a simple line of texts. I can actually map this string into an array of strings using the special dot split function and you those of you know Python know that this is works exactly the same in Python so when I do this when I call split on a string the result is an array of string so essentially what we want to do is do this to every element of our RDD.
Okay so let's just do that if we want to know the words per line what we can do is apply this function here. The splitting function to each line in the RDD. Okay so let's take a second let that sink in so we started with this lines. RDD that's an RDD of string and we're going to map each string with this anonymous function here which takes a string and splits it. Okay so those of you familiar with Python will know that anonymous functions in Python you use lambda X right or it's a lambda function. The same idea here. We're just constructing this function. That tells us how to map it okay. So the result here that we call words is an RDD which is an array of strings. Okay and if we want to know how many words are in each line then what we need to do is just take the size of that. RDD so I have this words RTD here and I'm just going to map it. I know that each element is an array of string so I'm going to take the array of string and apply the method size which will give me the total number of elements so if I do this then I can store the result as words per line and I'm just going to print out the first five elements of this RDD. Okay so let's go back for a second and look at what our original data was it was a hashtag pachi spark and then a space and then spark is a blah blah blah. So what about thirteen lines or thirteen words in this line or so. So that's exactly the result that we have here the first line which was just a pachi spark then a blank line which counts as one in this case and then fourteen words etc so questions. What does it make sense at this point. Or what does make sense. The mapping is transforming each elements of the RDD independently so you can think of this as a completely independent transformation of whatever the element is in your RDD okay.
So here we've mapped a string of words into an array of words and then we've mapped that array into its size so an integer so we started with a string we went to an array of string and then to integer so that's exactly words per line is an RDD now of integers precisely. Yes it's very good point so one one one note somewhere we're restoring regular day that's not ready to use many smaller files because it's something just over to distributing I've got a small sure so here's a good example that's actually a real one say that I'm parsing through 500 gigabytes of website data. Okay and there's there's texts and content from blogs and some food bloggers and. I have a dictionary of food. Words that tells me what the flavor profile is for a given word so if I see bacon in the dictionary it's going to be delicious or terrible or whatever. The dictionary itself is not going to be a huge file right. It's going to be maybe a few megabytes. So in that case you can certainly store that locally read that in through Python or Scala as a normal text file and use those use that dictionary as is as a non distributed data set there is a mechanism to broadcast that data. Set to every node in the cluster. Okay so you get a copy of that thing everywhere and that way every node can look at that and reference that read-only data if you will so yes you're right in that case that you wouldn't want that for our example it's purely educational here. So yeah with for a text file of a few kilobytes. No this would not make sense to do that. But in terms of all the half a terabyte of website data that you wanted to analyze. Certainly you want that to be an RDD so good question and as Davide mentioned here this is sort of a key point. That is maybe not truly necessary to get it to work but it's necessary to understand and that's that each time you transform an RDD you're actually creating a new RDD you're not mutating the RDD that's there in place and the reason for this has to everything to do with resiliency okay because I can't know for sure when I'm going to access a piece of data or that is to say the nodes themselves don't know exactly when they're going to access a piece of data and they're all doing this in parallel so if they were all trying to modify the same piece of data they get into race conditions and would be very difficult to guarantee that you were going to get a consistent result every time you ran your program so the idea is that these struck search okay.
So it's sort of. It is an important point move not necessarily critical to get it to work but good question. So that was the mapping phase so if we want to get just the total number of words in this document that we have we can sum up our words per line. RDD so remember it's an RDD of ends and we're going to apply a function here our method which is dot reduce. Okay this should sound familiar. MapReduce so it also takes a function is its argument here which is a lambda function so it takes two elements and. I haven't told you what those are. But it tells me how to combine those two elements together. Okay so we just said mapping is transforming the data reducing is combining data. Okay and that's I mean that's the entire MapReduce paradigm in a nutshell. So this would be useful if you want to compute the mean of a data set or the standard deviation or something like that you have to have a way to combine values together and that's exactly what this reduce mechanism is doing so if we look at this function it takes two parameters a and B and I'm just going to give you a hint here these are both integers and it's not saying the normal way you would calculate this sum in say. C would be okay let's create a for loop and for I is equal to 0 to n let's step through each element of it and we're going to keep a variable called total and we'll just add to total every time here. The model is different.
Here's what we're saying. Is that if we had a set that was only two members that only contain two members. How would we combine them. Okay we'll just add them together but it doesn't actually matter the order in which we combine the two and that's again harkens back to this whole idea of parallelism and that you can't really know the order in which your operations are going to happen necessarily so luckily for us a plus B is equal to B plus a and. So there's no this is a commutative operation and it's perfectly valid as a reduce step. Okay so you can think of others. That might not be so trying to directly calculate a mean you don't know what the denominator is opera so you have to calculate means in two steps so. I might be getting a little bit into the weeds here but again let's let's take a look here so we started with an RDD event. We replied we applied our reduced method on it and our result is this value total. Which is an integer okay. So that's another hallmark of this reduced. Step that we take this distributed thing and we collapse it down into back into our normal. Scala objects okay and so you see this. This thing isn't distributed now anymore. It's an integer that lives on the current. The master node is it were and so it's directly viewable all right so this is sort of rehash of just what I said here but it is critical that you understand that the order in which you do this reduce operation cannot matter okay. That's it maybe admittedly a tough way to think about things first. But that's why we're doing the hello world example so you all will understand the mechanics of it. No no problem. So here let's just recap the MapReduce paradigm is that you have some data it's it's distributed typically and you transform it in some way so here. I've just put some examples of some transformations so we saw map. But certainly you could apply a filter function so we want to get rid of lines that say the word spark in them. We can totally do that with a filtering function on our RDD.
We'll talk about flat map in a minute. But other things sort of sequel ask or if you know yeah if you're familiar with databases or set theory then join Union intersection. These are all things you can do to map your data and then reduce so the common ones like reduce what we saw count which is simply count the number of samples maybe sample is one where you just generate a random sample of the data so in a nutshell that's it there's certainly nuances to these functions and there's actually many more functions that spark exposes to us for doing this kind of analysis so let's take it a little bit further. Wow all we did. There was count the total number of words but now let's say that we want to create a list of the most common words in this document. Well that truly is what word count is but let's think about just in pseudocode how we would do such a thing. So first step would be to map each line into an array of words so we're just going to split it so that we're not looking at sentences but words so you've seen that already the second step is that we're going to match each word into a word value pair or Tuffle so who here can tell me what a toughie. Liz who doesn't want to tell me but knows you. Is that a hand. No okay sure okay so it can be it. Can be ordered can be different types can be same time yes it is immutable. That's very good. So it's it's really a generalization of a double a triple quadruple couple is this immutable list. Like thing of a certain size so a size n would say so this happens to be a couple of size here too and generally in Python and in Scala you denote tuples with parentheses. Okay so this is a tupple of string and integer here. Okay so that's exactly what we're going to do to our text file the words that we're pulling out of our text file. Okay so in this tuple the first element is the key which is a unique identifier to this. Tuffle so let me ask you. This could an integer be the key of a Tuffle or could it serve as a key into a word about yeah just abstractly.
It doesn't matter what language but we see here. The string would be how about an integer. I'm seeing some yeses. What about a float or there's discrete number of floats. Well yeah but they can't so essentially the key has to be uniquely identify. It has to be hashable that is to say that if you create if you run a certain function on it. It's going to give you the same output every time so yeah that's a little little bit farther than I need to go there. So the first element is key. The second element is a value or payload and this idea of key value pairs are sort of special ideas in. Rdd's okay so they actually have a name for them pair pair rdd's and the reason is that sometimes we want to group data together okay and so the key is the thing by which we're going to group our data okay so if we wanted to calculate all the occurrences of the word the' in our corpus then for every time we see the word the the we would omit a key and the value 1 for this every time we see it okay and we can group all those occurrences of thee by the key the and sum them together and that gives us the total occurrences of the word in the corpus. Okay so I'll take a second and maybe say that again so key. Value pairs are the mechanism for allowing group by operations. And if you think about it this is the way it kind of has to work okay. So we're going to use a special function called reduce by key to do this for us on this particular. RTD okay so I know I've thrown a lot at you here and this is a pretty messy code. I probably wouldn't recommend doing it this way anyways but I just want to show you that yeah this these commands can be piped and once you understand the flow of data through it which sort of makes sense to you. Okay so let's start back at lines. RTD which is an RTD of type. String right we're going to apply this mapping function flatmap to it. And we'll get to what that is in just a minute but we know that this thing has to take a string because it's a string element in lines.
RDD so for this string we're going to split it on spaces again. Okay we do that. The result will be an array of string right well for each element in this array. We're going to map the word to its tupple key value pair which is word in the integer one. So what is the significance of putting a 1 here. Why not 10 or 20 single word yes. It's occurred one time every time. I see it. I know that it's occurred one time I know so. Each mapping function is just working very locally. It says okay. I know that I see this word is here once. So I'm going to admit that value and I'll let some other function take care of summing those together I'm not keeping track of the total number of words as I go along. I'm just passing on the information that I have okay so what we have here the results of this expression that I'm going to highlight here would be an array of tuples okay. Flat map is a special function that says okay. I know you're going to give me an array when you when you perform this function here but I don't care I don't want that array just flatten it out just for every element in that array create a new RTD element okay. So it's sort of like exploding. I mean you see this happen in other contexts too computer science. But it's one that kind of comes up quite a bit and so in this context what we're doing is just creating that RTD and flattening the whole thing out so that now the RTD that we're looking at up to this point would be an RTD of tuples right. Does that make sense everyone questions. They're not some deaf good question so we have an RDD there so the last step of the pipeline here and it's actually chained with this dot here is to reduce by key okay so this is essentially the same reduce function we were using before. But now what we're saying is that we're going to reduce by key which means we're going to group all the like keys together so the key is just the first element of that tupple.
So it's whatever word is so that guarantees that all the thes and all the sparks and all the Apaches that occur in that document are going to be grouped together and we can sum them up together. Okay question you need another step to map your array to expand it it would be an RDD of arrays of tuples. Okay yes that's a good question. In this case it would be different. Because the only transformation of data we've done is to split on white space. Okay so it's just and actually you split on single spaces. So yes. Our mapping function. This this mapping here we could make that a more refined function to say let's capitalize everything or let's lowercase it. Or let's take the root of the word but not not the ending so good question others wrap it up. Yeah okay. So here's the nice. The graph to the directed acyclic graph the generalized execution graph that we talked about earlier. So this is actually. What's going on under the hood of MapReduce. I encourage you to take a look at this. We'll have this posted on your website. Just wanted to give you an idea of the kind of things you could do so this gets much more complex but I think you guys will have a firm understanding now of the mechanics and the mechanics don't really change depending on the API but you can read CSV files for sure and you can query them using sequel like syntax so I just wanted to show you a little bit. There's a whole machine learning library. That spark maintains since developed which is extremely useful. So it has things like logistic regression clustering random forests. I'll show you here. I have an example on YouTube of how to classify SEM images so SEM and x-ray data. Some of you may have for sure what you can do is actually use these these built-in sort of classifiers which you don't have to program yourself and create pipelines of data to do things like that you can do computation on taxicab data for instance so here we've calculated the most popular taxi pickup slash drop-off area in New York and we did that using the PageRank algorithm which comes baked into spark and is what Google is used for years to compute to direct to you web searches to web pages so this is really I mean cutting edge from an algorithmic standpoint so I'll just try to finish it up five minutes over here so my tips read the documentation for sure use the check the github repositories for example code and the biggest one.
I think one of the main ones we use a good integrated development environment which will prevent a lot of your mistakes before they happen so again. Apologize for going over here but we are here to help you. We're certainly. I'm here to get you up and running. Don't hesitate to contact us through any of the channels and we'll be happy to consult with you so thank you very much.