How to use Python as a Data Analyst

What up dead nerds let's see how I use this as a data analyst now python is a popular tool you can install for free and here are some recent poll results from my YouTube channel let's analyze them first.

I collect these results into this script next I perform a simple calculation to analyze it then I plot these findings to share in three simple lines of code we just performed the three major steps of any data analytics project and this multi-purpose programming language is nearly Limitless in what it can do from building Bots to scrape the internet for more data to using Advanced machine learning to predict results or even building a website to share your favorite data nerd content speaking of content my subscribers have been Relentless in asking about top skills they should be learning as a data analyst so like any lazy data nerd I'm going to automate this data analysis and build an app to tell them just this first we need real-time data whenever a subscriber accesses the app it needs today's results next we'll have to perform data analysis to find the top skills within this data and last we'll share these results via an easy accessible app now we're going to do this entire project with python and I've chose this over other tools as python makes it super easy to not only share all of this code with you but also so that way you can contribute and improve this code because frankly I'm still improving my skills as well and I've been spending the last number of years doing this so you're not going to master this language in this 15 minute video instead I reached out to Coursera to sponsor this video and I'll be sharing course from them that I've taken to improve my python skills that I'll link below alright so let's get in the first phase of this project of collecting the data that we need now there are a number of ways to get data if you work for a company they'll typically house their data in databases or if they're less Advanced they'll store them in Excel sheets no God please no no for both of these scenarios python can connect to these data sources and a lot more but unfortunately I don't have any access to any company databases so we need to search for publicly available data sites like Google and even kaggle are great starting places but searching them I found they didn't have what I needed another popular option to get data with python is web scraping now I attempted this last year with a python bot I built that collected job postings from LinkedIn with this data collected.

I then analyzed it for skill requirements unfortunately I ended up getting banned on LinkedIn from doing this so this wouldn't satisfy our rule number one of collecting real-time data now I will admit web scraping is great if you need something once or even twice but most likely if you're not getting actively blocked by websites your code will eventually break due to web pages changing so where the heck are we going to get this data well there's actually a more reliable source than web scraping and that is apis or application programming interface application programming interface these apis allow you to send code from your computer the client to an another computer called the server in order to request the information that you need in a more sustainable manner.

You've Got Mail say for example I wanted to collect the key information from a favorite programming languages from Wikipedia I can install something like the Wikipedia API with python so I can tell Wikipedia what page I want and then even specify things like page title and sections even in the future if Wikipedia updates the layout of their page this API will continue to work and there are actually a ton of apis out there to use luckily for us I found this one which is serp API conveniently which provides results from a search engine result Pages like Bing Yahoo or even YouTube but I would say that for searching Google this API really has some special features specifically I can go in and actually search job results on Google so say I was to search data analyst jobs in the United States this is what I would get so we can use serp API to get all these different job postings so check this out first I can install serp API then I can specify things like like job title and search location from there I call the API with this information in less than a second we have it and from there I can actually print out these results and this has all the information we need such as title location description and salary if we actually went back to that web page itself we can see that this job right here is requesting things like SQL Python and power bi and we'll be able to extract this information from all these different job postings for our dashboard so this is pretty amazing right we're using python in order to call an API and get the job postings that we want what we'll need to do next is set up some Automation in order to call this API daily and then store these results now what I just did in Python really only requires the basic skills but unfortunately there's a lot to the basics specifically for python more than I can cover in this video so if you're a complete beginner to python I recommend checking out the python for everybody specialization Dr Chuck is the instructor and he breaks down the basics of python all the way from installing it understanding data types calling apis and finally accessing data via SQL databases which is actually what we're going to do next so right now we're collecting around 100 jobs per day but I'm planning to up ramp this to around 10 000 jobs a day if I do the math for that that's around 3 million jobs per year wow so storing this data in something like a CSV or Excel file is not an option so we need to store this data in a sustainable solution something that's designed for large amounts of data so we're going to use a SQL database now because I want this data to be accessible by everybody we're going to use a cloud-based solution specifically bigquery from the Google Cloud platform those that took the Google analytics certificate are familiar with this solution all right so jumping back to my computer I've set up a blank database in bigquery that has all the different fields that we need to import into it once we have those query results from serpi API we can then connect to this bigquery database and insert those new results after running this script I can then go back to that bigquery database and look in and actually see that all of this data was imported into it now remember the first rule this project we need to have real-time data because of this I need to execute this python script daily in order to grab the jobs from serp API and insert them into bigquery now if we have to rely on me to run this python code daily we're going to be in big trouble as I'm going to probably forget instead I took all this python code and put it into Google Cloud to be automated and run daily on its own so now this data pipeline is fully automated and will continue to collect into the future but that's not even the best part I've taken it a step further to export this data from bigquery into kaggle so that way all my subscribers have access to this real-time data which I don't know of many public data sets that actually do this now knowing Cloud technology such as Google Cloud Azure even AWS is not something that I think an entry level data analyst needs to know however I do think it is a growing skill for those that have a few years of experience a dead analysts should start to learn because of this I've invested more time in learning Google cloud in my studies potentially saved me a bill of three thousand dollars let me explain when I first started setting up these services in Google Cloud I made a big mistake and started a service I didn't need called Google data flow come to find out there's an expensive service which cost me almost 10 bucks a day that escalated quickly so if this would have ran all year it would have cost me over 3 000 bucks and Google Cloud doesn't have any alerts or notifications to let you know when you're using more of their service luckily for my studies I learned the basics of monitoring services and checking balances so I was able to find this issue before escalating further now my bill is back to normal and it's only costing about a few cents a day which isn't that bad all right so let's actually get into cleaning and analyzing the data and this is where I spend the majority of my time when I first started as a data analyst my expectation was that I'd spend the majority of my time analyzing data in reality I I spend most my time trying to clean up the data what the remember our goal we're trying to find what are the top skills of that analyst inspecting one of the job descriptions we can see buried inside of it is a list of tools that are required for this job so the great thing about python is we can just tell it what to look for within these descriptions and extract those values out so I put together a pretty hefty list of keywords that python needs to look for focusing on programming languages and also On Tools with these lists of keywords I then created a loop to go through each one of the job descriptions and extract these keywords out of it so now looking at this updated data set we can see that python went through and pulled out those keywords that were in each of those job descriptions but we're not done with data cleaning just yet I'd like to clean one more column and that's salary right now it's in an unusable format sometimes it's hourly or yearly sometimes there's ranges I can go through and create rules to clean up all of those different salary columns so now I have the salary in a much more usable format having it in columns for the average Min and Max along with the rate alright so I know that took me like less than a minute to explain but that actually took me a few days actually to figure out and actually clean so don't underestimate how long it's going to take for the data cleaning portion all right let's move into the Eda now or exploratory data analysis so looking at some basic statistics it looks like we've collected around 1300 jobs already and are averaging about a hundred jobs a day since we started this let's actually go in now and visualize a lot of the different columns and see what values they have in them we can see that data analyst is one of the most frequent titles for the job type it looks like there's mostly full-time jobs available with some contractor but what I think is most interesting is that most of the jobs are coming from LinkedIn the place that I wanted to scrape originally but also from these other ones so upwork monster and talent so we get an assortment all right so let's now look at some of the statistics of the salary this is a histogram showing the salaries are distributed between fifty thousand and two hundred thousand dollars it looks like we have a high clumping around a hundred thousand dollars we'll probably need to investigate this at some point so next is the hourly pay it looks like it's ranging from 20 to 80 bucks I actually dove into this further and found that the majority of these postings came from upwork which is a freelance site so for both these salary plots this includes not only entry level but also those that have more experience so this is why this is such a big range for both of these all right so let's actually dive into that final analytical question of what is the top skill of data analysts for this we need a metric because we have skills and a job posting We can evaluate the likelihood of a skill appearing in a job post if python is in two of three job postings its likelihood is 66 so once again python does this really conveniently I can go through each of those keywords and calculate a percentage for each one to determine their likelihood to be in a job posting which drum roll please is SQL and then Excel and finally python but I'll be honest.

I'm not really surprised by these results they're actually very similar to what I found last year scraping LinkedIn job data now if you're curious to seeing all this python code that I just went through put this notebook on kaggle so you can go through it download it and try it out and if you're looking at the code and have no idea what's going on there then after you've gone through and learned the basics with python for everybody I recommend next diving into the applied data science with python specialization this course is packed full of information it has all those tactics and information to do exactly what I just did with popular libraries such as numpy matplotlib and even pandas and the Highlight at the end of the course is that it even gets into machine learning as I'd like to build some models to predict salary and skills but that's for another video alright let's actually jump back into this python notebook I now have this script in an automated fashion so say I wanted updated results tomorrow I could just go and run this script actually let's go see future Luke to see what the results are tomorrow future Luke here and I guess I'll just stop editing to answer this question anyway I ran this python script and it looks like SQL is still in the lead but that we also have python catching up nice thanks future Luke cool story bro I'm gonna get back to editing okay so this is good for me and for future me but how the heck do I get these results to my subscribers well this is where dashboard Solutions come in typically in the past I've used Solutions like power bi and Tableau for this we're going to use Python which I've never really done before now python has a lot of different support and packages to use for this I decided to just go with the most popular the most user-friendly option which was streamlit I was able to reuse a lot of the work I had done already from importing and cleaning the data performing all those calculations and then visualizing it all the best part is that streamlit also deploys your app for free so let's check it out on the first page it has where you can see the top skills for data analysts also have options to sort by different languages and then also you can go in to even see a daily Trend analysis of each of these skills now I also wanted to include the salary data so I included it to where you can see the annual hourly and also a standardized method that combines both the annual and the hourly into a common unit there's also other features as well to show the health of the data collection something to explore the actual data set itself and then a page just to find out more about this project itself so I'm really proud of this this is a really crazy solution we're using python for everything from Gathering our data from Google jobs to putting it into a database at bigquery and then extracting it into providing in this dashboard that anybody can access and this is hopefully only the beginning of this project right now we're only extracting about a hundred jobs a day for data analysts in the United States I want to grow this further to those beyond that analysts such as data scientists and data engineers and thankfully serp API has agreed to help with this for providing some free credits and with that I need your help if you find any interesting insights in this data set I'd be Beyond grateful if you share them to my subreddit r slash data nerd.

I'm hoping that by open sourcing this project we can get more contributions and make it easier for those that are aspiring to become data analysts Engineers or scientists to land their jobs in this field all right as always if you got value out of this video smash that like button with that.

I'll see you in the next one.