Single cell data analysis using VisR: Part2 - Seurat


Hello everyone thank you for coming so today will be a demo of the. Surat our app in bazaar so just to in case. You didn't in case you didn't come to the last session. It's our demo so. Whizzer is a Java based platform that contains a lot of Java apps as well as many are apps so these so these are apps allows the users to use our packages as well as our functions without having to go into the our console so it's a more user friendly version of yielding our so today I'll be talking about this rat our app so just a brief overview of this rat our package which is developed by the Satish our lab. It's a very commonly used single cell a single cell sequencing analysis tool so the poachers for this. Our package can be found at this website. So there's a brief overview. The general workflow is very similar to the cell. Ranger our kit our package so it involves the inputs of expression matrix. Which is a typically contains the either accounts or the UM accounts or accounts so after inputting the expression matrix we do some data pre-processing by filtering the data or normalizing the data scaling the data etc after that we reduce all the genes and to a few dimensions that captures most of the features for each cell after that we used a reduced dimensions to cross their cells and perform the dimension differential expression analysis on these clusters to help us draw meaningful biological influences so since some steps in this workflow takes a long time so we will use a combination of life demo of this wrap as well as well as some pre generative figures to illustrate how our app works so first we open the is art and we can search for. Surat here so alternately we can also go to the sequencing folder and single cell so we can see all the single cell our apps here of this tester that has it's a giant might be it's not on the website so about that so we have surrender our kit which which was demonstrated a Lhasa and Surat here as well as Monaco which is still working progress. So we drag this rat into the main panel so in the main panel well we will see the display of plots that's generated after each run and on the right hand side contains the parameters and options.

We can set for this our act so first we will choose workflow so we can either choose to perform a standard dimension clustering and the differential expression analysis of one single solitaire set or we can do integration accessing in case we have two different data sets with different experimental conditions so currently only supports up supports up to two different data sets so to start with well first use the NASA so one data set and so so for the input options weekender loading the surrender pipeline output. So it's the folder which contains and outs folder. Or we can load English object so since we don't have any store object now we can loading the surrender. Python output we click on the three dots and the allows us to browse our directories so I have a solo sub PBMCs three cakes. It's provided by a task genomics so when. I open the folder we can see like counting South's water. We would not select outs further. He used the further. That's one level above. Okay click open. I'll be here and the next section expanded. We choose the output folder we can store our outputs in a single folder. Cut just called output so here are two options yes as we can either create a new subdirectory or or not we check this checkbox by default so as to prevent the old files from being overwritten by the files generated from the Noura and the sub temperature directory will be named after the current date and time and another option is that we can choose to save the opposite object or not so we recommend the same at every time or for some stats. That didn't make any changes to the objects off. We can deselect this option as as you say. Save some time. Because if they're objects is very large then it could take quite some time to save it so after specifying the aperture 3 we can now create this rather object from the third pipeline output.

So if we so these are the parameters that we can adjust and lower limit basic. Just means that it's there's no upper bound at all this value. If we describe this we can change it value. Here you click run and start to create a throught object as well as performed the field training step based on these parameters so we can see the progress here. It's on the point bring out. So it's now loading the data and creating the object another place to see. The outputs is from the console. So it shows both the outputs print outputs as well as the code. That's being executed in fact we can even see the source code from the code section here so now after the round finishes we can see on the main screen it displays a plot but in case that there are multiple plots generated from a single run or if we or another possibility that we selected multiple steps and random at the same time so there will be multiple process generated this product will be only the last part generated during a single run and as for the rest of plots we can see there is a PDF file generated containing all the plots so as a brief summary table so all the output files can now be found inside this octo folder named I was named with the current date and time so if go inside this folder and see several files so this is basically just the parts that were preemies shown so this contains four other parameters that were specified during this run and the cells. This file contains all the information that's available for each cell so as we progress through as more information is being added into this red object. We'll see more more columns being added to this file and also another is the threats. RDS which is the threat object. You can load the subject into your own customized har script using read. RDS function for your customized and asses and the sculpture can also be directly linked to these are a threat our app using their load using this load as ray object option so that if we are ready perform that a lot of NASA steps before and we can directly load this object so that it's not read one the previous there are steps us already being computed so nothing's work radius R object.

We can changing important message to those right object and can again for us from the three dots so go to the preemies. The output folder from the previous run a loading the threat object. Now we can choose which subsequent step we want to run so the first will be fun wearable jeans. That's a quick step so I'll just show here again. The corresponding parameters are exposed. So we can't change them as we want. I'll click run and then start running so we can see the output PDF files already generated and the square shows the plot for the variable genes as well with the thoughts that are being labeled the dots are being able with a gene name are the rebel genes that we have selected so if we are happy with the result then we can proceed with the subsequent steps before not to occur there. We can just suggest the steps here adjust the parameters here to rerun this step so actually it's the similar for each step. We run so once. We are content with the results are one step so we can go into the load straight object and change our impose rather object into the one that we want so the things the following steps are the National narrative reduction and the cross ourselves. They take some time to run so. I was just used my slides for that to show the results. So here's the dimensionality reduction so we use PC for that and then we can specify the number of PCs here this is the initial number of PCs to be computed and then output plots of variant or the cells the pc1 & pc2 projections and to determine how many of these pieces are actually significant. We have this rat. Provide the three methods of children number pcs so versus jacks for which each PC have a associate a number of p-value so we can choose where to cut off is on the p-value another scree plot so basically we process the variance of each PC and we can try to see where the elbow of this plot is so where it starts now off answer wise the PCG map so it parts like how how each PC is driving the extreme genes and cells apart one can see that as as the PC number goes up.

The difference between the two extremes becomes less obvious so you know all these three steps we were input and the object containing the PC results but this object can be generated from the previous run of run PC. It also be generous from your own script up after that. We run a tease me. Which is makes it easier to realize how the cells and the next step we achieve just to mention that seeing the run cheese new step we provide an option of automatically computer calculates the number of pieces to use so if you choose. I just want to have a quick run-through you can choose this option so that you don't have to specify the number of pieces but if you choose to run the previous three steps and the ones use best PC is yourself you can just uncheck uncheck this option and you can special number here. After that we cluster the cells based on the PCA without he also says cheese near. That's here so that it can be plotted on this. His knee scar that his knee one cheese need two projections if it we didn't run. Disney before and you and the cells will be project on the pc1 & pc2 projections. Well it's less it's harder to realize on the PCs guess and as the cells are more work so separated and cheese need projections so again for the cursory step we have the option to automatically computes the number of pieces and it's recommended that you probably don't want to use that and specify the number of pieces yourself so if we have the after we have the cell clusters we can perform the differential expression analysis and these clusters to help us and identify that which cell type each cluster correspond to so we have two options. Here the first is to compare each cluster to the rest of the cells.

So it's all done. Another option is to you can choose choose specific clusters to compare against each other or you can specify two groups of clusters so essentially here is. There's only three but if we want to see compare all the cells within groups in cluster three and a five against all cells in. Custer four and six now you can just use a comma separated list of cluster names if you don't specify anything within a group to then assume that we are comparing Group 1 a guest or the other cells in this dataset and in the following it also has some other options so for example there is a differential expression test method. So we can see from here we play differentials version versus so in this house might say it offers the options to choose different a different statistical test method. That's offered by this rat package so after running the Thresher expression that is you step a file. New file will be generated within the results folder. So it's called. The name is very obvious so we can actually import this so this work is actually a tab separated if I also seeing a table format we can load this file into into Bazaar for viewing the content so I can show you so we can try to loading some pre computed results here so to loading a table you can use the add table option and event table. I'm here in our example output. We can load in. We can use a table and a buting app called a table view to help us see the content of this file so here. I can see that can make and see the future this table by C. So if we filter a table by its cluster. We can see how the gene that are differentially expressed in secret one or five or eight. Okay also adjust some future for the say the p-value order just a p-value or maybe the percentage difference between our target group and the group for comparison so this this file can the TSV file can also be manipulate manipulated in any type of spreadsheet software such as Simkins files Excel so. It's quite easy to manipulate and of your choice so on that.

We also want to introduce upset. I think we introduced in our last session about our injure our app we have added some new features to this one so our us to shew it here so upset is an alternative to. Venn diagram can help us. Viewed a number of the size of the intersections between different sets for a Venn diagram when the number sets exceeds say three it becomes very messy to realize the subset can provide a much clearer view of that so for example in this plot here so the first part shows the number of genes archer that are differentially expressed only in each each cross. Only so it's in this case is it shows the number of genes that are differentially expressed in this cluster. But not in any other clusters and the following one's issues worth urging that are differentially expressed in exactly these - or these two clusters so to generate such a plot for a search for the upset and drag it to the middle. We can't drag the table to me though as well. I can't choose the input type so this is the formatter which the genes and the clusters are they about so I provide an example here so for the stretch output. If we choose to perform differential a differential expression on clusters they will work at the table. That's similar in this format. Where one column is the number of genes and another column contains their corresponding clusters. And so we call this. ID set coordinates format another format. Which is the cell render our ABS output for different differential expression. It's something like a truth table where the row each row is a gene and the each column is a cluster and if the entry is one here it means the string is differentially expressing zero. If it's zero here then it means the string is not differentially expressing this cursor so in this case we used output from this rare earths rat so we choose the ID set combination has been protect and the ID columns will be the genes and the sets are the clusters. I'll also try to output a binary encoded table just in case once you have an alternative view of that run.

I'll have such table. So currently the max number of intersections being a project is 40 we can adjust this value as needed or we can choose a different way of ordering these intersections currently others by the degree first so the intersections of one dataset only and intersection between one set. Only or the intersection between two sets. Only so if we order by for other than by frequency then the order by the hour counting each section. Oh sorry oh so. This axis represents the set or in this case there are clusters so here it's cluster to a little bit smart. You see. Oh yeah that's the. Axios option yeah so. These are the other clusters generated from the our custom step so we can see a cluster. 2 0 4 3 8 7 6 5 so these are the clusters and I just thought mean that this war represents the count of all of the number of differentially expressed genes in this in this cluster. So these are all similar. Hi this is that were differentially expressed genes in cluster 8. And before the inside this it means the number differentially expressed genes number of genes that are differentially expressed only in cluster 1 and the 5. But now you had the other clusters. So it's a strict intersection and here. We can also see that. There are three dots here so that means it represents the number of genes that are differentially expressing these 3 clusters but now he had the other clusters and the bars here represents the number of genes that are differentially expressed in saying cluster - well it does not consider whether it's differentially expressed in other clusters or not. So that. Does they explain. Okay thanks so this is off water. Hops ethic could be you know. I think it could be a useful way of living. See the relationship between different clusters so if say there are a lot of shared differentially expressed genes between two clusters. And maybe we can say that these two clusters are very closely related.

Or even we should maybe put them into a same cluster. So this is all for the so. I plan to the differential expression we already finished the right general workflow for single-cell analysis in the following parts. We have some additional options so one of them is rename the clusters so maybe after we finish the differential expression analysis we already knows the identity of some clusters and we want to rename them to corresponding cell types. We can use this option and provide a list of the current cluster. ID names there. Are your ages numbers. 1 0 to wherever the number of cluster is and rename them by the prototype after that in the output you can see that cluster 3 or 4 have been renamed to B cells and see the 80s else so another option we provide is colorize the jeans so there are 5. There are 5 gene visualization functions provided by threat. So we put them all as the options here so we can provide a list of genes that we want to realize there. Maybe there are. The marker genes that we found during the differential analysis step so we can put a comma separated list of genes here and in the following. Well the foreign section will be expanded to show the 5 different routing options. The first is scatter plot where we plot the cells on the choose new project rejections clarify their expression. And that's why it's about in part so it's useful to realize the district distribution of the genius. Russians I was in each cluster and we also have a dot plot where the each dart is colored by their average expression within the cluster and the size of the dot represents the percentage of cells within krauser that express this gene and so we have a rich part which is very similar to far apart is just kids to become horizontal. Also we have a he map where the cells are grouped by their clusters and we can realize there. I realized the expression of each each gene wide by their colors. So this is all for the first workflow that will show which is the analysis of one data set so the following part will should workflow for so.

Actually there's one word alright so after so after we run. The visualization of the gene expression step will see that the gene expression values have been added to the to this to the South dot. CSV file that we have seen output folder. So we can try to load it. Load that file into that so as an example again realize we can check this file inside. The you saw a table vo app the bizarre. I see that now since try this rare object or L contains the PC results and his results and the clustering information so equals all these columns as well as the gene expression values that we selected to plot. And if we want to do a customized plot must be specifically a customized scatter plot. We can use another buting. Java app in is our search for scatter you'll see the scatter plot here drag the table and here we can select. Which axis do you want to harm and so we'll jump right on say two. T's new projections legislature here and choose new one here so sounds now our T's new projections under color the cells by say their krauser. ID alternative by the gene expression level and also change the colors here so this is all for rise also show on the slides so this is all for. We're not is a one. Single data set a workflow so next we can quickly go through the integrate analysis work for so we will use this part if if we have two data sets that's consists of mostly a very similar cell types but have undergo different experimental conditions so to do that we want to merge the two data sets together and find a common cell types between the two the assets so the first step is to merge the Julia sets so here we'll provide option so we have the option of either loading two different threat objects each of them should already contain the marital genes information. So if we don't have that we can go to the first workflow to generate the variable genes information so we can go here and click on define viral genes.

Then we can get sir get containing the variable genes so once we have these objects we can load them separately into into the second workflow and the output will be the Apple would be a merge threat object containing the data from both input objects so simply merging. The cheese round. Iplex issue is really not enough for this into glasses. So that's in fact so as we can see as a sample here so these cells are party on the cheese nice girl that's generated from the PC a lot and we can see that that cells from the two in assets are somewhat are actually very well separated already so which mean that batch effect will pay become a driving force in their clustering so instead of being clustered based on the cell types they will be cluster first by their experimental conditions. Which is now we want. So in order to remove the batch effect we saw it this rat a workflow provides option to use. CCA instead of PC what dimensionality reduction just output of CC results so the cells are party RCC year one versus ec2 and we can see and she truth number of cc's for subsequent stats. We use a bio core saturation plot which is very similar to a scree plot where we want to find the elbow or where this lie. Display starts to level off and then we aligned. Us is here with us. I used CC so these are very similar to to the previous work load versus of one data set also cluster based on the CC results because here of clusters and the additional option provided by. This workflow is to find the differentially. It's very experimental conditions. Or the two data set sets we can do specify a single cluster here and they know our fund differ the genes are differentially expressed between the two between the cells are around two different special experimental conditions within the same country. So that's all for today and how much for coming well. I think one of the main instead of first and then these are also gives you some options if you know are you can bypass and then the other nice thing is that the VIS our platform gives you an interactive workspace to actually plot data.

You gather our tools so hopefully in the next month or so. The third installment of this is our presentation looking at monocle and but a big part is also meeting. The last demo went over the. Yeah so here. There is a help can report box or suggest features here so it's a it's a sorry form that you can fill out so that we can see. Yeah and then you may have noticed. We were also recording so we'll try to make these resources the slides and did you know available for you to reference when you actually start getting data going through because probably the best way to learn this stuff is actually to try and put data through and play with it and see what you can do so far also. That's the sort of the reality they're trying to tackle other but this is basically there's too much run into people you know how it looks like. Then maybe okay feedback on that for sure like lots of parameters and stuff but from don't bring your data components the resolution of your like everything else scaling. Also that's why we break this into distinct steps. So that's after you perform one step you can adjust the parameters until you are satisfied somewhat and they want to stay information from the previous run. We'll hang out for a little bit.