This is the last exercise set of R language learning. This paper mainly introduces how to use R language for data visualization to help us intuitively see the meaning of data. Drawing functions are so changeable that it is impossible to write down all the functions. To be proficient in using help documents, you won't turn over the code at any time. It takes a long time to master r drawing.
Original title: http://www.bio-info-trainee.com/4387.html
Reference answer: https://www.jianshu.com/p/fab27c63af94
Reference answer: https://www.jianshu.com/p/8fce9d2ad562
1, Basic drawing
# Prepare data rm(list = ls()) options(stringsAsFactors = F) library(airway) data("airway") airway RNAseq_expr <- assay(airway) dim(RNAseq_expr) colnames(RNAseq_expr) RNAseq_expr[1:4,1:4] RNAseq_gl <- colData(airway)[,3] table(RNAseq_gl)
1. For rnaseq_ Plot boxplot for each column of expr
boxplot(RNAseq_expr)
2. For rnaseq_ Draw the density graph for each column of expr
# Remove useless values (data with column sum less than or equal to 1) e1 <- RNAseq_expr[apply(RNAseq_expr, 1, function(x) sum(x>0)>1), ] dim(RNAseq_expr) # [1] 64102 8 dim(e1) # [1] 28877 8 plot(density(RNAseq_expr)) plot(density(e1))
3. For rnaseq_ Draw a bar chart for each column of expr
# An unprocessed picture can be taken at a glance. It's meaningless barplot(RNAseq_expr)
4. For rnaseq_ After taking log2 for each column of expr, re draw boxplot, density chart and bar chart
e2 <- log2(e1+1) # The data drawing after log2 looks much more comfortable # When the source data value is large and the difference is large, consider log boxplot(e2) plot(density(e2)) barplot(e2)
5. Add trt and untrt group colors to the three diagrams of Q4 to distinguish them
# Box diagram boxplot(e2, main = 'Boxplot of RNAseq-expr', xlab = 'samples',ylab = 'expression',col = RNAseq_gl) # Density map # Generates a list of current drawing parameters that can be modified opar <- par(no.readonly=T) par(mfrow = c(3,3)) for (i in c(1:8)) { plot(density(e2[,i]), col=as.integer(RNAseq_gl)[i], main = paste("Density", i)) } # Resets the parameter to its previous value par(opar) # If you accidentally modify par(), restart RStudio to restore the default value # histogram barplot(e2, main = 'Barplot of RNAseq-expr', xlab = 'samples',ylab = 'expression', border = NA, col = RNAseq_gl) # At this time, the picture is very strange. Take a subset to see what happens e3 <- e2[1:10,] barplot(e3, main = 'Barplot of RNAseq-expr', xlab = 'samples',ylab = 'expression', border = NA, col = RNAseq_gl)
As you can see, the desired result is that the color of each column changes in turn according to the grouping, but the color of the graph changes in turn in each column
The reason is that if the data in the barplot is a matrix and the best is FALSE (default), each column in the figure is stacked one by one, so the color is given one by one
The explanation is a little messy. Attach the original text
barplot(height, ...)
height: either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
6. For rnaseq_ The first two columns of expr draw a scatter diagram and calculate the linear regression equation
e4 <- as.data.frame(e2) # 'data' must be a data.frame, not a matrix or an array fit <- lm(e4[,1] ~ e4[,2], data = e4) fit # Call: # lm(formula = e4[, 1] ~ e4[, 2], data = e4) # # Coefficients: # (Intercept) e4[, 2] # 0.2105 0.9868 # The difference between the original data is large, and the data processed with log2 plot(RNAseq_expr[,1:2]) plot(e2[,1:2]) abline(fit, col = "red")
7. For rnaseq_ The correlation coefficient is calculated between all columns of expr, and the heat map is visualized
M <- cor(e2) pheatmap::pheatmap(M)
8. Take RNAseq_expr the first line of expression to draw a line graph
plot(e2[1,], type="b", xlab = "gene", ylab="expression", col="red") # type="b" is the most common line chart # type p Only point l Only line o Solid points and lines (that is, lines cover points) b,c Line connection point( c (do not draw points when) s,S Step line h Vertical line of the square schema n No points and lines are generated (usually used to create axes for later commands)
9. Take rnaseq_ The lines of the 10 genes with the highest expr expression were drawn in a multi line broken line diagram
top10 <- e2[names(tail(sort(rowSums(e2)), 10)), ] top10 # Set the interval of abscissa and ordinate library(reshape2) yrange <- range(melt(top10)[,3]) yrange # [1] 16.15977 18.97075 yrange <- c(16,19) xrange <- c(1,8) # mapping # The plot() function creates a new graph when called # The lines() function adds information to an existing graph and cannot generate a graph by itself. # Therefore, the lines() function is usually called after the plot() function generates a graph. plot(xrange, yrange, type="n", xlab = "gene", ylab="expression") for(i in c(1:10)){ lines(top10[i,], type="b", xlab = "gene", ylab="expression", pch = i) }
10. Line by line operation
https://github.com/jmzeng1314/5years/blob/master/learn-R/tasks/2-chunjuan-600.R code
The code is complete and runs smoothly.
2, Gglot drawing
# Geometric functions commonly used in ggplot2 geom_bar() Bar chart color,fill,alpha geom_boxplot() Box diagram color,fill,alpha,notch,width geom_density() Density map color,fill,alpha,linetype geom_histogram() histogram color,fill,alpha,linetype,binwidth geom_hline() level color,alpha,linetype,size geom_jitter() Jitter point color,size,alpha,shape geom_line() Line diagram colorvalpha,linetype,size geom_point() Scatter diagram color,alpha,shape,size geom_rug() Carpet map color,side geom_smooth() Fitting curve method,formula,color,fill,linetype,size geom_text() There are many text annotations. See the "help" of the function geom_violin() Violin picture color,fill,alpha,linetype geom_vline() vertical color,alpha,linetype,size # Common options for geometric functions color Shade the boundaries of points, lines, and filled areas fill Shade filled areas, such as bars and density areas alpha The transparency of the color, from 0 (fully transparent) to 1 (opaque). linetype Line of pattern (1)=Solid line, 2=Dotted line, 3=Point, 4=Dot dash, 5=Long dash, 6=Double dash) size The size of the point and the width of the line shape The shape of the point (and pch Same, 0=Open square, 1=Open circle, 2=Open triangles, etc.), see Figure 3-4 position Draw the location of objects such as bars and points. For a bar chart,"dodge"Group bars side by side,"stacked"Stacked grouped bars,"fill"Stack grouped bars vertically and make them equal in height. For this point,"jitter"Reduce point overlap binwidth Width of histogram notch Indicates whether the block diagram should be a gap( TRUE/FALSE) sides Placement of carpet drawings("b"=Bottom,"l"=Left,"t"=Top,"r"=Right,"bl"=Lower left, etc.) width Width of box diagram
1. Rewrite the Q1-5 exercise of the above basic drawing with ggplot code
# Data preparation rm(list = ls()) options(stringsAsFactors = F) library(airway) data("airway") airway RNAseq_expr <- assay(airway) RNAseq_gl <- colData(airway)[,3] e1 <- RNAseq_expr[apply(RNAseq_expr, 1, function(x) sum(x>0)>1), ] e2 <- log2(e1+1) # When drawing with ggplot, the data should be in the format of data.frame library(reshape2) me2 <- melt(e2) colnames(me2) <- c("gene", "sample", "expression") tmp <- data.frame(group_list=RNAseq_gl) rownames(tmp) <- colnames(RNAseq_expr) tmp$sample <- rownames(tmp) e3 <- merge(me2, tmp, by="sample") group <- as.data.frame(colData(airway)[,c(3,5)]) group # Question 5 contains the previous four questions ### 5 ### library(ggplot2) # Box diagram ggplot(e3, aes(sample, expression, fill = group_list)) + geom_boxplot() # Density map # Group according to sample ggplot(e3, aes(expression, color = sample)) + geom_density() # Group according to trt and untrt ggplot(e3, aes(expression, color = group_list)) + geom_density() # Bar chart # geom_bar() uses stat_count() by default: it counts the number of cases at each x position. # geom_col() uses stat_identity(): it leaves the data as is. ggplot(e3, aes(sample, expression, fill = group_list)) + geom_bar(stat="identity")
2. Rewrite the Q6-9 exercise of the above basic drawing with ggplot code
### 6 ### ggplot(as.data.frame(e2[, 1:2]), aes(x = SRR1039508, y = SRR1039509)) + geom_point() + geom_smooth(method = "lm") ### 7 ### # Like a heat map, but not exactly a doge M <- cor(e2) meltM <- melt(M) # If you want to draw arbitrary rectangles, use geom_tile() or geom_rect() ggplot(meltM, aes(x = Var1, y = Var2, fill = value)) + geom_tile() ### 8 ### # Line chart e4 <- data.frame(expression = e2[1, ]) e4$sample <- rownames(e4) ggplot(e4, aes(x = sample, y = expression, group = 1)) + geom_line() + geom_point() ### 9 ### top10 <- e2[names(tail(sort(rowSums(e2)), 10)), ] top10 <- melt(top10) colnames(top10) <- c("gene", "sample", "expression") ggplot(top10, aes(x = sample, y = expression, color = gene, group = gene)) + geom_line() + geom_point()
Explanation to questions 8 and 9:
For line graphs, the data points must be grouped so that it knows which points to connect. In this case, it is simple – all points should be connected, so group=1. When more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.
Q10: line by line operation: http://biotrainee.com/jmzeng/markdown/ggplot-in-R.html code
3, Bioinformatics mapping
Need reference https://github.com/jmzeng1314/GEO/blob/master/airway_RNAseq/DEG_rnsseq.R
1. Line by line operation:
https://github.com/jmzeng1314/5years/blob/master/learn-R/tasks/top50ggplot.Rmd
The code is smooth except that several data links fail.
2. Select the expression matrix of 100 genes with the largest MAD value for RNAseq_expr and draw the heat map
# Take the 100 gene names with the largest mad value top100_mad <- names(tail(sort(apply(e1, 1, mad)), 100)) # top100_mad # Data standardization refers to the value minus the mean and then divided by the standard deviation z_score <- t(scale(t(e2))) # Take top100 matrix top100 <- z_score[rownames(z_score) %in% top100_mad,] pheatmap::pheatmap(top100)
3. Principal component analysis and mapping of RNAseq_expr
# The PCA diagram shall be drawn using the z-score matrix library(ggplot2) library(ggfortify) dat <- z_score df <- as.data.frame(t(dat)) # Add a column to facilitate group drawing group_list <- RNAseq_gl df$group <- group_list autoplot(prcomp(df[,1:(ncol(df)-1)]), data = df, colour = 'group') + theme_bw() # install.packages("FactoMineR") # install.packages("factoextra") library("FactoMineR") library("factoextra") # Reset df df <- as.data.frame(t(dat)) # ?PCA # graph: boolean, if TRUE a graph is displayed dat.pca <- PCA(df, graph = FALSE) # ?fviz_pca_ind fviz_pca_ind(dat.pca, geom.ind = "point", col.ind = group_list, addEllipses = TRUE, legend.title = "Groups")
4. Analyze the difference of RNAseq_expr and draw volcano map
# Difference analysis # BiocManager::install("DESeq2") suppressMessages(library(DESeq2)) colData <- data.frame(row.names = colnames(RNAseq_expr), group_list = group_list) # ?DESeqDataSetFromMatrix() # Rows of colData correspond to columns of countData dds <- DESeqDataSetFromMatrix(countData = RNAseq_expr, colData = colData, design = ~ group_list) # ?DESeq dds <- DESeq(dds) res <- results(dds, contrast=c("group_list","trt","untrt")) resOrdered <- res[order(res$padj),] head(resOrdered) # output # log2 fold change (MLE): group_list trt vs untrt Wald test p-value: group_list trt vs untrt DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000152583 997.440 4.60253 0.2117708 21.7335 9.89036e-105 1.83911e-100 ENSG00000148175 11193.719 1.45147 0.0848249 17.1113 1.22198e-65 1.13614e-61 ENSG00000179094 776.597 3.18386 0.2015154 15.7996 3.13247e-56 1.94161e-52 ENSG00000134686 2737.982 1.38714 0.0915842 15.1461 8.04404e-52 3.73947e-48 ENSG00000125148 3656.253 2.20344 0.1474087 14.9478 1.60924e-50 5.98476e-47 ENSG00000120129 3409.029 2.94898 0.2016136 14.6269 1.89198e-48 5.86358e-45 # Map volcanoes DEG <- as.data.frame(resOrdered) nrDEG <- na.omit(DEG) DEseq_DEG <- nrDEG nrDEG <- DEseq_DEG[,c(2,6)] colnames(nrDEG) <- c('log2FoldChange','pvalue') logFC_cutoff <- with(nrDEG,mean(abs(log2FoldChange)) + 2*sd(abs( log2FoldChange))) # &Compare the corresponding elements of the two vectors in turn, while & & only compare the first element of the two vectors nrDEG$change <- as.factor(ifelse(nrDEG$pvalue < 0.05 & abs(nrDEG$log2FoldChange) > logFC_cutoff, ifelse(nrDEG$log2FoldChange > logFC_cutoff ,'UP','DOWN'),'NOT')) this_title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), '\nThe number of up gene is ',nrow(nrDEG[nrDEG$change =='UP',]) , '\nThe number of down gene is ',nrow(nrDEG[nrDEG$change =='DOWN',])) volcano <- ggplot(data=nrDEG, aes(x=log2FoldChange, y=-log10(pvalue), color=change)) + geom_point(alpha=0.4, size=1.75) + xlab("log2 fold change") + ylab("-log10 p-value") + ggtitle(this_title) + theme(plot.title = element_text(size=15,hjust = 0.5)) + scale_colour_manual(values = c('blue','black','red')) volcano
5. Analyze the difference of RNAseq_expr and draw the (mean value VS change multiple) diagram
plotMA(res,ylim=c(-5,5)) # Wrong report # 'coef' should specify same coefficient as in results 'res' resLFC <- lfcShrink(dds,coef = 2,res=res) plotMA(resLFC, ylim=c(-5,5))
I don't understand this part of knowledge yet. I suggest you look at the analysis of reference answers.
- The expression of one differential gene in the two groups was plotted boxplot, and the statistical significance index was added
- Obtain the chromosome information of all RNAseq_expr genes through the org.Hs.eg.db package, and draw the bar graph of gene number of chromosomes
- The bar graph of the number of genes on the upper chromosome is superimposed with the bar graph of the number of differential genes
- Get the expression of GUL5 gene in BRCA data set and patient survival data in oncolnc web tool, and draw the survival analysis map locally
- Get the expression of GUL5 gene in BRCA dataset and PAM50 classification of patients in xena web tool, and draw the classified boxplot