Student letter learning - 30 visual exercises based on R (with detailed answer interpretation)


This is the last exercise set of R language learning. This paper mainly introduces how to use R language for data visualization to help us intuitively see the meaning of data. Drawing functions are so changeable that it is impossible to write down all the functions. To be proficient in using help documents, you won't turn over the code at any time. It takes a long time to master r drawing.

Original title: http://www.bio-info-trainee.com/4387.html
Reference answer: https://www.jianshu.com/p/fab27c63af94
Reference answer: https://www.jianshu.com/p/8fce9d2ad562

1, Basic drawing

# Prepare data
rm(list = ls())
options(stringsAsFactors = F)

library(airway)
data("airway")
airway 

RNAseq_expr <- assay(airway)
dim(RNAseq_expr)
colnames(RNAseq_expr) 
RNAseq_expr[1:4,1:4] 

RNAseq_gl <- colData(airway)[,3]
table(RNAseq_gl)

1. For rnaseq_ Plot boxplot for each column of expr

boxplot(RNAseq_expr)



2. For rnaseq_ Draw the density graph for each column of expr

# Remove useless values (data with column sum less than or equal to 1)
e1 <- RNAseq_expr[apply(RNAseq_expr, 1, function(x) sum(x>0)>1), ] 

dim(RNAseq_expr)
# [1] 64102     8
dim(e1)
# [1] 28877     8
                        
plot(density(RNAseq_expr))
plot(density(e1))                        






3. For rnaseq_ Draw a bar chart for each column of expr

# An unprocessed picture can be taken at a glance. It's meaningless
barplot(RNAseq_expr)




4. For rnaseq_ After taking log2 for each column of expr, re draw boxplot, density chart and bar chart

e2 <- log2(e1+1)

# The data drawing after log2 looks much more comfortable
# When the source data value is large and the difference is large, consider log
boxplot(e2)
plot(density(e2))
barplot(e2)








5. Add trt and untrt group colors to the three diagrams of Q4 to distinguish them

# Box diagram
boxplot(e2, main = 'Boxplot of RNAseq-expr',
        xlab = 'samples',ylab = 'expression',col = RNAseq_gl)

# Density map
# Generates a list of current drawing parameters that can be modified
opar <- par(no.readonly=T)
par(mfrow = c(3,3))
for (i in c(1:8)) {
  plot(density(e2[,i]), col=as.integer(RNAseq_gl)[i], main = paste("Density", i))
}
# Resets the parameter to its previous value
par(opar)

# If you accidentally modify par(), restart RStudio to restore the default value

# histogram
barplot(e2, main = 'Barplot of RNAseq-expr',
        xlab = 'samples',ylab = 'expression', border = NA, col = RNAseq_gl)

# At this time, the picture is very strange. Take a subset to see what happens
e3 <- e2[1:10,]
barplot(e3, main = 'Barplot of RNAseq-expr',
        xlab = 'samples',ylab = 'expression', border = NA, col = RNAseq_gl)

As you can see, the desired result is that the color of each column changes in turn according to the grouping, but the color of the graph changes in turn in each column
The reason is that if the data in the barplot is a matrix and the best is FALSE (default), each column in the figure is stacked one by one, so the color is given one by one

The explanation is a little messy. Attach the original text
barplot(height, ...)
height: either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.










6. For rnaseq_ The first two columns of expr draw a scatter diagram and calculate the linear regression equation

e4 <- as.data.frame(e2)
# 'data' must be a data.frame, not a matrix or an array
fit <- lm(e4[,1] ~ e4[,2], data = e4)
fit

# Call:
# lm(formula = e4[, 1] ~ e4[, 2], data = e4)
# 
# Coefficients:
# (Intercept)      e4[, 2]  
#      0.2105       0.9868 

# The difference between the original data is large, and the data processed with log2
plot(RNAseq_expr[,1:2])
plot(e2[,1:2])

abline(fit, col = "red")






7. For rnaseq_ The correlation coefficient is calculated between all columns of expr, and the heat map is visualized

M <- cor(e2)
pheatmap::pheatmap(M)




8. Take RNAseq_expr the first line of expression to draw a line graph

plot(e2[1,], type="b", xlab = "gene", ylab="expression", col="red")

# type="b" is the most common line chart
# type
p     Only point
l     Only line
o     Solid points and lines (that is, lines cover points)
b,c  Line connection point( c (do not draw points when)
s,S  Step line
h     Vertical line of the square schema
n     No points and lines are generated (usually used to create axes for later commands)




9. Take rnaseq_ The lines of the 10 genes with the highest expr expression were drawn in a multi line broken line diagram

top10 <- e2[names(tail(sort(rowSums(e2)), 10)), ]
top10

# Set the interval of abscissa and ordinate
library(reshape2)
yrange <- range(melt(top10)[,3])
yrange 
# [1] 16.15977 18.97075
yrange <- c(16,19)
xrange <- c(1,8)

# mapping
# The plot() function creates a new graph when called
# The lines() function adds information to an existing graph and cannot generate a graph by itself.
# Therefore, the lines() function is usually called after the plot() function generates a graph.
plot(xrange, yrange, type="n", xlab = "gene", ylab="expression")
for(i in c(1:10)){
  lines(top10[i,], type="b", xlab = "gene", ylab="expression", pch = i)
}




10. Line by line operation

https://github.com/jmzeng1314/5years/blob/master/learn-R/tasks/2-chunjuan-600.R code

The code is complete and runs smoothly.


2, Gglot drawing

# Geometric functions commonly used in ggplot2
geom_bar() Bar chart color,fill,alpha 
geom_boxplot() Box diagram color,fill,alpha,notch,width 
geom_density() Density map color,fill,alpha,linetype 
geom_histogram() histogram color,fill,alpha,linetype,binwidth 
geom_hline() level color,alpha,linetype,size 
geom_jitter() Jitter point color,size,alpha,shape 
geom_line() Line diagram colorvalpha,linetype,size 
geom_point() Scatter diagram color,alpha,shape,size 
geom_rug() Carpet map color,side 
geom_smooth() Fitting curve method,formula,color,fill,linetype,size 
geom_text() There are many text annotations. See the "help" of the function
geom_violin() Violin picture color,fill,alpha,linetype 
geom_vline() vertical color,alpha,linetype,size 


# Common options for geometric functions
color Shade the boundaries of points, lines, and filled areas
fill Shade filled areas, such as bars and density areas
alpha The transparency of the color, from 0 (fully transparent) to 1 (opaque).
linetype Line of pattern (1)=Solid line, 2=Dotted line, 3=Point, 4=Dot dash, 5=Long dash, 6=Double dash)
size The size of the point and the width of the line
shape The shape of the point (and pch Same, 0=Open square, 1=Open circle, 2=Open triangles, etc.), see Figure 3-4 
position Draw the location of objects such as bars and points. For a bar chart,"dodge"Group bars side by side,"stacked"Stacked grouped bars,"fill"Stack grouped bars vertically and make them equal in height. For this point,"jitter"Reduce point overlap
binwidth Width of histogram
notch Indicates whether the block diagram should be a gap( TRUE/FALSE)
sides Placement of carpet drawings("b"=Bottom,"l"=Left,"t"=Top,"r"=Right,"bl"=Lower left, etc.)
width Width of box diagram

1. Rewrite the Q1-5 exercise of the above basic drawing with ggplot code

# Data preparation
rm(list = ls())
options(stringsAsFactors = F)

library(airway)
data("airway")
airway 

RNAseq_expr <- assay(airway)
RNAseq_gl <- colData(airway)[,3]

e1 <- RNAseq_expr[apply(RNAseq_expr, 1, function(x) sum(x>0)>1), ] 
e2 <- log2(e1+1)

# When drawing with ggplot, the data should be in the format of data.frame                        
library(reshape2)
me2 <- melt(e2)
colnames(me2) <- c("gene", "sample", "expression")
tmp <- data.frame(group_list=RNAseq_gl)
rownames(tmp) <- colnames(RNAseq_expr)
tmp$sample <- rownames(tmp)
e3 <- merge(me2, tmp, by="sample")

group <- as.data.frame(colData(airway)[,c(3,5)])
group

                        
# Question 5 contains the previous four questions                        
### 5 ###
library(ggplot2)

# Box diagram                        
ggplot(e3, aes(sample, expression, fill = group_list)) + geom_boxplot()

# Density map
# Group according to sample                        
ggplot(e3, aes(expression, color = sample)) + geom_density()
# Group according to trt and untrt                        
ggplot(e3, aes(expression, color = group_list)) + geom_density()

# Bar chart                      
# geom_bar() uses stat_count() by default: it counts the number of cases at each x position. 
# geom_col() uses stat_identity(): it leaves the data as is.
ggplot(e3, aes(sample, expression, fill = group_list)) + geom_bar(stat="identity")









2. Rewrite the Q6-9 exercise of the above basic drawing with ggplot code

### 6 ###
ggplot(as.data.frame(e2[, 1:2]), aes(x = SRR1039508, y = SRR1039509)) + geom_point() + geom_smooth(method = "lm")



### 7 ###
# Like a heat map, but not exactly a doge
M <- cor(e2)
meltM <- melt(M)
# If you want to draw arbitrary rectangles, use geom_tile() or geom_rect()
ggplot(meltM, aes(x = Var1, y = Var2, fill = value)) + geom_tile()



### 8 ###
# Line chart
e4 <- data.frame(expression = e2[1, ])
e4$sample <- rownames(e4)

ggplot(e4, aes(x = sample, y = expression, group = 1)) + geom_line() + geom_point()



### 9 ###
top10 <- e2[names(tail(sort(rowSums(e2)), 10)), ]
top10 <- melt(top10)
colnames(top10) <- c("gene", "sample", "expression")

ggplot(top10, aes(x = sample, y = expression, color = gene, group = gene)) + geom_line() + geom_point()

Explanation to questions 8 and 9:
For line graphs, the data points must be grouped so that it knows which points to connect. In this case, it is simple – all points should be connected, so group=1. When more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.










Q10: line by line operation: http://biotrainee.com/jmzeng/markdown/ggplot-in-R.html code

3, Bioinformatics mapping

Need reference https://github.com/jmzeng1314/GEO/blob/master/airway_RNAseq/DEG_rnsseq.R

1. Line by line operation:

https://github.com/jmzeng1314/5years/blob/master/learn-R/tasks/top50ggplot.Rmd

The code is smooth except that several data links fail.

2. Select the expression matrix of 100 genes with the largest MAD value for RNAseq_expr and draw the heat map

# Take the 100 gene names with the largest mad value
top100_mad <- names(tail(sort(apply(e1, 1, mad)), 100))
# top100_mad

# Data standardization refers to the value minus the mean and then divided by the standard deviation
z_score <- t(scale(t(e2)))

# Take top100 matrix
top100 <- z_score[rownames(z_score) %in% top100_mad,]

pheatmap::pheatmap(top100)  




3. Principal component analysis and mapping of RNAseq_expr

# The PCA diagram shall be drawn using the z-score matrix
library(ggplot2)
library(ggfortify)

dat <- z_score
df <- as.data.frame(t(dat))

# Add a column to facilitate group drawing
group_list <- RNAseq_gl
df$group <- group_list

autoplot(prcomp(df[,1:(ncol(df)-1)]), data = df, colour = 'group') + theme_bw()


# install.packages("FactoMineR")
# install.packages("factoextra")
library("FactoMineR")
library("factoextra") 

# Reset df
df <- as.data.frame(t(dat))

# ?PCA
# graph: boolean, if TRUE a graph is displayed
dat.pca <- PCA(df, graph = FALSE)

# ?fviz_pca_ind
fviz_pca_ind(dat.pca, geom.ind = "point", col.ind = group_list, addEllipses = TRUE, 
             legend.title = "Groups")






4. Analyze the difference of RNAseq_expr and draw volcano map

# Difference analysis
# BiocManager::install("DESeq2")
suppressMessages(library(DESeq2))

colData <- data.frame(row.names = colnames(RNAseq_expr), group_list = group_list)

# ?DESeqDataSetFromMatrix()
# Rows of colData correspond to columns of countData
dds <- DESeqDataSetFromMatrix(countData = RNAseq_expr, colData = colData, design = ~ group_list)

# ?DESeq
dds <- DESeq(dds)
res <- results(dds, contrast=c("group_list","trt","untrt"))
resOrdered <- res[order(res$padj),]
head(resOrdered)

# output #
log2 fold change (MLE): group_list trt vs untrt 
Wald test p-value: group_list trt vs untrt 
DataFrame with 6 rows and 6 columns
                 baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000152583   997.440        4.60253 0.2117708   21.7335 9.89036e-105 1.83911e-100
ENSG00000148175 11193.719        1.45147 0.0848249   17.1113  1.22198e-65  1.13614e-61
ENSG00000179094   776.597        3.18386 0.2015154   15.7996  3.13247e-56  1.94161e-52
ENSG00000134686  2737.982        1.38714 0.0915842   15.1461  8.04404e-52  3.73947e-48
ENSG00000125148  3656.253        2.20344 0.1474087   14.9478  1.60924e-50  5.98476e-47
ENSG00000120129  3409.029        2.94898 0.2016136   14.6269  1.89198e-48  5.86358e-45


# Map volcanoes
DEG <- as.data.frame(resOrdered)
nrDEG <- na.omit(DEG)
DEseq_DEG <- nrDEG
nrDEG <- DEseq_DEG[,c(2,6)]
colnames(nrDEG) <- c('log2FoldChange','pvalue') 


logFC_cutoff <- with(nrDEG,mean(abs(log2FoldChange)) + 2*sd(abs( log2FoldChange)))

# &Compare the corresponding elements of the two vectors in turn, while & & only compare the first element of the two vectors
nrDEG$change <-  as.factor(ifelse(nrDEG$pvalue < 0.05 & abs(nrDEG$log2FoldChange) > logFC_cutoff, ifelse(nrDEG$log2FoldChange > logFC_cutoff ,'UP','DOWN'),'NOT'))

this_title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3),
                    '\nThe number of up gene is ',nrow(nrDEG[nrDEG$change =='UP',]) ,
                    '\nThe number of down gene is ',nrow(nrDEG[nrDEG$change =='DOWN',]))

volcano <- ggplot(data=nrDEG, aes(x=log2FoldChange, y=-log10(pvalue), color=change)) +
  geom_point(alpha=0.4, size=1.75) + xlab("log2 fold change") + ylab("-log10 p-value") +
  ggtitle(this_title) + theme(plot.title = element_text(size=15,hjust = 0.5)) +
  scale_colour_manual(values = c('blue','black','red'))

volcano




5. Analyze the difference of RNAseq_expr and draw the (mean value VS change multiple) diagram

plotMA(res,ylim=c(-5,5))

# Wrong report
# 'coef' should specify same coefficient as in results 'res'
resLFC <- lfcShrink(dds,coef = 2,res=res)

plotMA(resLFC, ylim=c(-5,5))




I don't understand this part of knowledge yet. I suggest you look at the analysis of reference answers.

  1. The expression of one differential gene in the two groups was plotted boxplot, and the statistical significance index was added
  2. Obtain the chromosome information of all RNAseq_expr genes through the org.Hs.eg.db package, and draw the bar graph of gene number of chromosomes
  3. The bar graph of the number of genes on the upper chromosome is superimposed with the bar graph of the number of differential genes
  4. Get the expression of GUL5 gene in BRCA data set and patient survival data in oncolnc web tool, and draw the survival analysis map locally
  5. Get the expression of GUL5 gene in BRCA dataset and PAM50 classification of patients in xena web tool, and draw the classified boxplot

Tags: R Language data visualization

Posted on Tue, 23 Nov 2021 18:21:59 -0500 by heyjohnlim