Next Generation Sequencing RNAseq Data Analysis (Part 2)

Analysis of library size, Normalization and DGE analysis with edgeR

Previous article

In the first article, we obtained NGS data and made the necessary preparations for further analysis. This involved processing and organizing the data in a way that would facilitate downstream analysis. Additionally, we included annotations that provide valuable context and information for the interpretation of the results.

Introduction

There are many ways to perform analysis of differential gene expression (DGE) in R. I’m going to look at a few of them. In this article, we will analyze library distribution and perform a full DGE pipeline using edgeR package.

In writing this article, I have utilized resources from both the User’s Guide for edgeR and the article of Yunshun Chen et.al.1.

Initial preparation

Add necessary packages

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library(conflicted)

library(tidyverse)

library(knitr)

library(edgeR)

library(ComplexHeatmap)

library(ggvenn)

As the original GEO entry lacked a table of conditions, I manually created one using the metadata from the samples. You can find the resulting table presented in the code chunk below.

1
2
3
4
RNA_Conditions <- read_tsv("./../../../data/input/data/RNA_conditions.tsv")

RNA_Conditions <- RNA_Conditions %>%
    dplyr::mutate(Line = factor(Line, levels = c("Normal", "KK", "KL", "KKL")))
1
RNA_Conditions %>% print(n=nrow(RNA_Conditions))

Expand the code chank to see the table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## # A tibble: 35 × 5
##    Sample          Sex   Line   Replicate   Nodulus
##    <chr>           <chr> <fct>  <chr>       <chr>  
##  1 norm_1_A_F_run2 F     Normal replicate 1 NO     
##  2 norm_2_A_F_run3 F     Normal replicate 2 NO     
##  3 norm_3_A_F_run1 F     Normal replicate 3 NO     
##  4 norm_4_A_M_run4 M     Normal replicate 4 NO     
##  5 norm_5_A_M_run5 M     Normal replicate 5 NO     
##  6 norm_6_A_M_run1 M     Normal replicate 6 NO     
##  7 KK_1_A_F_run3   F     KK     replicate 1 tumor A
##  8 KK_1_B_F_run3   F     KK     replicate 1 tumor B
##  9 KK_2_A_F_run1   F     KK     replicate 2 tumor A
## 10 KK_2_B_F_run1   F     KK     replicate 2 tumor B
## 11 KK_3_A_M_run5   M     KK     replicate 3 tumor A
## 12 KK_3_B_M_run5   M     KK     replicate 3 tumor B
## 13 KK_4_A_M_run2   M     KK     replicate 4 tumor A
## 14 KK_4_B_M_run2   M     KK     replicate 4 tumor B
## 15 KK_5_A_M_run4   M     KK     replicate 5 tumor A
## 16 KK_5_B_M_run4   M     KK     replicate 5 tumor B
## 17 KL_1_A_F_run2   F     KL     replicate 1 tumor A
## 18 KL_1_B_F_run2   F     KL     replicate 1 tumor B
## 19 KL_2_A_F_run5   F     KL     replicate 2 tumor A
## 20 KL_2_B_F_run5   F     KL     replicate 2 tumor B
## 21 KL_3_A_M_run1   M     KL     replicate 3 tumor A
## 22 KL_3_B_M_run1   M     KL     replicate 3 tumor B
## 23 KL_4_A_M_run3   M     KL     replicate 4 tumor A
## 24 KL_4_B_M_run3   M     KL     replicate 4 tumor B
## 25 KL_5_A_M_run4   M     KL     replicate 5 tumor A
## 26 KL_5_B_M_run4   M     KL     replicate 5 tumor B
## 27 KKL_1_A_F_run4  F     KKL    replicate 1 tumor A
## 28 KKL_1_B_F_run4  F     KKL    replicate 1 tumor B
## 29 KKL_2_A_F_run2  F     KKL    replicate 2 tumor A
## 30 KKL_2_B_F_run2  F     KKL    replicate 2 tumor B
## 31 KKL_3_A_M_run3  M     KKL    replicate 3 tumor A
## 32 KKL_3_B_M_run3  M     KKL    replicate 3 tumor B
## 33 KKL_4_A_M_run5  M     KKL    replicate 4 tumor A
## 34 KKL_5_A_M_run1  M     KKL    replicate 5 tumor A
## 35 KKL_5_B_M_run1  M     KKL    replicate 5 tumor B

For this tutorial, we will utilize the RNA counts table that was generated in the preceding one.

1
2
3
RNA_counts <- read_tsv("./../../../data/input/data/RNA_counts_ann.tsv")

RNA_counts %>% head %>% kable

Ensuring that the metadata of the table of conditions is in the same order as the column names of the counts table is of utmost importance. This cannot be emphasized enough. To preempt any issues, I will proactively reorder the columns of the counts table based on the sample sequence specified in the condition table.

1
2
RNA_counts <- RNA_counts %>% 
  relocate(c(RNA_Conditions$Sample), .before = !c(RNA_Conditions$Sample))
1
table(colnames(RNA_counts)[1:length(RNA_Conditions$Sample)] == RNA_Conditions$Sample)
## 
## TRUE 
##   35

Furthermore, I will rename the row names in the counts table with the gene names to provide a clearer understanding of which genes are differentially expressed. I will also include a row index since the table may contain duplicates.

1
2
RNA_counts <- as.data.frame(RNA_counts)
rownames(RNA_counts) <- paste(rownames(RNA_counts), RNA_counts$symbol, sep = "_")

create edgeR object - DEGList()

The data in the edgeR package is stored in a straightforward, list-based data structure known as a DEGlist. The main field of DEGlist is a matrix of counts. But it also stores information about groups of conditions and gene metadata. Additionally, DEGlist enables data subsetting based on column names. Personally, I prefer edgeR over DESeq2 due to its ease of data manipulation and the fact that many other packages utilize its DEGlist for subsequent analyses.

To utilize DEGlist, we need to provide the counts variable, which is the matrix of counts, as well as information about the sample groups. The remaining gene metadata can be sent to the genes variable for subsequent analysis.

1
2
3
4
y <- DGEList(counts = RNA_counts[, colnames(RNA_counts) %in% RNA_Conditions$Sample], 
             group  = RNA_Conditions$Line,
             genes  = RNA_counts[, !(colnames(RNA_counts) %in% RNA_Conditions$Sample), 
                                     drop = FALSE])

Gene counts per sample

Now we can start exploring your data. Below are the raw counts for each sample.

1
2
gene_counts <- apply(y$counts, 2, sum)  
base::sort(gene_counts)
## norm_3_A_F_run1   KL_1_B_F_run2   KK_1_B_F_run3   KK_4_B_M_run2   KK_4_A_M_run2 
##        11675586        16437820        17094871        18095347        18856436 
##   KK_1_A_F_run3   KL_1_A_F_run2 norm_4_A_M_run4  KKL_1_B_F_run4   KL_4_A_M_run3 
##        19165012        21035973        21078774        22050764        23239353 
##   KL_4_B_M_run3  KKL_3_A_M_run3 norm_6_A_M_run1  KKL_2_A_F_run2 norm_1_A_F_run2 
##        24216392        24698461        26001408        26095600        26796419 
##  KKL_1_A_F_run4  KKL_2_B_F_run2   KK_5_B_M_run4   KK_2_B_F_run1 norm_2_A_F_run3 
##        26891471        26990324        29037583        29203933        30186292 
##  KKL_5_B_M_run1 norm_5_A_M_run5  KKL_4_A_M_run5   KL_5_A_M_run4  KKL_5_A_M_run1 
##        31352424        31740941        32095794        32467669        33098042 
##   KL_2_A_F_run5   KL_5_B_M_run4   KK_2_A_F_run1   KL_3_B_M_run1   KK_3_B_M_run5 
##        33736324        35007584        36198774        36452365        37099116 
##   KK_5_A_M_run4   KL_2_B_F_run5   KL_3_A_M_run1   KK_3_A_M_run5  KKL_3_B_M_run3 
##        37509192        39368111        41240569        57689274        65373448
1
dim(y)
## [1] 50931    35
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# adjusting the plot dimensions
par(mar=c(9,4,3,0))

# sort by library sizes 
sorted_lib.size <- y$samples[base::order(y$samples$lib.size),]

# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(sorted_lib.size$lib.size,
        names = rownames(sorted_lib.size), 
        las = 2, 
        col="#69b3a2")

abline(h = median(sorted_lib.size$lib.size), col="blue")

# Add a title to the plot
title("Barplot of library sizes")

Design matrix

To fit the linear model and perform differential gene expression analysis, we need to specify the design matrix, which includes the treatment conditions for each sample. Each column in this matrix represents a group of conditions, while each row represents whether the sample belongs to the group or not.

To create this matrix, we use the model.matrix function, which links the groups and samples together.

1
2
3
4
5
design <- model.matrix(~ 0 + RNA_Conditions$Line)

colnames(design) <- levels(RNA_Conditions$Line)

design %>% head %>% kable
Normal KK KL KKL
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0

Filtering to remove low counts

Next, we need to remove genes with low counts. There are several strategies to accomplish this, one of which involves keeping genes that have at least 10 counts in at least some libraries before they are considered expressed.

EdgeR offers a convenient way to filter out low count genes using the filterByExpr function, which attempts to retain the maximum number of genes possible.

Regardless of the chosen method of filtration, all libraries must be filtered in the same way to avoid bias.

1
2
keep <- filterByExpr(y, design =  design)
table(keep)
## keep
## FALSE  TRUE 
## 30175 20756
1
y <- y[keep, keep.lib.sizes=FALSE]

explore data

We can now visualize the distribution of gene counts for the filtered data.

1
2
3
4
par(mar=c(9,4,3,0))

AveLogCPM <- aveLogCPM(y,normalized.lib.sizes = TRUE, prior.count = 1)
hist(AveLogCPM,col="#69b3a2")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
y_logcounts <- cpm(y, prior.count = 1, log=TRUE)
y_logcounts <- y_logcounts[ ,rownames((sorted_lib.size))]

# Check distributions of samples using boxplots
boxplot(y_logcounts, xlab="", 
        ylab="Log2 counts per million",
        las=2, 
        col="#69b3a2")

abline(h=median(y_logcounts), col="blue")
title("Boxplots of logCPMs (Filtered)\nordered by library size from min to max")

Library size normalization

For downstream analysis, we need to normalize the library size. We can use the calcNormFactors function to create normalization factors for each library using the trimmed mean of M values (TMM) method (Robinson and Oshlack 2010). Next, the effective library size will be generated based on the library size and normalization factor.

1
2
3
4
y <- calcNormFactors(y, method = "TMM")

# norm.factors for each sample
y$samples$norm.factors
##  [1] 1.0762076 1.0970443 1.0723530 1.0490546 1.0851532 1.0647561 0.9858153
##  [8] 1.0520711 0.9667595 0.9556910 0.8821953 0.9538502 0.9061506 0.9576947
## [15] 0.9127112 0.9194349 1.0086732 0.9587406 1.0272246 1.0083948 1.0333428
## [22] 1.0270089 1.0126018 1.0023800 0.9785801 0.9255397 0.9963523 1.0345062
## [29] 1.0208527 1.0451356 1.0225006 0.9792565 1.0522012 0.9800223 0.9999865

The plotMDS function can be used to display differences between expression profiles of different samples. The plot has two dimensions that correspond to the first and second components of the PCA analysis.

1
2
3
4
pch <- c(0,1,2,15)
colors <- c("darkgreen", "blue", "red", "black")
plotMDS(y, col=colors[RNA_Conditions$Line], pch=pch[RNA_Conditions$Line], var.explained = T)
legend("bottomleft", legend=levels(RNA_Conditions$Line), pch=pch, col=colors, ncol=1)

The plotMD shows mean difference of expression data.

1
2
plotMD(y, column = 1) # column = sample from DEGlist
abline(h = 0, col = "red", lty = 2, lwd = 2)
1
2
3
# It is good practice to make MD plots for all the samples as a quality check.
plotMD(y, column = 18)
abline(h = 0, col="red", lty = 2, lwd = 2)

Dispersion estimation

Next, we need to estimate the dispersion. EdgeR utilizes the negative binomial (NB) distribution to model the read counts for each gene in each sample.

We can estimate the dispersion using the estimateDisp function, which calculates the common dispersion, trended dispersion, and tagwise dispersion, and adds fields corresponding to the estimated dispersions to a new DEGlist object.

1
2
3
y <- estimateDisp(y, design = design, robust=TRUE)

plotBCV(y)

Differential expression analysis

We can now proceed to analyze differential expression. In this article, we will construct and compare three very similar models for performing this analysis: regular glmFit, glmQLFit, and glmQLFit with additional filtration. The glmFit model is based on likelihood ratio tests (LRT) to identify differentially expressed genes, while the glmQLFit uses quasi-likelihood (QL) methods to account for overdispersion in the data, also glmQLFit model uses F-tests to calculate the probability of differential expression.

If there are many levels in the comparison group we should start by comparing conditions 1 vs 1.

Let’s examine which factors are present in the data:

1
2
y_cond_column <- levels(factor(colnames(y$design)))
y_cond_column %>% print
## [1] "KK"     "KKL"    "KL"     "Normal"

Create a table for performing one-to-one comparisons.

1
2
all.combos <- as.data.frame(t(combn(y_cond_column,2)))
print(all.combos)
##    V1     V2
## 1  KK    KKL
## 2  KK     KL
## 3  KK Normal
## 4 KKL     KL
## 5 KKL Normal
## 6  KL Normal

Select a pair to analyze.

1
2
3
row_id <- 1
single_cond <- unname(unlist(all.combos[row_id,]))
print(single_cond)
## [1] "KK"  "KKL"
1
2
3
4
5
6
# make new column for subsequent annotation
RNA_Conditions <- RNA_Conditions %>% 
  mutate(pair = case_when(
    Line == single_cond[1] ~ single_cond[1],
    Line == single_cond[2] ~ single_cond[2]
  ))

Generate a simple contrast for the selected pair. While it is possible to create more complex contrasts, interpreting the results can become more difficult.

1
2
y_contrast <- makeContrasts(base::paste(single_cond[1], '-', single_cond[2]), 
                            levels = design)

glmFit

Regular model – less strict then other models.

1
2
fit.glm <- glmFit(y, design, robust=TRUE)
res.glm <- glmLRT(fit.glm, contrast = y_contrast)

View top tags

1
2
3
4
topTags(res.glm, n = 6, sort.by = "PValue") %>% 
  as.data.frame %>% 
  select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>% 
  kable
GeneID symbol logFC logCPM PValue FDR gene_biotype description
5436_Ptprn ENSMUSG00000026204.15 Ptprn -6.777845 7.642472 0 0 protein_coding protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765]
3526_Gzme ENSMUSG00000022156.8 Gzme -8.556547 5.058337 0 0 protein_coding granzyme E [Source:MGI Symbol;Acc:MGI:109265]
17410_Rpl10a-ps2 ENSMUSG00000061988.3 Rpl10a-ps2 -5.969254 2.377657 0 0 processed_pseudogene ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227]
9394_Idi2 ENSMUSG00000033520.2 Idi2 -7.699325 4.560845 0 0 protein_coding isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315]
8466_Star ENSMUSG00000031574.8 Star -2.072882 3.101177 0 0 protein_coding steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760]
2957_Gm9745 ENSMUSG00000021148.11 Gm9745 -7.842162 2.343155 0 0 protein_coding predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398]

Add a new column with the adjusted p-values and create results table.

1
2
3
4
5
res.glm$table$p.adjust <- p.adjust(res.glm$table$PValue, method = "BH")

res.glm.table <- res.glm$table

res.glm.table$symbol <- res.glm$genes$symbol

Examine the number of differentially expressed genes.

1
2
is.de.res.glm <- decideTestsDGE(res.glm)
summary(is.de.res.glm)
##        1*KK -1*KKL
## Down          2671
## NotSig       15852
## Up            2233
1
2
3
4
# selected genes from fit glmFit()
res.glm.de.genes <- rownames(res.glm.table)[which(is.de.res.glm != 0)]

res.glm.de.genes %>% length() 
## [1] 4904

Enhance the plotMD with information regarding differential expression for the selected pair.

1
2
plotMD(res.glm, status = is.de.res.glm)
abline(h = c(-2, 2), col = "darkgreen")

Creating heatmap with selected genes

Simple heatmap. Shows top 200 DE genes by p-value. Groups: KK vs KKL

1
2
# log -	if TRUE then log2 values are returned.
logCPM <- cpm(y, prior.count = 1, log=TRUE) 
1
2
3
4
5
6
7
8
selected <- res.glm.table[res.glm.de.genes,] %>% 
  arrange(p.adjust)

logCPM.gr <- logCPM

colnames(logCPM.gr) <- y$samples$group

coolmap(logCPM.gr[rownames(selected)[1:200], ])

Create more complex heatmap

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
y_data_subset <- logCPM[rownames(selected)[1:60], ] 

# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)

# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample", 
                                                  "Sex", 
                                                  "Line", 
                                                  "pair"), 
                                              drop = FALSE] %>%
  column_to_rownames(var="Sample") %>% 
  `colnames<-`(c("Sex", "Line", "Condition")) 


row_lables <-  base::paste(round(selected$logFC[1:60], digits = 2),
                                            sapply(str_split(
                                              rownames(y_data_subset)[1:60], pattern = "_", n = 2),
                                              function(x) x[2]),
                                            round(selected$p.adjust[1:60], digits = 2)
                                            )
    
pheatmap::pheatmap(y_data_subset, 
                   cluster_rows = F,
                   cluster_cols = F,
                   show_rownames = T,
                   legend = F,
                   fontsize_col = 8,
                   fontsize_row = 8,
                   annotation_col = df_legend,
                   labels_row = row_lables)

glmQLFit

Model glmQLFit is similar to glmFit. It also negative binomial generalized log-linear model but implement the quasi-likelihood (QL) methods for dispersion estimation. This model is more strict than regular glmFit because it gives the same or grater p-value.

1
2
3
4
fit.glmQL <- glmQLFit(y, design, robust=TRUE)

# get the result - DE genes
res.glmQL <- glmQLFTest(fit.glmQL, contrast = y_contrast)
1
plotQLDisp(fit.glmQL)
1
2
3
4
topTags(res.glmQL, n = 6, sort.by = "PValue") %>% 
  as.data.frame %>% 
  select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>% 
  kable
GeneID symbol logFC logCPM PValue FDR gene_biotype description
5436_Ptprn ENSMUSG00000026204.15 Ptprn -6.789561 7.642472 0 0 protein_coding protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765]
9394_Idi2 ENSMUSG00000033520.2 Idi2 -7.646714 4.560845 0 0 protein_coding isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315]
17410_Rpl10a-ps2 ENSMUSG00000061988.3 Rpl10a-ps2 -5.972805 2.377657 0 0 processed_pseudogene ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227]
8466_Star ENSMUSG00000031574.8 Star -2.074807 3.101177 0 0 protein_coding steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760]
3526_Gzme ENSMUSG00000022156.8 Gzme -8.523256 5.058337 0 0 protein_coding granzyme E [Source:MGI Symbol;Acc:MGI:109265]
2957_Gm9745 ENSMUSG00000021148.11 Gm9745 -7.800929 2.343155 0 0 protein_coding predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398]
1
2
3
4
5
6
7
# write new column with p.adjust
res.glmQL$table$p.adjust <- p.adjust(res.glmQL$table$PValue, method = "BH")

# Access results tables
res.glmQL.table <- res.glmQL$table

res.glmQL.table$symbol <- res.glmQL$genes$symbol

Show DE genes

1
2
is.de.res.glmQL <- decideTestsDGE(res.glmQL)
summary(is.de.res.glmQL)
##        1*KK -1*KKL
## Down          2487
## NotSig       16208
## Up            2061
1
2
3
4
# selected genes from fit glmQLFit()
res.glmQL.de.genes <- rownames(res.glmQL.table)[which(is.de.res.glmQL != 0)]

res.glmQL.de.genes %>% length() 
## [1] 4548
1
2
3
4
# The magnitude of the differential expression changes can be visualized 
# with a fitted model MD plot:
plotMD(res.glmQL, status = is.de.res.glmQL)
abline(h = c(-2, 2), col = "darkgreen")

glmQLFit with more robust filtration method

If we are only interested in genes with relatively large expression changes, we can set a fold-change threshold for differential expression analysis. This can be a more biologically meaningful approach when dealing with a large number of differentially expressed genes. To detect expression changes that are greater than a specified threshold, we can modify the statistical test accordingly. For instance, we can test if the differential expression fold changes are significantly greater than 1.5, or if the logFCs are significantly greater than log2(1.5).

1
res.glmQL.tr <- glmTreat(fit.glmQL, contrast = y_contrast, lfc = log2(1.5))
1
2
3
4
topTags(res.glmQL.tr, n = 6, sort.by = "PValue") %>% 
  as.data.frame %>% 
  select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>% 
  kable
GeneID symbol logFC logCPM PValue FDR gene_biotype description
5436_Ptprn ENSMUSG00000026204.15 Ptprn -6.789561 7.642472 0 0 protein_coding protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765]
9394_Idi2 ENSMUSG00000033520.2 Idi2 -7.646714 4.560845 0 0 protein_coding isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315]
17410_Rpl10a-ps2 ENSMUSG00000061988.3 Rpl10a-ps2 -5.972805 2.377657 0 0 processed_pseudogene ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227]
2957_Gm9745 ENSMUSG00000021148.11 Gm9745 -7.800929 2.343155 0 0 protein_coding predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398]
3526_Gzme ENSMUSG00000022156.8 Gzme -8.523256 5.058337 0 0 protein_coding granzyme E [Source:MGI Symbol;Acc:MGI:109265]
5732_Cubn ENSMUSG00000026726.10 Cubn -3.092687 1.459152 0 0 protein_coding cubilin (intrinsic factor-cobalamin receptor) [Source:MGI Symbol;Acc:MGI:1931256]
1
2
is.de.res.glmQL.tr <- decideTestsDGE(res.glmQL.tr)
summary(is.de.res.glmQL.tr)
##        1*KK -1*KKL
## Down           495
## NotSig       20027
## Up             234
1
2
3
4
5
6
# write new column with p.adjust
res.glmQL.tr$table$p.adjust <- p.adjust(res.glmQL.tr$table$PValue, method = "BH")

# Access results tables
res.glmQL.tr.table <- res.glmQL.tr$table
res.glmQL.tr.table$symbol <- res.glmQL.tr$genes$symbol
1
2
3
# selected genes from fit glmQLFit()
res.glmQL.tr.de.genes <- rownames(res.glmQL.tr.table)[which(is.de.res.glmQL.tr != 0)]
res.glmQL.tr.de.genes %>% length() 
## [1] 729
1
2
selected <- res.glmQL.tr.table[res.glmQL.tr.de.genes,] %>% 
  arrange(p.adjust)
1
2
3
4
# The magnitude of the differential expression changes can be visualized 
# with a fitted model MD plot:
plotMD(res.glmQL.tr, status = is.de.res.glmQL.tr)
abline(h = c(-2, 2), col = "darkgreen")
1
2
3
4
5
ordered <- order(res.glmQL.tr.table$p.adjust)
logCPM.tr.ordered <- logCPM[ordered[1:200],]
colnames(logCPM.tr.ordered) <- res.glmQL.tr$samples$group

coolmap(logCPM.tr.ordered)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
y_data_subset <- logCPM[rownames(selected)[1:60], ] 

# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)

# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample", 
                                                  "Sex", 
                                                  "Line", 
                                                  "pair"), 
                                              drop = FALSE] %>%
  column_to_rownames(var="Sample") %>% 
  `colnames<-`(c("Sex", "Line", "Condition")) 


row_lables <-  base::paste(round(selected$logFC[1:60], digits = 2),
                                            sapply(str_split(
                                              rownames(y_data_subset)[1:60], pattern = "_", n = 2),
                                              function(x) x[2]),
                                            round(selected$p.adjust[1:60], digits = 2)
                                            )
    
pheatmap::pheatmap(y_data_subset, 
                   cluster_rows = T,
                   cluster_cols = F,
                   show_rownames = T,
                   legend = F,
                   fontsize_col = 8,
                   fontsize_row = 8,
                   annotation_col = df_legend,
                   labels_row = row_lables)

We can now save the table of differentially expressed genes for further analysis. For instance, we can investigate pathways that are associated with one or the other condition.

1
2
3
4
5
6
7
8
final.table.tr <- res.glmQL.tr[res.glmQL.tr.de.genes,]

final.table.tr <- cbind(final.table.tr$table, final.table.tr$genes) %>% 
  as.data.frame %>% 
  rownames_to_column %>% 
  select(GeneID,gene_name,logFC,logCPM,PValue,p.adjust,gene_biotype,description)

final.table.tr %>% head %>% kable
GeneID gene_name logFC logCPM PValue p.adjust gene_biotype description
ENSMUSG00000000182.9 Fgf23 -3.554532 -1.113700 0.0000018 0.0002005 protein_coding fibroblast growth factor 23 [Source:MGI Symbol;Acc:MGI:1891427]
ENSMUSG00000000184.12 Ccnd2 -1.180055 6.861950 0.0006287 0.0223847 protein_coding cyclin D2 [Source:MGI Symbol;Acc:MGI:88314]
ENSMUSG00000000386.16 Mx1 -1.198251 3.864102 0.0007030 0.0242781 protein_coding MX dynamin-like GTPase 1 [Source:MGI Symbol;Acc:MGI:97243]
ENSMUSG00000000544.14 Gpa33 -3.692302 4.101342 0.0000005 0.0000727 protein_coding glycoprotein A33 (transmembrane) [Source:MGI Symbol;Acc:MGI:1891703]
ENSMUSG00000000562.5 Adora3 -1.388302 1.475210 0.0005006 0.0187889 protein_coding adenosine A3 receptor [Source:MGI Symbol;Acc:MGI:104847]
ENSMUSG00000000562.5 Adora3 -1.388302 1.475210 0.0005006 0.0187889 protein_coding adenosine A3 receptor [Source:MGI Symbol;Acc:MGI:104847]

Compare results

At the end of this article, we will compare the selected genes from our models to identify if there are any differences. We will visualize this using a Venn diagram, which will display the intersections and differences between our models.

1
2
3
4
5
ggvenn(list(glmFit = res.glm.de.genes,
            glmQLFit = res.glmQL.de.genes, 
            glmQLFit_tr = res.glmQL.tr.de.genes),
       fill_color = c("#009E73", "#E69F00", "#CD534CFF"),
       stroke_size = 0.5, set_name_size = 5, text_size = 3)
1
2
# number of found tegs by each model
length(res.glm.de.genes)
## [1] 4904
1
length(res.glmQL.de.genes)
## [1] 4548
1
length(res.glmQL.tr.de.genes)
## [1] 729

Conclusion

Throughout this article, we explored and compared three similar models to identify differentially expressed genes. In the next article, we will continue to explore additional methods for detecting DE genes.

References:

1. Chen Y, Lun ATL, Smyth GK. From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res [Internet]. 2016 Aug 2 [cited 2023 Feb 14];5:1438. Available from: https://f1000research.com/articles/5-1438/v2

0%