Next Generation Sequencing RNAseq Data Analysis (Part 3)

Compare edreR classic to Wilcoxon test and NB Quasi-Likelihood GLM model

Previous article

In the previous article we have discussed how to analyze RNA NGS data using the commonly used technique of negative binomial GLM. In the following article, we will proceed into differential gene expression analysis by utilizing alternative approaches, such as the edgeR classic method that is based on negative binomial exact test analysis, as well as the Wilcoxon ranked sum test for read counts. Furthermore, we will compare the outcomes of these methods to those of the quasi-likelihood GLM model.

Iport and prepare data

Add necessary packages

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library(conflicted)

library(tidyverse)

library(knitr)

library(edgeR)

library(ComplexHeatmap)

library(ggvenn)

We will be using the same dataset that was originally generated in our initial post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Import conditions table
RNA_Conditions <- read_tsv("./../../../data/input/data/RNA_conditions.tsv")

RNA_Conditions <- RNA_Conditions %>%
    dplyr::mutate(Line = factor(Line, levels = c("Normal", "KK", "KL", "KKL")))

# Import counts data
RNA_counts <- read_tsv("./../../../data/input/data/RNA_counts_ann.tsv")

# Reorder columns
RNA_counts <- RNA_counts %>%
    relocate(c(RNA_Conditions$Sample), .before = !c(RNA_Conditions$Sample))

RNA_counts <- as.data.frame(RNA_counts)
rownames(RNA_counts) <- paste(rownames(RNA_counts), RNA_counts$symbol, sep = "_")
1
2
# Ensure the columns are in the right order 
table(colnames(RNA_counts)[1:length(RNA_Conditions$Sample)] == RNA_Conditions$Sample)
## 
## TRUE 
##   35

Create DEGlist object.

1
2
3
4
y <- DGEList(counts = RNA_counts[,   colnames(RNA_counts) %in% RNA_Conditions$Sample], 
             group  = RNA_Conditions$Line,
             genes  = RNA_counts[, !(colnames(RNA_counts) %in% RNA_Conditions$Sample),
                                 drop = FALSE])

Create design matrix.

1
2
design <- model.matrix(~ 0 + RNA_Conditions$Line)
colnames(design) <- levels(RNA_Conditions$Line)

Remove low expressed genes.

1
2
keep <- filterByExpr(y, design =  design)
y <- y[keep, keep.lib.sizes=FALSE]

Calculate norm factors.

1
y <- calcNormFactors(y, method = "TMM")

Estimate dispersion.

1
y <- estimateDisp(y, design = design, robust=TRUE)

Create table of comparison pairs

1
2
y_cond_column <- levels(factor(colnames(y$design)))
all.combos <- as.data.frame(t(combn(y_cond_column,2)))
1
2
3
4
5
6
7
8
9
row_id <- 1
single_cond <- unname(unlist(all.combos[row_id,]))

# make new column for subsequent annotation
RNA_Conditions <- RNA_Conditions %>% 
  mutate(pair = case_when(
    Line == single_cond[1] ~ single_cond[1],
    Line == single_cond[2] ~ single_cond[2]
  ))

Differential Expression

Edger classic using exactTest function.

The EdgeR classic approach implements the exact test proposed by Robinson and Smyth (2008) to measure the difference in mean between two sets of negative binomial stochastic variables. Two groups of count libraries are entered into the function, and a test is performed for each data row. The test is dependent on the cumulative total of counts for that row and is an extension of the conventional exact binomial test (utilized in binomTest), customized for over-dispersed counts.

1
2
# perform test
res.et <- exactTest(y, pair = single_cond) # compare groups 1 and 2

The exactTest function generates a DGEExact class object that contains all the necessary components required for subsequent analysis. This comprises a results table with columns for log2-fold-change (logFC), average log2-counts-per-million (logCPM), two-sided p-value (PValue), and a genes table that consists of the annotations for each gene provided during the construction of the initial DEGList object.

Furthermore, we will determine the adjusted p-values, using the Benjamini-Hochberg approach, derived from the exact test p-value scores.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# write new column with p.adjust
res.et$table$p.adjust <- p.adjust(res.et$table$PValue, method = "BH")

# Access results tables
res.et.table <- res.et$table

res.et.table$symbol <- res.et$genes$symbol

# View the table
res.et.table %>% head %>% kable
logFC logCPM PValue p.adjust symbol
1_Gnai3 0.0685377 7.3130136 0.5046907 0.6772364 Gnai3
3_Cdc45 -0.0631151 2.7380599 0.6886230 0.8140483 Cdc45
4_H19 -0.0185511 1.2750368 0.9643894 0.9847420 H19
5_Scml2 0.4565841 -0.6971466 0.0950176 0.2262964 Scml2
7_Narf 0.2022848 4.9929162 0.1933061 0.3666175 Narf
8_Cav2 0.0613371 8.2812087 0.6106915 0.7587401 Cav2
1
2
3
4
topTags(res.et, n = 6, sort.by = "PValue") %>% 
  as.data.frame %>% 
  select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>% 
  kable
GeneID symbol logFC logCPM PValue FDR gene_biotype description
5436_Ptprn ENSMUSG00000026204.15 Ptprn 6.777902 7.642472 0 0 protein_coding protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765]
3526_Gzme ENSMUSG00000022156.8 Gzme 8.558893 5.058337 0 0 protein_coding granzyme E [Source:MGI Symbol;Acc:MGI:109265]
17410_Rpl10a-ps2 ENSMUSG00000061988.3 Rpl10a-ps2 5.969637 2.377657 0 0 processed_pseudogene ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227]
9394_Idi2 ENSMUSG00000033520.2 Idi2 7.699590 4.560845 0 0 protein_coding isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315]
8466_Star ENSMUSG00000031574.8 Star 2.072902 3.101177 0 0 protein_coding steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760]
2957_Gm9745 ENSMUSG00000021148.11 Gm9745 7.843513 2.343155 0 0 protein_coding predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398]
1
2
is.de.res.et <- decideTestsDGE(res.et)
summary(is.de.res.et)
##        KKL-KK
## Down     2203
## NotSig  15899
## Up       2654
1
2
3
4
# selected genes from exactTest()
res.et.de.genes <- rownames(res.et.table)[which(is.de.res.et != 0)]

res.et.de.genes %>% length() 
## [1] 4857
1
2
selected <- res.et.table[res.et.de.genes,] %>% 
  arrange(p.adjust)
1
2
3
4
# The magnitude of the differential expression changes can be visualized 
# with a fitted model MD plot:
plotMD(res.et, status = is.de.res.et)
abline(h = c(-2, 2), col = "darkgreen")
1
2
# log -	if TRUE then log2 values are returned.
logCPM <- cpm(y, prior.count = 1, log=TRUE) 
1
2
3
4
5
ordered <- order(res.et.table$p.adjust)
logCPM.et.ordered <- logCPM[ordered[1:200],]
colnames(logCPM.et.ordered) <- RNA_Conditions$Line

coolmap(logCPM.et.ordered)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
y_data_subset <- logCPM[rownames(selected)[1:60], ] 

# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)

# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample", 
                                                  "Sex", 
                                                  "Line", 
                                                  "pair"), 
                                              drop = FALSE] %>%
  column_to_rownames(var="Sample") %>% 
  `colnames<-`(c("Sex", "Line", "Condition")) 


row_lables <-  base::paste(round(selected$logFC[1:60], digits = 2),
                                            sapply(str_split(
                                              rownames(y_data_subset)[1:60], pattern = "_", n = 2),
                                              function(x) x[2]),
                                            round(selected$p.adjust[1:60], digits = 2)
                                            )
    
pheatmap::pheatmap(y_data_subset, 
                   cluster_rows = T,
                   cluster_cols = F,
                   show_rownames = T,
                   legend = F,
                   fontsize_col = 8,
                   fontsize_row = 8,
                   annotation_col = df_legend,
                   labels_row = row_lables)

Wilkoxon test

The Wilcoxon rank-sum test is deemed to be more impervious to outliers, largely because of its different null hypothesis. In essence, a gene’s measurement under one condition is equally likely to be greater or less than its measurement under another condition. As a result, the Wilcoxon rank-sum test attaches more importance to ranks than the magnitudes of measurements, making it highly robust to outliers.

For this reason, the test is highly recommended for studies with extensive sample sizes.

1
2
3
4
5
6
7
8
# Run the Wilcoxon rank-sum test for each gene
t.Wilcoxon <- function (index, log_counts, conditions) {
  test_data <- cbind.data.frame(gene = as.numeric(t(log_counts[index,])), 
                                pair = conditions$pair)
  p <- wilcox.test(gene ~ pair, data = test_data, exact = FALSE)$p.value
}
  
pvalues <- sapply(1:nrow(logCPM), function(x) t.Wilcoxon(x, logCPM, RNA_Conditions))
1
2
3
4
5
6
7
8
9
fdr <- p.adjust(pvalues, method = "fdr")

# Calculate the fold-change for each gene
conditionsLevel <- levels(factor(RNA_Conditions$pair))

dataCon1 <- logCPM[,c(which(RNA_Conditions$pair==conditionsLevel[1]))]
dataCon2 <- logCPM[,c(which(RNA_Conditions$pair==conditionsLevel[2]))]

foldChanges <- log2(rowMeans(dataCon2)/rowMeans(dataCon1))
## Warning: NaNs produced
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Output results based on FDR threshold
res.wt <- data.frame(logFC = foldChanges, PValue = pvalues, FDR = fdr) %>% 
  na.omit %>%
  dplyr::filter(FDR < 0.05) %>% 
  rownames_to_column(var = "rowname") %>% 
  arrange(FDR) 

res.wt.de.genes <- res.wt %>% 
  select(rowname) %>% unname %>% unlist

# Results table
results.table <- RNA_counts %>% select(!c(RNA_Conditions$Sample)) %>% 
  rownames_to_column(var = "rowname") %>% 
  left_join(res.wt, ., by = "rowname") %>% 
  dplyr::select(GeneID, symbol, logFC, PValue, FDR, gene_biotype, description) 

results.table %>% head %>% kable
GeneID symbol logFC PValue FDR gene_biotype description
ENSMUSG00000000194.13 Gpr107 0.0615766 0.0002797 0.0057651 protein_coding G protein-coupled receptor 107 [Source:MGI Symbol;Acc:MGI:2139054]
ENSMUSG00000000308.14 Ckmt1 0.1666067 0.0002797 0.0057651 protein_coding creatine kinase, mitochondrial 1, ubiquitous [Source:MGI Symbol;Acc:MGI:99441]
ENSMUSG00000000320.10 Alox12 -0.6603288 0.0002797 0.0057651 protein_coding arachidonate 12-lipoxygenase [Source:MGI Symbol;Acc:MGI:87998]
ENSMUSG00000000340.10 Dbt -0.1512780 0.0002797 0.0057651 protein_coding dihydrolipoamide branched chain transacylase E2 [Source:MGI Symbol;Acc:MGI:105386]
ENSMUSG00000000386.16 Mx1 0.4073687 0.0002797 0.0057651 protein_coding MX dynamin-like GTPase 1 [Source:MGI Symbol;Acc:MGI:97243]
ENSMUSG00000000420.15 Galnt1 0.0750379 0.0002797 0.0057651 protein_coding polypeptide N-acetylgalactosaminyltransferase 1 [Source:MGI Symbol;Acc:MGI:894693]
1
2
3
4
5
6
ordered <- order(res.et.table$p.adjust)

logCPM.wt.ordered <- logCPM[res.wt$rowname[1:200],]
colnames(logCPM.wt.ordered) <- RNA_Conditions$Line

coolmap(logCPM.wt.ordered)

glmQLFit

Compare DE genes to glmQL model

1
2
y_contrast <- makeContrasts(base::paste(single_cond[1], '-', single_cond[2]), 
                             levels = design)
1
2
fit.glmQL <- glmQLFit(y, design, robust=TRUE)
res.glmQL <- glmQLFTest(fit.glmQL, contrast = y_contrast)

Show DE genes

1
2
3
res.glmQL.table <- res.glmQL$table
is.de.res.glmQL <- decideTestsDGE(res.glmQL)
res.glmQL.de.genes <- rownames(res.glmQL.table)[which(is.de.res.glmQL != 0)]

Compare results

We can observe that the edgeR classic approach and the edgeR NB quasi-likelihood model produce comparable results. In terms of the glmQL genes, there is a 1.1 percent difference relative to the exactTest model, and for the exactTest model, there are 7.4 percent unique genes. In contrast, the Wilcoxon test yields a 22 percent unique gene count compared to the other two methods.

1
2
3
4
5
ggvenn(list("edgeR\nclassic" = res.et.de.genes,
            "Wilcoxon\ntest" = res.wt.de.genes,
            glmQLFit = res.glmQL.de.genes),
       fill_color = c("#009E73", "#E69F00", "#CD534CFF"),
       stroke_size = 0.5, set_name_size = 5, text_size = 3)
1
2
# number of found tegs by each model
length(res.et.de.genes)
## [1] 4857
1
length(res.wt.de.genes)
## [1] 5437
1
length(res.glmQL.de.genes)
## [1] 4548
0%