Previous article
In the previous article we have discussed how to analyze RNA NGS data using the commonly used technique of negative binomial GLM. In the following article, we will proceed into differential gene expression analysis by utilizing alternative approaches, such as the edgeR classic method that is based on negative binomial exact test analysis, as well as the Wilcoxon ranked sum test for read counts. Furthermore, we will compare the outcomes of these methods to those of the quasi-likelihood GLM model.
Iport and prepare data
Add necessary packages
1
2
3
4
5
6
7
8
9
10
11
|
library(conflicted)
library(tidyverse)
library(knitr)
library(edgeR)
library(ComplexHeatmap)
library(ggvenn)
|
We will be using the same dataset that was originally generated in our initial post.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
# Import conditions table
RNA_Conditions <- read_tsv("./../../../data/input/data/RNA_conditions.tsv")
RNA_Conditions <- RNA_Conditions %>%
dplyr::mutate(Line = factor(Line, levels = c("Normal", "KK", "KL", "KKL")))
# Import counts data
RNA_counts <- read_tsv("./../../../data/input/data/RNA_counts_ann.tsv")
# Reorder columns
RNA_counts <- RNA_counts %>%
relocate(c(RNA_Conditions$Sample), .before = !c(RNA_Conditions$Sample))
RNA_counts <- as.data.frame(RNA_counts)
rownames(RNA_counts) <- paste(rownames(RNA_counts), RNA_counts$symbol, sep = "_")
|
1
2
|
# Ensure the columns are in the right order
table(colnames(RNA_counts)[1:length(RNA_Conditions$Sample)] == RNA_Conditions$Sample)
|
##
## TRUE
## 35
Create DEGlist
object.
1
2
3
4
|
y <- DGEList(counts = RNA_counts[, colnames(RNA_counts) %in% RNA_Conditions$Sample],
group = RNA_Conditions$Line,
genes = RNA_counts[, !(colnames(RNA_counts) %in% RNA_Conditions$Sample),
drop = FALSE])
|
Create design matrix.
1
2
|
design <- model.matrix(~ 0 + RNA_Conditions$Line)
colnames(design) <- levels(RNA_Conditions$Line)
|
Remove low expressed genes.
1
2
|
keep <- filterByExpr(y, design = design)
y <- y[keep, keep.lib.sizes=FALSE]
|
Calculate norm factors.
1
|
y <- calcNormFactors(y, method = "TMM")
|
Estimate dispersion.
1
|
y <- estimateDisp(y, design = design, robust=TRUE)
|
Create table of comparison pairs
1
2
|
y_cond_column <- levels(factor(colnames(y$design)))
all.combos <- as.data.frame(t(combn(y_cond_column,2)))
|
1
2
3
4
5
6
7
8
9
|
row_id <- 1
single_cond <- unname(unlist(all.combos[row_id,]))
# make new column for subsequent annotation
RNA_Conditions <- RNA_Conditions %>%
mutate(pair = case_when(
Line == single_cond[1] ~ single_cond[1],
Line == single_cond[2] ~ single_cond[2]
))
|
Differential Expression
Edger classic using exactTest
function.
The EdgeR classic
approach implements the exact test proposed by Robinson and Smyth (2008) to measure the difference in mean between two sets of negative binomial stochastic variables. Two groups of count libraries are entered into the function, and a test is performed for each data row. The test is dependent on the cumulative total of counts for that row and is an extension of the conventional exact binomial test (utilized in binomTest
), customized for over-dispersed counts.
1
2
|
# perform test
res.et <- exactTest(y, pair = single_cond) # compare groups 1 and 2
|
The exactTest
function generates a DGEExact
class object that contains all the necessary components required for subsequent analysis. This comprises a results table with columns for log2-fold-change (logFC), average log2-counts-per-million (logCPM), two-sided p-value (PValue), and a genes table that consists of the annotations for each gene provided during the construction of the initial DEGList
object.
Furthermore, we will determine the adjusted p-values, using the Benjamini-Hochberg approach, derived from the exact test p-value scores.
1
2
3
4
5
6
7
8
9
10
|
# write new column with p.adjust
res.et$table$p.adjust <- p.adjust(res.et$table$PValue, method = "BH")
# Access results tables
res.et.table <- res.et$table
res.et.table$symbol <- res.et$genes$symbol
# View the table
res.et.table %>% head %>% kable
|
|
logFC |
logCPM |
PValue |
p.adjust |
symbol |
1_Gnai3 |
0.0685377 |
7.3130136 |
0.5046907 |
0.6772364 |
Gnai3 |
3_Cdc45 |
-0.0631151 |
2.7380599 |
0.6886230 |
0.8140483 |
Cdc45 |
4_H19 |
-0.0185511 |
1.2750368 |
0.9643894 |
0.9847420 |
H19 |
5_Scml2 |
0.4565841 |
-0.6971466 |
0.0950176 |
0.2262964 |
Scml2 |
7_Narf |
0.2022848 |
4.9929162 |
0.1933061 |
0.3666175 |
Narf |
8_Cav2 |
0.0613371 |
8.2812087 |
0.6106915 |
0.7587401 |
Cav2 |
1
2
3
4
|
topTags(res.et, n = 6, sort.by = "PValue") %>%
as.data.frame %>%
select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>%
kable
|
|
GeneID |
symbol |
logFC |
logCPM |
PValue |
FDR |
gene_biotype |
description |
5436_Ptprn |
ENSMUSG00000026204.15 |
Ptprn |
6.777902 |
7.642472 |
0 |
0 |
protein_coding |
protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765] |
3526_Gzme |
ENSMUSG00000022156.8 |
Gzme |
8.558893 |
5.058337 |
0 |
0 |
protein_coding |
granzyme E [Source:MGI Symbol;Acc:MGI:109265] |
17410_Rpl10a-ps2 |
ENSMUSG00000061988.3 |
Rpl10a-ps2 |
5.969637 |
2.377657 |
0 |
0 |
processed_pseudogene |
ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227] |
9394_Idi2 |
ENSMUSG00000033520.2 |
Idi2 |
7.699590 |
4.560845 |
0 |
0 |
protein_coding |
isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315] |
8466_Star |
ENSMUSG00000031574.8 |
Star |
2.072902 |
3.101177 |
0 |
0 |
protein_coding |
steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760] |
2957_Gm9745 |
ENSMUSG00000021148.11 |
Gm9745 |
7.843513 |
2.343155 |
0 |
0 |
protein_coding |
predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398] |
1
2
|
is.de.res.et <- decideTestsDGE(res.et)
summary(is.de.res.et)
|
## KKL-KK
## Down 2203
## NotSig 15899
## Up 2654
1
2
3
4
|
# selected genes from exactTest()
res.et.de.genes <- rownames(res.et.table)[which(is.de.res.et != 0)]
res.et.de.genes %>% length()
|
## [1] 4857
1
2
|
selected <- res.et.table[res.et.de.genes,] %>%
arrange(p.adjust)
|
1
2
3
4
|
# The magnitude of the differential expression changes can be visualized
# with a fitted model MD plot:
plotMD(res.et, status = is.de.res.et)
abline(h = c(-2, 2), col = "darkgreen")
|
1
2
|
# log - if TRUE then log2 values are returned.
logCPM <- cpm(y, prior.count = 1, log=TRUE)
|
1
2
3
4
5
|
ordered <- order(res.et.table$p.adjust)
logCPM.et.ordered <- logCPM[ordered[1:200],]
colnames(logCPM.et.ordered) <- RNA_Conditions$Line
coolmap(logCPM.et.ordered)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
y_data_subset <- logCPM[rownames(selected)[1:60], ]
# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)
# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample",
"Sex",
"Line",
"pair"),
drop = FALSE] %>%
column_to_rownames(var="Sample") %>%
`colnames<-`(c("Sex", "Line", "Condition"))
row_lables <- base::paste(round(selected$logFC[1:60], digits = 2),
sapply(str_split(
rownames(y_data_subset)[1:60], pattern = "_", n = 2),
function(x) x[2]),
round(selected$p.adjust[1:60], digits = 2)
)
pheatmap::pheatmap(y_data_subset,
cluster_rows = T,
cluster_cols = F,
show_rownames = T,
legend = F,
fontsize_col = 8,
fontsize_row = 8,
annotation_col = df_legend,
labels_row = row_lables)
|
Wilkoxon test
The Wilcoxon rank-sum test is deemed to be more impervious to outliers, largely because of its different null hypothesis. In essence, a gene’s measurement under one condition is equally likely to be greater or less than its measurement under another condition. As a result, the Wilcoxon rank-sum test attaches more importance to ranks than the magnitudes of measurements, making it highly robust to outliers.
For this reason, the test is highly recommended for studies with extensive sample sizes.
1
2
3
4
5
6
7
8
|
# Run the Wilcoxon rank-sum test for each gene
t.Wilcoxon <- function (index, log_counts, conditions) {
test_data <- cbind.data.frame(gene = as.numeric(t(log_counts[index,])),
pair = conditions$pair)
p <- wilcox.test(gene ~ pair, data = test_data, exact = FALSE)$p.value
}
pvalues <- sapply(1:nrow(logCPM), function(x) t.Wilcoxon(x, logCPM, RNA_Conditions))
|
1
2
3
4
5
6
7
8
9
|
fdr <- p.adjust(pvalues, method = "fdr")
# Calculate the fold-change for each gene
conditionsLevel <- levels(factor(RNA_Conditions$pair))
dataCon1 <- logCPM[,c(which(RNA_Conditions$pair==conditionsLevel[1]))]
dataCon2 <- logCPM[,c(which(RNA_Conditions$pair==conditionsLevel[2]))]
foldChanges <- log2(rowMeans(dataCon2)/rowMeans(dataCon1))
|
## Warning: NaNs produced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# Output results based on FDR threshold
res.wt <- data.frame(logFC = foldChanges, PValue = pvalues, FDR = fdr) %>%
na.omit %>%
dplyr::filter(FDR < 0.05) %>%
rownames_to_column(var = "rowname") %>%
arrange(FDR)
res.wt.de.genes <- res.wt %>%
select(rowname) %>% unname %>% unlist
# Results table
results.table <- RNA_counts %>% select(!c(RNA_Conditions$Sample)) %>%
rownames_to_column(var = "rowname") %>%
left_join(res.wt, ., by = "rowname") %>%
dplyr::select(GeneID, symbol, logFC, PValue, FDR, gene_biotype, description)
results.table %>% head %>% kable
|
GeneID |
symbol |
logFC |
PValue |
FDR |
gene_biotype |
description |
ENSMUSG00000000194.13 |
Gpr107 |
0.0615766 |
0.0002797 |
0.0057651 |
protein_coding |
G protein-coupled receptor 107 [Source:MGI Symbol;Acc:MGI:2139054] |
ENSMUSG00000000308.14 |
Ckmt1 |
0.1666067 |
0.0002797 |
0.0057651 |
protein_coding |
creatine kinase, mitochondrial 1, ubiquitous [Source:MGI Symbol;Acc:MGI:99441] |
ENSMUSG00000000320.10 |
Alox12 |
-0.6603288 |
0.0002797 |
0.0057651 |
protein_coding |
arachidonate 12-lipoxygenase [Source:MGI Symbol;Acc:MGI:87998] |
ENSMUSG00000000340.10 |
Dbt |
-0.1512780 |
0.0002797 |
0.0057651 |
protein_coding |
dihydrolipoamide branched chain transacylase E2 [Source:MGI Symbol;Acc:MGI:105386] |
ENSMUSG00000000386.16 |
Mx1 |
0.4073687 |
0.0002797 |
0.0057651 |
protein_coding |
MX dynamin-like GTPase 1 [Source:MGI Symbol;Acc:MGI:97243] |
ENSMUSG00000000420.15 |
Galnt1 |
0.0750379 |
0.0002797 |
0.0057651 |
protein_coding |
polypeptide N-acetylgalactosaminyltransferase 1 [Source:MGI Symbol;Acc:MGI:894693] |
1
2
3
4
5
6
|
ordered <- order(res.et.table$p.adjust)
logCPM.wt.ordered <- logCPM[res.wt$rowname[1:200],]
colnames(logCPM.wt.ordered) <- RNA_Conditions$Line
coolmap(logCPM.wt.ordered)
|
glmQLFit
Compare DE genes to glmQL model
1
2
|
y_contrast <- makeContrasts(base::paste(single_cond[1], '-', single_cond[2]),
levels = design)
|
1
2
|
fit.glmQL <- glmQLFit(y, design, robust=TRUE)
res.glmQL <- glmQLFTest(fit.glmQL, contrast = y_contrast)
|
Show DE genes
1
2
3
|
res.glmQL.table <- res.glmQL$table
is.de.res.glmQL <- decideTestsDGE(res.glmQL)
res.glmQL.de.genes <- rownames(res.glmQL.table)[which(is.de.res.glmQL != 0)]
|
Compare results
We can observe that the edgeR classic approach and the edgeR NB quasi-likelihood model produce comparable results. In terms of the glmQL genes, there is a 1.1 percent difference relative to the exactTest model, and for the exactTest model, there are 7.4 percent unique genes. In contrast, the Wilcoxon test yields a 22 percent unique gene count compared to the other two methods.
1
2
3
4
5
|
ggvenn(list("edgeR\nclassic" = res.et.de.genes,
"Wilcoxon\ntest" = res.wt.de.genes,
glmQLFit = res.glmQL.de.genes),
fill_color = c("#009E73", "#E69F00", "#CD534CFF"),
stroke_size = 0.5, set_name_size = 5, text_size = 3)
|
1
2
|
# number of found tegs by each model
length(res.et.de.genes)
|
## [1] 4857
1
|
length(res.wt.de.genes)
|
## [1] 5437
1
|
length(res.glmQL.de.genes)
|
## [1] 4548