Next Generation Sequencing RNAseq Data Analysis (Part 3)

2023-02-19 1441 words 7 minutes

Contents

In the previous article we have discussed how to analyze RNA NGS data using the commonly used technique of negative binomial GLM. In the following article, we will proceed into differential gene expression analysis by utilizing alternative approaches, such as the edgeR classic method that is based on negative binomial exact test analysis, as well as the Wilcoxon ranked sum test for read counts. Furthermore, we will compare the outcomes of these methods to those of the quasi-likelihood GLM model.

Iport and prepare data

Add necessary packages

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


library(conflicted)

library(tidyverse)

library(knitr)

library(edgeR)

library(ComplexHeatmap)

library(ggvenn)

We will be using the same dataset that was originally generated in our initial post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# Import conditions table
RNA_Conditions <- read_tsv("./../../../data/input/data/RNA_conditions.tsv")

RNA_Conditions <- RNA_Conditions %>%
    dplyr::mutate(Line = factor(Line, levels = c("Normal", "KK", "KL", "KKL")))

# Import counts data
RNA_counts <- read_tsv("./../../../data/input/data/RNA_counts_ann.tsv")

# Reorder columns
RNA_counts <- RNA_counts %>%
    relocate(c(RNA_Conditions$Sample), .before = !c(RNA_Conditions$Sample))

RNA_counts <- as.data.frame(RNA_counts)
rownames(RNA_counts) <- paste(rownames(RNA_counts), RNA_counts$symbol, sep = "_")

1
2


# Ensure the columns are in the right order 
table(colnames(RNA_counts)[1:length(RNA_Conditions$Sample)] == RNA_Conditions$Sample)

## 
## TRUE 
##   35

Create DEGlist object.

1
2
3
4


y <- DGEList(counts = RNA_counts[,   colnames(RNA_counts) %in% RNA_Conditions$Sample], 
             group  = RNA_Conditions$Line,
             genes  = RNA_counts[, !(colnames(RNA_counts) %in% RNA_Conditions$Sample),
                                 drop = FALSE])

Create design matrix.

1
2


design <- model.matrix(~ 0 + RNA_Conditions$Line)
colnames(design) <- levels(RNA_Conditions$Line)

Remove low expressed genes.

1
2


keep <- filterByExpr(y, design =  design)
y <- y[keep, keep.lib.sizes=FALSE]

Calculate norm factors.

1

y <- calcNormFactors(y, method = "TMM")

Estimate dispersion.

1

y <- estimateDisp(y, design = design, robust=TRUE)

Create table of comparison pairs

1
2


y_cond_column <- levels(factor(colnames(y$design)))
all.combos <- as.data.frame(t(combn(y_cond_column,2)))

1
2
3
4
5
6
7
8
9


row_id <- 1
single_cond <- unname(unlist(all.combos[row_id,]))

# make new column for subsequent annotation
RNA_Conditions <- RNA_Conditions %>% 
  mutate(pair = case_when(
    Line == single_cond[1] ~ single_cond[1],
    Line == single_cond[2] ~ single_cond[2]
  ))

Differential Expression

Edger classic using `exactTest` function.

The EdgeR classic approach implements the exact test proposed by Robinson and Smyth (2008) to measure the difference in mean between two sets of negative binomial stochastic variables. Two groups of count libraries are entered into the function, and a test is performed for each data row. The test is dependent on the cumulative total of counts for that row and is an extension of the conventional exact binomial test (utilized in binomTest), customized for over-dispersed counts.

1
2


# perform test
res.et <- exactTest(y, pair = single_cond) # compare groups 1 and 2

The exactTest function generates a DGEExact class object that contains all the necessary components required for subsequent analysis. This comprises a results table with columns for log2-fold-change (logFC), average log2-counts-per-million (logCPM), two-sided p-value (PValue), and a genes table that consists of the annotations for each gene provided during the construction of the initial DEGList object.

Furthermore, we will determine the adjusted p-values, using the Benjamini-Hochberg approach, derived from the exact test p-value scores.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# write new column with p.adjust
res.et$table$p.adjust <- p.adjust(res.et$table$PValue, method = "BH")

# Access results tables
res.et.table <- res.et$table

res.et.table$symbol <- res.et$genes$symbol

# View the table
res.et.table %>% head %>% kable

	logFC	logCPM	PValue	p.adjust	symbol
1_Gnai3	0.0685377	7.3130136	0.5046907	0.6772364	Gnai3
3_Cdc45	-0.0631151	2.7380599	0.6886230	0.8140483	Cdc45
4_H19	-0.0185511	1.2750368	0.9643894	0.9847420	H19
5_Scml2	0.4565841	-0.6971466	0.0950176	0.2262964	Scml2
7_Narf	0.2022848	4.9929162	0.1933061	0.3666175	Narf
8_Cav2	0.0613371	8.2812087	0.6106915	0.7587401	Cav2

1
2
3
4


topTags(res.et, n = 6, sort.by = "PValue") %>% 
  as.data.frame %>% 
  select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>% 
  kable

	GeneID	symbol	logFC	logCPM	gene_biotype	description
5436_Ptprn	ENSMUSG00000026204.15	Ptprn	6.777902	7.642472	protein_coding	protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765]
3526_Gzme	ENSMUSG00000022156.8	Gzme	8.558893	5.058337	protein_coding	granzyme E [Source:MGI Symbol;Acc:MGI:109265]
17410_Rpl10a-ps2	ENSMUSG00000061988.3	Rpl10a-ps2	5.969637	2.377657	processed_pseudogene	ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227]
9394_Idi2	ENSMUSG00000033520.2	Idi2	7.699590	4.560845	protein_coding	isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315]
8466_Star	ENSMUSG00000031574.8	Star	2.072902	3.101177	protein_coding	steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760]
2957_Gm9745	ENSMUSG00000021148.11	Gm9745	7.843513	2.343155	protein_coding	predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398]

1
2


is.de.res.et <- decideTestsDGE(res.et)
summary(is.de.res.et)

##        KKL-KK
## Down     2203
## NotSig  15899
## Up       2654

1
2
3
4


# selected genes from exactTest()
res.et.de.genes <- rownames(res.et.table)[which(is.de.res.et != 0)]

res.et.de.genes %>% length() 

## [1] 4857

1
2


selected <- res.et.table[res.et.de.genes,] %>% 
  arrange(p.adjust)

1
2
3
4


# The magnitude of the differential expression changes can be visualized 
# with a fitted model MD plot:
plotMD(res.et, status = is.de.res.et)
abline(h = c(-2, 2), col = "darkgreen")

1
2


# log -	if TRUE then log2 values are returned.
logCPM <- cpm(y, prior.count = 1, log=TRUE) 

1
2
3
4
5


ordered <- order(res.et.table$p.adjust)
logCPM.et.ordered <- logCPM[ordered[1:200],]
colnames(logCPM.et.ordered) <- RNA_Conditions$Line

coolmap(logCPM.et.ordered)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


y_data_subset <- logCPM[rownames(selected)[1:60], ] 

# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)

# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample", 
                                                  "Sex", 
                                                  "Line", 
                                                  "pair"), 
                                              drop = FALSE] %>%
  column_to_rownames(var="Sample") %>% 
  `colnames<-`(c("Sex", "Line", "Condition")) 


row_lables <-  base::paste(round(selected$logFC[1:60], digits = 2),
                                            sapply(str_split(
                                              rownames(y_data_subset)[1:60], pattern = "_", n = 2),
                                              function(x) x[2]),
                                            round(selected$p.adjust[1:60], digits = 2)
                                            )
    
pheatmap::pheatmap(y_data_subset, 
                   cluster_rows = T,
                   cluster_cols = F,
                   show_rownames = T,
                   legend = F,
                   fontsize_col = 8,
                   fontsize_row = 8,
                   annotation_col = df_legend,
                   labels_row = row_lables)

Wilkoxon test

The Wilcoxon rank-sum test is deemed to be more impervious to outliers, largely because of its different null hypothesis. In essence, a gene’s measurement under one condition is equally likely to be greater or less than its measurement under another condition. As a result, the Wilcoxon rank-sum test attaches more importance to ranks than the magnitudes of measurements, making it highly robust to outliers.

For this reason, the test is highly recommended for studies with extensive sample sizes.

1
2
3
4
5
6
7
8


# Run the Wilcoxon rank-sum test for each gene
t.Wilcoxon <- function (index, log_counts, conditions) {
  test_data <- cbind.data.frame(gene = as.numeric(t(log_counts[index,])), 
                                pair = conditions$pair)
  p <- wilcox.test(gene ~ pair, data = test_data, exact = FALSE)$p.value
}
  
pvalues <- sapply(1:nrow(logCPM), function(x) t.Wilcoxon(x, logCPM, RNA_Conditions))

1
2
3
4
5
6
7
8
9


fdr <- p.adjust(pvalues, method = "fdr")

# Calculate the fold-change for each gene
conditionsLevel <- levels(factor(RNA_Conditions$pair))

dataCon1 <- logCPM[,c(which(RNA_Conditions$pair==conditionsLevel[1]))]
dataCon2 <- logCPM[,c(which(RNA_Conditions$pair==conditionsLevel[2]))]

foldChanges <- log2(rowMeans(dataCon2)/rowMeans(dataCon1))

## Warning: NaNs produced

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# Output results based on FDR threshold
res.wt <- data.frame(logFC = foldChanges, PValue = pvalues, FDR = fdr) %>% 
  na.omit %>%
  dplyr::filter(FDR < 0.05) %>% 
  rownames_to_column(var = "rowname") %>% 
  arrange(FDR) 

res.wt.de.genes <- res.wt %>% 
  select(rowname) %>% unname %>% unlist

# Results table
results.table <- RNA_counts %>% select(!c(RNA_Conditions$Sample)) %>% 
  rownames_to_column(var = "rowname") %>% 
  left_join(res.wt, ., by = "rowname") %>% 
  dplyr::select(GeneID, symbol, logFC, PValue, FDR, gene_biotype, description) 

results.table %>% head %>% kable

GeneID	symbol	logFC	PValue	FDR	gene_biotype	description
ENSMUSG00000000194.13	Gpr107	0.0615766	0.0002797	0.0057651	protein_coding	G protein-coupled receptor 107 [Source:MGI Symbol;Acc:MGI:2139054]
ENSMUSG00000000308.14	Ckmt1	0.1666067	0.0002797	0.0057651	protein_coding	creatine kinase, mitochondrial 1, ubiquitous [Source:MGI Symbol;Acc:MGI:99441]
ENSMUSG00000000320.10	Alox12	-0.6603288	0.0002797	0.0057651	protein_coding	arachidonate 12-lipoxygenase [Source:MGI Symbol;Acc:MGI:87998]
ENSMUSG00000000340.10	Dbt	-0.1512780	0.0002797	0.0057651	protein_coding	dihydrolipoamide branched chain transacylase E2 [Source:MGI Symbol;Acc:MGI:105386]
ENSMUSG00000000386.16	Mx1	0.4073687	0.0002797	0.0057651	protein_coding	MX dynamin-like GTPase 1 [Source:MGI Symbol;Acc:MGI:97243]
ENSMUSG00000000420.15	Galnt1	0.0750379	0.0002797	0.0057651	protein_coding	polypeptide N-acetylgalactosaminyltransferase 1 [Source:MGI Symbol;Acc:MGI:894693]

1
2
3
4
5
6


ordered <- order(res.et.table$p.adjust)

logCPM.wt.ordered <- logCPM[res.wt$rowname[1:200],]
colnames(logCPM.wt.ordered) <- RNA_Conditions$Line

coolmap(logCPM.wt.ordered)

glmQLFit

Compare DE genes to glmQL model

1
2


y_contrast <- makeContrasts(base::paste(single_cond[1], '-', single_cond[2]), 
                             levels = design)

1
2


fit.glmQL <- glmQLFit(y, design, robust=TRUE)
res.glmQL <- glmQLFTest(fit.glmQL, contrast = y_contrast)

Show DE genes

1
2
3


res.glmQL.table <- res.glmQL$table
is.de.res.glmQL <- decideTestsDGE(res.glmQL)
res.glmQL.de.genes <- rownames(res.glmQL.table)[which(is.de.res.glmQL != 0)]

Compare results

We can observe that the edgeR classic approach and the edgeR NB quasi-likelihood model produce comparable results. In terms of the glmQL genes, there is a 1.1 percent difference relative to the exactTest model, and for the exactTest model, there are 7.4 percent unique genes. In contrast, the Wilcoxon test yields a 22 percent unique gene count compared to the other two methods.

1
2
3
4
5


ggvenn(list("edgeR\nclassic" = res.et.de.genes,
            "Wilcoxon\ntest" = res.wt.de.genes,
            glmQLFit = res.glmQL.de.genes),
       fill_color = c("#009E73", "#E69F00", "#CD534CFF"),
       stroke_size = 0.5, set_name_size = 5, text_size = 3)

1
2


# number of found tegs by each model
length(res.et.de.genes)

## [1] 4857

1

length(res.wt.de.genes)

## [1] 5437

1

length(res.glmQL.de.genes)

## [1] 4548