Previous article
In the first article, we obtained NGS data and made the necessary preparations for further analysis. This involved processing and organizing the data in a way that would facilitate downstream analysis. Additionally, we included annotations that provide valuable context and information for the interpretation of the results.
Introduction
There are many ways to perform analysis of differential gene expression (DGE) in R. I’m going to look at a few of them. In this article, we will analyze library distribution and perform a full DGE pipeline using edgeR package.
In writing this article, I have utilized resources from both the User’s Guide for edgeR and the article of Yunshun Chen et.al.1.
Initial preparation
Add necessary packages
1
2
3
4
5
6
7
8
9
10
11
|
library(conflicted)
library(tidyverse)
library(knitr)
library(edgeR)
library(ComplexHeatmap)
library(ggvenn)
|
As the original GEO entry lacked a table of conditions, I manually created one using the metadata from the samples. You can find the resulting table presented in the code chunk below.
1
2
3
4
|
RNA_Conditions <- read_tsv("./../../../data/input/data/RNA_conditions.tsv")
RNA_Conditions <- RNA_Conditions %>%
dplyr::mutate(Line = factor(Line, levels = c("Normal", "KK", "KL", "KKL")))
|
1
|
RNA_Conditions %>% print(n=nrow(RNA_Conditions))
|
Expand the code chank to see the table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
## # A tibble: 35 × 5
## Sample Sex Line Replicate Nodulus
## <chr> <chr> <fct> <chr> <chr>
## 1 norm_1_A_F_run2 F Normal replicate 1 NO
## 2 norm_2_A_F_run3 F Normal replicate 2 NO
## 3 norm_3_A_F_run1 F Normal replicate 3 NO
## 4 norm_4_A_M_run4 M Normal replicate 4 NO
## 5 norm_5_A_M_run5 M Normal replicate 5 NO
## 6 norm_6_A_M_run1 M Normal replicate 6 NO
## 7 KK_1_A_F_run3 F KK replicate 1 tumor A
## 8 KK_1_B_F_run3 F KK replicate 1 tumor B
## 9 KK_2_A_F_run1 F KK replicate 2 tumor A
## 10 KK_2_B_F_run1 F KK replicate 2 tumor B
## 11 KK_3_A_M_run5 M KK replicate 3 tumor A
## 12 KK_3_B_M_run5 M KK replicate 3 tumor B
## 13 KK_4_A_M_run2 M KK replicate 4 tumor A
## 14 KK_4_B_M_run2 M KK replicate 4 tumor B
## 15 KK_5_A_M_run4 M KK replicate 5 tumor A
## 16 KK_5_B_M_run4 M KK replicate 5 tumor B
## 17 KL_1_A_F_run2 F KL replicate 1 tumor A
## 18 KL_1_B_F_run2 F KL replicate 1 tumor B
## 19 KL_2_A_F_run5 F KL replicate 2 tumor A
## 20 KL_2_B_F_run5 F KL replicate 2 tumor B
## 21 KL_3_A_M_run1 M KL replicate 3 tumor A
## 22 KL_3_B_M_run1 M KL replicate 3 tumor B
## 23 KL_4_A_M_run3 M KL replicate 4 tumor A
## 24 KL_4_B_M_run3 M KL replicate 4 tumor B
## 25 KL_5_A_M_run4 M KL replicate 5 tumor A
## 26 KL_5_B_M_run4 M KL replicate 5 tumor B
## 27 KKL_1_A_F_run4 F KKL replicate 1 tumor A
## 28 KKL_1_B_F_run4 F KKL replicate 1 tumor B
## 29 KKL_2_A_F_run2 F KKL replicate 2 tumor A
## 30 KKL_2_B_F_run2 F KKL replicate 2 tumor B
## 31 KKL_3_A_M_run3 M KKL replicate 3 tumor A
## 32 KKL_3_B_M_run3 M KKL replicate 3 tumor B
## 33 KKL_4_A_M_run5 M KKL replicate 4 tumor A
## 34 KKL_5_A_M_run1 M KKL replicate 5 tumor A
## 35 KKL_5_B_M_run1 M KKL replicate 5 tumor B
|
For this tutorial, we will utilize the RNA counts table that was generated in the preceding one.
1
2
3
|
RNA_counts <- read_tsv("./../../../data/input/data/RNA_counts_ann.tsv")
RNA_counts %>% head %>% kable
|
Ensuring that the metadata of the table of conditions is in the same order as the column names of the counts table is of utmost importance. This cannot be emphasized enough. To preempt any issues, I will proactively reorder the columns of the counts table based on the sample sequence specified in the condition table.
1
2
|
RNA_counts <- RNA_counts %>%
relocate(c(RNA_Conditions$Sample), .before = !c(RNA_Conditions$Sample))
|
1
|
table(colnames(RNA_counts)[1:length(RNA_Conditions$Sample)] == RNA_Conditions$Sample)
|
##
## TRUE
## 35
Furthermore, I will rename the row names in the counts table with the gene names to provide a clearer understanding of which genes are differentially expressed. I will also include a row index since the table may contain duplicates.
1
2
|
RNA_counts <- as.data.frame(RNA_counts)
rownames(RNA_counts) <- paste(rownames(RNA_counts), RNA_counts$symbol, sep = "_")
|
create edgeR object - DEGList()
The data in the edgeR
package is stored in a straightforward, list-based data structure known as a DEGlist
. The main field of DEGlist
is a matrix of counts. But it also stores information about groups of conditions and gene metadata. Additionally, DEGlist
enables data subsetting based on column names. Personally, I prefer edgeR
over DESeq2
due to its ease of data manipulation and the fact that many other packages utilize its DEGlist
for subsequent analyses.
To utilize DEGlist
, we need to provide the counts variable, which is the matrix of counts, as well as information about the sample groups. The remaining gene metadata can be sent to the genes
variable for subsequent analysis.
1
2
3
4
|
y <- DGEList(counts = RNA_counts[, colnames(RNA_counts) %in% RNA_Conditions$Sample],
group = RNA_Conditions$Line,
genes = RNA_counts[, !(colnames(RNA_counts) %in% RNA_Conditions$Sample),
drop = FALSE])
|
Gene counts per sample
Now we can start exploring your data. Below are the raw counts for each sample.
1
2
|
gene_counts <- apply(y$counts, 2, sum)
base::sort(gene_counts)
|
## norm_3_A_F_run1 KL_1_B_F_run2 KK_1_B_F_run3 KK_4_B_M_run2 KK_4_A_M_run2
## 11675586 16437820 17094871 18095347 18856436
## KK_1_A_F_run3 KL_1_A_F_run2 norm_4_A_M_run4 KKL_1_B_F_run4 KL_4_A_M_run3
## 19165012 21035973 21078774 22050764 23239353
## KL_4_B_M_run3 KKL_3_A_M_run3 norm_6_A_M_run1 KKL_2_A_F_run2 norm_1_A_F_run2
## 24216392 24698461 26001408 26095600 26796419
## KKL_1_A_F_run4 KKL_2_B_F_run2 KK_5_B_M_run4 KK_2_B_F_run1 norm_2_A_F_run3
## 26891471 26990324 29037583 29203933 30186292
## KKL_5_B_M_run1 norm_5_A_M_run5 KKL_4_A_M_run5 KL_5_A_M_run4 KKL_5_A_M_run1
## 31352424 31740941 32095794 32467669 33098042
## KL_2_A_F_run5 KL_5_B_M_run4 KK_2_A_F_run1 KL_3_B_M_run1 KK_3_B_M_run5
## 33736324 35007584 36198774 36452365 37099116
## KK_5_A_M_run4 KL_2_B_F_run5 KL_3_A_M_run1 KK_3_A_M_run5 KKL_3_B_M_run3
## 37509192 39368111 41240569 57689274 65373448
## [1] 50931 35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# adjusting the plot dimensions
par(mar=c(9,4,3,0))
# sort by library sizes
sorted_lib.size <- y$samples[base::order(y$samples$lib.size),]
# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(sorted_lib.size$lib.size,
names = rownames(sorted_lib.size),
las = 2,
col="#69b3a2")
abline(h = median(sorted_lib.size$lib.size), col="blue")
# Add a title to the plot
title("Barplot of library sizes")
|
Design matrix
To fit the linear model and perform differential gene expression analysis, we need to specify the design matrix, which includes the treatment conditions for each sample. Each column in this matrix represents a group of conditions, while each row represents whether the sample belongs to the group or not.
To create this matrix, we use the model.matrix
function, which links the groups and samples together.
1
2
3
4
5
|
design <- model.matrix(~ 0 + RNA_Conditions$Line)
colnames(design) <- levels(RNA_Conditions$Line)
design %>% head %>% kable
|
Normal |
KK |
KL |
KKL |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
Filtering to remove low counts
Next, we need to remove genes with low counts. There are several strategies to accomplish this, one of which involves keeping genes that have at least 10 counts in at least some libraries before they are considered expressed.
EdgeR
offers a convenient way to filter out low count genes using the filterByExpr
function, which attempts to retain the maximum number of genes possible.
Regardless of the chosen method of filtration, all libraries must be filtered in the same way to avoid bias.
1
2
|
keep <- filterByExpr(y, design = design)
table(keep)
|
## keep
## FALSE TRUE
## 30175 20756
1
|
y <- y[keep, keep.lib.sizes=FALSE]
|
explore data
We can now visualize the distribution of gene counts for the filtered data.
1
2
3
4
|
par(mar=c(9,4,3,0))
AveLogCPM <- aveLogCPM(y,normalized.lib.sizes = TRUE, prior.count = 1)
hist(AveLogCPM,col="#69b3a2")
|
1
2
3
4
5
6
7
8
9
10
11
|
y_logcounts <- cpm(y, prior.count = 1, log=TRUE)
y_logcounts <- y_logcounts[ ,rownames((sorted_lib.size))]
# Check distributions of samples using boxplots
boxplot(y_logcounts, xlab="",
ylab="Log2 counts per million",
las=2,
col="#69b3a2")
abline(h=median(y_logcounts), col="blue")
title("Boxplots of logCPMs (Filtered)\nordered by library size from min to max")
|
Library size normalization
For downstream analysis, we need to normalize the library size. We can use the calcNormFactors
function to create normalization factors for each library using the trimmed mean of M values (TMM) method (Robinson and Oshlack 2010). Next, the effective library size will be generated based on the library size and normalization factor.
1
2
3
4
|
y <- calcNormFactors(y, method = "TMM")
# norm.factors for each sample
y$samples$norm.factors
|
## [1] 1.0762076 1.0970443 1.0723530 1.0490546 1.0851532 1.0647561 0.9858153
## [8] 1.0520711 0.9667595 0.9556910 0.8821953 0.9538502 0.9061506 0.9576947
## [15] 0.9127112 0.9194349 1.0086732 0.9587406 1.0272246 1.0083948 1.0333428
## [22] 1.0270089 1.0126018 1.0023800 0.9785801 0.9255397 0.9963523 1.0345062
## [29] 1.0208527 1.0451356 1.0225006 0.9792565 1.0522012 0.9800223 0.9999865
The plotMDS
function can be used to display differences between expression profiles of different samples. The plot has two dimensions that correspond to the first and second components of the PCA analysis.
1
2
3
4
|
pch <- c(0,1,2,15)
colors <- c("darkgreen", "blue", "red", "black")
plotMDS(y, col=colors[RNA_Conditions$Line], pch=pch[RNA_Conditions$Line], var.explained = T)
legend("bottomleft", legend=levels(RNA_Conditions$Line), pch=pch, col=colors, ncol=1)
|
The plotMD
shows mean difference of expression data.
1
2
|
plotMD(y, column = 1) # column = sample from DEGlist
abline(h = 0, col = "red", lty = 2, lwd = 2)
|
1
2
3
|
# It is good practice to make MD plots for all the samples as a quality check.
plotMD(y, column = 18)
abline(h = 0, col="red", lty = 2, lwd = 2)
|
Dispersion estimation
Next, we need to estimate the dispersion. EdgeR
utilizes the negative binomial (NB) distribution to model the read counts for each gene in each sample.
We can estimate the dispersion using the estimateDisp
function, which calculates the common dispersion
, trended dispersion
, and tagwise dispersion
, and adds fields corresponding to the estimated dispersions to a new DEGlist
object.
1
2
3
|
y <- estimateDisp(y, design = design, robust=TRUE)
plotBCV(y)
|
Differential expression analysis
We can now proceed to analyze differential expression. In this article, we will construct and compare three very similar models for performing this analysis: regular glmFit, glmQLFit, and glmQLFit with additional filtration. The glmFit model is based on likelihood ratio tests (LRT) to identify differentially expressed genes, while the glmQLFit uses quasi-likelihood (QL) methods to account for overdispersion in the data, also glmQLFit model uses F-tests to calculate the probability of differential expression.
If there are many levels in the comparison group we should start by comparing conditions 1 vs 1.
Let’s examine which factors are present in the data:
1
2
|
y_cond_column <- levels(factor(colnames(y$design)))
y_cond_column %>% print
|
## [1] "KK" "KKL" "KL" "Normal"
Create a table for performing one-to-one comparisons.
1
2
|
all.combos <- as.data.frame(t(combn(y_cond_column,2)))
print(all.combos)
|
## V1 V2
## 1 KK KKL
## 2 KK KL
## 3 KK Normal
## 4 KKL KL
## 5 KKL Normal
## 6 KL Normal
Select a pair to analyze.
1
2
3
|
row_id <- 1
single_cond <- unname(unlist(all.combos[row_id,]))
print(single_cond)
|
## [1] "KK" "KKL"
1
2
3
4
5
6
|
# make new column for subsequent annotation
RNA_Conditions <- RNA_Conditions %>%
mutate(pair = case_when(
Line == single_cond[1] ~ single_cond[1],
Line == single_cond[2] ~ single_cond[2]
))
|
Generate a simple contrast for the selected pair. While it is possible to create more complex contrasts, interpreting the results can become more difficult.
1
2
|
y_contrast <- makeContrasts(base::paste(single_cond[1], '-', single_cond[2]),
levels = design)
|
glmFit
Regular model – less strict then other models.
1
2
|
fit.glm <- glmFit(y, design, robust=TRUE)
res.glm <- glmLRT(fit.glm, contrast = y_contrast)
|
View top tags
1
2
3
4
|
topTags(res.glm, n = 6, sort.by = "PValue") %>%
as.data.frame %>%
select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>%
kable
|
|
GeneID |
symbol |
logFC |
logCPM |
PValue |
FDR |
gene_biotype |
description |
5436_Ptprn |
ENSMUSG00000026204.15 |
Ptprn |
-6.777845 |
7.642472 |
0 |
0 |
protein_coding |
protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765] |
3526_Gzme |
ENSMUSG00000022156.8 |
Gzme |
-8.556547 |
5.058337 |
0 |
0 |
protein_coding |
granzyme E [Source:MGI Symbol;Acc:MGI:109265] |
17410_Rpl10a-ps2 |
ENSMUSG00000061988.3 |
Rpl10a-ps2 |
-5.969254 |
2.377657 |
0 |
0 |
processed_pseudogene |
ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227] |
9394_Idi2 |
ENSMUSG00000033520.2 |
Idi2 |
-7.699325 |
4.560845 |
0 |
0 |
protein_coding |
isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315] |
8466_Star |
ENSMUSG00000031574.8 |
Star |
-2.072882 |
3.101177 |
0 |
0 |
protein_coding |
steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760] |
2957_Gm9745 |
ENSMUSG00000021148.11 |
Gm9745 |
-7.842162 |
2.343155 |
0 |
0 |
protein_coding |
predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398] |
Add a new column with the adjusted p-values and create results table.
1
2
3
4
5
|
res.glm$table$p.adjust <- p.adjust(res.glm$table$PValue, method = "BH")
res.glm.table <- res.glm$table
res.glm.table$symbol <- res.glm$genes$symbol
|
Examine the number of differentially expressed genes.
1
2
|
is.de.res.glm <- decideTestsDGE(res.glm)
summary(is.de.res.glm)
|
## 1*KK -1*KKL
## Down 2671
## NotSig 15852
## Up 2233
1
2
3
4
|
# selected genes from fit glmFit()
res.glm.de.genes <- rownames(res.glm.table)[which(is.de.res.glm != 0)]
res.glm.de.genes %>% length()
|
## [1] 4904
Enhance the plotMD with information regarding differential expression for the selected pair.
1
2
|
plotMD(res.glm, status = is.de.res.glm)
abline(h = c(-2, 2), col = "darkgreen")
|
Creating heatmap with selected genes
Simple heatmap. Shows top 200 DE genes by p-value. Groups: KK vs KKL
1
2
|
# log - if TRUE then log2 values are returned.
logCPM <- cpm(y, prior.count = 1, log=TRUE)
|
1
2
3
4
5
6
7
8
|
selected <- res.glm.table[res.glm.de.genes,] %>%
arrange(p.adjust)
logCPM.gr <- logCPM
colnames(logCPM.gr) <- y$samples$group
coolmap(logCPM.gr[rownames(selected)[1:200], ])
|
Create more complex heatmap
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
y_data_subset <- logCPM[rownames(selected)[1:60], ]
# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)
# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample",
"Sex",
"Line",
"pair"),
drop = FALSE] %>%
column_to_rownames(var="Sample") %>%
`colnames<-`(c("Sex", "Line", "Condition"))
row_lables <- base::paste(round(selected$logFC[1:60], digits = 2),
sapply(str_split(
rownames(y_data_subset)[1:60], pattern = "_", n = 2),
function(x) x[2]),
round(selected$p.adjust[1:60], digits = 2)
)
pheatmap::pheatmap(y_data_subset,
cluster_rows = F,
cluster_cols = F,
show_rownames = T,
legend = F,
fontsize_col = 8,
fontsize_row = 8,
annotation_col = df_legend,
labels_row = row_lables)
|
glmQLFit
Model glmQLFit is similar to glmFit. It also negative binomial generalized log-linear model but implement the quasi-likelihood (QL) methods for dispersion estimation. This model is more strict than regular glmFit because it gives the same or grater p-value.
1
2
3
4
|
fit.glmQL <- glmQLFit(y, design, robust=TRUE)
# get the result - DE genes
res.glmQL <- glmQLFTest(fit.glmQL, contrast = y_contrast)
|
1
2
3
4
|
topTags(res.glmQL, n = 6, sort.by = "PValue") %>%
as.data.frame %>%
select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>%
kable
|
|
GeneID |
symbol |
logFC |
logCPM |
PValue |
FDR |
gene_biotype |
description |
5436_Ptprn |
ENSMUSG00000026204.15 |
Ptprn |
-6.789561 |
7.642472 |
0 |
0 |
protein_coding |
protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765] |
9394_Idi2 |
ENSMUSG00000033520.2 |
Idi2 |
-7.646714 |
4.560845 |
0 |
0 |
protein_coding |
isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315] |
17410_Rpl10a-ps2 |
ENSMUSG00000061988.3 |
Rpl10a-ps2 |
-5.972805 |
2.377657 |
0 |
0 |
processed_pseudogene |
ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227] |
8466_Star |
ENSMUSG00000031574.8 |
Star |
-2.074807 |
3.101177 |
0 |
0 |
protein_coding |
steroidogenic acute regulatory protein [Source:MGI Symbol;Acc:MGI:102760] |
3526_Gzme |
ENSMUSG00000022156.8 |
Gzme |
-8.523256 |
5.058337 |
0 |
0 |
protein_coding |
granzyme E [Source:MGI Symbol;Acc:MGI:109265] |
2957_Gm9745 |
ENSMUSG00000021148.11 |
Gm9745 |
-7.800929 |
2.343155 |
0 |
0 |
protein_coding |
predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398] |
1
2
3
4
5
6
7
|
# write new column with p.adjust
res.glmQL$table$p.adjust <- p.adjust(res.glmQL$table$PValue, method = "BH")
# Access results tables
res.glmQL.table <- res.glmQL$table
res.glmQL.table$symbol <- res.glmQL$genes$symbol
|
Show DE genes
1
2
|
is.de.res.glmQL <- decideTestsDGE(res.glmQL)
summary(is.de.res.glmQL)
|
## 1*KK -1*KKL
## Down 2487
## NotSig 16208
## Up 2061
1
2
3
4
|
# selected genes from fit glmQLFit()
res.glmQL.de.genes <- rownames(res.glmQL.table)[which(is.de.res.glmQL != 0)]
res.glmQL.de.genes %>% length()
|
## [1] 4548
1
2
3
4
|
# The magnitude of the differential expression changes can be visualized
# with a fitted model MD plot:
plotMD(res.glmQL, status = is.de.res.glmQL)
abline(h = c(-2, 2), col = "darkgreen")
|
glmQLFit with more robust filtration method
If we are only interested in genes with relatively large expression changes, we can set a fold-change threshold for differential expression analysis. This can be a more biologically meaningful approach when dealing with a large number of differentially expressed genes. To detect expression changes that are greater than a specified threshold, we can modify the statistical test accordingly. For instance, we can test if the differential expression fold changes are significantly greater than 1.5, or if the logFCs are significantly greater than log2(1.5).
1
|
res.glmQL.tr <- glmTreat(fit.glmQL, contrast = y_contrast, lfc = log2(1.5))
|
1
2
3
4
|
topTags(res.glmQL.tr, n = 6, sort.by = "PValue") %>%
as.data.frame %>%
select(GeneID, symbol, logFC, logCPM, PValue, FDR, gene_biotype, description) %>%
kable
|
|
GeneID |
symbol |
logFC |
logCPM |
PValue |
FDR |
gene_biotype |
description |
5436_Ptprn |
ENSMUSG00000026204.15 |
Ptprn |
-6.789561 |
7.642472 |
0 |
0 |
protein_coding |
protein tyrosine phosphatase, receptor type, N [Source:MGI Symbol;Acc:MGI:102765] |
9394_Idi2 |
ENSMUSG00000033520.2 |
Idi2 |
-7.646714 |
4.560845 |
0 |
0 |
protein_coding |
isopentenyl-diphosphate delta isomerase 2 [Source:MGI Symbol;Acc:MGI:2444315] |
17410_Rpl10a-ps2 |
ENSMUSG00000061988.3 |
Rpl10a-ps2 |
-5.972805 |
2.377657 |
0 |
0 |
processed_pseudogene |
ribosomal protein L10A, pseudogene 2 [Source:MGI Symbol;Acc:MGI:3646227] |
2957_Gm9745 |
ENSMUSG00000021148.11 |
Gm9745 |
-7.800929 |
2.343155 |
0 |
0 |
protein_coding |
predicted gene 9745 [Source:MGI Symbol;Acc:MGI:3704398] |
3526_Gzme |
ENSMUSG00000022156.8 |
Gzme |
-8.523256 |
5.058337 |
0 |
0 |
protein_coding |
granzyme E [Source:MGI Symbol;Acc:MGI:109265] |
5732_Cubn |
ENSMUSG00000026726.10 |
Cubn |
-3.092687 |
1.459152 |
0 |
0 |
protein_coding |
cubilin (intrinsic factor-cobalamin receptor) [Source:MGI Symbol;Acc:MGI:1931256] |
1
2
|
is.de.res.glmQL.tr <- decideTestsDGE(res.glmQL.tr)
summary(is.de.res.glmQL.tr)
|
## 1*KK -1*KKL
## Down 495
## NotSig 20027
## Up 234
1
2
3
4
5
6
|
# write new column with p.adjust
res.glmQL.tr$table$p.adjust <- p.adjust(res.glmQL.tr$table$PValue, method = "BH")
# Access results tables
res.glmQL.tr.table <- res.glmQL.tr$table
res.glmQL.tr.table$symbol <- res.glmQL.tr$genes$symbol
|
1
2
3
|
# selected genes from fit glmQLFit()
res.glmQL.tr.de.genes <- rownames(res.glmQL.tr.table)[which(is.de.res.glmQL.tr != 0)]
res.glmQL.tr.de.genes %>% length()
|
## [1] 729
1
2
|
selected <- res.glmQL.tr.table[res.glmQL.tr.de.genes,] %>%
arrange(p.adjust)
|
1
2
3
4
|
# The magnitude of the differential expression changes can be visualized
# with a fitted model MD plot:
plotMD(res.glmQL.tr, status = is.de.res.glmQL.tr)
abline(h = c(-2, 2), col = "darkgreen")
|
1
2
3
4
5
|
ordered <- order(res.glmQL.tr.table$p.adjust)
logCPM.tr.ordered <- logCPM[ordered[1:200],]
colnames(logCPM.tr.ordered) <- res.glmQL.tr$samples$group
coolmap(logCPM.tr.ordered)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
y_data_subset <- logCPM[rownames(selected)[1:60], ]
# arrange columns
conditions_sorted <- RNA_Conditions %>% arrange(Line, Sex)
# create legend for heatmap
df_legend <- as.data.frame(conditions_sorted)[, c("Sample",
"Sex",
"Line",
"pair"),
drop = FALSE] %>%
column_to_rownames(var="Sample") %>%
`colnames<-`(c("Sex", "Line", "Condition"))
row_lables <- base::paste(round(selected$logFC[1:60], digits = 2),
sapply(str_split(
rownames(y_data_subset)[1:60], pattern = "_", n = 2),
function(x) x[2]),
round(selected$p.adjust[1:60], digits = 2)
)
pheatmap::pheatmap(y_data_subset,
cluster_rows = T,
cluster_cols = F,
show_rownames = T,
legend = F,
fontsize_col = 8,
fontsize_row = 8,
annotation_col = df_legend,
labels_row = row_lables)
|
We can now save the table of differentially expressed genes for further analysis. For instance, we can investigate pathways that are associated with one or the other condition.
1
2
3
4
5
6
7
8
|
final.table.tr <- res.glmQL.tr[res.glmQL.tr.de.genes,]
final.table.tr <- cbind(final.table.tr$table, final.table.tr$genes) %>%
as.data.frame %>%
rownames_to_column %>%
select(GeneID,gene_name,logFC,logCPM,PValue,p.adjust,gene_biotype,description)
final.table.tr %>% head %>% kable
|
GeneID |
gene_name |
logFC |
logCPM |
PValue |
p.adjust |
gene_biotype |
description |
ENSMUSG00000000182.9 |
Fgf23 |
-3.554532 |
-1.113700 |
0.0000018 |
0.0002005 |
protein_coding |
fibroblast growth factor 23 [Source:MGI Symbol;Acc:MGI:1891427] |
ENSMUSG00000000184.12 |
Ccnd2 |
-1.180055 |
6.861950 |
0.0006287 |
0.0223847 |
protein_coding |
cyclin D2 [Source:MGI Symbol;Acc:MGI:88314] |
ENSMUSG00000000386.16 |
Mx1 |
-1.198251 |
3.864102 |
0.0007030 |
0.0242781 |
protein_coding |
MX dynamin-like GTPase 1 [Source:MGI Symbol;Acc:MGI:97243] |
ENSMUSG00000000544.14 |
Gpa33 |
-3.692302 |
4.101342 |
0.0000005 |
0.0000727 |
protein_coding |
glycoprotein A33 (transmembrane) [Source:MGI Symbol;Acc:MGI:1891703] |
ENSMUSG00000000562.5 |
Adora3 |
-1.388302 |
1.475210 |
0.0005006 |
0.0187889 |
protein_coding |
adenosine A3 receptor [Source:MGI Symbol;Acc:MGI:104847] |
ENSMUSG00000000562.5 |
Adora3 |
-1.388302 |
1.475210 |
0.0005006 |
0.0187889 |
protein_coding |
adenosine A3 receptor [Source:MGI Symbol;Acc:MGI:104847] |
Compare results
At the end of this article, we will compare the selected genes from our models to identify if there are any differences. We will visualize this using a Venn diagram, which will display the intersections and differences between our models.
1
2
3
4
5
|
ggvenn(list(glmFit = res.glm.de.genes,
glmQLFit = res.glmQL.de.genes,
glmQLFit_tr = res.glmQL.tr.de.genes),
fill_color = c("#009E73", "#E69F00", "#CD534CFF"),
stroke_size = 0.5, set_name_size = 5, text_size = 3)
|
1
2
|
# number of found tegs by each model
length(res.glm.de.genes)
|
## [1] 4904
1
|
length(res.glmQL.de.genes)
|
## [1] 4548
1
|
length(res.glmQL.tr.de.genes)
|
## [1] 729
Conclusion
Throughout this article, we explored and compared three similar models to identify differentially expressed genes. In the next article, we will continue to explore additional methods for detecting DE genes.
References:
1. Chen Y, Lun ATL, Smyth GK. From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res [Internet]. 2016 Aug 2 [cited 2023 Feb 14];5:1438. Available from: https://f1000research.com/articles/5-1438/v2