Preface
Here I am going to write a series of posts dedicated to high throughput RNA sequencing data analysis. As an essential technology, RNA sequencing is applied to many fields of biology and life science. Knowing how to process RNA seq data is important for researchers in these fields. This series of articles will explain the steps involved in the process.
In our first article, we will obtain real-world Next Generation Sequencing (NGS) data, learn about the data format required for further analysis, and annotate the data accordingly.
Where to get a data
For this project we will get data from NCBI Gene Expression Omnibus Database
(GEO) which is a public repository of functional genomics data. This database contains thousands of datasets related to analysis of gene expression. We will download one of them (Series GSE193895). Results of this research were published here1.
In series of my articles I’m not going to replicate the original research but describe the ways to analyse NGS RNA seq data.
Downloading data
We need to download the archive with raw counts for each sample GSE193895_RAW.tar
. It contains 35 separate files named after sample id followed by treatment conditions. Extract them tar -xf GSE193895_RAW.tar
. The files are also archived so use gzip -dv *
to unarchive all of them.
Data wrangling
Load all needed packages:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
# run explicit error when packages contain functions with same names
library(conflicted)
# for easier data manipulations
library(tidyverse)
# for pretty table printing
library(knitr)
# for splitting strings
library(stringr)
# for annotation
library(AnnotationHub)
library(ensembldb)
|
Get list of all files with raw counts for each sample:
1
|
count_files <- list.files("./../../../data/input/GSE193895_RAW/", full.names = T)
|
Each file contains three columns: GeneID
– mouse Ensembl gene Id, Length
– transcript length and column, named after sample Id and filled with counts for each gene
1
|
read_tsv(count_files[1]) %>% head %>% kable
|
## Rows: 53815 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): GeneID
## dbl (2): Length, norm_1_A_F_run2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
GeneID |
Length |
norm_1_A_F_run2 |
ENSMUSG00000000001.4 |
3262 |
3923 |
ENSMUSG00000000003.15 |
902 |
0 |
ENSMUSG00000000028.15 |
3506 |
62 |
ENSMUSG00000000031.16 |
2460 |
18 |
ENSMUSG00000000037.17 |
6397 |
32 |
ENSMUSG00000000049.11 |
1594 |
3 |
So, let’s read all the files and create full table of counts:
1
2
3
4
5
6
|
RNA_counts <- lapply(count_files, function(x) read_tsv(x)) %>%
purrr::reduce(dplyr::full_join, by = 'GeneID') %>%
rowwise() %>%
mutate(Glength = max(c_across(starts_with("Length")))) %>%
dplyr::select(!starts_with("Length")) %>%
ungroup()
|
View the first few rows of the table:
1
|
RNA_counts %>% head %>% kable
|
GeneID |
norm_1_A_F_run2 |
norm_2_A_F_run3 |
norm_3_A_F_run1 |
norm_4_A_M_run4 |
norm_5_A_M_run5 |
norm_6_A_M_run1 |
KK_1_A_F_run3 |
KK_1_B_F_run3 |
KK_2_A_F_run1 |
KK_2_B_F_run1 |
KK_3_A_M_run5 |
KK_3_B_M_run5 |
KK_4_A_M_run2 |
KK_4_B_M_run2 |
KK_5_A_M_run4 |
KK_5_B_M_run4 |
KL_1_A_F_run2 |
KL_1_B_F_run2 |
KL_2_A_F_run5 |
KL_2_B_F_run5 |
KL_3_A_M_run1 |
KL_3_B_M_run1 |
KL_4_A_M_run3 |
KL_4_B_M_run3 |
KL_5_A_M_run4 |
KL_5_B_M_run4 |
KKL_1_A_F_run4 |
KKL_1_B_F_run4 |
KKL_2_A_F_run2 |
KKL_2_B_F_run2 |
KKL_3_A_M_run3 |
KKL_3_B_M_run3 |
KKL_4_A_M_run5 |
KKL_5_A_M_run1 |
KKL_5_B_M_run1 |
Glength |
ENSMUSG00000000001.4 |
3923 |
4526 |
1551 |
2506 |
4117 |
3244 |
2894 |
2393 |
5744 |
4507 |
8217 |
5462 |
2036 |
2400 |
6116 |
4723 |
3164 |
1719 |
6633 |
9681 |
8198 |
6701 |
4246 |
4065 |
6813 |
6148 |
4381 |
3396 |
3476 |
4187 |
4163 |
10262 |
6042 |
6049 |
5417 |
3262 |
ENSMUSG00000000003.15 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
902 |
ENSMUSG00000000028.15 |
62 |
83 |
32 |
60 |
86 |
78 |
147 |
160 |
268 |
177 |
344 |
233 |
95 |
111 |
228 |
180 |
175 |
177 |
204 |
610 |
299 |
194 |
200 |
171 |
332 |
247 |
180 |
136 |
160 |
225 |
157 |
522 |
274 |
172 |
163 |
3506 |
ENSMUSG00000000031.16 |
18 |
19 |
12 |
29 |
25 |
24 |
45 |
59 |
76 |
38 |
150 |
78 |
21 |
52 |
180 |
60 |
123 |
15 |
57 |
65 |
247 |
132 |
92 |
38 |
57 |
29 |
62 |
29 |
85 |
57 |
36 |
133 |
256 |
52 |
53 |
2460 |
ENSMUSG00000000037.17 |
32 |
30 |
14 |
23 |
27 |
20 |
12 |
8 |
14 |
18 |
19 |
16 |
4 |
7 |
17 |
15 |
7 |
8 |
6 |
5 |
29 |
7 |
9 |
8 |
9 |
10 |
15 |
17 |
11 |
28 |
21 |
60 |
23 |
7 |
11 |
6397 |
ENSMUSG00000000049.11 |
3 |
3 |
0 |
7 |
6 |
3 |
6 |
2 |
1 |
2 |
6 |
0 |
2 |
1 |
1 |
2 |
2 |
0 |
3 |
0 |
3 |
0 |
0 |
1 |
2 |
2 |
2 |
4 |
1 |
0 |
0 |
8 |
1 |
3 |
2 |
1594 |
Fortunately, this data table does not contain any missing values. All samples were processed at the same time and we do not have to deal with them.
1
|
length(RNA_counts[is.na(RNA_counts)])
|
## [1] 0
We need this table for all kinds of subsequent analyzes. So, we should write it on the disk:
1
2
3
4
|
write.table(RNA_counts,
file = "./../../../data/input/data//RNA_counts.tsv",
row.names = F,
sep = "\t")
|
Add annotation to our counts DT
The GeneID
column of our data contains version of each gene. We need gene index without version. So, let’s create new index column:
1
2
3
4
|
RNA_counts <- RNA_counts %>%
mutate(gene_id = sapply(stringr::str_split(RNA_counts$GeneID,
pattern = "\\.", n = 2),
function(.) .[1]))
|
Now we need annotations. AnnotationHub provides interface for all most useful data annotation systems.
## snapshotDate(): 2022-10-31
We could retrieve information about species and data providers:
1
|
unique(hub$species) %>% head %>% kable
|
x |
Homo sapiens |
Vicugna pacos |
Dasypus novemcinctus |
Otolemur garnettii |
Papio hamadryas |
Papio anubis |
1
|
hub$species[grep("Mus", hub$species)] %>% head
|
## [1] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus"
## [6] "Mus musculus"
1
2
3
|
u <- unique(hub$dataprovider)
o <- order(unique(hub$dataprovider))
u[o] %>% head %>% kable
|
x |
AmoebaDB |
BioMart |
Broad Institute |
BroadInstitute |
ChEA |
CHM13 |
Obtain list of tables for Mus musculus from Ensembl:
1
|
query(hub, c("Mus musculus", "Ensembl"))
|
## AnnotationHub with 2485 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: Ensembl, BioMart, UCSC, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWI...
## # $species: Mus musculus, mus musculus wsbeij, mus musculus pwkphj, mus musc...
## # $rdataclass: TwoBitFile, GRanges, EnsDb, data.frame, SQLiteFile, list, OrgDb
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH6066"]]'
##
## title
## AH6066 | Ensembl Genes
## AH6093 | Ensembl Genes
## AH6165 | Ensembl Genes
## AH6204 | Ensembl Genes
## AH7567 | Mus_musculus.GRCm38.70.gtf
## ... ...
## AH106502 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
## AH106503 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit
## AH107060 | org.Mm.eg.db.sqlite
## AH107223 | LRBaseDb for Mus musculus (Mouse, v004)
## AH109367 | Ensembl 108 EnsDb for Mus musculus
We’ll use the latest one – AH109367 | Ensembl 108 EnsDb for Mus musculus
:
1
|
ensdb <- hub[["AH109367"]]
|
## loading from cache
Obtain metadata from EnsemblDB for each gene:
1
2
|
gns <- genes(ensdb)
gns %>% head %>% kable
|
seqnames |
start |
end |
width |
strand |
gene_id |
gene_name |
gene_biotype |
seq_coord_system |
description |
gene_id_version |
canonical_transcript |
symbol |
entrezid |
1 |
3143476 |
3144545 |
1070 |
+ |
ENSMUSG00000102693 |
4933401J01Rik |
TEC |
chromosome |
RIKEN cDNA 4933401J01 gene [Source:MGI Symbol;Acc:MGI:1918292] |
ENSMUSG00000102693.2 |
ENSMUST00000193812 |
4933401J01Rik |
NA |
1 |
3172239 |
3172348 |
110 |
+ |
ENSMUSG00000064842 |
Gm26206 |
snRNA |
chromosome |
predicted gene, 26206 [Source:MGI Symbol;Acc:MGI:5455983] |
ENSMUSG00000064842.3 |
ENSMUST00000082908 |
Gm26206 |
115487594 |
1 |
3276124 |
3741721 |
465598 |
- |
ENSMUSG00000051951 |
Xkr4 |
protein_coding |
chromosome |
X-linked Kx blood group related 4 [Source:MGI Symbol;Acc:MGI:3528744] |
ENSMUSG00000051951.6 |
ENSMUST00000070533 |
Xkr4 |
497097 |
1 |
3322980 |
3323459 |
480 |
+ |
ENSMUSG00000102851 |
Gm18956 |
processed_pseudogene |
chromosome |
predicted gene, 18956 [Source:MGI Symbol;Acc:MGI:5011141] |
ENSMUSG00000102851.2 |
ENSMUST00000192857 |
Gm18956 |
NA |
1 |
3435954 |
3438772 |
2819 |
- |
ENSMUSG00000103377 |
Gm37180 |
TEC |
chromosome |
predicted gene, 37180 [Source:MGI Symbol;Acc:MGI:5610408] |
ENSMUSG00000103377.2 |
ENSMUST00000195335 |
Gm37180 |
NA |
1 |
3445779 |
3448011 |
2233 |
- |
ENSMUSG00000104017 |
Gm37363 |
TEC |
chromosome |
predicted gene, 37363 [Source:MGI Symbol;Acc:MGI:5610591] |
ENSMUSG00000104017.2 |
ENSMUST00000192336 |
Gm37363 |
NA |
Add to our data table information about gene and transcript length from Encembl.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# get transcripts
# transcriptsBy includes the UTR and exon regions
txs <- transcriptsBy(ensdb)
# max transcript widths (RNA + UTRs)
# width() is a getter for IRanges objects (<IRanges>)
txwid <- sapply(width(txs), max)
# full DNA gene widths, with names
gnwid <- setNames(width(gns), names(gns))
# txwid and gnwid have different orders
# create table of "TxLen", "GeneLen", "ens_id"
ens.annot <- data.frame(TxLen = txwid, GeneLen = gnwid[names(txwid)]) %>%
rownames_to_column(var = "gene_id")
ens.annot %>% head %>% kable
|
gene_id |
TxLen |
GeneLen |
ENSMUSG00000000001 |
38867 |
38867 |
ENSMUSG00000000003 |
15723 |
15723 |
ENSMUSG00000000028 |
31526 |
31541 |
ENSMUSG00000000031 |
2625 |
2625 |
ENSMUSG00000000037 |
139708 |
175689 |
ENSMUSG00000000049 |
52552 |
71043 |
Merge tables:
1
|
RNA_counts_ann <- left_join(RNA_counts, ens.annot, by = "gene_id")
|
Because TxLen
contains not only cDNA sequence it is not useful for us now.
1
2
3
|
RNA_counts_ann %>% dplyr::select(gene_id, Glength, TxLen, GeneLen) %>%
head %>%
kable
|
gene_id |
Glength |
TxLen |
GeneLen |
ENSMUSG00000000001 |
3262 |
38867 |
38867 |
ENSMUSG00000000003 |
902 |
15723 |
15723 |
ENSMUSG00000000028 |
3506 |
31526 |
31541 |
ENSMUSG00000000031 |
2460 |
2625 |
2625 |
ENSMUSG00000000037 |
6397 |
139708 |
175689 |
ENSMUSG00000000049 |
1594 |
52552 |
71043 |
Add rest of metadata to our table:
1
2
|
RNA_counts_ann <- left_join(RNA_counts_ann, as.data.frame(gns), by = "gene_id") %>%
unnest(entrezid)
|
Let’s view annotations of our table:
1
2
3
4
|
RNA_counts_ann %>%
dplyr::select(1,37:53) %>%
head %>%
kable
|
GeneID |
Glength |
gene_id |
TxLen |
GeneLen |
seqnames |
start |
end |
width |
strand |
gene_name |
gene_biotype |
seq_coord_system |
description |
gene_id_version |
canonical_transcript |
symbol |
entrezid |
ENSMUSG00000000001.4 |
3262 |
ENSMUSG00000000001 |
38867 |
38867 |
3 |
108014596 |
108053462 |
38867 |
- |
Gnai3 |
protein_coding |
chromosome |
guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773] |
ENSMUSG00000000001.5 |
ENSMUST00000000001 |
Gnai3 |
14679 |
ENSMUSG00000000003.15 |
902 |
ENSMUSG00000000003 |
15723 |
15723 |
X |
76881507 |
76897229 |
15723 |
- |
Pbsn |
protein_coding |
chromosome |
probasin [Source:MGI Symbol;Acc:MGI:1860484] |
ENSMUSG00000000003.16 |
ENSMUST00000000003 |
Pbsn |
54192 |
ENSMUSG00000000028.15 |
3506 |
ENSMUSG00000000028 |
31526 |
31541 |
16 |
18599197 |
18630737 |
31541 |
- |
Cdc45 |
protein_coding |
chromosome |
cell division cycle 45 [Source:MGI Symbol;Acc:MGI:1338073] |
ENSMUSG00000000028.16 |
ENSMUST00000000028 |
Cdc45 |
12544 |
ENSMUSG00000000031.16 |
2460 |
ENSMUSG00000000031 |
2625 |
2625 |
7 |
142129262 |
142131886 |
2625 |
- |
H19 |
lncRNA |
chromosome |
H19, imprinted maternally expressed transcript [Source:MGI Symbol;Acc:MGI:95891] |
ENSMUSG00000000031.18 |
ENSMUST00000247195 |
H19 |
14955 |
ENSMUSG00000000037.17 |
6397 |
ENSMUSG00000000037 |
139708 |
175689 |
X |
159865521 |
160041209 |
175689 |
+ |
Scml2 |
protein_coding |
chromosome |
Scm polycomb group protein like 2 [Source:MGI Symbol;Acc:MGI:1340042] |
ENSMUSG00000000037.18 |
ENSMUST00000077375 |
Scml2 |
107815 |
ENSMUSG00000000049.11 |
1594 |
ENSMUSG00000000049 |
52552 |
71043 |
11 |
108234180 |
108305222 |
71043 |
+ |
Apoh |
protein_coding |
chromosome |
apolipoprotein H [Source:MGI Symbol;Acc:MGI:88058] |
ENSMUSG00000000049.12 |
ENSMUST00000000049 |
Apoh |
11818 |
So, this is the time to write our table on disk for subsequent uses.
1
2
3
4
|
write.table(RNA_counts_ann,
file = "./../../../data/input/data//RNA_counts_ann.tsv",
row.names = F,
sep = "\t")
|
Afterword
In this article we’ve found out how to obtain and prepare RNA seq data. In the next post we will create table of conditions and perform analysis of differential gene expression.
References:
1. Best SA, Gubser PM, Sethumadhavan S, Kersbergen A, Negrón Abril YL, Goldford J, Sellers K, Abeysekera W, Garnham AL, McDonald JA, Weeden CE, Anderson D, Pirman D, Roddy TP, Creek DJ, Kallies A, Kingsbury G, Sutherland KD. Glutaminase inhibition impairs CD8 T cell activation in STK11-/Lkb1-deficient lung cancer. Cell Metab. 2022 Jun 7;34(6):874–887.e6.