Next Generation Sequencing RNAseq Data Analysis (Part 1)

Obtaining NGS data, preprocessing and annotation

Preface

Here I am going to write a series of posts dedicated to high throughput RNA sequencing data analysis. As an essential technology, RNA sequencing is applied to many fields of biology and life science. Knowing how to process RNA seq data is important for researchers in these fields. This series of articles will explain the steps involved in the process.

In our first article, we will obtain real-world Next Generation Sequencing (NGS) data, learn about the data format required for further analysis, and annotate the data accordingly.

Where to get a data

For this project we will get data from NCBI Gene Expression Omnibus Database (GEO) which is a public repository of functional genomics data. This database contains thousands of datasets related to analysis of gene expression. We will download one of them (Series GSE193895). Results of this research were published here1.

In series of my articles I’m not going to replicate the original research but describe the ways to analyse NGS RNA seq data.

Downloading data

We need to download the archive with raw counts for each sample GSE193895_RAW.tar. It contains 35 separate files named after sample id followed by treatment conditions. Extract them tar -xf GSE193895_RAW.tar. The files are also archived so use gzip -dv * to unarchive all of them.

Data wrangling

Load all needed packages:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# run explicit error when packages contain functions with same names
library(conflicted)

# for easier data manipulations
library(tidyverse)

# for pretty table printing
library(knitr)

# for splitting strings
library(stringr)

# for annotation
library(AnnotationHub)

library(ensembldb)

Get list of all files with raw counts for each sample:

1
count_files <- list.files("./../../../data/input/GSE193895_RAW/", full.names = T)

Each file contains three columns: GeneID – mouse Ensembl gene Id, Length – transcript length and column, named after sample Id and filled with counts for each gene

1
read_tsv(count_files[1]) %>% head %>% kable
## Rows: 53815 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): GeneID
## dbl (2): Length, norm_1_A_F_run2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
GeneID Length norm_1_A_F_run2
ENSMUSG00000000001.4 3262 3923
ENSMUSG00000000003.15 902 0
ENSMUSG00000000028.15 3506 62
ENSMUSG00000000031.16 2460 18
ENSMUSG00000000037.17 6397 32
ENSMUSG00000000049.11 1594 3

So, let’s read all the files and create full table of counts:

1
2
3
4
5
6
RNA_counts <- lapply(count_files, function(x) read_tsv(x)) %>% 
              purrr::reduce(dplyr::full_join, by = 'GeneID') %>% 
              rowwise() %>%
              mutate(Glength = max(c_across(starts_with("Length")))) %>% 
              dplyr::select(!starts_with("Length")) %>% 
              ungroup()

View the first few rows of the table:

1
RNA_counts %>% head %>% kable
GeneID norm_1_A_F_run2 norm_2_A_F_run3 norm_3_A_F_run1 norm_4_A_M_run4 norm_5_A_M_run5 norm_6_A_M_run1 KK_1_A_F_run3 KK_1_B_F_run3 KK_2_A_F_run1 KK_2_B_F_run1 KK_3_A_M_run5 KK_3_B_M_run5 KK_4_A_M_run2 KK_4_B_M_run2 KK_5_A_M_run4 KK_5_B_M_run4 KL_1_A_F_run2 KL_1_B_F_run2 KL_2_A_F_run5 KL_2_B_F_run5 KL_3_A_M_run1 KL_3_B_M_run1 KL_4_A_M_run3 KL_4_B_M_run3 KL_5_A_M_run4 KL_5_B_M_run4 KKL_1_A_F_run4 KKL_1_B_F_run4 KKL_2_A_F_run2 KKL_2_B_F_run2 KKL_3_A_M_run3 KKL_3_B_M_run3 KKL_4_A_M_run5 KKL_5_A_M_run1 KKL_5_B_M_run1 Glength
ENSMUSG00000000001.4 3923 4526 1551 2506 4117 3244 2894 2393 5744 4507 8217 5462 2036 2400 6116 4723 3164 1719 6633 9681 8198 6701 4246 4065 6813 6148 4381 3396 3476 4187 4163 10262 6042 6049 5417 3262
ENSMUSG00000000003.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 902
ENSMUSG00000000028.15 62 83 32 60 86 78 147 160 268 177 344 233 95 111 228 180 175 177 204 610 299 194 200 171 332 247 180 136 160 225 157 522 274 172 163 3506
ENSMUSG00000000031.16 18 19 12 29 25 24 45 59 76 38 150 78 21 52 180 60 123 15 57 65 247 132 92 38 57 29 62 29 85 57 36 133 256 52 53 2460
ENSMUSG00000000037.17 32 30 14 23 27 20 12 8 14 18 19 16 4 7 17 15 7 8 6 5 29 7 9 8 9 10 15 17 11 28 21 60 23 7 11 6397
ENSMUSG00000000049.11 3 3 0 7 6 3 6 2 1 2 6 0 2 1 1 2 2 0 3 0 3 0 0 1 2 2 2 4 1 0 0 8 1 3 2 1594

Fortunately, this data table does not contain any missing values. All samples were processed at the same time and we do not have to deal with them.

1
length(RNA_counts[is.na(RNA_counts)])
## [1] 0

We need this table for all kinds of subsequent analyzes. So, we should write it on the disk:

1
2
3
4
write.table(RNA_counts, 
            file = "./../../../data/input/data//RNA_counts.tsv", 
            row.names = F, 
            sep = "\t")

Add annotation to our counts DT

The GeneID column of our data contains version of each gene. We need gene index without version. So, let’s create new index column:

1
2
3
4
RNA_counts <- RNA_counts %>% 
  mutate(gene_id = sapply(stringr::str_split(RNA_counts$GeneID, 
                                             pattern = "\\.", n = 2),  
                          function(.)  .[1]))

Now we need annotations. AnnotationHub provides interface for all most useful data annotation systems.

1
hub <- AnnotationHub()
## snapshotDate(): 2022-10-31

We could retrieve information about species and data providers:

1
unique(hub$species) %>% head %>% kable
x
Homo sapiens
Vicugna pacos
Dasypus novemcinctus
Otolemur garnettii
Papio hamadryas
Papio anubis
1
hub$species[grep("Mus", hub$species)] %>% head
## [1] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus"
## [6] "Mus musculus"
1
2
3
u <- unique(hub$dataprovider)
o <- order(unique(hub$dataprovider))
u[o] %>% head %>% kable
x
AmoebaDB
BioMart
Broad Institute
BroadInstitute
ChEA
CHM13

Obtain list of tables for Mus musculus from Ensembl:

1
query(hub, c("Mus musculus", "Ensembl"))
## AnnotationHub with 2485 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: Ensembl, BioMart, UCSC, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWI...
## # $species: Mus musculus, mus musculus wsbeij, mus musculus pwkphj, mus musc...
## # $rdataclass: TwoBitFile, GRanges, EnsDb, data.frame, SQLiteFile, list, OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH6066"]]' 
## 
##              title                                              
##   AH6066   | Ensembl Genes                                      
##   AH6093   | Ensembl Genes                                      
##   AH6165   | Ensembl Genes                                      
##   AH6204   | Ensembl Genes                                      
##   AH7567   | Mus_musculus.GRCm38.70.gtf                         
##   ...        ...                                                
##   AH106502 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
##   AH106503 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit          
##   AH107060 | org.Mm.eg.db.sqlite                                
##   AH107223 | LRBaseDb for Mus musculus (Mouse, v004)            
##   AH109367 | Ensembl 108 EnsDb for Mus musculus

We’ll use the latest one – AH109367 | Ensembl 108 EnsDb for Mus musculus:

1
ensdb <- hub[["AH109367"]]
## loading from cache

Obtain metadata from EnsemblDB for each gene:

1
2
gns <- genes(ensdb)
gns %>% head %>% kable
seqnames start end width strand gene_id gene_name gene_biotype seq_coord_system description gene_id_version canonical_transcript symbol entrezid
1 3143476 3144545 1070 + ENSMUSG00000102693 4933401J01Rik TEC chromosome RIKEN cDNA 4933401J01 gene [Source:MGI Symbol;Acc:MGI:1918292] ENSMUSG00000102693.2 ENSMUST00000193812 4933401J01Rik NA
1 3172239 3172348 110 + ENSMUSG00000064842 Gm26206 snRNA chromosome predicted gene, 26206 [Source:MGI Symbol;Acc:MGI:5455983] ENSMUSG00000064842.3 ENSMUST00000082908 Gm26206 115487594
1 3276124 3741721 465598 - ENSMUSG00000051951 Xkr4 protein_coding chromosome X-linked Kx blood group related 4 [Source:MGI Symbol;Acc:MGI:3528744] ENSMUSG00000051951.6 ENSMUST00000070533 Xkr4 497097
1 3322980 3323459 480 + ENSMUSG00000102851 Gm18956 processed_pseudogene chromosome predicted gene, 18956 [Source:MGI Symbol;Acc:MGI:5011141] ENSMUSG00000102851.2 ENSMUST00000192857 Gm18956 NA
1 3435954 3438772 2819 - ENSMUSG00000103377 Gm37180 TEC chromosome predicted gene, 37180 [Source:MGI Symbol;Acc:MGI:5610408] ENSMUSG00000103377.2 ENSMUST00000195335 Gm37180 NA
1 3445779 3448011 2233 - ENSMUSG00000104017 Gm37363 TEC chromosome predicted gene, 37363 [Source:MGI Symbol;Acc:MGI:5610591] ENSMUSG00000104017.2 ENSMUST00000192336 Gm37363 NA

Add to our data table information about gene and transcript length from Encembl.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# get transcripts
# transcriptsBy includes the UTR and exon regions
txs <- transcriptsBy(ensdb)

# max transcript widths (RNA + UTRs)
# width() is a getter for IRanges objects (<IRanges>)
txwid <- sapply(width(txs), max)

# full DNA gene widths, with names
gnwid <- setNames(width(gns), names(gns))

# txwid and gnwid have different orders
# create table of "TxLen", "GeneLen", "ens_id"
ens.annot <- data.frame(TxLen = txwid, GeneLen = gnwid[names(txwid)]) %>% 
  rownames_to_column(var = "gene_id") 

ens.annot %>% head %>% kable
gene_id TxLen GeneLen
ENSMUSG00000000001 38867 38867
ENSMUSG00000000003 15723 15723
ENSMUSG00000000028 31526 31541
ENSMUSG00000000031 2625 2625
ENSMUSG00000000037 139708 175689
ENSMUSG00000000049 52552 71043

Merge tables:

1
RNA_counts_ann <- left_join(RNA_counts, ens.annot,  by = "gene_id")

Because TxLen contains not only cDNA sequence it is not useful for us now.

1
2
3
RNA_counts_ann %>% dplyr::select(gene_id, Glength, TxLen, GeneLen) %>% 
  head %>% 
  kable
gene_id Glength TxLen GeneLen
ENSMUSG00000000001 3262 38867 38867
ENSMUSG00000000003 902 15723 15723
ENSMUSG00000000028 3506 31526 31541
ENSMUSG00000000031 2460 2625 2625
ENSMUSG00000000037 6397 139708 175689
ENSMUSG00000000049 1594 52552 71043

Add rest of metadata to our table:

1
2
RNA_counts_ann <- left_join(RNA_counts_ann, as.data.frame(gns), by = "gene_id") %>% 
                  unnest(entrezid)

Let’s view annotations of our table:

1
2
3
4
RNA_counts_ann %>% 
  dplyr::select(1,37:53) %>% 
  head %>% 
  kable
GeneID Glength gene_id TxLen GeneLen seqnames start end width strand gene_name gene_biotype seq_coord_system description gene_id_version canonical_transcript symbol entrezid
ENSMUSG00000000001.4 3262 ENSMUSG00000000001 38867 38867 3 108014596 108053462 38867 - Gnai3 protein_coding chromosome guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773] ENSMUSG00000000001.5 ENSMUST00000000001 Gnai3 14679
ENSMUSG00000000003.15 902 ENSMUSG00000000003 15723 15723 X 76881507 76897229 15723 - Pbsn protein_coding chromosome probasin [Source:MGI Symbol;Acc:MGI:1860484] ENSMUSG00000000003.16 ENSMUST00000000003 Pbsn 54192
ENSMUSG00000000028.15 3506 ENSMUSG00000000028 31526 31541 16 18599197 18630737 31541 - Cdc45 protein_coding chromosome cell division cycle 45 [Source:MGI Symbol;Acc:MGI:1338073] ENSMUSG00000000028.16 ENSMUST00000000028 Cdc45 12544
ENSMUSG00000000031.16 2460 ENSMUSG00000000031 2625 2625 7 142129262 142131886 2625 - H19 lncRNA chromosome H19, imprinted maternally expressed transcript [Source:MGI Symbol;Acc:MGI:95891] ENSMUSG00000000031.18 ENSMUST00000247195 H19 14955
ENSMUSG00000000037.17 6397 ENSMUSG00000000037 139708 175689 X 159865521 160041209 175689 + Scml2 protein_coding chromosome Scm polycomb group protein like 2 [Source:MGI Symbol;Acc:MGI:1340042] ENSMUSG00000000037.18 ENSMUST00000077375 Scml2 107815
ENSMUSG00000000049.11 1594 ENSMUSG00000000049 52552 71043 11 108234180 108305222 71043 + Apoh protein_coding chromosome apolipoprotein H [Source:MGI Symbol;Acc:MGI:88058] ENSMUSG00000000049.12 ENSMUST00000000049 Apoh 11818

So, this is the time to write our table on disk for subsequent uses.

1
2
3
4
write.table(RNA_counts_ann, 
            file = "./../../../data/input/data//RNA_counts_ann.tsv", 
            row.names = F, 
            sep = "\t")

Afterword

In this article we’ve found out how to obtain and prepare RNA seq data. In the next post we will create table of conditions and perform analysis of differential gene expression.

References:

1. Best SA, Gubser PM, Sethumadhavan S, Kersbergen A, Negrón Abril YL, Goldford J, Sellers K, Abeysekera W, Garnham AL, McDonald JA, Weeden CE, Anderson D, Pirman D, Roddy TP, Creek DJ, Kallies A, Kingsbury G, Sutherland KD. Glutaminase inhibition impairs CD8 T cell activation in STK11-/Lkb1-deficient lung cancer. Cell Metab. 2022 Jun 7;34(6):874–887.e6.

0%