Next Generation Sequencing RNAseq Data Analysis (Part 1)

2023-02-13 1714 words 9 minutes

Contents

Preface

Here I am going to write a series of posts dedicated to high throughput RNA sequencing data analysis. As an essential technology, RNA sequencing is applied to many fields of biology and life science. Knowing how to process RNA seq data is important for researchers in these fields. This series of articles will explain the steps involved in the process.

In our first article, we will obtain real-world Next Generation Sequencing (NGS) data, learn about the data format required for further analysis, and annotate the data accordingly.

Where to get a data

For this project we will get data from NCBI Gene Expression Omnibus Database (GEO) which is a public repository of functional genomics data. This database contains thousands of datasets related to analysis of gene expression. We will download one of them (Series GSE193895). Results of this research were published here¹.

In series of my articles I’m not going to replicate the original research but describe the ways to analyse NGS RNA seq data.

Downloading data

We need to download the archive with raw counts for each sample GSE193895_RAW.tar. It contains 35 separate files named after sample id followed by treatment conditions. Extract them tar -xf GSE193895_RAW.tar. The files are also archived so use gzip -dv * to unarchive all of them.

Data wrangling

Load all needed packages:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# run explicit error when packages contain functions with same names
library(conflicted)

# for easier data manipulations
library(tidyverse)

# for pretty table printing
library(knitr)

# for splitting strings
library(stringr)

# for annotation
library(AnnotationHub)

library(ensembldb)

Get list of all files with raw counts for each sample:

1

count_files <- list.files("./../../../data/input/GSE193895_RAW/", full.names = T)

Each file contains three columns: GeneID – mouse Ensembl gene Id, Length – transcript length and column, named after sample Id and filled with counts for each gene

1

read_tsv(count_files[1]) %>% head %>% kable

## Rows: 53815 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): GeneID
## dbl (2): Length, norm_1_A_F_run2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

GeneID	Length	norm_1_A_F_run2
ENSMUSG00000000001.4	3262	3923
ENSMUSG00000000003.15	902	0
ENSMUSG00000000028.15	3506	62
ENSMUSG00000000031.16	2460	18
ENSMUSG00000000037.17	6397	32
ENSMUSG00000000049.11	1594	3

So, let’s read all the files and create full table of counts:

1
2
3
4
5
6


RNA_counts <- lapply(count_files, function(x) read_tsv(x)) %>% 
              purrr::reduce(dplyr::full_join, by = 'GeneID') %>% 
              rowwise() %>%
              mutate(Glength = max(c_across(starts_with("Length")))) %>% 
              dplyr::select(!starts_with("Length")) %>% 
              ungroup()

View the first few rows of the table:

1

RNA_counts %>% head %>% kable

GeneID	norm_1_A_F_run2	norm_2_A_F_run3	norm_3_A_F_run1	norm_4_A_M_run4	norm_5_A_M_run5	norm_6_A_M_run1	KK_1_A_F_run3	KK_1_B_F_run3	KK_2_A_F_run1	KK_2_B_F_run1	KK_3_A_M_run5	KK_3_B_M_run5	KK_4_A_M_run2	KK_4_B_M_run2	KK_5_A_M_run4	KK_5_B_M_run4	KL_1_A_F_run2	KL_1_B_F_run2	KL_2_A_F_run5	KL_2_B_F_run5	KL_3_A_M_run1	KL_3_B_M_run1	KL_4_A_M_run3	KL_4_B_M_run3	KL_5_A_M_run4	KL_5_B_M_run4	KKL_1_A_F_run4	KKL_1_B_F_run4	KKL_2_A_F_run2	KKL_2_B_F_run2	KKL_3_A_M_run3	KKL_3_B_M_run3	KKL_4_A_M_run5	KKL_5_A_M_run1	KKL_5_B_M_run1	Glength
ENSMUSG00000000001.4	3923	4526	1551	2506	4117	3244	2894	2393	5744	4507	8217	5462	2036	2400	6116	4723	3164	1719	6633	9681	8198	6701	4246	4065	6813	6148	4381	3396	3476	4187	4163	10262	6042	6049	5417	3262
ENSMUSG00000000003.15	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	902
ENSMUSG00000000028.15	62	83	32	60	86	78	147	160	268	177	344	233	95	111	228	180	175	177	204	610	299	194	200	171	332	247	180	136	160	225	157	522	274	172	163	3506
ENSMUSG00000000031.16	18	19	12	29	25	24	45	59	76	38	150	78	21	52	180	60	123	15	57	65	247	132	92	38	57	29	62	29	85	57	36	133	256	52	53	2460
ENSMUSG00000000037.17	32	30	14	23	27	20	12	8	14	18	19	16	4	7	17	15	7	8	6	5	29	7	9	8	9	10	15	17	11	28	21	60	23	7	11	6397
ENSMUSG00000000049.11	3	3	0	7	6	3	6	2	1	2	6	0	2	1	1	2	2	0	3	0	3	0	0	1	2	2	2	4	1	0	0	8	1	3	2	1594

Fortunately, this data table does not contain any missing values. All samples were processed at the same time and we do not have to deal with them.

1

length(RNA_counts[is.na(RNA_counts)])

## [1] 0

We need this table for all kinds of subsequent analyzes. So, we should write it on the disk:

1
2
3
4


write.table(RNA_counts, 
            file = "./../../../data/input/data//RNA_counts.tsv", 
            row.names = F, 
            sep = "\t")

Add annotation to our counts DT

The GeneID column of our data contains version of each gene. We need gene index without version. So, let’s create new index column:

1
2
3
4


RNA_counts <- RNA_counts %>% 
  mutate(gene_id = sapply(stringr::str_split(RNA_counts$GeneID, 
                                             pattern = "\\.", n = 2),  
                          function(.)  .[1]))

Now we need annotations. AnnotationHub provides interface for all most useful data annotation systems.

1

hub <- AnnotationHub()

## snapshotDate(): 2022-10-31

We could retrieve information about species and data providers:

1

unique(hub$species) %>% head %>% kable

x
Homo sapiens
Vicugna pacos
Dasypus novemcinctus
Otolemur garnettii
Papio hamadryas
Papio anubis

1

hub$species[grep("Mus", hub$species)] %>% head

## [1] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus"
## [6] "Mus musculus"

1
2
3


u <- unique(hub$dataprovider)
o <- order(unique(hub$dataprovider))
u[o] %>% head %>% kable

x
AmoebaDB
BioMart
Broad Institute
BroadInstitute
ChEA
CHM13

Obtain list of tables for Mus musculus from Ensembl:

1

query(hub, c("Mus musculus", "Ensembl"))

## AnnotationHub with 2485 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: Ensembl, BioMart, UCSC, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWI...
## # $species: Mus musculus, mus musculus wsbeij, mus musculus pwkphj, mus musc...
## # $rdataclass: TwoBitFile, GRanges, EnsDb, data.frame, SQLiteFile, list, OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH6066"]]' 
## 
##              title                                              
##   AH6066   | Ensembl Genes                                      
##   AH6093   | Ensembl Genes                                      
##   AH6165   | Ensembl Genes                                      
##   AH6204   | Ensembl Genes                                      
##   AH7567   | Mus_musculus.GRCm38.70.gtf                         
##   ...        ...                                                
##   AH106502 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
##   AH106503 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit          
##   AH107060 | org.Mm.eg.db.sqlite                                
##   AH107223 | LRBaseDb for Mus musculus (Mouse, v004)            
##   AH109367 | Ensembl 108 EnsDb for Mus musculus

We’ll use the latest one – AH109367 | Ensembl 108 EnsDb for Mus musculus:

1

ensdb <- hub[["AH109367"]]

## loading from cache

Obtain metadata from EnsemblDB for each gene:

1
2


gns <- genes(ensdb)
gns %>% head %>% kable

seqnames	start	end	width	strand	gene_id	gene_name	gene_biotype	seq_coord_system	description	gene_id_version	canonical_transcript	symbol	entrezid
1	3143476	3144545	1070	+	ENSMUSG00000102693	4933401J01Rik	TEC	chromosome	RIKEN cDNA 4933401J01 gene [Source:MGI Symbol;Acc:MGI:1918292]	ENSMUSG00000102693.2	ENSMUST00000193812	4933401J01Rik	NA
1	3172239	3172348	110	+	ENSMUSG00000064842	Gm26206	snRNA	chromosome	predicted gene, 26206 [Source:MGI Symbol;Acc:MGI:5455983]	ENSMUSG00000064842.3	ENSMUST00000082908	Gm26206	115487594
1	3276124	3741721	465598	-	ENSMUSG00000051951	Xkr4	protein_coding	chromosome	X-linked Kx blood group related 4 [Source:MGI Symbol;Acc:MGI:3528744]	ENSMUSG00000051951.6	ENSMUST00000070533	Xkr4	497097
1	3322980	3323459	480	+	ENSMUSG00000102851	Gm18956	processed_pseudogene	chromosome	predicted gene, 18956 [Source:MGI Symbol;Acc:MGI:5011141]	ENSMUSG00000102851.2	ENSMUST00000192857	Gm18956	NA
1	3435954	3438772	2819	-	ENSMUSG00000103377	Gm37180	TEC	chromosome	predicted gene, 37180 [Source:MGI Symbol;Acc:MGI:5610408]	ENSMUSG00000103377.2	ENSMUST00000195335	Gm37180	NA
1	3445779	3448011	2233	-	ENSMUSG00000104017	Gm37363	TEC	chromosome	predicted gene, 37363 [Source:MGI Symbol;Acc:MGI:5610591]	ENSMUSG00000104017.2	ENSMUST00000192336	Gm37363	NA

Add to our data table information about gene and transcript length from Encembl.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# get transcripts
# transcriptsBy includes the UTR and exon regions
txs <- transcriptsBy(ensdb)

# max transcript widths (RNA + UTRs)
# width() is a getter for IRanges objects (<IRanges>)
txwid <- sapply(width(txs), max)

# full DNA gene widths, with names
gnwid <- setNames(width(gns), names(gns))

# txwid and gnwid have different orders
# create table of "TxLen", "GeneLen", "ens_id"
ens.annot <- data.frame(TxLen = txwid, GeneLen = gnwid[names(txwid)]) %>% 
  rownames_to_column(var = "gene_id") 

ens.annot %>% head %>% kable

gene_id	TxLen	GeneLen
ENSMUSG00000000001	38867	38867
ENSMUSG00000000003	15723	15723
ENSMUSG00000000028	31526	31541
ENSMUSG00000000031	2625	2625
ENSMUSG00000000037	139708	175689
ENSMUSG00000000049	52552	71043

Merge tables:

1

RNA_counts_ann <- left_join(RNA_counts, ens.annot,  by = "gene_id")

Because TxLen contains not only cDNA sequence it is not useful for us now.

1
2
3


RNA_counts_ann %>% dplyr::select(gene_id, Glength, TxLen, GeneLen) %>% 
  head %>% 
  kable

gene_id	Glength	TxLen	GeneLen
ENSMUSG00000000001	3262	38867	38867
ENSMUSG00000000003	902	15723	15723
ENSMUSG00000000028	3506	31526	31541
ENSMUSG00000000031	2460	2625	2625
ENSMUSG00000000037	6397	139708	175689
ENSMUSG00000000049	1594	52552	71043

Add rest of metadata to our table:

1
2


RNA_counts_ann <- left_join(RNA_counts_ann, as.data.frame(gns), by = "gene_id") %>% 
                  unnest(entrezid)

Let’s view annotations of our table:

1
2
3
4


RNA_counts_ann %>% 
  dplyr::select(1,37:53) %>% 
  head %>% 
  kable

GeneID	Glength	gene_id	TxLen	GeneLen	seqnames	start	end	width	strand	gene_name	gene_biotype	seq_coord_system	description	gene_id_version	canonical_transcript	symbol	entrezid
ENSMUSG00000000001.4	3262	ENSMUSG00000000001	38867	38867	3	108014596	108053462	38867	-	Gnai3	protein_coding	chromosome	guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]	ENSMUSG00000000001.5	ENSMUST00000000001	Gnai3	14679
ENSMUSG00000000003.15	902	ENSMUSG00000000003	15723	15723	X	76881507	76897229	15723	-	Pbsn	protein_coding	chromosome	probasin [Source:MGI Symbol;Acc:MGI:1860484]	ENSMUSG00000000003.16	ENSMUST00000000003	Pbsn	54192
ENSMUSG00000000028.15	3506	ENSMUSG00000000028	31526	31541	16	18599197	18630737	31541	-	Cdc45	protein_coding	chromosome	cell division cycle 45 [Source:MGI Symbol;Acc:MGI:1338073]	ENSMUSG00000000028.16	ENSMUST00000000028	Cdc45	12544
ENSMUSG00000000031.16	2460	ENSMUSG00000000031	2625	2625	7	142129262	142131886	2625	-	H19	lncRNA	chromosome	H19, imprinted maternally expressed transcript [Source:MGI Symbol;Acc:MGI:95891]	ENSMUSG00000000031.18	ENSMUST00000247195	H19	14955
ENSMUSG00000000037.17	6397	ENSMUSG00000000037	139708	175689	X	159865521	160041209	175689	+	Scml2	protein_coding	chromosome	Scm polycomb group protein like 2 [Source:MGI Symbol;Acc:MGI:1340042]	ENSMUSG00000000037.18	ENSMUST00000077375	Scml2	107815
ENSMUSG00000000049.11	1594	ENSMUSG00000000049	52552	71043	11	108234180	108305222	71043	+	Apoh	protein_coding	chromosome	apolipoprotein H [Source:MGI Symbol;Acc:MGI:88058]	ENSMUSG00000000049.12	ENSMUST00000000049	Apoh	11818

So, this is the time to write our table on disk for subsequent uses.

1
2
3
4


write.table(RNA_counts_ann, 
            file = "./../../../data/input/data//RNA_counts_ann.tsv", 
            row.names = F, 
            sep = "\t")

Afterword

In this article we’ve found out how to obtain and prepare RNA seq data. In the next post we will create table of conditions and perform analysis of differential gene expression.

References:

1. Best SA, Gubser PM, Sethumadhavan S, Kersbergen A, Negrón Abril YL, Goldford J, Sellers K, Abeysekera W, Garnham AL, McDonald JA, Weeden CE, Anderson D, Pirman D, Roddy TP, Creek DJ, Kallies A, Kingsbury G, Sutherland KD. Glutaminase inhibition impairs CD8 T cell activation in STK11-/Lkb1-deficient lung cancer. Cell Metab. 2022 Jun 7;34(6):874–887.e6.