Z Tutorial

# RNA-Seq analysis pipeline

### Introduction

• Based on the capabilities of high-throughput sequencing methods, RNA sequencing (RNA-Seq) provides information on the transcriptome of a sample. The data generated by RNA-Seq allows scientists to quantitatively analyze gene expression.
• The data generated by RNA-Seq can also help the scientists to uncover novel transcripts, identify alternatively spliced genes, and detect allele-specific expression.

### QC data

• Before analysing RNA-seq data, you should perform quality control checks to ensure the quality of the raw data.
• FastQC (Andrews, 2010) can generate a QC report which can spot problems which originate either in the sequencer or in the starting library material.
• According to the QC report, cutadapt (Martin, 2011) is recommended to trim low-quality data.

### Alignment

• After getting high-quality sequencing data, to determine where on the reference genome your reads originated from, you can align our reads to the reference genome using STAR (Dobin et al., 2013) (Spliced Transcripts Alignment to a Reference).
• Creating STAR genome index is necessary before reads mapping, you can download reference genome in fasta format and annotations files from ensembl. To learn more about how to create STAR genome index, refer to the STAR manual.
• Now you can perform alignments with STAR to the reference genome. STAR will generate several output files, such as alignments (SAM/BAM), mapping summary statistics etc. Mapping is controlled by a variety of input parameters (options) which are described in the STAR manual.

### Expression quantification

• The most common application of RNA-seq is to estimate gene and transcript expression.
• The simplest and most commonly used method to quantify gene expression is by analysing the expression count matrix from BAM/SAM files generated by STAR using programs such as featureCounts (Liao et al., 2014.). The input parameters should also be taken into account.

### Differential gene expression

• The differential expression analysis is used to identify differences in the transcriptome (gene expression) across a cohort of samples.
• There are many tools available to identify differentially expressed genes (DEGs) between multiple biological conditions (e.g. wt vs ko). DEseq2 (Love et al., 2014. ) is a popular Bioconductor package. DESeq2 builds negative binomial generalized linear models and contains methods to test for differential expression. It will generate several summary statistics such as baseMean, log2FoldChange, p-value between conditions. The exported result is recommended for our RAD web tools usage.

# ChIP/ATAC-seq analysis pipeline

### ChIP-seq introduction

• ChIP-seq is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation with DNA sequencing to infer the possible binding sites of DNA-associated proteins.
• The Assay for Transposase-Accessible Chromatin followed by sequencing (ATAC-seq) experiment provides genome-wide profiles of chromatin accessibility.

### QC data

• Please refer to RNA-seq QC data.

### Alignment

• To determine where on the reference genome your reads originated from, you can align your reads to the reference genome using STAR (Spliced Transcripts Alignment to a Reference). When you use STAR to align DNA-seq data, it is recommended to set alignIntroMax as 1 and alignEndsType as EndToEnd to disable spliced alignments.
• Then you can do some data cleaning to improve the libraries quality. First, duplicated reads are removed using samtools (Li et al., 2009. ). Second, the processed bam files are sorted and indexed for downstream analysis.

### Peak calling

• Peak calling is a computational method used to identify areas in the genome that have been enriched with aligned reads as a consequence of performing a ChIP-seq or ATAC-seq experiment.
• To determine which DNA fragments are enriched, Model-based Analysis of ChIP-seq (MACS2) (Zhang et al., 2008.) is a commonly used tool to identify enriched sites. It captures the influence of genome complexity to evaluate the significance of enriched regions.
• Before you run MACS2, you need to look at parameters as there are several of them affecting peak calling as well as reporting the results. It is important to understand them to be able to modify the command to the needs of your data set.
• The output of MACS2 is recommended for RAD web tools use.

# WGBS analysis pipeline

### WBGS introduction

• Whole-genome bisulfite sequencing (WGBS) is the most commonly used approach to unbiasedly profile genome-wide single-cytosine methylation level.

### QC data

• Refer to RNA-seq QC data step.

### Alignment

• To determine where on the reference genome your reads originated from, you will align your reads to the reference genome using BSMAP (Xi and Li, 2009). The input parameters should also be taken into account.
• WGBS data is so large that it is recommended to split bam files into each chromosome to calculate methylation ratio using python script from BSMAP.

### Call DMR

• Differentially methylated regions (DMRs) are genomic regions with different DNA methylation pattern across different biological samples. They are hypothesized to have a function in transcriptional regulation of gene expression.
• DMRcaller (Catoni et al., 2018.) uses Bisulfite sequencing data in two conditions and identifies differentially methylated regions between the conditions in CG and non-CG context. If you use BSMAP to align reads, you should convert it into bismark format. Finally, the output is differentially methylated regions in bedgraph format, which is recommended for RAD web tools.