PAVED- A Software suite for the analysis of epigenome-derived next generation sequencing data |
This user manual is designed to help the users run the software "PAVED" introduced in the manuscript J.S. Shaik, B.A. Anderson and S.M. Beverley, "PAVED-A software suite for the analysis of epigenome-derived next-generation sequencing data", Under Preparation. Please cite this manuscript if you use this software. Peak And VallEy Detector (PAVED) is a software suite written in Java that implements a computational pipeline that detects peaks or valleys from epigenomic next-generation sequencing (NGS) datasets such as ChIP-, FAIRE, and MNAse-seq. PAVED takes as input short read alignments to the reference genome in the standard BAM format and generates versatile outputs in wiggle and bed format compatible with genome browsers such as IGV and software suites such as bedtools. PAVED employs a fragment depth ratio (FragDR) based approach to extract peaks and valleys from NGS data. FragDR facilitates normalization of treatment against the control sample(s) and accounts for coverage based artifacts such as due to aneuploidies, copy number variations, noise due to repeats, multi-gene families and those introduced by experimental design, methodology and sequencing. PAVED is platform independent, easily portable and runs on any computer platform that supports Java version 1.5 or higher. | ||
MAIN INDEX ANALYTICAL PIPELINE CONTACT SYSTEM REQUIREMENTS PAVED Package Example Data | PrerequisitesPeakAndValleyDetector is a software suite written in java and therefore requires certain environment that is favorable for running the software.1) Check the system requirements here. 2) This page shows how to run the software. List of Available UtilitiesThe list of available utilities and pipelines with instructionsA generic pipeline to extract valleys from epigenomic data is here and a generic pipeline to extract peaks from epigenomic data is here. Here are the pipelines to extract valleys from MNAse-seq and peaks from FAIRE-seq using the sample datasets as described in the manuscript. | The first most common step to analyze NGS datasets is to align the raw reads in fastq format to the reference genome using aligners of choice such as BWT, Novoalign and Bowtie. The aligned reads in sequence alignment map (SAM) format can be converted to binary alignment map (BAM) format and sorted by genomic position by using tools such as samtools. PAVED then takes these sorted alignments in BAM format as input, compares transcript against the control(s) and detects peaks or valleys depending on the type of epigenomic NGS data. Broadly, the steps involved are as follows: 1) Construct fragments using the raw sorted alignments. The fragments are constructed from the raw alignments based on user specified minimum and maximum insert sizes. The alignments not within the specified range are eliminated. This has a couple of advantages i) since the reads are aligned independently, the fact that both the read and its pair are aligned within a specified insert size re-affirms their localization and ii) since the coverage is found using fragments, the region not covered by the sequencing within a fragment is also scored similar to region that is sequenced. 2) Find the fragment depth/read depth. Using the fragments constructed in step1, the fragment depth is determined. However, if the coverage is to be determined based on the individual reads for single-end sequencing, the read depth utility in PAVED may be used to find coverage 3) perform read count normalization. The control and experimental datasets might have varying number of reads and different fragment lengths. Therefore the coverage values must be normalized to account for this bias. 4) find the fragment depth ratio. Find the FragDR by normalizing the coverage values in experimental data by coverage values in the control dataset(s). 5) find peaks or valleys. PAVED uses a sliding window of user defined size to find peaks or valleys greater than a specified threshold. The reported peaks or valleys are reported in .bed format 6) find annotations. PAVED finds annotations for the regions of interest using a .gff file. This facilitates categorization of these regions using known parameters such as transcription type, gene class or epigenetic state. |