Plant Genome

ENPG Tutorial 2018

This ENPG tutorial was presented by Dr. Chen Chen in June 2018

1 Usage of ENPG
1.1 Search and visualize ChIP-seq data in ENPG

For areas restricted to access YouTube, please watch the video with following link:

1.2 Compare ChIP-seq signals across samples

Visualizing and comparing the changes in ChIP signals among different treatments and genetic backgrounds are commonly performed during ChIP-seq analyses. As the data provided in the ENPG were normalized to 1x genome content (see the Data processing section), the signal intensity of each dataset may affect by sequencing depth and efficiency of immunoprecipitation (IP). Currently, the intensity of ChIP-seq signals is presented as the height of peaks and the default maximal height for each track is set to 30 (Figure 1).

Figure1 Height setting of each track

The vertical rule in each track represents the height of the signal. The maximal value is set as 30 for each track, indicated with red arrow.

Ideally, if the IP efficiency and sequencing depth are same among different data tracks, the height of each peak represents the real intensity of the ChIP signal and can be directly compared. However, in real case, it is even hard to keep the same IP efficiency and sequencing depth in individual experiment. Thus, to avoid the potential difference among samples, it is recommended to selecting the datasets which are produced from same study (based on the sra_study ID in the track selector). And then, the datasets with a similar number of uniquely mapped reads (provided in the track selector) likely had a similar level of sequencing depth. With the two conditions above, the signals of each track may be directly compared.

However,in most cases, the interested datasets were generated from different studies and the sequencing depth may vary a lot. Therefore, the user can proportionally adjust the maximal height of each track based on the sequencing depth of each dataset. The value setting can be found in track configuration panel.

Step1: click the track configuration

Step2: click “Edit config”

Step3: find the “max_score” and change the value

Step4: click “Apply” to accept the change. Do not directly type “Enter” after changing the value.

In addition to sequencing depth,the user also can adjust the track setting based on the ChIP signals at control locus. For example, H3K27me3, a histone modification related to gene repression, should have low ChIP signal at the loci of housekeeping genes. And, any detected signals at housekeeping genes can be considered as background signals.

Step1: Some weak signals of H3K27me3 were detected at FLC locus in clf28 swn7 background. As CLF and SWN are only two methyltransferases to catalyze H3K27me3, the H3K27me3 ChIP signal should diminished completely.

Step2: To distinguish real and background signals, the H3K27me3 signals at ACT7 locus were examined and high level of signals was found at ACT7 locus.

Step3: The signals at ACT7 locus are considered as background and adjust the track setting to minimize the peak height. The blue arrow indicate the maximal height is changed to 200

Step4: Back to FLC locus and the originally weak H3K27me3 signals were not visible.

Similar to actin, any locus with unchanged signals can be used for signal adjusting.

Finally, I would like to point out that quantitatively comparing ChIP-seq signals among samples is still a problem in ChIP-seq experiment. Any sequencing depth related normalization is assuming same efficiency of IP procedure. Sometimes, ChIP-seq experiment was used for discovering binding sites, it is hard to define good control loci. One method published in 2014 providing a method to normalize signals between samples by adding alien reference chromatins (Bonhoure N, et al. Quantifying ChIP-seq data: a spiking method providing an internal reference for sample-to-sample normalization. Genome research 24, 1157-1168, 2014). I think it may be a choice to determine the efficiency of the IP step .

1.3 Downloading processed Bigwig file for each dataset

Step1: Select the “Assayed Protein “mode and corresponding species

Step2: Search the database with your interested proteins or histone modifications

Step3: Find the dataset you are interested and click the arrow button to download the bigwig file

1.4 Evaluation of ChIP-seq data quality

ENPG reports sequencing depth, global enrichment, peak saturation and median enrichment of peaks for ChIP-seq datasets in histone modifications and nonhistone proteins. In detail, sequencing depth is reported as the number of uniquely mapped reads; Global enrichment is indicated by fraction of reads in peaks (FRiP), which calculates the proportion of reads contributed in identified peaks; Peak saturation is presented using fraction of total called peaks as a function of subset of uniquely mapped reads used for peak calling, which indicates whether the sequencing depth is deep enough to uncover all the potential peaks in genome-wide; Median enrichment measures the enrichment level of newly identified peaks with increasing reads.

In ENPG, any dataset with less than 1 million uniquely mapped reads will not report data quality. To access data quality reports of each dataset, users can follow the steps listed as below:

Step 1: Select the interested protein or histone modification in “Assayed Protein” mode and click “Search”

Step2: Click the “Detail” button of individual dataset

Step3: Sequencing depth and quality figures are included in “Unique Reads” and “Quality” section, respectively.

Step 4: By clicking each button in “Quality”, the quality figure will be displayed

Step5: Quality figures also can be accessed from Jbrowse

At this time, ENPG does not make any judgment on the data quality of each dataset. By checking quality information, the user can make their own judgment during data selection. It needs to bear in mind that the value of FRiP may affect by the nature of examined proteins and a relatively low level of FRiP will be expected if the protein naturally bound to limited regions. The method of generating those figures can be checked in section 2.1.3

2 Data processing

This section provides the detailed steps how the individual ENPG dataset was processed from raw data. All those information will help users to evaluate ENPG’s data and also provide beginners one way to deal with high-throughput sequencing data

2.1 ChIP-seq
2.1.1 Running platform

The platform and used software are listed as follow:

Ubuntu 14.04 LTS

sratoolkit (v2.9.0)
FastQC (v0.10.1)
Bowtie2 (v2.3.4)
samtools (v1.7)

2.1.2 Obtain data from public resources and map reads to the genome

All the raw data were obtained from public resources centers, Gene Expression Omnibus (GEO) repository and Sequence Read Archive (SRA), by using sratoolkit.

$ ./fastq-dump -I SRRxxxxx (for single-end sequencing) $ ./fastq-dump -I --split-files SRRxxxxx (for pair-end sequencing)

Reads qualities were analyzed with FastQC (v0.10.1) and the low quality bases (less than 30) were ignored in the following mapping step.

All the raw reads were mapped to corresponding genomes with Bowtie2 (v2.3.4) and compressed to binary files using samtools (v1.7).

$ bowtie2 -5 xx -3 xx -x /directory/tair10 -U file.fastq -S file.sam

Here, xx representsthe the position of the base will be trimmed from the reads before mapping. -5 and -3 indicates the direction of trimming (5’ end and 3’end, respectively); -x and tair10 indicate the mapping index. The index has to be built before mapping. Please see detail in bowtie2 manual.

Converting sam file to indexed bam file:

$ samtools view -bS file.sam > file.bam $ samtools sort -T file -o file_sort.bam file.bam $ samtools index file_sort.bam

Finally, the mapped reads were converted to bigwig format for visualization by using bamcoverage (deeptools2) with the following settings: binsize 10, normalizing with 1x genome content (RPGC, Arabidopsis thaliana, 135,000,000; Glycine max, 978,000,000; Oryza sativa, 375,000,000; Zea mays, 2,135,000,000), ignorDuplicates, samFlagExclude 1796 and 1924 for single and pair-end sequencing, respectively. The following command will convert single-end Arabidopsis thaliana sorted bam file to bigwig file.

$ bamCoverage -b flie_sort.bam -o file_sort.bw -bs 10 --effectiveGenomeSize 135000000 --normalizeUsing RPGC --ignoreDuplicates -e 100 --samFlagExclude 1796
2.1.3 Generating peak plateau, FRiP, and median enrichment

The following shell and three R scripts are used for generating figures of peak plateau, FRiP, and median enrichment from mapped binary file (bam file). A list of file name needs to be provided before running the script. All the files including scripts, bam files, and listshould be in the same directory.


Before running the script, several tools need to be installed and added to your executable path.

ggplot2 package