update docs

macs3-project · Nov 2, 2023 · af8362c · af8362c
1 parent 4a7f2ba
commit af8362c
Show file tree

Hide file tree

Showing 5 changed files with 163 additions and 35 deletions.
diff --git a/docs/bdgcmp.md b/docs/bdgcmp.md
@@ -1,24 +1,24 @@
-# Bdgcmp
+docs/bdgcmp.md# bdgcmp
 
 ## Overview
-The `bdgcmp` command is part of the MACS3 suite of tools and is used to compare bedGraph files. It is particularly useful for comparing two bedGraph files. The typical usage is to calculate pvalue or qvalue using poisson model for each basepair given a treatment pileup signal file in bedGraph format and a control lambda bedGraph file. But we provides more functions rather than pvalue and qvalue, including subtract, division (FE) and more. 
+The `bdgcmp` command is part of the MACS3 suite of tools and is used to compare two bedGraph files in each basepair that are commonly covered by the two files. The typical use case is to calculate pvalue or qvalue using Poisson model for each basepair given a treatment pileup signal file in bedGraph format and a control lambda bedGraph file. But we provides more functions rather than pvalue and qvalue, including subtract, division (FE) and more.
 
 ## Detailed Description
 
-The `bdgcmp` command takes two input bedGraph files (e.g. a control and a treatment bedgraph) and produces an output bedGraph of comparison scores for each genomic position involved in the bedGraph files. The `bdgcmp` command normally is used to deduct noise by comparing two signal tracks in bedGraph. Note: All regions on the same chromosome in the bedGraph file should be continuous so we recommand you use the bedGraph files from MACS3. We provide the following function to 'compare two tracks':
+The `bdgcmp` command takes two input bedGraph files (e.g. a control and a treatment bedgraph) and produces an output bedGraph of comparison scores for each genomic position involved in the bedGraph files. The `bdgcmp` command normally is used to deduct noise from a signal track in bedGraph (e.g. ChIP treatment) over another signal track in bedGraph (e.g. control). Note: All regions on the same chromosome in the bedGraph file should be continuous so we recommand you use the bedGraph files from MACS3. We provide the following function to 'compare two tracks':
 
-- ppois Poisson p-value (-log10(pvalue) form) using control as lambda and treatment as observation
+- ppois Poisson p-value (-log10(pvalue) form) using the second file (-c) as lambda and treatment (-t) as observation
 - qpois Q-value through a BH process for poisson pvalues
 - subtract Subtraction from treatment
 - FE linear scale fold enrichment, or the score from file A divided by the score from file B
 - logFE log10 fold enrichment(need to set pseudocount)
-- logLR log10 likelihood between ChIP-enriched model and open chromatin model(need to set pseudocount)
-- symmetric log10 likelihood between two ChIP-enrichment models using Poison distribution
+- logLR log10 likelihood between ChIP-enriched model and open chromatin model (need to set pseudocount)
+- symmetric log10 likelihood between two ChIP-enrichment models using Poison distribution, and this can be used to compare ChIP signals from two differen conditions (differential binding)
 - max Maximum value between the two tracks.
 
 ## Command Line Options
 
-The command line options for `bdgcmp` are defined in `/MACS3/Commands/bdgcmp_cmd.py` and `/bin/macs3` files. Here is a brief overview of these options:
+Here is a brief description of the command line options for `bdgcmp` :
 
 - `-t` or `--tfile`: Treatment bedGraph file, e.g. *_treat_pileup.bdg from MACS. REQUIRED
 - `-c` or `--cfile`: Control bedGraph file, e.g. *_control_lambda.bdg from MACS. REQUIRED

diff --git a/docs/bdgdiff.md b/docs/bdgdiff.md
@@ -1,14 +1,15 @@
-# Bdgdiff
+# bdgdiff
 
 ## Overview
-The `bdgdiff` command is part of the MACS3 suite of tools and is used to call differential peaks from four bedGraph tracks for scores. It is particularly useful in ChIP-Seq analysis for identifying differential peaks in the data.
+The `bdgdiff` command is part of the MACS3 suite of tools and is used to call differential peaks from four bedGraph tracks of scores, including treatment and control track for each condition. It is particularly useful in ChIP-Seq analysis for identifying differential peaks in the data.
 
 ## Detailed Description
 
-The `bdgdiff` command takes four input bedGraph files (two treatment and two control files) and produces three output files with differential peaks called. It uses an efficient algorithm to detect and call differential peaks, greatly improving the quality of your data for further analysis.
+The `bdgdiff` command takes four input bedGraph files (two treatment and two control files) and produces three output files with differential peaks called. Users should provide paired four bedgraph files: for each condition, a treatment pileup signal track in bedGraph format, and a control lambda track in bedGraph format. This differential calling can only handle one replicate per condition, so if you have multiple replicates, you should either combine the replicates when `callpeak`, or choose other tool that can consider within-group variation (such as DESeq2 or edgeR). The method we use to define the differential peaks is based on multiple likelihood tests, based on poisson distribution. Suppose that we have two conditions A and B, the unique binding sites in condition A over condition B should be *more likely* to be a binding event in treatment A over treatment B, and also *more likely* to be a real binding site in condition A while comparing treatment A over control A; the unique binding sites in condition B is defined in the same way; the common peaks of both condition should be *more likely* to be a real binding sites in condition A while comparing treatment A and control A, and in condition B while comparing treatment B over control B, and also the likelihood test while comparing treatment A and treatment B can't decide which condition is stronger.
 
-Differential peak detection based on paired four bedgraph files. Note: All regions on the same chromosome in the bedGraph file
-                        should be continuous so only bedGraph files from MACS3 are accpetable.
+The likelihood function we used is: 
+
+Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are acceptable.
 
 ## Command Line Options
 
@@ -39,3 +40,60 @@ macs3 bdgdiff -t1 treatment1.bedGraph -c1 control1.bedGraph -t2 treatment2.bedGr
 
 In this example, the program will call differential peaks from the two pairs of treatment and control bedGraph files and write the result to `output.bedGraph`. The depth of the first and second condition is set to 1.0, the minimum length of differential peaks is set to 500, the maximum gap between differential peaks is set to 1000, and the cutoff for LLR to call differential peaks is set to 1.0.
 
+## Step-by-step Instruction for calling differential peaks
+
+In this chatper, we will describe how to use MACS3 to identify differential regions by comparing pileup tracks of two conditions, starting from the alignment files. Two modules will be involved: `callpeak` and `bdgdiff` ( `predictd` is optional ). We will use human ChIP-seq data as example, and filenames of raw data are: cond1_ChIP.bam and cond1_Control.bam for condition 1; cond2_ChIP.bam and cond2_Control.bam for condition 2.
+
+### Step 1: Generate pileup tracks using callpeak module
+
+Purpose of this step is to use `callpeak` with -B option to generate bedGraph files for both conditions. There are several things to be remember: 1. `--SPMR` is not compatible with `bdgdiff`, so avoid using it; 2. prepare a pen to write down the number of non-redundant reads of both conditions -- you will find such information in runtime message  or xls output from `callpeak`; 3. keep using the same `--extsize` for both conditions ( you can get it from `predictd` module).
+
+To get a uniform extension size for running `callpeak`, run `predictd`:
+
+```
+ $ macs3 predictd -i cond1_ChIP.bam
+
+ $ macs3 predictd -i cond2_ChIP.bam
+```
+
+An easy solution is to use the average of two 'fragment size' predicted in `callpeak`, however any reasonable value will work. For example, you can use `200` for most ChIP-seq datasets for transcription factors, or ''147'' for most histone modification ChIP-seq. The only requirement is that you have to keep using the same extsize for the following commands:
+
+```
+ $ macs3 callpeak -B -t cond1_ChIP.bam -c cond1_Control.bam -n cond1 --nomodel --extsize 120
+ 
+ $ macs3 callpeak -B -t cond2_ChIP.bam -c cond2_Control.bam -n cond2 --nomodel --extsize 120
+```
+
+Pay attention to runtime message, or extract the "tags after filtering in treatment" and "tags after filtering in control" lines from xls to see the effective sequencing depths for both conditions. In our previous command lines, '--to-large' is not used, so the effective sequencing depth is the smaller number of treatment and control. For example:
+
+ $ egrep "tags after filtering in treatment|tags after filtering in control" cond1_peaks.xls
+ # tags after filtering in treatment: 19291269
+ # tags after filtering in control: 12914669
+
+ $ egrep "tags after filtering in treatment|tags after filtering in control" cond2_peaks.xls
+ # tags after filtering in treatment: 19962431
+ # tags after filtering in control: 14444786
+
+Then actual effective depths of condition 1 and 2 are: 12914669 and 14444786. Keep record of these two numbers and we will use them later. After successfully running '''callpeak''', you will have ''cond1_treat_pileup.bdg'', ''cond1_control_lambda.bdg'', ''cond2_treat_pileup.bdg'', and ''cond2_control_lambda.bdg'' in the working directory.
+
+### Step 2: Call differential regions
+
+The purpose of this step is to do a three ways comparisons to find out where in the genome has differential enrichment between two conditions. A basic requirement is that this region should be at least enriched in either condition. A log10 likelihood ratio cutoff (C) will be applied in this step. Three types of differential regions will be reported: 1. those having more enrichment in condition 1 over condition 2 ( cond1_ChIP > cond1_Control and cond1_ChIP > cond2_ChIP ); 2. those having more enrichment in condition 2 over condition 1 ( cond2_ChIP > cond2_Control and cond2_ChIP > cond1_ChIP ); those having similar enrichment in both conditions ( cond1_ChIP > cond1_Control and cond2_ChIP > cond2_Control and cond1_ChIP ≈ cond1_ChIP ).
+
+Run this:
+
+```
+ $ macs3 bdgdiff --t1 cond1_treat_pileup.bdg --c1 cond1_control_lambda.bdg --t2 cond2_treat_pileup.bdg\
+   --c2 cond2_control_lambda.bdg --d1 12914669 --d2 14444786 -g 60 -l 120 --o-prefix diff_c1_vs_c2
+```
+
+You will get the following three files in working directory:
+
+ 1. diff_c1_vs_c2_c3.0_cond1.bed
+ This file stores regions that are highly enriched in condition 1 comparing to condition 2. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 1 in this region is from condition 1 comparing to condition 2. Higher the value, bigger the difference.
+
+ 2. diff_c1_vs_c2_c3.0_cond2.bed
+ This file stores regions that are highly enriched in condition 2 comparing to condition 1. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 2 in this region is from condition 2 comparing to condition 1. Higher the value, bigger the difference.
+
+ 3. diff_c1_vs_c2_c3.0_common.bed
+ This file stores regions that are highly enriched in both condition 1 and condition 2, and the difference between condition 1 and condition 2 is not significant. The last column in the file represent the difference between condition 1 and condition 2 in log10 likelihood ratios.
diff --git a/docs/bdgpeakcall.md b/docs/bdgpeakcall.md
@@ -1,39 +1,109 @@
-# Bdgpeakcall
+# bdgpeakcall
 
 ## Overview
-The `bdgpeakcall` command is part of the MACS3 suite of tools and is used to call peaks from a single bedGraph track for scores. It is particularly useful in ChIP-Seq analysis for identifying peaks in the data.
+The `bdgpeakcall` command is part of the MACS3 suite of tools and is
+used to call peaks from a single bedGraph track for scores.
 
 ## Detailed Description
+The `bdgpeakcall` command takes an input bedGraph file, a cutoff
+value, and the options to control peak width, then produces an output
+file with peaks called. This tool can be used as a generic peak caller
+that can be applied to any scoring system in a BedGraph file, no
+matter the score is the pileup, the p-value, or the fold change -- or
+any other measurement to decide the 'significancy' of the genomic
+loci.
 
-The `bdgpeakcall` command takes an input bedGraph file and produces an output file with peaks called. It uses an efficient algorithm to detect and call peaks, greatly improving the quality of your data for further analysis.
-
-
-
-Call peaks from bedGraph output. Note: All regions on the same chromosome in the bedGraph file should be continuous so only
-                        bedGraph files from MACS3 are accpetable.
+Note: All regions on the same chromosome in the bedGraph file should
+be continuous.
 
 ## Command Line Options
 
-The command line options for `bdgpeakcall` are defined in `/MACS3/Commands/bdgpeakcall_cmd.py` and `/bin/macs3` files. Here is a brief overview of these options:
+Here is a brief overview of the command line options for `bdgpeakcall`:
 
-- `-i` or `--ifile`: MACS score in bedGraph. REQUIRED
-- `-c` or `--cutoff`: Cutoff depends on which method you used for the score track. If the file contains p-value scores from MACS3, score 5 means pvalue 1e-5. Regions with signals lower than the cutoff will not be considered as enriched regions. DEFAULT: 5
-- `-l` or `--min-length`: Minimum length of peak, better to set it as d value. DEFAULT: 200
-- `-g` or `--max-gap`: Maximum gap between significant points in a peak, better to set it as the tag size. DEFAULT: 30
-- `--cutoff-analysis`: While set, bdgpeakcall will analyze the number or total length of peaks that can be called by different cutoff then output a summary table to help the user decide a better cutoff. Note, minlen and maxgap may affect the results. DEFAULT: False
-- `--no-trackline`: Tells MACS not to include a trackline with bedGraph files. The trackline is required by UCSC.
-- `--verbose`: Set the verbose level of runtime messages. 0: only show critical messages, 1: show additional warning messages, 2: show process information, 3: show debug messages. DEFAULT: 2
-- `--outdir`: If specified, all output files will be written to that directory. Default: the current working directory
-- `-o` or `--ofile`: Output file name. Mutually exclusive with --o-prefix.
-- `--o-prefix`: Output file prefix. Mutually exclusive with -o/--ofile.
+- `-i` or `--ifile`: MACS score, or any numerical measurement score in
+   bedGraph. The only assumption on the score is that higher the score
+   is, more significant the genomic loci is. REQUIRED
+- `-c` or `--cutoff`: Cutoff depends on which method you used for the
+   score track. If the file contains -log10(p-value)  scores from
+   MACS3, score 5 means pvalue 1e-5. Regions with signals lower than
+   the cutoff will not be considered as enriched regions. DEFAULT: 5 
+- `-l` or `--min-length`: Minimum length of peak, better to set it as
+   d value if we will deal with MACS3 generated scores. DEFAULT: 200 
+- `-g` or `--max-gap`: Maximum gap between significant points in a 
+   peak, better to set it as the tag size if we will deal with MACS3
+   generated scores. DEFAULT: 30 
+- `--cutoff-analysis`: While set, bdgpeakcall will analyze the number
+   or total length of peaks that can be called by different cutoff
+   then output a summary table to help the user decide a better
+   cutoff. Note, minlen and maxgap may affect the results. Also, if
+   this option is on, `bdgpeakcall` will analyze the outcome of
+   different cutoff values and then quit without calling any peaks.
+   DEFAULT: False
+- `--no-trackline`: Tells MACS not to include a trackline with
+   bedGraph files. The trackline is used by UCSC for the options for
+   display.  
+- `--verbose`: Set the verbose level of runtime messages. 0: only show
+   critical messages, 1: show additional warning messages, 2: show
+   process information, 3: show debug messages. DEFAULT: 2
+- `--outdir`: If specified, all output files will be written to that
+   directory. Default: the current working directory 
+- `-o` or `--ofile`: Output file name. Mutually exclusive with
+   `--o-prefix`. 
+- `--o-prefix`: Output file prefix. Mutually exclusive with
+   `-o/--ofile`. 
 
 
 ## Example Usage
 
 Here is an example of how to use the `bdgpeakcall` command:
 
 ```bash
-macs3 bdgpeakcall -i input.bedGraph -o output.narrowPeak --cutoff 1.0 --minlen 500 --maxgap 1000 --cutoff-analysis
+macs3 bdgpeakcall -i input.bedGraph -o output.narrowPeak --cutoff 1.0 --minlen 500 --maxgap 1000
 ```
 
-In this example, the program will call peaks from the `input.bedGraph` file and write the result to `output.narrowPeak`. The cutoff for calling peaks is set to 1.0, the minimum length of peaks is set to 500, the maximum gap between peaks is set to 1000, and the cutoff-analysis option is enabled.
+In this example, the program will call peaks from the `input.bedGraph`
+file and write the result to `output.narrowPeak`. The cutoff for
+calling peaks is set to 1.0, the minimum length of peaks is set to
+500, and the maximum gap between peaks is set to 1000.
+
+## Cutoff Analysis
+
+The cutoff analysis function is provided by `--cutoff-analysis` option
+in `callpeak`, `bdgpeakcall`, and `hmmratac`. However, the function is
+`bdgpeakcall` is more flexible and can be applied on any scoring
+scheme. We will sperate this function into a dedicated subcommand in
+the future. 
+
+Please note that if this `--cutoff-anlaysis` option is on, the
+`bdgpeakcall` won't write any results of the peaks into narrowPeak
+format file, ignoring `-c` you specified. Instead, it will write a
+cutoff analysis report (`-o`) and quit.
+
+When the option is on, we will first calculate the window of step for
+increasing the cutoff from the lowest (`min_v`) to the highest
+(`max_v`), using the `--cutoff-analysis-steps`:
+
+`win_step = (max_v-min_v)/steps`. 
+
+Then for each cutoff we plan to investigate, we will check the number
+of peaks that can be called, their average peak length, and their
+total length. The smallest cutoff (close to `min_v`) and any cutoff
+that can't lead to any peak will be excluded in the final report.
+
+The report consists of four columns:
+
+1. score: the possible fold change cutoff value.
+2. npeaks: the number of peaks under this cutoff.
+3. lpeaks: the total length of all peaks.
+4. avelpeak: the average length of peaks.
+
+While there's no universal rule to suggest the best cutoff, here are a
+few suggestions:
+
+- You can use elbow analysis to find the cutoff that dramatically
+   change the trend of npeaks, lpeaks, or avelpeak. But you need to
+   think about how to define 'dramatical change'.
+- You can use some common expectation to decide the cutoff. For
+   example, the number of peaks should be thousands/ or the avelpeak
+   should be around 500bps. Of course, it's arbitrary but the table
+   will give you some insight.
diff --git a/docs/callpeak.md b/docs/callpeak.md
@@ -1,4 +1,4 @@
-# Call peaks
+# callpeak
 
 This is the main function in MACS3. It can be invoked by `macs3
 callpeak` . If you type this command with `-h`, you will see a full

diff --git a/docs/hmmratac.md b/docs/hmmratac.md
@@ -1,4 +1,4 @@
-# Hidden Markov Model peak caller for ATAC-seq -- HMMRATAC
+# hmmratac
 
 HMMRATAC (`macs3 hmmratac`) is a dedicated peak calling algorithm
 based on Hidden Markov Model for ATAC-seq data. The basic idea behind