Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
taoliu committed Nov 2, 2023
1 parent 4a7f2ba commit af8362c
Show file tree
Hide file tree
Showing 5 changed files with 163 additions and 35 deletions.
14 changes: 7 additions & 7 deletions docs/bdgcmp.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# Bdgcmp
docs/bdgcmp.md# bdgcmp

## Overview
The `bdgcmp` command is part of the MACS3 suite of tools and is used to compare bedGraph files. It is particularly useful for comparing two bedGraph files. The typical usage is to calculate pvalue or qvalue using poisson model for each basepair given a treatment pileup signal file in bedGraph format and a control lambda bedGraph file. But we provides more functions rather than pvalue and qvalue, including subtract, division (FE) and more.
The `bdgcmp` command is part of the MACS3 suite of tools and is used to compare two bedGraph files in each basepair that are commonly covered by the two files. The typical use case is to calculate pvalue or qvalue using Poisson model for each basepair given a treatment pileup signal file in bedGraph format and a control lambda bedGraph file. But we provides more functions rather than pvalue and qvalue, including subtract, division (FE) and more.

## Detailed Description

The `bdgcmp` command takes two input bedGraph files (e.g. a control and a treatment bedgraph) and produces an output bedGraph of comparison scores for each genomic position involved in the bedGraph files. The `bdgcmp` command normally is used to deduct noise by comparing two signal tracks in bedGraph. Note: All regions on the same chromosome in the bedGraph file should be continuous so we recommand you use the bedGraph files from MACS3. We provide the following function to 'compare two tracks':
The `bdgcmp` command takes two input bedGraph files (e.g. a control and a treatment bedgraph) and produces an output bedGraph of comparison scores for each genomic position involved in the bedGraph files. The `bdgcmp` command normally is used to deduct noise from a signal track in bedGraph (e.g. ChIP treatment) over another signal track in bedGraph (e.g. control). Note: All regions on the same chromosome in the bedGraph file should be continuous so we recommand you use the bedGraph files from MACS3. We provide the following function to 'compare two tracks':

- ppois Poisson p-value (-log10(pvalue) form) using control as lambda and treatment as observation
- ppois Poisson p-value (-log10(pvalue) form) using the second file (-c) as lambda and treatment (-t) as observation
- qpois Q-value through a BH process for poisson pvalues
- subtract Subtraction from treatment
- FE linear scale fold enrichment, or the score from file A divided by the score from file B
- logFE log10 fold enrichment(need to set pseudocount)
- logLR log10 likelihood between ChIP-enriched model and open chromatin model(need to set pseudocount)
- symmetric log10 likelihood between two ChIP-enrichment models using Poison distribution
- logLR log10 likelihood between ChIP-enriched model and open chromatin model (need to set pseudocount)
- symmetric log10 likelihood between two ChIP-enrichment models using Poison distribution, and this can be used to compare ChIP signals from two differen conditions (differential binding)
- max Maximum value between the two tracks.

## Command Line Options

The command line options for `bdgcmp` are defined in `/MACS3/Commands/bdgcmp_cmd.py` and `/bin/macs3` files. Here is a brief overview of these options:
Here is a brief description of the command line options for `bdgcmp` :

- `-t` or `--tfile`: Treatment bedGraph file, e.g. *_treat_pileup.bdg from MACS. REQUIRED
- `-c` or `--cfile`: Control bedGraph file, e.g. *_control_lambda.bdg from MACS. REQUIRED
Expand Down
68 changes: 63 additions & 5 deletions docs/bdgdiff.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
# Bdgdiff
# bdgdiff

## Overview
The `bdgdiff` command is part of the MACS3 suite of tools and is used to call differential peaks from four bedGraph tracks for scores. It is particularly useful in ChIP-Seq analysis for identifying differential peaks in the data.
The `bdgdiff` command is part of the MACS3 suite of tools and is used to call differential peaks from four bedGraph tracks of scores, including treatment and control track for each condition. It is particularly useful in ChIP-Seq analysis for identifying differential peaks in the data.

## Detailed Description

The `bdgdiff` command takes four input bedGraph files (two treatment and two control files) and produces three output files with differential peaks called. It uses an efficient algorithm to detect and call differential peaks, greatly improving the quality of your data for further analysis.
The `bdgdiff` command takes four input bedGraph files (two treatment and two control files) and produces three output files with differential peaks called. Users should provide paired four bedgraph files: for each condition, a treatment pileup signal track in bedGraph format, and a control lambda track in bedGraph format. This differential calling can only handle one replicate per condition, so if you have multiple replicates, you should either combine the replicates when `callpeak`, or choose other tool that can consider within-group variation (such as DESeq2 or edgeR). The method we use to define the differential peaks is based on multiple likelihood tests, based on poisson distribution. Suppose that we have two conditions A and B, the unique binding sites in condition A over condition B should be *more likely* to be a binding event in treatment A over treatment B, and also *more likely* to be a real binding site in condition A while comparing treatment A over control A; the unique binding sites in condition B is defined in the same way; the common peaks of both condition should be *more likely* to be a real binding sites in condition A while comparing treatment A and control A, and in condition B while comparing treatment B over control B, and also the likelihood test while comparing treatment A and treatment B can't decide which condition is stronger.

Differential peak detection based on paired four bedgraph files. Note: All regions on the same chromosome in the bedGraph file
should be continuous so only bedGraph files from MACS3 are accpetable.
The likelihood function we used is:

Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are acceptable.

## Command Line Options

Expand Down Expand Up @@ -39,3 +40,60 @@ macs3 bdgdiff -t1 treatment1.bedGraph -c1 control1.bedGraph -t2 treatment2.bedGr

In this example, the program will call differential peaks from the two pairs of treatment and control bedGraph files and write the result to `output.bedGraph`. The depth of the first and second condition is set to 1.0, the minimum length of differential peaks is set to 500, the maximum gap between differential peaks is set to 1000, and the cutoff for LLR to call differential peaks is set to 1.0.

## Step-by-step Instruction for calling differential peaks

In this chatper, we will describe how to use MACS3 to identify differential regions by comparing pileup tracks of two conditions, starting from the alignment files. Two modules will be involved: `callpeak` and `bdgdiff` ( `predictd` is optional ). We will use human ChIP-seq data as example, and filenames of raw data are: cond1_ChIP.bam and cond1_Control.bam for condition 1; cond2_ChIP.bam and cond2_Control.bam for condition 2.

### Step 1: Generate pileup tracks using callpeak module

Purpose of this step is to use `callpeak` with -B option to generate bedGraph files for both conditions. There are several things to be remember: 1. `--SPMR` is not compatible with `bdgdiff`, so avoid using it; 2. prepare a pen to write down the number of non-redundant reads of both conditions -- you will find such information in runtime message or xls output from `callpeak`; 3. keep using the same `--extsize` for both conditions ( you can get it from `predictd` module).

To get a uniform extension size for running `callpeak`, run `predictd`:

```
$ macs3 predictd -i cond1_ChIP.bam
$ macs3 predictd -i cond2_ChIP.bam
```

An easy solution is to use the average of two 'fragment size' predicted in `callpeak`, however any reasonable value will work. For example, you can use `200` for most ChIP-seq datasets for transcription factors, or ''147'' for most histone modification ChIP-seq. The only requirement is that you have to keep using the same extsize for the following commands:

```
$ macs3 callpeak -B -t cond1_ChIP.bam -c cond1_Control.bam -n cond1 --nomodel --extsize 120
$ macs3 callpeak -B -t cond2_ChIP.bam -c cond2_Control.bam -n cond2 --nomodel --extsize 120
```

Pay attention to runtime message, or extract the "tags after filtering in treatment" and "tags after filtering in control" lines from xls to see the effective sequencing depths for both conditions. In our previous command lines, '--to-large' is not used, so the effective sequencing depth is the smaller number of treatment and control. For example:

$ egrep "tags after filtering in treatment|tags after filtering in control" cond1_peaks.xls
# tags after filtering in treatment: 19291269
# tags after filtering in control: 12914669

$ egrep "tags after filtering in treatment|tags after filtering in control" cond2_peaks.xls
# tags after filtering in treatment: 19962431
# tags after filtering in control: 14444786

Then actual effective depths of condition 1 and 2 are: 12914669 and 14444786. Keep record of these two numbers and we will use them later. After successfully running '''callpeak''', you will have ''cond1_treat_pileup.bdg'', ''cond1_control_lambda.bdg'', ''cond2_treat_pileup.bdg'', and ''cond2_control_lambda.bdg'' in the working directory.

### Step 2: Call differential regions

The purpose of this step is to do a three ways comparisons to find out where in the genome has differential enrichment between two conditions. A basic requirement is that this region should be at least enriched in either condition. A log10 likelihood ratio cutoff (C) will be applied in this step. Three types of differential regions will be reported: 1. those having more enrichment in condition 1 over condition 2 ( cond1_ChIP > cond1_Control and cond1_ChIP > cond2_ChIP ); 2. those having more enrichment in condition 2 over condition 1 ( cond2_ChIP > cond2_Control and cond2_ChIP > cond1_ChIP ); those having similar enrichment in both conditions ( cond1_ChIP > cond1_Control and cond2_ChIP > cond2_Control and cond1_ChIP ≈ cond1_ChIP ).

Run this:

```
$ macs3 bdgdiff --t1 cond1_treat_pileup.bdg --c1 cond1_control_lambda.bdg --t2 cond2_treat_pileup.bdg\
--c2 cond2_control_lambda.bdg --d1 12914669 --d2 14444786 -g 60 -l 120 --o-prefix diff_c1_vs_c2
```

You will get the following three files in working directory:

1. diff_c1_vs_c2_c3.0_cond1.bed
This file stores regions that are highly enriched in condition 1 comparing to condition 2. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 1 in this region is from condition 1 comparing to condition 2. Higher the value, bigger the difference.

2. diff_c1_vs_c2_c3.0_cond2.bed
This file stores regions that are highly enriched in condition 2 comparing to condition 1. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 2 in this region is from condition 2 comparing to condition 1. Higher the value, bigger the difference.

3. diff_c1_vs_c2_c3.0_common.bed
This file stores regions that are highly enriched in both condition 1 and condition 2, and the difference between condition 1 and condition 2 is not significant. The last column in the file represent the difference between condition 1 and condition 2 in log10 likelihood ratios.
112 changes: 91 additions & 21 deletions docs/bdgpeakcall.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,109 @@
# Bdgpeakcall
# bdgpeakcall

## Overview
The `bdgpeakcall` command is part of the MACS3 suite of tools and is used to call peaks from a single bedGraph track for scores. It is particularly useful in ChIP-Seq analysis for identifying peaks in the data.
The `bdgpeakcall` command is part of the MACS3 suite of tools and is
used to call peaks from a single bedGraph track for scores.

## Detailed Description
The `bdgpeakcall` command takes an input bedGraph file, a cutoff
value, and the options to control peak width, then produces an output
file with peaks called. This tool can be used as a generic peak caller
that can be applied to any scoring system in a BedGraph file, no
matter the score is the pileup, the p-value, or the fold change -- or
any other measurement to decide the 'significancy' of the genomic
loci.

The `bdgpeakcall` command takes an input bedGraph file and produces an output file with peaks called. It uses an efficient algorithm to detect and call peaks, greatly improving the quality of your data for further analysis.



Call peaks from bedGraph output. Note: All regions on the same chromosome in the bedGraph file should be continuous so only
bedGraph files from MACS3 are accpetable.
Note: All regions on the same chromosome in the bedGraph file should
be continuous.

## Command Line Options

The command line options for `bdgpeakcall` are defined in `/MACS3/Commands/bdgpeakcall_cmd.py` and `/bin/macs3` files. Here is a brief overview of these options:
Here is a brief overview of the command line options for `bdgpeakcall`:

- `-i` or `--ifile`: MACS score in bedGraph. REQUIRED
- `-c` or `--cutoff`: Cutoff depends on which method you used for the score track. If the file contains p-value scores from MACS3, score 5 means pvalue 1e-5. Regions with signals lower than the cutoff will not be considered as enriched regions. DEFAULT: 5
- `-l` or `--min-length`: Minimum length of peak, better to set it as d value. DEFAULT: 200
- `-g` or `--max-gap`: Maximum gap between significant points in a peak, better to set it as the tag size. DEFAULT: 30
- `--cutoff-analysis`: While set, bdgpeakcall will analyze the number or total length of peaks that can be called by different cutoff then output a summary table to help the user decide a better cutoff. Note, minlen and maxgap may affect the results. DEFAULT: False
- `--no-trackline`: Tells MACS not to include a trackline with bedGraph files. The trackline is required by UCSC.
- `--verbose`: Set the verbose level of runtime messages. 0: only show critical messages, 1: show additional warning messages, 2: show process information, 3: show debug messages. DEFAULT: 2
- `--outdir`: If specified, all output files will be written to that directory. Default: the current working directory
- `-o` or `--ofile`: Output file name. Mutually exclusive with --o-prefix.
- `--o-prefix`: Output file prefix. Mutually exclusive with -o/--ofile.
- `-i` or `--ifile`: MACS score, or any numerical measurement score in
bedGraph. The only assumption on the score is that higher the score
is, more significant the genomic loci is. REQUIRED
- `-c` or `--cutoff`: Cutoff depends on which method you used for the
score track. If the file contains -log10(p-value) scores from
MACS3, score 5 means pvalue 1e-5. Regions with signals lower than
the cutoff will not be considered as enriched regions. DEFAULT: 5
- `-l` or `--min-length`: Minimum length of peak, better to set it as
d value if we will deal with MACS3 generated scores. DEFAULT: 200
- `-g` or `--max-gap`: Maximum gap between significant points in a
peak, better to set it as the tag size if we will deal with MACS3
generated scores. DEFAULT: 30
- `--cutoff-analysis`: While set, bdgpeakcall will analyze the number
or total length of peaks that can be called by different cutoff
then output a summary table to help the user decide a better
cutoff. Note, minlen and maxgap may affect the results. Also, if
this option is on, `bdgpeakcall` will analyze the outcome of
different cutoff values and then quit without calling any peaks.
DEFAULT: False
- `--no-trackline`: Tells MACS not to include a trackline with
bedGraph files. The trackline is used by UCSC for the options for
display.
- `--verbose`: Set the verbose level of runtime messages. 0: only show
critical messages, 1: show additional warning messages, 2: show
process information, 3: show debug messages. DEFAULT: 2
- `--outdir`: If specified, all output files will be written to that
directory. Default: the current working directory
- `-o` or `--ofile`: Output file name. Mutually exclusive with
`--o-prefix`.
- `--o-prefix`: Output file prefix. Mutually exclusive with
`-o/--ofile`.


## Example Usage

Here is an example of how to use the `bdgpeakcall` command:

```bash
macs3 bdgpeakcall -i input.bedGraph -o output.narrowPeak --cutoff 1.0 --minlen 500 --maxgap 1000 --cutoff-analysis
macs3 bdgpeakcall -i input.bedGraph -o output.narrowPeak --cutoff 1.0 --minlen 500 --maxgap 1000
```

In this example, the program will call peaks from the `input.bedGraph` file and write the result to `output.narrowPeak`. The cutoff for calling peaks is set to 1.0, the minimum length of peaks is set to 500, the maximum gap between peaks is set to 1000, and the cutoff-analysis option is enabled.
In this example, the program will call peaks from the `input.bedGraph`
file and write the result to `output.narrowPeak`. The cutoff for
calling peaks is set to 1.0, the minimum length of peaks is set to
500, and the maximum gap between peaks is set to 1000.

## Cutoff Analysis

The cutoff analysis function is provided by `--cutoff-analysis` option
in `callpeak`, `bdgpeakcall`, and `hmmratac`. However, the function is
`bdgpeakcall` is more flexible and can be applied on any scoring
scheme. We will sperate this function into a dedicated subcommand in
the future.

Please note that if this `--cutoff-anlaysis` option is on, the
`bdgpeakcall` won't write any results of the peaks into narrowPeak
format file, ignoring `-c` you specified. Instead, it will write a
cutoff analysis report (`-o`) and quit.

When the option is on, we will first calculate the window of step for
increasing the cutoff from the lowest (`min_v`) to the highest
(`max_v`), using the `--cutoff-analysis-steps`:

`win_step = (max_v-min_v)/steps`.

Then for each cutoff we plan to investigate, we will check the number
of peaks that can be called, their average peak length, and their
total length. The smallest cutoff (close to `min_v`) and any cutoff
that can't lead to any peak will be excluded in the final report.

The report consists of four columns:

1. score: the possible fold change cutoff value.
2. npeaks: the number of peaks under this cutoff.
3. lpeaks: the total length of all peaks.
4. avelpeak: the average length of peaks.

While there's no universal rule to suggest the best cutoff, here are a
few suggestions:

- You can use elbow analysis to find the cutoff that dramatically
change the trend of npeaks, lpeaks, or avelpeak. But you need to
think about how to define 'dramatical change'.
- You can use some common expectation to decide the cutoff. For
example, the number of peaks should be thousands/ or the avelpeak
should be around 500bps. Of course, it's arbitrary but the table
will give you some insight.
2 changes: 1 addition & 1 deletion docs/callpeak.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Call peaks
# callpeak

This is the main function in MACS3. It can be invoked by `macs3
callpeak` . If you type this command with `-h`, you will see a full
Expand Down
2 changes: 1 addition & 1 deletion docs/hmmratac.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Hidden Markov Model peak caller for ATAC-seq -- HMMRATAC
# hmmratac

HMMRATAC (`macs3 hmmratac`) is a dedicated peak calling algorithm
based on Hidden Markov Model for ATAC-seq data. The basic idea behind
Expand Down

0 comments on commit af8362c

Please sign in to comment.