Pipeline for extracting TCR using TraCeR from Smart-seq3 sequencing data processed by zUMIs.
A HPC cluster running a version of Linux with Singularity installed.
├── bin
│ ├── data_functions.py
│ ├── objects.py
│ └── tracer.conf
├── complete_pipeline.sh
├── data
│ ├── 00_SS3_raw_data
│ │ ├── Plate_1
│ │ │ ├── Plate_1.filtered.tagged.Aligned.out.bam
│ │ │ ├── Plate_1.filtered.tagged.unmapped.bam
│ │ │ └── Plate_1.barcodes.csv
│ │ ├── Plate_2
│ │ │ ├── Plate_2.filtered.tagged.Aligned.out.bam
│ │ │ ├── Plate_2.filtered.tagged.unmapped.bam
│ │ │ └── Plate_2.barcodes.csv
│ │ └── Plate_3
│ │ │ ├── Plate_3.filtered.tagged.Aligned.out.bam
│ │ │ ├── Plate_3.filtered.tagged.unmapped.bam
│ │ │ └── Plate_3.barcodes.csv
│ ├── 01_SS3_splitted_bams
│ ├── 02_SS3_merged_fastq
│ ├── 03_SS3_trimmed_fastq
│ ├── 04_SS3_Tracer_assembled_cells
│ └── 05_SS3_collected_TCRs
├── env
│ ├── 01_pysam_SS3.def
│ ├── 02_samtools_SS3.def
│ ├── 03_trimgalore_SS3.def
│ ├── 04_tracer_SS3.def
│ └── figlet.def
├── merge_plates_with_clonality.sh
├── README.md
├── results
└── src
├── 01_split_bam_by_tag_and_condition_file.py
├── 02_bam2fastq.sh
├── 03_run_trim_galore.sh
├── 04_assemble_trimmed_cells.sh
├── 05_collect_assemble.py
└── 06_clonality_analysis.py
- Place the zUMIs output in the folder
data/00_SS3_raw_data/
named as the plate (Plate_1
for example). Make sure that the following 3 files are in the plate subfolder: 1.1 Bam file for aligned reads, namedPlate_1.filtered.tagged.Aligned.out.bam
. 1.2 Bam file for unmapped reads, namedPlate_1.filtered.tagged.unmapped.bam
. 1.3 Barcodes per cell in.csv
format namedPlate_1.barcodes.csv
. - Run the pipeline as follows
./complete_pipeline.sh Plate_1
for all the existing plates. The TCR dataset will be saved in data/05_SS3_collected_TCRs/
, in a folder named as the plate.
Optional: Specify the number of nodes for parallel execution after the plate name. By default it will run on 10 nodes.
- Once the pipeline has been run for the desired plates, to merge them and calculate the clones and their frequency, run:
./merge_plates_with_clonality.sh
This will save the clonality dataset in results/
TCR_clonality.tsv
Optional: To change the output name, use the flag --out_file
and the path to the output file in .csv
. .tsv
or .xlsx
format. To specify another input directory, use the flag --input_dir
.
After cloning this repository, build the singularity images in the env
folder. The command singularity build
requires administration rights or being added to a fakeroot list. Consult your local IT for information on how to build singularity containers. Other options include building the containers remotely or on a local device and scp
the image to your HPC system. More information on building containers can be found here. To build the containers using the fakeroot
option from singularity definition files:
singularity build --fakeroot env/01_pysam_SS3.sif env/01_pysam_SS3.def
singularity build --fakeroot env/02_samtools_SS3.sif env/02_samtools_SS3.def
singularity build --fakeroot env/03_trimgalore_SS3.sif env/03_trimgalore_SS3.def
singularity build --fakeroot env/04_tracer_SS3.sif env/04_tracer_SS3.def
singularity build --fakeroot env/figlet.sif env/figlet.def
The output files from the sequence facility of interest are two big .bam
file per plate given by zUMIs:
<plate_name>.filtered.tagged.Aligned.out.bam
<plate_name>.filtered.tagged.unmapped.bam
and a barcode file <plate_name>.barcodes.csv
, a tabular file with a column for barcodes and a column with the corresponding cell name.
Move your smartseq3 data to the folder data/00_SS3_raw_data/
and to a sub-folder corresponding to a plate.
This section uses a container with Python 3.9 and pysam
that calls the python script /src/01_split_bam_by_tag_and_condition_file.py
internally. The script extracts one .bam
file per cell from each big .bam
given the barcode and name of each cell.
./env/01_pysam_SS3.sif bam_in condition_csv bam_out --condition_tag_col <barcode_column> --condition_name_col <cell_name_column> --bam_tag_flag BC
Inputs:
Parameter | Type | Description |
---|---|---|
bam_in |
string | Relative path to the directory holding the multiplexed .bam file. |
condition_csv |
string | Relative path to the .csv file containing the mapping between the barcodes and the name of the cells. |
bam_out |
string | Relative path to the folder where the output .bam files per cell are to be saved. |
condition_tag_col |
string (optional) | Name of the column containing the barcodes in 'condition_csv'. Defaults to 'Barcode'. |
condition_name_col |
string (optional) | Name of the column containing the cell name in 'condition_csv'. Defaults to 'Name'. |
bam_tag_flag |
string (optional) | The tag in the bam file that contains the sample barcode. Defaults to 'BC' for zUMIs output. |
name_part_filer |
string (optional) | Use to limit itself to samples names that contain a particular substring. Defaults to None. |
Example:
./env/01_pysam_SS3.sif data/00_SS3_raw_data/Plate_1/ Plate_1.filtered.tagged.Aligned.out.bam data/00_SS3_raw_data/Plate_1/Plate_1.barcodes.csv data/01_SS3_splitted_bams/Aligned/Plate_1/ --condition_tag_col Barcode --condition_name_col Name --bam_tag_flag BC
For detailed help, type ./env/01_pysam_SS3.sif --help
or singularity run-help env/01_pysam_SS3.sif
- Execution time: ~20 seconds per cell.
- The previous procedure has to be done for the
.Aligned.out.bam
file and for the.unmapped.bam
file. - The output directory
data/01_SS3_splitted_bams/Aligned/Plate_1/
has to be created before running the script.
The individual .bam
files from the Aligned and unmapped files have to be converted to fastq.gz
and be concatenated.
This section uses a container with samtools
executing the bash script src/02_bam2fastq.sh
, that translates the .bam
files to fastq.gz
format, and the concatenates the Aligned and unmapped per cell.
./env/02_samtools_SS3.sif INPUT_DIR OUTPUT_DIR NODES
Inputs to the bash script:
Parameter | Type | Description |
---|---|---|
INPUT_DIR |
string | Directory with the aligned single cell bam files. |
OUTPUT_DIR |
string | Directory where the fastq.gz files will be written. |
NODES |
int | Number of nodes to use in samtools fastq . |
Example:
./env/02_samtools_SS3.sif data/01_SS3_splitted_bams/Plate_1/ data/02_SS3_merged_fastq/Plate_1/ 40
For detailed help, type ./env/02_samtools_SS3.sif --help
or singularity run-help env/02_samtools_SS3.sif
- Execution time on 40 nodes: ~20 seconds per cell
- The script creates temporary folders that are deleted if the script terminates with exit status 0.
- The output directory
data/02_SS3_merged_fastq/Plate_1/
has to be created before running the script.
The untrimmed fastq files of the cells are all saved on the directory data/02_SS3_merged_fastq
. This steps trim the adapters and saves the output of each cell in a separate folder named as the cell.
This section uses a container with TrimGalore 0.6.7 that calls the bash script src/03_run_trim_galore.sh
, which trims the adapters of the concatenated fastq files and creates the output in a new folder named as the cell.
./env/03_trimgalore_SS3.sif INPUT_DIR OUTPUT_DIR NODES
Inputs to the bash script:
Parameter | Type | Description |
---|---|---|
INPUT_DIR |
string | Directory with the merged fastq files. |
OUTPUT_DIR |
string | Directory where the trimmed fastq.gz files will be written. |
NODES |
int | Number of nodes to use in trim_galore . |
Example:
./env/03_trimgalore_SS3.sif data/02_SS3_merged_fastq/Plate_1/ data/03_SS3_trimmed_fastq/Plate_1/ 8
For detailed help, type ./env/03_trimgalore_SS3.sif --help
or singularity run-help env/03_trimgalore_SS3.sif
- Execution time with 8 nodes: < 10 seconds per cell
- Apparently trim_galore does not accept more than 8 cores. If provided more, it will truncate to 8.
- The output directory
data/03_SS3_trimmed_fastq/Plate_1/
has to be created before running the script.
TraCeR is a package that uses Bowtie, Trinity, IgBlast and Kallisto, among other tools, to assemble T cell receptors (TCRs) from fastq files.
This section uses a container with TraCeR that calls the bash script src/04_assemble_trimmed_cells.sh
, which calls tracer assemble
iteratively over the files of multiple cells.
./env/04_tracer_SS3.sif INPUT_DIR OUTPUT_DIR NODES LOCI
Inputs to the bash script:
Parameter | Type | Description |
---|---|---|
INPUT_DIR |
string | Relative path to the directory containing the trimmed fastq files. |
OUTPUT_DIR |
string | Relative path to the directory where the TCR files will be saved. |
NODES |
int | Number of nodes to use in tracer assemble . |
LOCI |
str | 'AB' for assembling alpha-beta chains and 'GD' for assembling gamma-delta chains." |
Example:
./env/04_tracer_SS3.sif data/03_SS3_trimmed_fastq/Plate_1/ \
data/04_SS3_Tracer_assembled_cells/Plate_1/AB 20 'AB'
For detailed help, type ./env/04_tracer_SS3.sif --help
or singularity run-help env/04_tracer_SS3.sif
- Execution time with 40 nodes: ~ 5 days for a 384 cell plate
- The output directory
data/04_SS3_Tracer_assembled_cells/Plate_1/AB
has to be created before running the script. - The above commands should be run for all the loci-pairs separately, namely, for alpha-beta (
AB
) and for gamma-delta (GD
).
The output of TraCeR is a nested folder named as the cell containing several files. This steps goes through the output, reads the TCRs and condenses the information from all cells in a tabular dataset.
This step uses the same python container as in step 1, calling it with the singularity exec
command to run the script src/05_collect_assemble.py
.
singularity exec env/01_pysam_SS3.sif ./src/05_collect_assemble.py in_path out_path
Inputs to the bash script:
Parameter | Type | Description |
---|---|---|
in_path |
string | Relative path to the directory containing the TraCeR output for the plate. |
out_path |
string | Relative path to the file (including extension) where the TCR dataset is going to be saved. Admitted formats are .csv , .tsv and .xlsx . |
Example:
singularity exec env/01_pysam_SS3.sif ./src/05_collect_assemble.py data/04_SS3_Tracer_assembled_cells/Plate_1/ data/05_SS3_collected_TCRs/Plate_1/Plate_1.tsv
For detailed help, type singularity exec env/01_pysam_SS3.sif --help
.
- Execution time: < 1 minute for a 384 cell plate
- The output directory
data/05_SS3_collected_TCRs/Plate_1/
has to be created before running the script.
Once all the plates have their TCR dataset exported, it is of interest to know if there are TCRs that are present more than once. Considering that TCRs are created through VDJ recombination, it is unlikely that 2 identical TCRs are created independently, therefore repeated TCRs have to come from cloned cells. We call these TCRs expanded clones. This section merges the TCR datasets from different plates and identifies the expanded clones and their frequency.
This step uses the same python container as in step 1, calling it with the singularity exec
command to run the script src/06_clonality_analysis.py
. The script merge_plates_with_clonality.sh
wraps this calling and the inputs to the script to facilitate the use:
./merge_plates_with_clonality.sh --input_dir <input_dir> --out_file <out_file>
Inputs to the bash script:
Parameter | Type | Description |
---|---|---|
input_dir |
string (optional) | Relative path to the directory containing the TCR datasets. Defaults to data/05_SS3_collected_TCRs . |
out_file |
string (optional) | Relative path to the file (including extension) where the clonality dataset is going to be saved. Admitted formats are .csv , .tsv and .xlsx . Defaults to results/TCR_clonality.tsv . |
Example:
./merge_plates_with_clonality.sh
For detailed help, type ./merge_plates_with_clonality.sh --help
.
- Execution time: < 1 minute for a 384 cell plate