Example dataset (with default config, should take 5 minutes)
- Download data from NCBI given a list of accessions, or alternatively, use your own fasta files
- Define a set of training intervals, e.g. full chromosomes, only exons (requires annotation), etc
- Shard the dataset for efficient loading with Hugging Face libraries
- Optional: upload to Hugging Face Hub
- GPN
- Snakemake
- If you want to automatically download data from NCBI, install NCBI Datasets (e.g.
conda install -c conda-forge ncbi-datasets-cli
)
- Manually download assembly metadata from NCBI Genome
- You can choose a set of taxa (e.g. mammals, plants) and apply filters such as annotation level, assembly level.
- Checkout the script
gpn/ss/filter_assemblies.py
for more details, such as how to subsample, or how to keep only one assembly per genus.
- See
config/config.yaml
andconfig/assemblies.tsv
- Check notes in
workflow/Snakefile
for running with your own set of fasta files
snakemake --cores all
- The dataset will be created at
results/dataset
For easy distribution and deployment, the dataset can be uploaded to HF Hub (optionally, as a private dataset). It can be automatically streamed during training (no need to fully download the data locally). Make sure to first install HF Hub client library.
from huggingface_hub import HfApi
api = HfApi()
private = False
repo_id = "gonzalobenegas/example_dataset" # replace with your username, dataset name
folder_path = "results/dataset"
api.create_repo(repo_id=repo_id, repo_type="dataset", private=private)
api.upload_folder(repo_id=repo_id, folder_path=folder_path, repo_type="dataset")