Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using Google Colab's GPU #914

Open
Ryosuke-254 opened this issue Dec 3, 2024 · 1 comment
Open

using Google Colab's GPU #914

Ryosuke-254 opened this issue Dec 3, 2024 · 1 comment

Comments

@Ryosuke-254
Copy link

I want to run the following code using Google Colab's GPU, but while the GPU is briefly utilized, it is mostly not being used, which is causing problems. Could you provide any suggestions for improvement?

必要なツールをインストール

!apt-get update -qq
!apt-get install -y -qq wget tar cmake build-essential

MMseqs2 (GPU版) をダウンロードして展開

!wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz -O mmseqs-linux-gpu.tar.gz
!tar xvzf mmseqs-linux-gpu.tar.gz
!mv mmseqs/bin/mmseqs /usr/local/bin/

CUDAツールをインストール

!apt-get install -y -qq nvidia-cuda-toolkit
!nvcc --version # CUDAがインストールされているか確認

PyCUDAとその他のPythonライブラリをインストール

!pip install -q pycuda biopython pandas

Google ColabでのGPU利用状況を確認

!nvidia-smi

MMseqs2ワークディレクトリを作成

import os
work_dir = "./mmseqs_work"
os.makedirs(work_dir, exist_ok=True)

入力FASTAファイルを指定

input_fasta = "/content/Book2test.fasta" # 必要に応じてファイルパスを変更してください

MMseqs2データベースの作成(1回のみ)

!mmseqs createdb {input_fasta} {work_dir}/db

データベースをGPU対応フォーマットに変換(makepaddedseqdbを使用)

!mmseqs makepaddedseqdb {work_dir}/db {work_dir}/db_gpu

自身に対してペアワイズ検索(GPUを使用)

search_result_path = os.path.join(work_dir, "search_result")
tmp_dir = os.path.join(work_dir, "tmp")
os.makedirs(tmp_dir, exist_ok=True)

!mmseqs search {work_dir}/db {work_dir}/db_gpu {search_result_path} {tmp_dir}
--min-seq-id 0.8 --threads 4 --search-type 3 --gpu 1 || echo "Search failed!"

.m8ファイルが生成されているか確認

!ls {search_result_path}.m8 || echo "No .m8 file found!"

出力結果を解析

import pandas as pd
from Bio import SeqIO

MMseqs2出力ファイルを指定

search_result_m8 = f"{search_result_path}.m8" # MMseqs2出力ファイルのパス

MMseqs2出力形式を読み込む

columns = ["query", "target", "pident", "alnlen", "mismatch", "gapopen", "qstart", "qend", "tstart", "tend", "evalue", "bits"]

try:
results = pd.read_csv(search_result_m8, sep="\t", names=columns)

# 配列同一性が80%未満のクエリ配列を抽出
filtered_results = results[results["pident"] < 80]
unique_query_ids = set(filtered_results["query"])

# 元のFASTAから該当する配列を抽出
filtered_sequences = {rec.id: rec for rec in SeqIO.parse(input_fasta, "fasta") if rec.id in unique_query_ids}
output_fasta = "/content/filtered_sequences.fasta"

# 抽出した配列をFASTA形式で保存
with open(output_fasta, "w") as f:
    SeqIO.write(filtered_sequences.values(), f, "fasta")

print(f"フィルタされた配列を保存しました: {output_fasta}")

except FileNotFoundError:
print(f"Error: MMseqs2 output file not found at {search_result_m8}")
except Exception as e:
print(f"Unexpected error: {e}")

low deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Profile output mode 0
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 3
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false
Translation mode 0

ungappedprefilter ./mmseqs_work/db ./mmseqs_work/db_gpu ./mmseqs_work/tmp/14843528504956813129/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -c 0 -e 0.001 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --min-ungapped-score 15 --max-seqs 300 --db-load-mode 0 --gpu 1 --gpu-server 0 --prefilter-mode 1 --threads 4 --compressed 0 -v 3

[=================================================================] 100.00% 25.33K 3m 2s 739ms
Time for merging to pref_0: 0h 0m 0s 4ms
Time for processing: 0h 3m 2s 790ms
align ./mmseqs_work/db ./mmseqs_work/db_gpu ./mmseqs_work/tmp/14843528504956813129/pref_0 ./mmseqs_work/search_result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.8 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 4 --compressed 0 -v 3

Compute score and coverage
Query database size: 25329 type: Aminoacid
Target database size: 25329 type: Aminoacid
Calculation of alignments
^C
ls: cannot access './mmseqs_work/search_result.m8': No such file or directory
No .m8 file found!
Error: MMseqs2 output file not found at ./mmseqs_work/search_result.m8

@martin-steinegger
Copy link
Member

martin-steinegger commented Jan 2, 2025

align ./mmseqs_work/db ./mmseqs_work/db_gpu ./mmseqs_work/tmp/14843528504956813129/pref_0 ./mmseqs_work/search_result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.8 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 4 --compressed 0 -v 3

Compute score and coverage
Query database size: 25329 type: Aminoacid
Target database size: 25329 type: Aminoacid
Calculation of alignments

It seems like it is computing the SW alignment here. This might be slow on Colab because the cores are very weak. What exactly do you need? Score only or do you need the full alignment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants