Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profile hapestry memory usage and reduce it #55

Open
Tracked by #15
rlorigro opened this issue Oct 22, 2024 · 10 comments
Open
Tracked by #15

profile hapestry memory usage and reduce it #55

rlorigro opened this issue Oct 22, 2024 · 10 comments

Comments

@rlorigro
Copy link
Collaborator

rlorigro commented Oct 22, 2024

image
image

Hapestry has poor CPU utilization and excessive RAM usage when run on small chunks of the 1074 AoU VCFs.

It appears that the majority of the RAM usage actually comes from the alignment+optimization step, so more work is needed to reduce the input size of the problem to the solver. More fine-grained profiling is also needed to fully understand the source of the allocations.

@rlorigro
Copy link
Collaborator Author

image

This diagram roughly summarizes the data structures in use. Targets for optimization in order of (intuitive) priority are:

  • Solver: Reduce the redundancy in the input data to the model to prevent excessive branching in the solver
  • Reads: The sequences stored in the transmap as plain strings can be compressed into 2bit integer vectors
  • Graphaligner: could be replaced with a better alternative
  • Transmap: has a backend which encodes the relationship between samples, reads, and haps as a graph with sink nodes to access each node type. This could be simplified.
  • Ref sequence: could be compressed (at most 3GB of savings)
  • VcfRecords: are kept in memory upfront for each window, could be reduced?

@rlorigro
Copy link
Collaborator Author

I implemented a BinarySequence class which encodes each sequences as a vector<int> sequence and a int32_t length. It has an asymptotic compression ratio of 1:4. Profiling a small local test gives these results:
image
Orange shows chr1 of the ref Fasta being loaded as a string, then green shows it additionally allocated as a bit vector

@rlorigro
Copy link
Collaborator Author

However, when running on Terra on the entirety of chr1 with 47 samples, a much less dramatic difference in RAM usage is observed:

Binary: 18.89 GB
String: 23.40 GB

Also, with the additional RAM checkpoint I added, it appears that GraphAligner and other postprocessing takes a significant amount of RAM.

Preprocessing windows (loading reads, VcfRecords):
[0h 1m 30s 637ms] Peak memory usage: 11.21 GB

Overall peak (after performing alignments and storing the CSVs, no solver):
[0h 41m 27s 998ms] Peak memory usage: 18.89 GB

This indicates approximately:

  • 1 GB used by the compressed reads
  • 7 GB used by graphaligner and other postprocessing
  • 3 GB used by reference sequence
  • ~8 GB unaccounted for (could be Transmap backend, VcfRecords, etc)

More profiling is needed

@rlorigro
Copy link
Collaborator Author

rlorigro commented Oct 22, 2024

Profiling the Transmap, assuming fully connected read-paths (worst-case), with these parameters:

    size_t coverage = 8;
    size_t n_reads = n_samples*coverage;
    size_t n_paths = 100;

across a range of n_samples, I get these results:

n_samples: 1
6.656 MB used

n_samples: 10
6.912 MB used

n_samples: 100
9.984 MB used

n_samples: 1000
37.628 MB used

n_samples: 10000
352.064 MB used

In particular, it appears that the majority of the RAM usage comes from the fully connected edges:
image
The small blip at the end shows the size of the Transmap without the read-hap edges. Luckily, these edges are not allocated until the thread function reaches each window, and then they should (hopefully) be deallocated.

More profiling is still needed. The next step will be to run a full hapestry job with valgrind...

@rlorigro
Copy link
Collaborator Author

rlorigro commented Oct 22, 2024

Profiling a very small local test region with 47 samples yields some useful information:
image

🟥 - (persistent portion) mysterious overhead omitted at the resolution/threshold used for this run
🟧 - VcfRecords appear to be a large initial source of RAM usage, but they are properly deallocated during thread processing
🟨 - (std::__cxx11::basic_string) ref sequence stays allocated until end of executable (could be avoided?)
🟩 - (mm_allocator_...) WFA2 initially consumes significant RAM (at this scale) but quickly deallocates
🟩 🟦 🟪 - (BMS...) Ortools dominates the memory usage during late thread processing

Transmap/Heterograph/BinarySequence all appear negligible (at this scale) but will consume more for larger chunks, more samples

@rlorigro
Copy link
Collaborator Author

Unfortunately when running on a full task on Terra, valgrind slows the program down to the point where it will likely never finish. Should try much faster alternative:
https://github.com/KDE/heaptrack

or some other profiler.

@rlorigro
Copy link
Collaborator Author

  1. Trying out different profilers for full scale runs (47 hprc)
    X - Valgrind massif: awesome but way too slow
    X - Heaptrack: faster but broken
    X - Tracy: doesn't compile on Ubuntu (maybe possible)

@rlorigro
Copy link
Collaborator Author

I added a step that will remove contigs from the reference sequences if they are not needed for the windows that were generated. Since we often run hapestry in chunks, we rarely need all the contigs.

image

@rlorigro
Copy link
Collaborator Author

After working on the following optimizations, the run time and memory of the solver have been reduced:

  1. Normalization simplification - do not find n_min given d_min, instead accept whatever n value is given from d_min
  2. Pruning haps based on d_min solution - no hap that is not the optimum of a sample shall be considered for joint model
  3. Not loading the entire reference FASTA - Only keep in memory the FASTA contigs that are observed in the windows/vcf
  4. Binarizing read seqs - Use 2bit nt encoding for all extracted read sequences

Total time to solve
image

Summary stats from each of the 8 chunks of chr1 on HPRC 47

Pre optimization

Min Sec KByte_peak
26m 28s 10397388
18m 23s 10359451
17m 34s 10518802
20m 57s 11414675
22m 1s  26375712
19m 56s 13903462
18m 56s 17339011
27m 57s 10310639

Post optimization

Min Sec KByte_peak
21m 10s 5266212
14m 21s 3704496
14m 24s 4310080
16m 5s  3538088
18m 21s 14570548
15m 28s 4172416
14m 35s 4924260
21m 53s 4517428

*these RAM comparisons may not be accurate because i switched from measuring with VmPeak to VmHWM

@rlorigro
Copy link
Collaborator Author

In addition to this, no negative effect on accuracy/parsimony was observed:

Vcfdist results from a parameter sweep on chr1:100-110Mbp region
image

Graph evaluation using the all_hprc_04n_07_rescale parameter choice on chm13 chr1 hprc 47 10bp min SV length
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant