profile hapestry memory usage and reduce it #55

rlorigro · 2024-10-22T18:40:13Z

Hapestry has poor CPU utilization and excessive RAM usage when run on small chunks of the 1074 AoU VCFs.

It appears that the majority of the RAM usage actually comes from the alignment+optimization step, so more work is needed to reduce the input size of the problem to the solver. More fine-grained profiling is also needed to fully understand the source of the allocations.

rlorigro · 2024-10-22T18:55:07Z

This diagram roughly summarizes the data structures in use. Targets for optimization in order of (intuitive) priority are:

Solver: Reduce the redundancy in the input data to the model to prevent excessive branching in the solver
Reads: The sequences stored in the transmap as plain strings can be compressed into 2bit integer vectors
Graphaligner: could be replaced with a better alternative
Transmap: has a backend which encodes the relationship between samples, reads, and haps as a graph with sink nodes to access each node type. This could be simplified.
Ref sequence: could be compressed (at most 3GB of savings)
VcfRecords: are kept in memory upfront for each window, could be reduced?

rlorigro · 2024-10-22T19:03:33Z

I implemented a BinarySequence class which encodes each sequences as a vector<int> sequence and a int32_t length. It has an asymptotic compression ratio of 1:4. Profiling a small local test gives these results:

Orange shows chr1 of the ref Fasta being loaded as a string, then green shows it additionally allocated as a bit vector

rlorigro · 2024-10-22T19:12:26Z

However, when running on Terra on the entirety of chr1 with 47 samples, a much less dramatic difference in RAM usage is observed:

Binary: 18.89 GB
String: 23.40 GB

Also, with the additional RAM checkpoint I added, it appears that GraphAligner and other postprocessing takes a significant amount of RAM.

Preprocessing windows (loading reads, VcfRecords):
[0h 1m 30s 637ms] Peak memory usage: 11.21 GB

Overall peak (after performing alignments and storing the CSVs, no solver):
[0h 41m 27s 998ms] Peak memory usage: 18.89 GB

This indicates approximately:

1 GB used by the compressed reads
7 GB used by graphaligner and other postprocessing
3 GB used by reference sequence
~8 GB unaccounted for (could be Transmap backend, VcfRecords, etc)

More profiling is needed

rlorigro · 2024-10-22T19:19:18Z

Profiling the Transmap, assuming fully connected read-paths (worst-case), with these parameters:

    size_t coverage = 8;
    size_t n_reads = n_samples*coverage;
    size_t n_paths = 100;

across a range of n_samples, I get these results:

n_samples: 1
6.656 MB used

n_samples: 10
6.912 MB used

n_samples: 100
9.984 MB used

n_samples: 1000
37.628 MB used

n_samples: 10000
352.064 MB used

In particular, it appears that the majority of the RAM usage comes from the fully connected edges:

The small blip at the end shows the size of the Transmap without the read-hap edges. Luckily, these edges are not allocated until the thread function reaches each window, and then they should (hopefully) be deallocated.

More profiling is still needed. The next step will be to run a full hapestry job with valgrind...

rlorigro · 2024-10-22T19:56:18Z

Profiling a very small local test region with 47 samples yields some useful information:

🟥 - (persistent portion) mysterious overhead omitted at the resolution/threshold used for this run
🟧 - VcfRecords appear to be a large initial source of RAM usage, but they are properly deallocated during thread processing
🟨 - (std::__cxx11::basic_string) ref sequence stays allocated until end of executable (could be avoided?)
🟩 - (mm_allocator_...) WFA2 initially consumes significant RAM (at this scale) but quickly deallocates
🟩 🟦 🟪 - (BMS...) Ortools dominates the memory usage during late thread processing

Transmap/Heterograph/BinarySequence all appear negligible (at this scale) but will consume more for larger chunks, more samples

rlorigro · 2024-10-23T21:22:00Z

Unfortunately when running on a full task on Terra, valgrind slows the program down to the point where it will likely never finish. Should try much faster alternative:
https://github.com/KDE/heaptrack

or some other profiler.

rlorigro · 2024-10-30T15:34:06Z

Trying out different profilers for full scale runs (47 hprc)
X - Valgrind massif: awesome but way too slow
X - Heaptrack: faster but broken
X - Tracy: doesn't compile on Ubuntu (maybe possible)

rlorigro · 2024-10-30T15:36:40Z

I added a step that will remove contigs from the reference sequences if they are not needed for the windows that were generated. Since we often run hapestry in chunks, we rarely need all the contigs.

rlorigro · 2024-11-13T22:34:16Z

After working on the following optimizations, the run time and memory of the solver have been reduced:

Normalization simplification - do not find n_min given d_min, instead accept whatever n value is given from d_min
Pruning haps based on d_min solution - no hap that is not the optimum of a sample shall be considered for joint model
Not loading the entire reference FASTA - Only keep in memory the FASTA contigs that are observed in the windows/vcf
Binarizing read seqs - Use 2bit nt encoding for all extracted read sequences

Total time to solve

Summary stats from each of the 8 chunks of chr1 on HPRC 47

Pre optimization

Min Sec KByte_peak
26m 28s 10397388
18m 23s 10359451
17m 34s 10518802
20m 57s 11414675
22m 1s  26375712
19m 56s 13903462
18m 56s 17339011
27m 57s 10310639

Post optimization

Min Sec KByte_peak
21m 10s 5266212
14m 21s 3704496
14m 24s 4310080
16m 5s  3538088
18m 21s 14570548
15m 28s 4172416
14m 35s 4924260
21m 53s 4517428

*these RAM comparisons may not be accurate because i switched from measuring with VmPeak to VmHWM

rlorigro · 2024-11-13T22:41:29Z

In addition to this, no negative effect on accuracy/parsimony was observed:

Vcfdist results from a parameter sweep on chr1:100-110Mbp region

Graph evaluation using the all_hprc_04n_07_rescale parameter choice on chm13 chr1 hprc 47 10bp min SV length

rlorigro mentioned this issue Oct 22, 2024

Finish developing/evaluating hapestry merge #15

Open

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profile hapestry memory usage and reduce it #55

profile hapestry memory usage and reduce it #55

rlorigro commented Oct 22, 2024 •

edited

Loading

rlorigro commented Oct 22, 2024

rlorigro commented Oct 22, 2024

rlorigro commented Oct 22, 2024

rlorigro commented Oct 22, 2024 •

edited

Loading

rlorigro commented Oct 22, 2024 •

edited

Loading

rlorigro commented Oct 23, 2024

rlorigro commented Oct 30, 2024

rlorigro commented Oct 30, 2024

rlorigro commented Nov 13, 2024

rlorigro commented Nov 13, 2024

profile hapestry memory usage and reduce it #55

profile hapestry memory usage and reduce it #55

Comments

rlorigro commented Oct 22, 2024 • edited Loading

rlorigro commented Oct 22, 2024

rlorigro commented Oct 22, 2024

rlorigro commented Oct 22, 2024

rlorigro commented Oct 22, 2024 • edited Loading

rlorigro commented Oct 22, 2024 • edited Loading

rlorigro commented Oct 23, 2024

rlorigro commented Oct 30, 2024

rlorigro commented Oct 30, 2024

rlorigro commented Nov 13, 2024

rlorigro commented Nov 13, 2024

rlorigro commented Oct 22, 2024 •

edited

Loading

rlorigro commented Oct 22, 2024 •

edited

Loading

rlorigro commented Oct 22, 2024 •

edited

Loading