Suggestion: Maintain order of input to KRAKENUNIQ_PRELOADEDKRAKENUNIQ across runs #570

muniheart · 2025-01-20T19:52:32Z

Description of feature

Thank you for writing and maintaining this excellent workflow. I suggest the following improvements to make profiling even more friendly.

The grouping of input datasets into batches is not guaranteed to be the same across workflow runs and this prevents retrieval of cached results when resuming a workflow run. Could you sort the datasets prior to batching?
The task identifier is just '(db1)', as in
name: NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:KRAKENUNIQ_PRELOADEDKRAKENUNIQ (db1). It would be nice to have more specific identifier, including the batch # ( provided the issue above is resolved ) or at least the sample identifier of one dataset of a batch.
Might it be better to split the input datasets into batches based on dataset size or read count? This may help balance the run times across batches.

jfy133 · 2025-01-22T10:38:21Z

For 1 and 2 that should be possible I think, right @Midnighter ?

however it would risk slow down of the pipeline waiting for all prior steps to complete to force an order

For 3 will be more difficult and I'm not sure if it's worth the effort (read count/dataset size doesn't necessarily linearly correspond to run time as it depends how many hits actually are valid vs things that don't match anything)

Midnighter · 2025-01-22T17:28:02Z

Agreed, 1&2 are definitely possible. I think, 2 is achievable with minimal changes and could be a good first time contributor feature 😉

As you say @jfy133, 1 requires a global ordering, meaning all samples will have to have finished being processed their upstream steps. I don't see that as a huge issue, it simply means that krakenuniq will be one of the last profilers to complete. I say, that's worth it for caching. Also this change should be achievable in a quick way if you are up for submitting a PR 😉

In my experience with kraken2, every input read is tested such that there is an almost perfect correlation between number of reads and processing time. I assume, krakenuniq will be similar. However, this is a larger change and I'm not sure it's worth it. Rather than trying to distribute your samples in a way that each node receives exactly one batch in a manner that they all complete at a similar time, try to use the batch size option such that each node receives multiple batches but you still observe speed ups compared to the overhead involved. Does that make sense?

muniheart added the enhancement Improvement for existing functionality label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Maintain order of input to KRAKENUNIQ_PRELOADEDKRAKENUNIQ across runs #570

Suggestion: Maintain order of input to KRAKENUNIQ_PRELOADEDKRAKENUNIQ across runs #570

muniheart commented Jan 20, 2025

jfy133 commented Jan 22, 2025

Midnighter commented Jan 22, 2025 •

edited

Loading

Suggestion: Maintain order of input to KRAKENUNIQ_PRELOADEDKRAKENUNIQ across runs #570

Suggestion: Maintain order of input to KRAKENUNIQ_PRELOADEDKRAKENUNIQ across runs #570

Comments

muniheart commented Jan 20, 2025

Description of feature

jfy133 commented Jan 22, 2025

Midnighter commented Jan 22, 2025 • edited Loading

Midnighter commented Jan 22, 2025 •

edited

Loading