Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQSR: avoid throwing an error when read group is missing in the recal table, and some refactoring. #9020

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

takutosato
Copy link
Contributor

Addresses #6242.

Current behavior: when all the reads in a read group are filtered in the base recalibration step, the read group is not logged in the recal table. Then ApplyBQSR encounters these reads, can't find the read group in the recal table, and throws an error.

New behavior: if --allow-read-group flag is set to true, then ApplyBQSR outputs the original quantities (after quantizing).

I avoided the alternative approach of collapsing (marginalizing) across the read groups, mostly because it would require a complete overhaul of the code. I also think that using recal data from other read groups might not be a good idea. In any case, using OQ should be good enough; I assume that these "missing" read groups are low enough quality to be filtered out and are likely to be thrown out by downstream tools.

I also refactored the BQSR code, mostly to update the variable and class names to be more accurate and descriptive. For instance:

ReadCovariates.java -> PerReadCovariateMatrix.java
EstimatedQReported -> ReportedQuality

@davidbenjamin davidbenjamin self-requested a review November 4, 2024 14:55
@davidbenjamin davidbenjamin self-assigned this Nov 4, 2024
Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just needs another pass at some of your comments and TODOs.

build.gradle Outdated
@@ -188,7 +188,7 @@ configurations.all {
}

tasks.withType(JavaCompile) {
options.compilerArgs = ['-proc:none', '-Xlint:all', '-Werror', '-Xdiags:verbose']
options.compilerArgs = ['-proc:none', '-Xlint:all', '-Xdiags:verbose']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beyond my job description but I trust there's a reason?

* If set to true, do not throw an error upon encountering a read with a read group that's not in the recalibration table.
* Instead, simply set the quantized original base qualities as the recalibrated base qualities.
*/ // tsato: should this be in the *unique* ApplyBQSRArgumentCollection?
@Argument(fullName = ALLOW_MISSING_READ_GROUPS_LONG_NAME, doc = "", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty doc string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -90,7 +90,7 @@ public final class ApplyBQSR extends ReadWalker{
*/
@Argument(fullName=StandardArgumentDefinitions.BQSR_TABLE_LONG_NAME, shortName=StandardArgumentDefinitions.BQSR_TABLE_SHORT_NAME, doc="Input recalibration table for BQSR")
@WorkflowInput
public File BQSR_RECAL_FILE;
public File bqsrRecalFile; // tsato: change to GATKPath?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address this TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -26,14 +28,14 @@

public final class BQSRReadTransformer implements ReadTransformer {
private static final long serialVersionUID = 1L;

// tsato: remove?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


//clear indel qualities
// clear indel qualities // tsato: why? Is this ok? Do we update them?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for ( final NestedIntegerArray.Leaf<RecalDatum> leaf : byQualTable.getAllLeaves() ) {
final int rgKey = leaf.keys[0];
final int eventIndex = leaf.keys[2];
final int rgKey = leaf.keys[readGroupIndex]; // tsato: ??? What are these indices?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -32,7 +32,7 @@
public final class RecalUtils {
public static final String ARGUMENT_REPORT_TABLE_TITLE = "Arguments";
public static final String QUANTIZED_REPORT_TABLE_TITLE = "Quantized";
public static final String READGROUP_REPORT_TABLE_TITLE = "RecalTable0";
public static final String READGROUP_REPORT_TABLE_TITLE = "RecalTable0"; // TODO: make them more descriptive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static boolean warnUserNullPlatform = false;

private static final String SCRIPT_FILE = "BQSR.R";
public static final int EMPIRICAL_QUAL_DECIMAL_PLACES = 4;
public static final int EMPIRICAL_Q_REPORTED_DECIMAL_PLACES = 4;
public static final int REPORTED_QUALITY_DECIMAL_PLACES = 4; // tsato: "estimated" q reported...we need to rename (DONE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

public final class ContextCovariate implements Covariate {
private static final long serialVersionUID = 1L;
private static final Logger logger = LogManager.getLogger(ContextCovariate.class);

private final int mismatchesContextSize;
private final int mismatchesContextSize; // TODO: rename "mismatch" here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -136,27 +152,30 @@ public double getEmpiricalErrorRate() {
return doubleMismatches / doubleObservations;
}
}

public void setEmpiricalQuality(final double empiricalQuality) {
// tsato: int-double
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato I took a look. Definitely want to put back the gradle build to how it was. If you need help with the warning we can help. I'm not sure I totally understand the motivation for changing the internal quality value from a double to an int? I think you changed if to being PHRED scaled, but it seems like that might change the precision of the calculation which could change the results? Or is it always output as a PHRED value before further calculations are made? It's not clear to me.

Thank you for the clarifying comments.

Lots of unaddressed TODOs. Some of them probably should be removed but if you think that some of them are useful to future readers than that's fine.

build.gradle Outdated
@@ -188,7 +188,7 @@ configurations.all {
}

tasks.withType(JavaCompile) {
options.compilerArgs = ['-proc:none', '-Xlint:all', '-Werror', '-Xdiags:verbose']
options.compilerArgs = ['-proc:none', '-Xlint:all', '-Xdiags:verbose']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warning are we seeing that made you take this out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets fix the warning or supress it instead of turning this off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reverted.

* If set to true, do not throw an error upon encountering a read with a read group that's not in the recalibration table.
* Instead, simply set the quantized original base qualities as the recalibrated base qualities.
*/ // tsato: should this be in the *unique* ApplyBQSRArgumentCollection?
@Argument(fullName = ALLOW_MISSING_READ_GROUPS_LONG_NAME, doc = "", optional = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot the doc string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this note. I think this argument should be moved up. to ApplyBQSRUniqueArgumentCollection. It is only relevant to the apply stage and not to the recalibration stage so that makes sense for it to be there. Also, I think you need to update the method
ApplyBQSRUniqueArgumentCollection.toApplyBQSRArgumentCollection to pass this argument to the newly created collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do away with this UniqueArgumentCollection construction --- for another time.

@@ -90,7 +90,7 @@ public final class ApplyBQSR extends ReadWalker{
*/
@Argument(fullName=StandardArgumentDefinitions.BQSR_TABLE_LONG_NAME, shortName=StandardArgumentDefinitions.BQSR_TABLE_SHORT_NAME, doc="Input recalibration table for BQSR")
@WorkflowInput
public File BQSR_RECAL_FILE;
public File bqsrRecalFile; // tsato: change to GATKPath?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a good thing to do but we could do it in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -26,14 +28,14 @@

public final class BQSRReadTransformer implements ReadTransformer {
private static final long serialVersionUID = 1L;

// tsato: remove?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be removed.

private byte[] staticQuantizedMapping;
private final CovariateKeyCache keyCache;

// tsato: new flag, needs to be moved to apply BQSR argument
private boolean allowMissingReadGroup;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. allowMissingReadGroup -> allowMissingReadGroups
  2. make this final
  3. I think the comment is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -92,6 +92,8 @@ public abstract class GATKBaseTest extends BaseTest {
// Variants from a DBSNP 138 VCF form the first 65Mb of chr1
public static final String dbsnp_138_b37_1_65M_vcf = largeFileTestDir + "dbsnp_138.b37.1.1-65M.vcf";

public static final String BQSR_TEST_RESOURCE_DIR = toolsTestDir + "BQSR/";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong place for this constant. I'd put it in a BQSR test or maybe just use the CommandLineProgram.getTestDataDir() if that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

IllegalStateException.class);
spec.executeTest("testMissingReadGroup", this);
public void testMissingReadGroup() {
// TODO: there are two copies of this file (one under Validation and one under BQSR) in the repo; conslidate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Duplicated test files are alright. Git deduplicates them internally for transfer so they don't really take up meaningful amounts of room.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

final byte[] oldQuals = ReadUtils.getOriginalBaseQualities(originalRead);

if (ReadUtils.getPlatformUnit(originalRead, originalBamHeader).equals(readGroupToFilterOut)) {
// These are the read groups that are not in teh recal table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


if (ReadUtils.getPlatformUnit(originalRead, originalBamHeader).equals(readGroupToFilterOut)) {
// These are the read groups that are not in teh recal table.
// final Random random = new Random();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -97,12 +97,12 @@ public void testRecalDatumCopyAndCombine(RecalDatumTestProvider cfg) {
@Test(dataProvider = "RecalDatumTestProvider")
public void testRecalDatumModification(RecalDatumTestProvider cfg) {
RecalDatum datum = cfg.makeRecalDatum();
datum.setEmpiricalQuality(10.1);
Assert.assertEquals(datum.getEmpiricalQuality(), 10.1);
datum.setEmpiricalQuality(10); // tsato: 10.1 -> 10. Downstream data might be affected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does changing the quality from double -> int potentially change the results of the recalibration table by rounding earlier? I'm not sure what the advantage of that is?

@takutosato
Copy link
Contributor Author

@lbergelson thanks for the review! I will address the changes as soon as I can.

The motivation for the double -> int change is in RecalDatum::bayesianEstimateOfEmpiricalQuality(). In the master branch now, this method always returns a (phred-scaled integer) / RESOLUTION_BINS_PER_QUAL, and RESOLUTION_BINS_PER_QUAL is a final constant set to 1. So this method really returns an integer, and my change attempts to make this clear.

@gatk-bot
Copy link

gatk-bot commented Nov 19, 2024

Github actions tests reported job failures from actions build 11903509453
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 17.0.6+10 11903509453.10 logs
unit 17.0.6+10 11903509453.12 logs
integration 17.0.6+10 11903509453.11 logs
unit 17.0.6+10 11903509453.1 logs
integration 17.0.6+10 11903509453.0 logs

@gatk-bot
Copy link

gatk-bot commented Jan 10, 2025

Github actions tests reported job failures from actions build 12702971619
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 12702971619.12 logs
unit 17.0.6+10 12702971619.1 logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants