Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ncmec: store checkpoint occasionally when start, end diff is one second #1731

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

prenner
Copy link
Contributor

@prenner prenner commented Jan 9, 2025

Summary

sometimes ncmec fails to make progress after hitting a second w/ a large number of results: #1679. when that happens (diff of end and start is a second and we have lots of data), store checkpoints occasionally via a next pointer

Test Plan

confirmed that resuming from a checkpoint works around the cursed second

@prenner prenner force-pushed the prenner/checkpoint-ncmec branch from 928ddce to 4f12e50 Compare January 9, 2025 16:14
@prenner prenner changed the title ncmec: store checkpoint occasionally when start, end diff is one second [WIP] ncmec: store checkpoint occasionally when start, end diff is one second Jan 9, 2025
@prenner prenner force-pushed the prenner/checkpoint-ncmec branch 4 times, most recently from 2965a46 to 5270515 Compare January 9, 2025 17:12
@prenner prenner force-pushed the prenner/checkpoint-ncmec branch from 5270515 to d7f207e Compare January 9, 2025 17:54
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good, thanks for making this change, and I think it will help a lot!

I am slightly suspicious that the paging URLs can go sour (e.g. I have noticed that NCMEC API tends to throw exceptions near the very end of the paging list that make me think that they are invaliding), so I think adding the time-based invalidation logic is a requirement.

As part of your test plan, can you also attempt fetching past an extremely dense time segment in the NCMEC API and confirm the behavior works as expected?


updates.extend(entry.updates)

if i % 100 == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: by change this from elif to if, I think it will now print the large update warning every update, which is incorrect, no?

Copy link
Contributor Author

@prenner prenner Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would print for the 0th, which we would not want. I updated this to be (i + 1) % 100 == 0, so it's every 100th iteration

we need to extend updates everytime, regardless of i, so this was cleaner than other things I thought of
but please suggest alternatives

log(f"large fetch ({i}), up to {len(updates)}")
updates.extend(entry.updates)
# so store the checkpoint occasionally
log(f"large fetch ({i}), up to {len(updates)}. storing checkpoint")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You don't actually store the checkpoint by yielding, technically the caller can decide whether to keep calling or store.

Copy link
Contributor Author

@prenner prenner Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so the original elif block doesn't need to change? the only real change that's needed is to use the next_url in the for loop on L283?

edit: I think the yield is still needed, just the comment might be incorrect.. let me know if not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment 👍

start_timestamp=current_start, end_timestamp=current_end
start_timestamp=current_start,
end_timestamp=current_end,
next_=current_next_fetch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Danger! It's actually very easy to mess up this argument and accidentally trigger and endless loop. It may be that you have done so in the current code, but it's hard to tell.

The only time current_next_fetch should be populated is when you are resuming from checkpoint, and you need to explicitly disable the overfetch check (L290) then.

There might be a refactoring of this code that makes this easier, or now that we are switching over to the next pointer version we can get rid of the probing behavior, which simplifies the implementation quite a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as I mentioned in slack looks like we need the probing behavior so I wasn't able to simplify. I added a check to disable the overfetch when resuming from a checkpoint

start_timestamp=current_start, end_timestamp=current_end
start_timestamp=current_start,
end_timestamp=current_end,
next_=current_next_fetch,
)
):
if i == 0: # First batch, check for overfetch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a comment, it turns out my implementation for estimation of the entries in range was completely off, and so this is basically always overly cautious. Not sure what to do about it, since the alternatives that I can think of are complicated.

@prenner prenner force-pushed the prenner/checkpoint-ncmec branch 8 times, most recently from 82bc20b to c4a004e Compare January 22, 2025 17:02
@prenner prenner force-pushed the prenner/checkpoint-ncmec branch 2 times, most recently from 3488550 to 83ebd79 Compare January 22, 2025 19:03
@prenner prenner force-pushed the prenner/checkpoint-ncmec branch from 83ebd79 to b0f7997 Compare January 22, 2025 19:04
# note: the default_factory value was not being set correctly when
# reading from pickle
if not "last_fetch_time" in d:
d["last_fetch_time"] = int(time.time())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting AttributeError: 'NCMECCheckpoint' object has no attribute 'last_fetch_time' without this in the test_state_compatibility test

seems sort of related to pydantic/pydantic#7821, since default was working (but wouldn't work if we want to set it to the current time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants