-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT]Unable to read new data in streaming mode with specific timestamp #12661
Comments
Sounds like a batch reading behavior, does the job finished after reading, or become an idle running? |
I can confirm this issue and have traced the source code to understand why. Root Cause Analysis: public List<HoodieInstant> filterInstantsWithRange(
HoodieTimeline commitTimeline,
@Nullable final String issuedInstant) {
// For continuous streaming read
if (issuedInstant != null) {
return completedTimeline
.getInstantsAsStream()
.filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), GREATER_THAN, issuedInstant))
.collect(Collectors.toList());
}
// For initial read
Stream<HoodieInstant> instantStream = completedTimeline.getInstantsAsStream();
if (OptionsResolver.hasNoSpecificReadCommits(this.conf)) {
// snapshot read - only reads the latest commit
return completedTimeline.lastInstant().map(Collections::singletonList).orElseGet(Collections::emptyList);
}
// With specific start commit time
if (OptionsResolver.isSpecificStartCommit(this.conf)) {
final String startCommit = this.conf.get(FlinkOptions.READ_START_COMMIT);
instantStream = instantStream
.filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), GREATER_THAN_OR_EQUALS, startCommit));
}
} Currently, this only works correctly with
Question:
Looking forward to your suggestions on this. |
Did you execute your job in streaming mode? |
Yes, we did execute the job in streaming mode by setting: options.put(FlinkOptions.READ_AS_STREAMING.key(), "true");
options.put(FlinkOptions.READ_START_COMMIT.key(), "20240116000000"); // specific timestamp
options.put(FlinkOptions.READ_STREAMING_CHECK_INTERVAL.key(), "5"); We've found that:
We tried different configurations but still couldn't make it work with a specific timestamp. The job can only read new data when using |
I mean the flink streaming execution mode, not just the option |
Yes, I’m sure. This is my current complete write code.
|
@xiearthur is the problem still there? If so, several things need confirmed:
And you can make some debug locally, |
Describe the problem you faced
When using Flink to read a Hudi COW table in streaming mode, specific timestamp can read new data written after the Flink job starts. The streaming job only reads data up to its start time.
To Reproduce
Expected behavior
The streaming job should continuously read new data written after job starts, regardless of using specific timestamp.
Environment Description
The text was updated successfully, but these errors were encountered: