Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8889] Trim unnecessary columns during MoR snapshot read #12677

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TheR1sing3un
Copy link
Member

Consider following case when we perform snapshot-read on MoR table:

  1. full schema: (A, B, C, D)
  2. primary key field: A
  3. pre-combine key field: B
  4. query with select D from table

for each file group

data file log files operation payload allow projection pushdown actual read schema
exist empty ... ... (_hoodie_record_key, B, D)
empty exist ... ... (_hoodie_record_key, B, D)
exist exist skip_merge ... (_hoodie_record_key, B, D)
exist exist payload_combine Y (_hoodie_record_key, B, D)
exist exist payload_combine N (_hoodie_record_key, A, B, D)

However, except for the last two case, we only need to read column D on the file in other cases.

Change Logs

  1. Trim unnecessary columns during MoR snapshot read

Impact

Improves performance when MoR snapshot-read with not actually merged

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 20, 2025
@TheR1sing3un TheR1sing3un force-pushed the feat_trim_cols_mor_snapshot_read branch 2 times, most recently from 39f927a to 1f6e7dc Compare January 20, 2025 10:33
@TheR1sing3un TheR1sing3un marked this pull request as draft January 20, 2025 14:27
@TheR1sing3un TheR1sing3un marked this pull request as ready for review January 21, 2025 02:47
1. Trim unnecessary columns during MoR snapshot read

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un TheR1sing3un force-pushed the feat_trim_cols_mor_snapshot_read branch from 4842151 to 18c1648 Compare January 21, 2025 03:03
1. fix wrong embed internal schema

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants