-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD #12434
Comments
@mzheng-plaid The parallelism should be equal to the number of file groups as we can have one task reading one parquet file. Want to understand more if the parquet files / log files are properly sized then whey you will face bottleneck at task level? |
We set these to:
So with parquet there is |
@mzheng-plaid I dont think HoodieMergeOnReadRDD has a way to split filegroups further during read. Any way it will be difficult with snapshot read, as log files has to be applied on the parquet records. |
This is problematic even on the read optimized table (ie. just the base parquet files), which is really surprising I tried:
And just reading the parquet files directly was much less memory intensive and faster (ie. not spilling to disk) when I tuned |
@ad1happy2go bump on this, is there any workaround for read-optimized queries? That behavior is surprising |
Describe the problem you faced
We have jobs that read from a MOR table using the following pyspark pseudo-code (
event_table_rt
is the MOR table):We're running into a bottleneck on
HoodieMergeOnReadRDD
(https://github.com/apache/hudi/blob/release-0.14.2/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L37) where the number of tasks in the stage readingevent_df
seems to be non-configurable and (I think) equal to the number of files being read. This is causing massive disk/memory spill and bottlenecking performance.Is it possible to configure the read parallelism to be higher or is this a fundamental limitation of Hudi with MOR tables? What is the recommendation for how to tune resourcing for readers of MOR tables?
Environment Description
Hudi version : 0.14.1-amzn-1 (EMR 7.2.0)
Spark version : 3.5.1
Hive version : 3.1.3
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
The text was updated successfully, but these errors were encountered: