[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

ennox108 · 2025-01-07T18:55:51Z

We upgraded EMR from 6.11.1 to 7.2.0 and Hudi from 0.13 to 0.14.1-amzn-1

I am trying to run a hudi job which runs for 4 data sources. I am able to execute the job for 3 data sources but the jobs keeps failing for 1 source with the below error

I have tried re ingesting the source tables used for this job as well as re creating the table where the data is written.

I am using the following hudi options

hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.table.type': table_type or 'MERGE_ON_READ',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.payload.class': payload_class,
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.datasource.write.recordkey.field': primary_keys.replace(' ', ''),
'hoodie.datasource.write.precombine.field': precombine_key,
'hoodie.datasource.write.partitionpath.field': 'src_db_id:SIMPLE',
'hoodie.embed.timeline.server': False,
'hoodie.index.type': 'BLOOM',
'hoodie.parquet.compression.codec': 'snappy',
'hoodie.clean.async': True,
'hoodie.clean.max.commits': 3,
'hoodie.parquet.max.file.size': 125829120,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.block.size': 125829120,
'hoodie.metadata.enable': not overwrite,
'hoodie.metadata.validate': True,
'hoodie.allow.empty.commit': True,
'hoodie.datasource.write.hive_style_partitioning': True,
'hoodie.datasource.hive_sync.support_timestamp': True,
'hoodie.datasource.hive_sync.jdbcurl': hive_jdbcurl,
'hoodie.datasource.hive_sync.username': hive_username,
'hoodie.datasource.hive_sync.password': hive_password,
'hoodie.datasource.hive_sync.database': cdm_db,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.partition_fields': 'src_db_id',
'hoodie.datasource.hive_sync.enable': True,
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.compact.inline': True,
'hoodie.compact.inline.trigger.strategy': 'NUM_OR_TIME',
'hoodie.compact.inline.max.delta.commits': 1,
'hoodie.compact.inline.max.delta.seconds': 3600
}

Application being used-
EMR 7.2.0
Spark 3.5.1
Hadoop 3.3.6
Hudi 0.14.1-amzn-1

The same job is working without any issue with the old EMR.

Another suggestion I found from AWS was to use Java 8 instead of 17. Even with Java 8 the issue persists

danny0405 · 2025-01-08T03:25:55Z

Looks like an known issue because of the avro version compatibility: #6621 (comment)

danny0405 added version-compatibility priority:major degraded perf; unable to move forward; potential bugs labels Jan 8, 2025

danny0405 added this to Hudi Issue Support Jan 8, 2025

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

ennox108 commented Jan 7, 2025 •

edited

Loading

danny0405 commented Jan 8, 2025

[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

Comments

ennox108 commented Jan 7, 2025 • edited Loading

danny0405 commented Jan 8, 2025

ennox108 commented Jan 7, 2025 •

edited

Loading