Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

Open
ennox108 opened this issue Jan 7, 2025 · 1 comment
Open

[SUPPORT] IndexOutOfBoundsException when running Hudi job #12593

ennox108 opened this issue Jan 7, 2025 · 1 comment
Labels
priority:major degraded perf; unable to move forward; potential bugs version-compatibility

Comments

@ennox108
Copy link

ennox108 commented Jan 7, 2025

We upgraded EMR from 6.11.1 to 7.2.0 and Hudi from 0.13 to 0.14.1-amzn-1

I am trying to run a hudi job which runs for 4 data sources. I am able to execute the job for 3 data sources but the jobs keeps failing for 1 source with the below error

{7905A290-A687-442A-A418-75996C36892B}

I have tried re ingesting the source tables used for this job as well as re creating the table where the data is written.

I am using the following hudi options

hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.table.type': table_type or 'MERGE_ON_READ',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.payload.class': payload_class,
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.datasource.write.recordkey.field': primary_keys.replace(' ', ''),
'hoodie.datasource.write.precombine.field': precombine_key,
'hoodie.datasource.write.partitionpath.field': 'src_db_id:SIMPLE',
'hoodie.embed.timeline.server': False,
'hoodie.index.type': 'BLOOM',
'hoodie.parquet.compression.codec': 'snappy',
'hoodie.clean.async': True,
'hoodie.clean.max.commits': 3,
'hoodie.parquet.max.file.size': 125829120,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.block.size': 125829120,
'hoodie.metadata.enable': not overwrite,
'hoodie.metadata.validate': True,
'hoodie.allow.empty.commit': True,
'hoodie.datasource.write.hive_style_partitioning': True,
'hoodie.datasource.hive_sync.support_timestamp': True,
'hoodie.datasource.hive_sync.jdbcurl': hive_jdbcurl,
'hoodie.datasource.hive_sync.username': hive_username,
'hoodie.datasource.hive_sync.password': hive_password,
'hoodie.datasource.hive_sync.database': cdm_db,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.partition_fields': 'src_db_id',
'hoodie.datasource.hive_sync.enable': True,
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.compact.inline': True,
'hoodie.compact.inline.trigger.strategy': 'NUM_OR_TIME',
'hoodie.compact.inline.max.delta.commits': 1,
'hoodie.compact.inline.max.delta.seconds': 3600
}

Application being used-
EMR 7.2.0
Spark 3.5.1
Hadoop 3.3.6
Hudi 0.14.1-amzn-1

The same job is working without any issue with the old EMR.

Another suggestion I found from AWS was to use Java 8 instead of 17. Even with Java 8 the issue persists

@danny0405
Copy link
Contributor

Looks like an known issue because of the avro version compatibility: #6621 (comment)

@danny0405 danny0405 added version-compatibility priority:major degraded perf; unable to move forward; potential bugs labels Jan 8, 2025
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:major degraded perf; unable to move forward; potential bugs version-compatibility
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants