Skip to content

Commit

Permalink
[SPARK-48816][SQL] Shorthand for interval converters in UnivocityParser
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

Directly call `IntervalUtils.castStringToDTInterval/castStringToYMInterval` instead of creating Cast expressions to evaluate.

- Benchmarks indicated a 10% time-saving.
- Bad record recording might not work if the cast handles the exceptions early

### Why are the changes needed?

- pref improvement
- Bugfix for bad record recording

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing existing tests and benchmark tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47227 from yaooqinn/SPARK-48816.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
  • Loading branch information
yaooqinn authored and attilapiros committed Oct 4, 2024
1 parent 76ab05e commit ea0b7af
Show file tree
Hide file tree
Showing 4 changed files with 134 additions and 85 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import com.univocity.parsers.csv.CsvParser
import org.apache.spark.SparkUpgradeException
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.{InternalRow, NoopFilters, OrderedFilters}
import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, ExprUtils, GenericInternalRow, Literal}
import org.apache.spark.sql.catalyst.expressions.{ExprUtils, GenericInternalRow}
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
import org.apache.spark.sql.errors.{ExecutionErrors, QueryExecutionErrors}
Expand Down Expand Up @@ -260,12 +260,14 @@ class UnivocityParser(

case ym: YearMonthIntervalType => (d: String) =>
nullSafeDatum(d, name, nullable, options) { datum =>
Cast(Literal(datum), ym).eval(EmptyRow)
IntervalUtils.castStringToYMInterval(
UTF8String.fromString(datum), ym.startField, ym.endField)
}

case dt: DayTimeIntervalType => (d: String) =>
nullSafeDatum(d, name, nullable, options) { datum =>
Cast(Literal(datum), dt).eval(EmptyRow)
IntervalUtils.castStringToDTInterval(
UTF8String.fromString(datum), dt.startField, dt.endField)
}

case udt: UserDefinedType[_] =>
Expand Down
89 changes: 48 additions & 41 deletions sql/core/benchmarks/CSVBenchmark-jdk21-results.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,69 +2,76 @@
Benchmark to measure CSV read/write performance
================================================================================================

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 23353 23432 75 0.0 467067.4 1.0X
One quoted string 23962 24182 316 0.0 479231.3 1.0X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 56825 57244 679 0.0 56825.1 1.0X
Select 100 columns 20482 20568 86 0.0 20481.7 2.8X
Select one column 16968 17000 36 0.1 16967.7 3.3X
count() 3366 3378 11 0.3 3366.4 16.9X
Select 100 columns, one bad input field 28347 28379 30 0.0 28346.6 2.0X
Select 100 columns, corrupt record field 32401 32450 42 0.0 32401.2 1.8X
Select 1000 columns 56724 57115 570 0.0 56724.1 1.0X
Select 100 columns 20740 20855 115 0.0 20739.7 2.7X
Select one column 17304 17377 114 0.1 17304.3 3.3X
count() 3719 3740 21 0.3 3719.0 15.3X
Select 100 columns, one bad input field 24943 24999 69 0.0 24943.2 2.3X
Select 100 columns, corrupt record field 28306 28341 31 0.0 28306.2 2.0X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 11174 11195 18 0.9 1117.4 1.0X
Select 1 column + count() 7666 7694 24 1.3 766.6 1.5X
count() 2042 2048 5 4.9 204.2 5.5X
Select 10 columns + count() 10977 10982 5 0.9 1097.7 1.0X
Select 1 column + count() 7406 7554 131 1.4 740.6 1.5X
count() 1550 1558 9 6.5 155.0 7.1X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 854 882 27 11.7 85.4 1.0X
to_csv(timestamp) 6166 6174 13 1.6 616.6 0.1X
write timestamps to files 6480 6575 158 1.5 648.0 0.1X
Create a dataset of dates 948 949 1 10.6 94.8 0.9X
to_csv(date) 4471 4474 3 2.2 447.1 0.2X
write dates to files 4599 4616 15 2.2 459.9 0.2X
Create a dataset of timestamps 845 847 3 11.8 84.5 1.0X
to_csv(timestamp) 5546 5597 57 1.8 554.6 0.2X
write timestamps to files 5760 5768 8 1.7 576.0 0.1X
Create a dataset of dates 1053 1064 9 9.5 105.3 0.8X
to_csv(date) 4115 4122 9 2.4 411.5 0.2X
write dates to files 4102 4108 5 2.4 410.2 0.2X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 1200 1213 12 8.3 120.0 1.0X
read timestamps from files 11576 11601 22 0.9 1157.6 0.1X
infer timestamps from files 23234 23253 16 0.4 2323.4 0.1X
read date text from files 1115 1162 44 9.0 111.5 1.1X
read date from files 10978 11006 43 0.9 1097.8 0.1X
infer date from files 22588 22604 13 0.4 2258.8 0.1X
timestamp strings 1224 1236 21 8.2 122.4 1.0X
parse timestamps from Dataset[String] 13566 13595 41 0.7 1356.6 0.1X
infer timestamps from Dataset[String] 25057 25094 36 0.4 2505.7 0.0X
date strings 1618 1626 7 6.2 161.8 0.7X
parse dates from Dataset[String] 12784 12816 34 0.8 1278.4 0.1X
from_csv(timestamp) 12008 12088 69 0.8 1200.8 0.1X
from_csv(date) 11930 11938 12 0.8 1193.0 0.1X
infer error timestamps from Dataset[String] with default format 14366 14394 35 0.7 1436.6 0.1X
infer error timestamps from Dataset[String] with user-provided format 14380 14412 52 0.7 1438.0 0.1X
infer error timestamps from Dataset[String] with legacy format 14439 14453 21 0.7 1443.9 0.1X
read timestamp text from files 1107 1119 16 9.0 110.7 1.0X
read timestamps from files 9511 9553 49 1.1 951.1 0.1X
infer timestamps from files 19084 19114 27 0.5 1908.4 0.1X
read date text from files 1036 1046 14 9.7 103.6 1.1X
read date from files 8299 8309 15 1.2 829.9 0.1X
infer date from files 17290 17294 4 0.6 1729.0 0.1X
timestamp strings 1188 1197 7 8.4 118.8 0.9X
parse timestamps from Dataset[String] 11442 11458 14 0.9 1144.2 0.1X
infer timestamps from Dataset[String] 21076 21116 39 0.5 2107.6 0.1X
date strings 1651 1659 10 6.1 165.1 0.7X
parse dates from Dataset[String] 10181 10186 5 1.0 1018.1 0.1X
from_csv(timestamp) 10023 10062 34 1.0 1002.3 0.1X
from_csv(date) 9335 9351 15 1.1 933.5 0.1X
infer error timestamps from Dataset[String] with default format 11187 11205 16 0.9 1118.7 0.1X
infer error timestamps from Dataset[String] with user-provided format 11201 11216 13 0.9 1120.1 0.1X
infer error timestamps from Dataset[String] with legacy format 11210 11227 17 0.9 1121.0 0.1X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 4302 4383 137 0.0 43020.6 1.0X
pushdown disabled 4206 4220 13 0.0 42058.8 1.0X
w/ filters 776 784 10 0.1 7756.3 5.5X
w/o filters 4365 4377 13 0.0 43653.8 1.0X
pushdown disabled 4348 4370 22 0.0 43477.7 1.0X
w/ filters 695 713 29 0.1 6950.2 6.3X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
Interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Read as Intervals 7089 7096 7 0.4 2362.1 1.0X
Read Raw Strings 2071 2075 6 1.4 690.1 3.4X


Loading

0 comments on commit ea0b7af

Please sign in to comment.