Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: jit: gemm: Add k-parallelism parameter to Xe2 gemm kernels #2477

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Simonsays095
Copy link
Contributor

Addresses MFDNN-13067. The Xe2 dynamic quantization kernels have some parameters that need to be tweaked:

  • ar -> sr br: This results in faster execution and should be the default in all cases
  • 2 k-parallel kernels need to have ikr added to the strategy to be valid
  • di cc can be removed from 2 kernels

The result should be a few correctness passes that used to fail (below), and slightly more optimized execution.

$ benchdnn --mode=C --matmul --engine=gpu --dt=s8:s4:f16 --stag=ab --wtag=ba --dtag=ab --attr-scales=src0:per_ocic:f16:1x128+wei:per_ocic:f16:128x1 --attr-fpmath=f16:true 1x4096:4096x4096
[   0][DST][0:0] exp_f32:     -213.25 exp:     -213.25 got:    -53.3125 diff: 159.938 rdiff:    0.75
[   1][DST][0:1] exp_f32:    -137.375 exp:    -137.375 got:    -68.6875 diff: 68.6875 rdiff:     0.5
[   2][DST][0:2] exp_f32:    -109.625 exp:    -109.625 got:     -219.25 diff: 109.625 rdiff:       1
[   4][DST][0:4] exp_f32:     798.812 exp:         799 got:      199.75 diff:  599.25 rdiff:    0.75
[   6][DST][0:6] exp_f32:      36.375 exp:      36.375 got:       145.5 diff: 109.125 rdiff:       3
[   7][DST][0:7] exp_f32:      415.75 exp:      415.75 got:     207.875 diff: 207.875 rdiff:     0.5
[   9][DST][0:9] exp_f32:    -464.938 exp:        -465 got:       -1860 diff:    1395 rdiff:       3
[  10][DST][0:10] exp_f32:      89.875 exp:      89.875 got:      179.75 diff:  89.875 rdiff:       1
[  11][DST][0:11] exp_f32:    -466.938 exp:        -467 got:        -934 diff:     467 rdiff:       1
[  12][DST][0:12] exp_f32:     -46.375 exp:     -46.375 got:    -11.5938 diff: 34.7812 rdiff:    0.75
0:FAILED (errors:4092 total:4096) __REPRO: --matmul --engine=gpu --dt=s8:s4:f16 --stag=ab --wtag=ba --dtag=ab --attr-scales=src:per_ocic:f16:1x128+wei:per_ocic:f16:128x1 --attr-fpmath=f16:true 1x4096:4096x4096
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0
total: 0.43s; fill: 0.09s (21%); compute_ref: 0.15s (36%); compare: 0.00s (0%);

@Simonsays095 Simonsays095 added the bug A confirmed library bug label Jan 22, 2025
@Simonsays095 Simonsays095 requested a review from a team as a code owner January 22, 2025 06:17
@github-actions github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Jan 22, 2025
@Simonsays095
Copy link
Contributor Author

make test
disable test_device_cpu
enable test_device_gpu

@petercad
Copy link
Contributor

@Simonsays095 can you run Xe2 perf testing?

{{'G', "gemm", {"F", "H", "S"}, {"T", "N", "N"}}, {-1, -1, {-1, 25, -1}, {-1, 32, -1}, {-1, 25, -1}, {-1, 32, -1}, {16, 16, 1}, "IAB"}, "at32+m128@80 am32+m128@80 aB wg 8x1x4 ikr wx2 xaf vav hi pt sr br sb128 bk0 sm sn bm0 nmk sys", {16, (LoopType) 255, 128, {(LoopType) 209, (LoopType) 255, (LoopType) 2}, {16777216, 262144, 16777216}, {262144, 262144, 16777216}, {16, 16, 128}, {8, 1, 4}, 2, (WGType) 1, 4357, 0, 8192, {16, 16, 4}, {true, true, true}}, {'W', 1, {256}}},
{{'G', "gemm", {"F", "H", "S"}, {"T", "N", "N"}}, {-1, -1, {-1, 33, -1}, {-1, 48, -1}, {-1, 33, -1}, {-1, 48, -1}, {16, 16, 1}, "ABI"}, "at64+m64@48 am32+m16@48 aB wg 4x1 xaf rr vav hi pt sr br sb64 bk0 sm grf256 sys np", {16, (LoopType) 255, 256, {(LoopType) 208, (LoopType) 255, (LoopType) 255}, {524288, 524288, 16777216}, {524288, 524288, 16777216}, {32, 32, 64}, {4, 1, 1}, 1, (WGType) 1, 257, 0, 0, {16, 16, 4}, {true, true, true}}, {'W', 1, {1024}}},
{{'G', "gemm", {"F", "O", "S"}, {"T", "N", "N"}}, {-1, -1, {-1, -1, -1}, {-1, -1, -1}, {-1, -1, -1}, {-1, -1, -1}, {16, 16, 1}, "ABI"}, "at32+m128@96 am32x2+m64@96 aB wg 2x16 vav hi pt sr br sb128 bk0 grf256 sys acb cr16", {16, (LoopType) 255, 256, {(LoopType) 208, (LoopType) 255, (LoopType) 255}, {2097152, 262144, 16777216}, {2097152, 262144, 16777216}, {128, 16, 32}, {2, 16, 1}, 1, (WGType) 1, 257, 0, 0, {16, 16, 4}, {true, true, true}}, {'E', 17, {879529, 62860.9, 0, 0, 0, 0, 1.12572, 1.9182, 3.81465, 7.84556, 0.00532516, 0.00532516, 0, 1, 1.01261, 1.00705, -3.00232e-14}}},
{{'G', "gemm", {"F", "O", "S"}, {"T", "N", "N"}}, {-1, -1, {-1, 1, -1}, {-1, 1, -1}, {-1, -1, -1}, {-1, -1, -1}, {16, 16, 1}, "ABI"}, "at128 am128 ab wg 2x1x16 sys ikr sr br", {16, (LoopType) 255, 128, {(LoopType) 0, (LoopType) 1, (LoopType) 255}, {8192, 8192, 16777216}, {8192, 8192, 16777216}, {32, 1, 128}, {2, 1, 1}, 1, (WGType) 0, 257, 0, 0, {16, 16, 4}, {true, true, true}}, {'E', 17, {533005, 706.931, 0, 0, 0, 0, 1.03522, 1.49979, 2.9056, 6.09078, 0.0666521, -0.0162066, 0.0674277, 0.261398, 1.07943, 0, 0}}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After modifying strategies in the catalog "by hand," you'll want to update the embedded DriverInfo structs with ktool --reinfo. In this case only the strategies you added ikr to really need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants