From f82360b9a80ccedb64b601f91cd35612ee4946d3 Mon Sep 17 00:00:00 2001 From: Diptorup Deb Date: Mon, 18 Mar 2024 22:48:00 -0500 Subject: [PATCH] Restructure and fill in the kernel programming guide. --- docs/source/ext_links.txt | 3 + .../kernel_programming/call-kernel-async.rst | 4 + .../kernel_programming/call-kernel.rst | 117 ++++++++++++++ .../kernel_programming/device-functions.rst | 46 ++++-- .../user_guide/kernel_programming/index.rst | 149 ++++-------------- .../kernel_programming/operators.csv | 35 ++++ .../kernel_programming/operators.rst | 6 + .../writing-ndrange-kernel.rst | 129 +++++++++++++++ .../writing-range-kernel.rst | 105 +++--------- 9 files changed, 373 insertions(+), 221 deletions(-) create mode 100644 docs/source/user_guide/kernel_programming/call-kernel-async.rst create mode 100644 docs/source/user_guide/kernel_programming/call-kernel.rst create mode 100644 docs/source/user_guide/kernel_programming/operators.csv create mode 100644 docs/source/user_guide/kernel_programming/operators.rst create mode 100644 docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst diff --git a/docs/source/ext_links.txt b/docs/source/ext_links.txt index 3bb02afdfd..4c369fbd92 100644 --- a/docs/source/ext_links.txt +++ b/docs/source/ext_links.txt @@ -31,3 +31,6 @@ .. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html .. _dpctl.tensor.usm_ndarray: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/usm_ndarray.html#dpctl.tensor.usm_ndarray .. _dpnp.ndarray: https://intelpython.github.io/dpnp/reference/ndarray.html + +.. _Dispatcher: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#dispatcher-objects +.. _Unboxes: https://numba.readthedocs.io/en/stable/extending/interval-example.html#boxing-and-unboxing diff --git a/docs/source/user_guide/kernel_programming/call-kernel-async.rst b/docs/source/user_guide/kernel_programming/call-kernel-async.rst new file mode 100644 index 0000000000..b3a657890d --- /dev/null +++ b/docs/source/user_guide/kernel_programming/call-kernel-async.rst @@ -0,0 +1,4 @@ +.. _launching-an-async-kernel: + +Async kernel execution +====================== diff --git a/docs/source/user_guide/kernel_programming/call-kernel.rst b/docs/source/user_guide/kernel_programming/call-kernel.rst new file mode 100644 index 0000000000..2655027460 --- /dev/null +++ b/docs/source/user_guide/kernel_programming/call-kernel.rst @@ -0,0 +1,117 @@ +.. _launching-a-kernel: + +Launching a kernel +================== + +A ``kernel`` decorated kapi function produces a ``KernelDispatcher`` object that +is a type of a Numba* `Dispatcher`_ object. However, unlike regular Numba* +Dispatcher objects a ``KernelDispatcher`` object cannot be directly invoked from +either CPython or another compiled Numba* ``jit`` function. To invoke a +``kernel`` decorated function, a programmer has to use the +:func:`numba_dpex.core.kernel_launcher.call_kernel` function. + +To invoke a ``KernelDispatcher`` the ``call_kernel`` function requires three +things: the ``KernelDispatcher`` object, the ``Range`` or ``NdRange`` object +over which the kernel is to be executed, and the list of arguments to be passed +to the compiled kernel. Once called with the necessary arguments, the +``call_kernel`` function does the following main things: + +- Compiles the ``KernelDispatcher`` object specializing it for the provided + argument types. + +- `Unboxes`_ the kernel arguments by converting CPython objects into Numba* or + numba-dpex objects. + +- Infer the execution queue on which to submit the kernel from the provided + kernel arguments. (TODO: Refer compute follows data.) + +- Submits the kernel to the execution queue. + +- Waits for the execution completion, before returning control back to the + caller. + +.. important:: + Programmers should note the following two things when defining the global or + local range to launch a kernel. + + * Numba-dpex currently limits the maximum allowed global range size to + ``2^31-1``. It is due to the capabilities of current OpenCL GPU backends + that generally do not support more than 32-bit global range sizes. A + kernel requesting a larger global range than that will not execute and a + ``dpctl._sycl_queue.SyclKernelSubmitError`` will get raised. + + The Intel dpcpp SYCL compiler does handle greater than 32-bit global + ranges for GPU backends by wrapping the kernel in a new kernel that has + each work-item perform multiple invocations of the original kernel in a + 32-bit global range. Such a feature is not yet available in numba-dpex. + + * When launching an nd-range kernel, if the number of work-items for a + particular dimension of a work-group exceeds the maximum device + capability, it can result in undefined behavior. + + The maximum allowed work-items for a device can be queried programmatically + as shown in :ref:`ex_max_work_item`. + + .. code-block:: python + :linenos: + :caption: **Example:** Query maximum number of work-items for a device + :name: ex_max_work_item + + import dpctl + import math + + d = dpctl.SyclDevice("gpu") + d.print_device_info() + + max_num_work_items = ( + d.max_work_group_size + * d.max_work_item_sizes1d[0] + * d.max_work_item_sizes2d[0] + * d.max_work_item_sizes3d[0] + ) + print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})") + + cpud = dpctl.SyclDevice("cpu") + cpud.print_device_info() + + max_num_work_items_cpu = ( + cpud.max_work_group_size + * cpud.max_work_item_sizes1d[0] + * cpud.max_work_item_sizes2d[0] + * cpud.max_work_item_sizes3d[0] + ) + print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})") + + The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated + graphics processor and a 9th Generation Coffee Lake CPU is shown in + :ref:`ex_max_work_item_output`. + + .. code-block:: bash + :caption: **OUTPUT:** Query maximum number of work-items for a device + :name: ex_max_work_item_output + + Name Intel(R) UHD Graphics 630 [0x3e98] + Driver version 1.3.24595 + Vendor Intel(R) Corporation + Filter string level_zero:gpu:0 + + 4294967296 (2^32) + Name Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz + Driver version 2023.16.12.0.12_195853.xmain-hotfix + Vendor Intel(R) Corporation + Filter string opencl:cpu:0 + + 4503599627370496 (2^52) + + +The ``call_kernel`` function can be invoked both from CPython and from another +Numba* compiled function. Note that the ``call_kernel`` function supports only +synchronous execution of kernel and the ``call_kernel_async`` function should be +used for asynchronous mode of kernel execution (refer +:ref:`launching-an-async-kernel`). + + +.. seealso:: + + Refer the API documentation for + :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details. diff --git a/docs/source/user_guide/kernel_programming/device-functions.rst b/docs/source/user_guide/kernel_programming/device-functions.rst index 1387e4a0da..b9dd914a0a 100644 --- a/docs/source/user_guide/kernel_programming/device-functions.rst +++ b/docs/source/user_guide/kernel_programming/device-functions.rst @@ -1,29 +1,33 @@ Numba-dpex provides a decorator to express auxiliary device-only functions that can be called from a kernel or another device function, but are not callable -from the host. This decorator :func:`numba_dpex.experimental.device_func` has no -direct analogue in SYCL and primarily is provided to help programmers make their -kapi applications modular. +from the host. This decorator :func:`numba_dpex.core.decorators.device_func` has +no direct analogue in SYCL and primarily is provided to help programmers make +their kapi applications modular. :ref:`ex_device_func1` shows a simple usage of +the ``device_func`` decorator. .. code-block:: python + :linenos: + :caption: **Example:** Basic usage of device_func + :name: ex_device_func1 import dpnp - from numba_dpex import experimental as dpex_exp + import numba_dpex as dpex from numba_dpex import kernel_api as kapi # Array size N = 10 - @dpex_exp.device_func + @dpex.device_func def a_device_function(a): - """A device callable function that can be invoked from a ``kernel`` or + """A device callable function that can be invoked from a kernel or another device function. """ return a + 1 - @dpex_exp.kernel + @dpex.kernel def a_kernel_function(item: kapi.Item, a, b): """Demonstrates calling a device function from a kernel.""" i = item.get_id(0) @@ -34,31 +38,45 @@ kapi applications modular. a = dpnp.ones(N, dtype=dpnp.int32) b = dpnp.zeros(N, dtype=dpnp.int32) - dpex_exp.call_kernel(a_kernel_function, dpex.Range(N), a, b) + dpex.call_kernel(a_kernel_function, dpex.Range(N), a, b) - @dpex_exp.device_func - def increment_value(nd_item: NdItem, a): +.. code-block:: python + :linenos: + :caption: **Example:** Using kapi functionalities in a device_func + :name: ex_device_func2 + + import dpnp + + import numba_dpex as dpex + from numba_dpex import kernel_api as kapi + + + @dpex.device_func + def increment_value(nd_item: kapi.NdItem, a): """Demonstrates the usage of group_barrier and NdItem usage in a device_func. """ i = nd_item.get_global_id(0) a[i] += 1 - group_barrier(nd_item.get_group(), MemoryScope.DEVICE) + kapi.group_barrier(nd_item.get_group(), kapi.MemoryScope.DEVICE) if i == 0: for idx in range(1, a.size): a[0] += a[idx] - @dpex_exp.kernel - def another_kernel(nd_item: NdItem, a): + @dpex.kernel + def another_kernel(nd_item: kapi.NdItem, a): """The kernel does everything by calling a device_func.""" increment_value(nd_item, a) - dpex_exp.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b) + N = 16 + b = dpnp.ones(N, dtype=dpnp.int32) + + dpex.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b) A device function does not require the first argument to be an index space id diff --git a/docs/source/user_guide/kernel_programming/index.rst b/docs/source/user_guide/kernel_programming/index.rst index 0606fabee1..a28564c813 100644 --- a/docs/source/user_guide/kernel_programming/index.rst +++ b/docs/source/user_guide/kernel_programming/index.rst @@ -26,7 +26,6 @@ the concepts needed to write data-parallel kernels in numba-dpex. Core concepts ************* - Writing a *range* kernel ======================== @@ -35,130 +34,18 @@ Writing a *range* kernel Writing an *nd-range* kernel ============================ -In a range kernel, the kernel execution is scheduled over a set of work items -without any explicit grouping of the work items. The basic form of parallelism -that can be expressed using a range kernel does not allow expressing any notion -of locality within the kernel. To get around that limitation, kapi provides a -second form of expressing a parallel kernel that is called an *nd-range* kernel. -An nd-range kernel represents a data-parallel execution of the kernel by a set -of explicitly defined groups of work items. An individual group of work items is -called a *work group*. :ref:`ex_matmul_kernel` demonstrates an nd-range kernel -and some of the advanced features programmers can use in this type of kernel. - -.. code-block:: python - :linenos: - :caption: **Example:** Sliding window matrix multiplication as an nd-range kernel - :name: ex_matmul_kernel - - from numba_dpex import kernel_api as kapi - import numba_dpex.experimental as dpex_exp - import numpy as np - import dpctl.tensor as dpt - - square_block_side = 2 - work_group_size = (square_block_side, square_block_side) - dtype = np.float32 - - - @dpex_exp.kernel - def matmul( - nditem: kapi.NdItem, - X, # IN READ-ONLY (X_n_rows, n_cols) - y, # IN READ-ONLY (n_cols, y_n_rows), - X_slm, # SLM to store a sliding window over X - Y_slm, # SLM to store a sliding window over Y - result, # OUT (X_n_rows, y_n_rows) - ): - X_n_rows = X.shape[0] - Y_n_cols = y.shape[1] - n_cols = X.shape[1] - - result_row_idx = nditem.get_global_id(0) - result_col_idx = nditem.get_global_id(1) - - local_row_idx = nditem.get_local_id(0) - local_col_idx = nditem.get_local_id(1) - - n_blocks_for_cols = n_cols // square_block_side - if (n_cols % square_block_side) > 0: - n_blocks_for_cols += 1 - - output = dtype(0) - - gr = nditem.get_group() - - for block_idx in range(n_blocks_for_cols): - X_slm[local_row_idx, local_col_idx] = dtype(0) - Y_slm[local_row_idx, local_col_idx] = dtype(0) - if (result_row_idx < X_n_rows) and ( - (local_col_idx + (square_block_side * block_idx)) < n_cols - ): - X_slm[local_row_idx, local_col_idx] = X[ - result_row_idx, local_col_idx + (square_block_side * block_idx) - ] - - if (result_col_idx < Y_n_cols) and ( - (local_row_idx + (square_block_side * block_idx)) < n_cols - ): - Y_slm[local_row_idx, local_col_idx] = y[ - local_row_idx + (square_block_side * block_idx), result_col_idx - ] - - kapi.group_barrier(gr) - - for idx in range(square_block_side): - output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx] - - kapi.group_barrier(gr) - - if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols): - result[result_row_idx, result_col_idx] = output - - - def _arange_reshaped(shape, dtype): - n_items = shape[0] * shape[1] - return np.arange(n_items, dtype=dtype).reshape(shape) - - - X = _arange_reshaped((5, 5), dtype) - Y = _arange_reshaped((5, 5), dtype) - X = dpt.asarray(X) - Y = dpt.asarray(Y) - device = X.device.sycl_device - result = dpt.zeros((5, 5), dtype, device=device) - X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype) - Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype) +.. include:: ./writing-ndrange-kernel.rst - dpex_exp.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result) - - -When writing an nd-range kernel, a programmer -defines a set of groups of work items instead of a flat execution range - -An nd-range kernel needs to be launched with -an instance of the :py:class:`numba_dpex.kernel_api.NdRange` class and the first -argument to an nd-range kernel has to be an instance of -:py:class:`numba_dpex.kernel_api.NdItem`. An ``NdRange`` object defines a set of -work groups each with it own set of work items. +.. Launching a kernel +.. ================== +.. include:: ./call-kernel.rst The ``device_func`` decorator ============================= .. include:: ./device-functions.rst -Supported mathematical operations -================================= - -.. include:: ./math-functions.rst - -Supported Python operators -========================== - -Supported general Python features -================================= - -.. include:: ./supported-python-features.rst Supported types of kernel argument ================================== @@ -172,7 +59,7 @@ arguments are passed by value. Supported array types --------------------- - `dpctl.tensor.usm_ndarray`_ : A SYCL-based Python Array API complaint tensor. -- `dpnp.ndarray`_ : A ``numpy.ndarray`` type container that supports SYCL USM memory allocation. +- `dpnp.ndarray`_ : A ``numpy.ndarray``-like array container that supports SYCL USM memory allocation. Scalar types ------------ @@ -236,8 +123,24 @@ users should first convert their input tensor or ndarray object into either of the two supported array types, both of which support DLPack. -Launching a kernel -================== +Supported Python features +************************* + +Mathematical operations +======================= + +.. include:: ./math-functions.rst + +Operators +========= + +.. include:: ./operators.rst + +General Python features +======================= + +.. include:: ./supported-python-features.rst + Advanced concepts ***************** @@ -254,8 +157,10 @@ Group barrier synchronization Atomic operations ================= -Async kernel execution -====================== +.. Async kernel execution +.. ====================== + +.. include:: ./call-kernel-async.rst Specializing a kernel or a device_func ====================================== diff --git a/docs/source/user_guide/kernel_programming/operators.csv b/docs/source/user_guide/kernel_programming/operators.csv new file mode 100644 index 0000000000..71e855cedb --- /dev/null +++ b/docs/source/user_guide/kernel_programming/operators.csv @@ -0,0 +1,35 @@ +Name, Operator, Note +Addition, ``+``, +Multiplication, ``*``, +Subtraction, ``-``, +Division, ``/``, +Floor Division, ``//``, +Modulo, ``%``, +Exponent, ``**``, +In-place Addition, ``+=``, +In-place Subtraction, ``-=``, +In-place Division, ``/=``, +In-place Floor Division, ``//=``, +In-place Modulo, ``%=``, +In-place Exponent, ``**=``, Only supported on OpenCL CPU devices +Bitwise And, ``&``, +Bitwise Left Shift, ``<<``, +Bitwise Right Shift, ``>>``, +Bitwise Or, ``|``, +Bitwise Exclusive Or, ``^``, +In-place Bitwise And, ``&=``, +In-place Bitwise Left Shift, ``<<=``, +In-place Bitwise Right Shift, ``>>=``, +In-place Bitwise Or, ``|=``, +In-place Bitwise Exclusive Or, ``^=``, +Negation, ``-``, +Complement, ``~``, +Pos, ``+``, +Less Than, ``<``, +Less Than Equal, ``<=``, +Greater Than, ``>``, +Greater Than Equal, ``>=``, +Equal To, ``==``, +Not Equal To, ``!=``, +Matmul, ``@``, **Not supported** +In-place Matmul, ``@=``, **Not supported** diff --git a/docs/source/user_guide/kernel_programming/operators.rst b/docs/source/user_guide/kernel_programming/operators.rst new file mode 100644 index 0000000000..f07ef4c986 --- /dev/null +++ b/docs/source/user_guide/kernel_programming/operators.rst @@ -0,0 +1,6 @@ +List of supported Python operators that can be used in a ``kernel`` or +``device_func`` decorated function. + +.. csv-table:: Current support matrix of Python operators + :file: ./operators.csv + :header-rows: 1 diff --git a/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst b/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst new file mode 100644 index 0000000000..43632176d3 --- /dev/null +++ b/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst @@ -0,0 +1,129 @@ + +In a range kernel, the kernel execution is scheduled over a set of work-items +without any explicit grouping of the work-items. The basic form of parallelism +that can be expressed using a range kernel does not allow expressing any notion +of locality within the kernel. To get around that limitation, kapi provides a +second form of expressing a parallel kernel that is called an *nd-range* kernel. +An nd-range kernel represents a data-parallel execution of the kernel by a set +of explicitly defined groups of work-items. An individual group of work-items is +called a *work-group*. :ref:`ex_matmul_kernel` demonstrates an nd-range kernel +and some of the advanced features programmers can use in this type of kernel. + +.. code-block:: python + :linenos: + :caption: **Example:** Sliding window matrix multiplication as an nd-range kernel + :name: ex_matmul_kernel + + from numba_dpex import kernel_api as kapi + import numba_dpex as dpex + import numpy as np + import dpctl.tensor as dpt + + square_block_side = 2 + work_group_size = (square_block_side, square_block_side) + dtype = np.float32 + + + @dpex.kernel + def matmul( + nditem: kapi.NdItem, + X, # IN READ-ONLY (X_n_rows, n_cols) + y, # IN READ-ONLY (n_cols, y_n_rows), + X_slm, # SLM to store a sliding window over X + Y_slm, # SLM to store a sliding window over Y + result, # OUT (X_n_rows, y_n_rows) + ): + X_n_rows = X.shape[0] + Y_n_cols = y.shape[1] + n_cols = X.shape[1] + + result_row_idx = nditem.get_global_id(0) + result_col_idx = nditem.get_global_id(1) + + local_row_idx = nditem.get_local_id(0) + local_col_idx = nditem.get_local_id(1) + + n_blocks_for_cols = n_cols // square_block_side + if (n_cols % square_block_side) > 0: + n_blocks_for_cols += 1 + + output = dtype(0) + + gr = nditem.get_group() + + for block_idx in range(n_blocks_for_cols): + X_slm[local_row_idx, local_col_idx] = dtype(0) + Y_slm[local_row_idx, local_col_idx] = dtype(0) + if (result_row_idx < X_n_rows) and ( + (local_col_idx + (square_block_side * block_idx)) < n_cols + ): + X_slm[local_row_idx, local_col_idx] = X[ + result_row_idx, local_col_idx + (square_block_side * block_idx) + ] + + if (result_col_idx < Y_n_cols) and ( + (local_row_idx + (square_block_side * block_idx)) < n_cols + ): + Y_slm[local_row_idx, local_col_idx] = y[ + local_row_idx + (square_block_side * block_idx), result_col_idx + ] + + kapi.group_barrier(gr) + + for idx in range(square_block_side): + output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx] + + kapi.group_barrier(gr) + + if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols): + result[result_row_idx, result_col_idx] = output + + + def _arange_reshaped(shape, dtype): + n_items = shape[0] * shape[1] + return np.arange(n_items, dtype=dtype).reshape(shape) + + + X = _arange_reshaped((5, 5), dtype) + Y = _arange_reshaped((5, 5), dtype) + X = dpt.asarray(X) + Y = dpt.asarray(Y) + device = X.device.sycl_device + result = dpt.zeros((5, 5), dtype, device=device) + X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype) + Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype) + + dpex.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result) + + +When writing an nd-range kernel, a programmer defines a set of groups of +work-items instead of a flat execution range.There are several semantic rules +associated both with a work-group and the work-items in a work-group: + +- Each work-group gets executed in an arbitrary order by the underlying + runtime and programmers should not assume any implicit ordering. + +- Work-items in different wok-groups cannot communicate with each other except + via atomic operations on global memory. + +- Work-items within a work-group share a common memory region called + "shared local memory" (SLM). Depending on the device the SLM maybe mapped to a + dedicated fast memory. + +- Work-items in a work-group can synchronize using a + :func:`numba_dpex.kernel_api.group_barrier` operation that can additionally + guarantee memory consistency using a *work-group memory fence*. + +.. note:: + + The SYCL language provides additional features for work-items in a + work-group such as *group functions* that specify communication routines + across work-items and also implement patterns such as reduction and scan. + These features are not yet available in numba-dpex. + +An nd-range kernel needs to be launched with an instance of the +:py:class:`numba_dpex.kernel_api.NdRange` class and the first +argument to an nd-range kernel has to be an instance of +:py:class:`numba_dpex.kernel_api.NdItem`. Apart from the need to provide an +```NdItem`` parameter, the rest of the semantic rules that apply to a range +kernel also apply to an nd-range kernel. diff --git a/docs/source/user_guide/kernel_programming/writing-range-kernel.rst b/docs/source/user_guide/kernel_programming/writing-range-kernel.rst index 376ffc34d1..70daf10975 100644 --- a/docs/source/user_guide/kernel_programming/writing-range-kernel.rst +++ b/docs/source/user_guide/kernel_programming/writing-range-kernel.rst @@ -1,6 +1,6 @@ A *range* kernel represents the simplest form of parallelism that can be expressed in numba-dpex using kapi. Such a kernel represents a data-parallel -execution over a set of work items with each work item representing a logical +execution over a set of work-items with each work-item representing a logical thread of execution. :ref:`ex_vecadd_kernel` shows an example of a range kernel written in numba-dpex. @@ -11,7 +11,7 @@ written in numba-dpex. :emphasize-lines: 9,17 import dpnp - import numba_dpex.experimental as dpex + import numba_dpex as dpex from numba_dpex import kernel_api as kapi @@ -29,81 +29,20 @@ written in numba-dpex. dpex.call_kernel(vecadd, kapi.Range(N), a, b, c) The highlighted lines in the example demonstrate the definition of the execution -range on **line 17** and extraction of every work items' *id* or index position +range on **line 17** and extraction of every work-items' *id* or index position via the ``item.get_id`` call on **line 10**. An execution range comprising of -1024 work items is defined when calling the kernel and each work item then -executes a single addition. Note that the array sizes for the input and output -arguments are equal to the size of the execution range. For very large arrays, -the design will not scale as there is usually an upper bound for the range size -depending on device. For most current Intel GPU devices, the maximum range size -is 2^32 and a kernel requesting more work items than that bound will not -execute. As such, programmers need to consider the size of the data and the -access patterns for their kernels before scheduling a range kernel. The maximum -number of work items can be queried programmatically as shown in -:ref:`ex_max_work_item`. - -.. code-block:: python - :linenos: - :caption: **Example:** Query maximum number of work items for a device - :name: ex_max_work_item - - import dpctl - import math - - d = dpctl.SyclDevice("gpu") - d.print_device_info() - - max_num_work_items = ( - d.max_work_group_size - * d.max_work_item_sizes1d[0] - * d.max_work_item_sizes2d[0] - * d.max_work_item_sizes3d[0] - ) - print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})") - - cpud = dpctl.SyclDevice("cpu") - cpud.print_device_info() - - max_num_work_items_cpu = ( - cpud.max_work_group_size - * cpud.max_work_item_sizes1d[0] - * cpud.max_work_item_sizes2d[0] - * cpud.max_work_item_sizes3d[0] - ) - print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})") - -The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated -graphics processor and a 9th Generation Coffee Lake CPU is shown in -:ref:`ex_max_work_item_output`. - -.. code-block:: bash - :caption: **OUTPUT:** Query maximum number of work items for a device - :name: ex_max_work_item_output - - Name Intel(R) UHD Graphics 630 [0x3e98] - Driver version 1.3.24595 - Vendor Intel(R) Corporation - Filter string level_zero:gpu:0 - - 4294967296 (2^32) - Name Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz - Driver version 2023.16.12.0.12_195853.xmain-hotfix - Vendor Intel(R) Corporation - Filter string opencl:cpu:0 - - 4503599627370496 (2^52) - - +1024 work-items is defined when calling the kernel and each work-item then +executes a single addition. There are a few semantic rules that have to be adhered to when writing a range kernel: * Analogous to the API of SYCL a range kernel can execute only over a 1-, 2-, or - a 3-dimensional set of work items. + a 3-dimensional set of work-items. * Every range kernel requires its first argument to be an instance of the :class:`numba_dpex.kernel_api.Item` class. The ``Item`` object is an - abstraction encapsulating the index position (id) of a single work item in the + abstraction encapsulating the index position (id) of a single work-item in the global execution range. The id will be a 1-, 2-, or a 3-tuple depending the dimensionality of the execution range. @@ -119,28 +58,24 @@ kernel: kernel. Scalar values are always passed by value. * At least one argument of a kernel should be an array. The requirement is so - that the kernel launcher (:func:`numba_dpex.experimental.call_kernel`) can - determine the execution queue on which to launch the kernel. Refer - the "Launching a kernel" section for more details. - -A range kernel has to be executed by calling the -:py:func:`numba_dpex.experimental.launcher.call_kernel` function. The execution -range for the kernel is specified by creating an instance of a -:class:`numba_dpex.kernel_api.Range` class and passing the ``Range`` object as -an argument to ``call_kernel``. The ``call_kernel`` function does three things: -compiles the kernel if needed, "unboxes" all kernel arguments by converting -CPython objects into numba-dpex objects, and finally submitting the kernel to an -execution queue with the specified execution range. Refer the -:doc:`../../autoapi/index` for further details. + that the kernel launcher (:func:`numba_dpex.core.kernel_launcher.call_kernel`) + can determine the execution queue on which to launch the kernel. Refer to the + :ref:`launching-a-kernel` section for more details. + +A range kernel has to be executed via the +:py:func:`numba_dpex.core.kernel_launcher.call_kernel` function by passing in +an instance of the :class:`numba_dpex.kernel_api.Range` class. Refer to the +:ref:`launching-a-kernel` section for more details on how to launch a range +kernel. A range kernel is meant to express a basic `parallel-for` calculation that is -ideally suited for embarrassingly parallel kernels such as elementwise +ideally suited for embarrassingly parallel kernels such as element-wise computations over n-dimensional arrays (ndarrays). The API for expressing a -range kernel does not allow advanced features such as synchronization of work -items and fine-grained control over memory allocation on a device. For such +range kernel does not allow advanced features such as synchronization of +work-items and fine-grained control over memory allocation on a device. For such advanced features, an nd-range kernel should be used. .. seealso:: Refer API documentation for :class:`numba_dpex.kernel_api.Range`, :class:`numba_dpex.kernel_api.Item`, and - :func:`numba_dpex.experimental.launcher.call_kernel` for more details. + :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details.