From f82360b9a80ccedb64b601f91cd35612ee4946d3 Mon Sep 17 00:00:00 2001
From: Diptorup Deb <diptorup.deb@intel.com>
Date: Mon, 18 Mar 2024 22:48:00 -0500
Subject: [PATCH] Restructure and fill in the kernel programming guide.

---
 docs/source/ext_links.txt                     |   3 +
 .../kernel_programming/call-kernel-async.rst  |   4 +
 .../kernel_programming/call-kernel.rst        | 117 ++++++++++++++
 .../kernel_programming/device-functions.rst   |  46 ++++--
 .../user_guide/kernel_programming/index.rst   | 149 ++++--------------
 .../kernel_programming/operators.csv          |  35 ++++
 .../kernel_programming/operators.rst          |   6 +
 .../writing-ndrange-kernel.rst                | 129 +++++++++++++++
 .../writing-range-kernel.rst                  | 105 +++---------
 9 files changed, 373 insertions(+), 221 deletions(-)
 create mode 100644 docs/source/user_guide/kernel_programming/call-kernel-async.rst
 create mode 100644 docs/source/user_guide/kernel_programming/call-kernel.rst
 create mode 100644 docs/source/user_guide/kernel_programming/operators.csv
 create mode 100644 docs/source/user_guide/kernel_programming/operators.rst
 create mode 100644 docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst

diff --git a/docs/source/ext_links.txt b/docs/source/ext_links.txt
index 3bb02afdfd..4c369fbd92 100644
--- a/docs/source/ext_links.txt
+++ b/docs/source/ext_links.txt
@@ -31,3 +31,6 @@
 .. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html
 .. _dpctl.tensor.usm_ndarray: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/usm_ndarray.html#dpctl.tensor.usm_ndarray
 .. _dpnp.ndarray: https://intelpython.github.io/dpnp/reference/ndarray.html
+
+.. _Dispatcher: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#dispatcher-objects
+.. _Unboxes: https://numba.readthedocs.io/en/stable/extending/interval-example.html#boxing-and-unboxing
diff --git a/docs/source/user_guide/kernel_programming/call-kernel-async.rst b/docs/source/user_guide/kernel_programming/call-kernel-async.rst
new file mode 100644
index 0000000000..b3a657890d
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/call-kernel-async.rst
@@ -0,0 +1,4 @@
+.. _launching-an-async-kernel:
+
+Async kernel execution
+======================
diff --git a/docs/source/user_guide/kernel_programming/call-kernel.rst b/docs/source/user_guide/kernel_programming/call-kernel.rst
new file mode 100644
index 0000000000..2655027460
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/call-kernel.rst
@@ -0,0 +1,117 @@
+.. _launching-a-kernel:
+
+Launching a kernel
+==================
+
+A ``kernel`` decorated kapi function produces a ``KernelDispatcher`` object that
+is a type of a Numba* `Dispatcher`_ object. However, unlike regular Numba*
+Dispatcher objects a ``KernelDispatcher`` object cannot be directly invoked from
+either CPython or another compiled Numba* ``jit`` function. To invoke a
+``kernel`` decorated function, a programmer has to use the
+:func:`numba_dpex.core.kernel_launcher.call_kernel` function.
+
+To invoke a ``KernelDispatcher`` the ``call_kernel`` function requires three
+things: the ``KernelDispatcher`` object, the ``Range`` or ``NdRange`` object
+over which the kernel is to be executed, and the list of arguments to be passed
+to the compiled kernel. Once called with the necessary  arguments, the
+``call_kernel`` function does the following main things:
+
+- Compiles the ``KernelDispatcher`` object specializing it for the provided
+  argument types.
+
+- `Unboxes`_  the kernel arguments by converting CPython objects into Numba* or
+   numba-dpex objects.
+
+- Infer the execution queue on which to submit the kernel from the provided
+  kernel arguments. (TODO: Refer compute follows data.)
+
+- Submits the kernel to the execution queue.
+
+- Waits for the execution completion, before returning control back to the
+  caller.
+
+.. important::
+    Programmers should note the following two things when defining the global or
+    local range to launch a kernel.
+
+    * Numba-dpex currently limits the maximum allowed global range size to
+      ``2^31-1``. It is due to the capabilities of current OpenCL GPU backends
+      that generally do not support more than 32-bit global range sizes. A
+      kernel requesting a larger global range than that will not execute and a
+      ``dpctl._sycl_queue.SyclKernelSubmitError`` will get raised.
+
+      The Intel dpcpp SYCL compiler does handle greater than 32-bit global
+      ranges for GPU backends by wrapping the kernel in a new kernel that has
+      each work-item perform multiple invocations of the original kernel in a
+      32-bit global range. Such a feature is not yet available in numba-dpex.
+
+    * When launching an nd-range kernel, if the number of work-items for a
+      particular dimension of a work-group exceeds the maximum device
+      capability, it can result in undefined behavior.
+
+    The maximum allowed work-items for a device can be queried programmatically
+    as shown in :ref:`ex_max_work_item`.
+
+    .. code-block:: python
+        :linenos:
+        :caption: **Example:** Query maximum number of work-items for a device
+        :name: ex_max_work_item
+
+        import dpctl
+        import math
+
+        d = dpctl.SyclDevice("gpu")
+        d.print_device_info()
+
+        max_num_work_items = (
+            d.max_work_group_size
+            * d.max_work_item_sizes1d[0]
+            * d.max_work_item_sizes2d[0]
+            * d.max_work_item_sizes3d[0]
+        )
+        print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})")
+
+        cpud = dpctl.SyclDevice("cpu")
+        cpud.print_device_info()
+
+        max_num_work_items_cpu = (
+            cpud.max_work_group_size
+            * cpud.max_work_item_sizes1d[0]
+            * cpud.max_work_item_sizes2d[0]
+            * cpud.max_work_item_sizes3d[0]
+        )
+        print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})")
+
+    The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated
+    graphics processor and a 9th Generation Coffee Lake CPU is shown in
+    :ref:`ex_max_work_item_output`.
+
+    .. code-block:: bash
+        :caption: **OUTPUT:** Query maximum number of work-items for a device
+        :name: ex_max_work_item_output
+
+            Name            Intel(R) UHD Graphics 630 [0x3e98]
+            Driver version  1.3.24595
+            Vendor          Intel(R) Corporation
+            Filter string   level_zero:gpu:0
+
+        4294967296 (2^32)
+            Name            Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
+            Driver version  2023.16.12.0.12_195853.xmain-hotfix
+            Vendor          Intel(R) Corporation
+            Filter string   opencl:cpu:0
+
+        4503599627370496 (2^52)
+
+
+The ``call_kernel`` function can be invoked both from CPython and from another
+Numba* compiled function. Note that the ``call_kernel`` function supports only
+synchronous execution of kernel and the ``call_kernel_async`` function should be
+used for asynchronous mode of kernel execution (refer
+:ref:`launching-an-async-kernel`).
+
+
+.. seealso::
+
+    Refer the API documentation for
+    :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details.
diff --git a/docs/source/user_guide/kernel_programming/device-functions.rst b/docs/source/user_guide/kernel_programming/device-functions.rst
index 1387e4a0da..b9dd914a0a 100644
--- a/docs/source/user_guide/kernel_programming/device-functions.rst
+++ b/docs/source/user_guide/kernel_programming/device-functions.rst
@@ -1,29 +1,33 @@
 Numba-dpex provides a decorator to express auxiliary device-only functions that
 can be called from a kernel or another device function, but are not callable
-from the host. This decorator :func:`numba_dpex.experimental.device_func` has no
-direct analogue in SYCL and primarily is provided to help programmers make their
-kapi applications modular.
+from the host. This decorator :func:`numba_dpex.core.decorators.device_func` has
+no direct analogue in SYCL and primarily is provided to help programmers make
+their kapi applications modular. :ref:`ex_device_func1` shows a simple usage of
+the ``device_func`` decorator.
 
 .. code-block:: python
+    :linenos:
+    :caption: **Example:** Basic usage of device_func
+    :name: ex_device_func1
 
     import dpnp
 
-    from numba_dpex import experimental as dpex_exp
+    import numba_dpex as dpex
     from numba_dpex import kernel_api as kapi
 
     # Array size
     N = 10
 
 
-    @dpex_exp.device_func
+    @dpex.device_func
     def a_device_function(a):
-        """A device callable function that can be invoked from a ``kernel`` or
+        """A device callable function that can be invoked from a kernel or
         another device function.
         """
         return a + 1
 
 
-    @dpex_exp.kernel
+    @dpex.kernel
     def a_kernel_function(item: kapi.Item, a, b):
         """Demonstrates calling a device function from a kernel."""
         i = item.get_id(0)
@@ -34,31 +38,45 @@ kapi applications modular.
     a = dpnp.ones(N, dtype=dpnp.int32)
     b = dpnp.zeros(N, dtype=dpnp.int32)
 
-    dpex_exp.call_kernel(a_kernel_function, dpex.Range(N), a, b)
+    dpex.call_kernel(a_kernel_function, dpex.Range(N), a, b)
 
 
-    @dpex_exp.device_func
-    def increment_value(nd_item: NdItem, a):
+.. code-block:: python
+    :linenos:
+    :caption: **Example:** Using kapi functionalities in a device_func
+    :name: ex_device_func2
+
+    import dpnp
+
+    import numba_dpex as dpex
+    from numba_dpex import kernel_api as kapi
+
+
+    @dpex.device_func
+    def increment_value(nd_item: kapi.NdItem, a):
         """Demonstrates the usage of group_barrier and NdItem usage in a
         device_func.
         """
         i = nd_item.get_global_id(0)
 
         a[i] += 1
-        group_barrier(nd_item.get_group(), MemoryScope.DEVICE)
+        kapi.group_barrier(nd_item.get_group(), kapi.MemoryScope.DEVICE)
 
         if i == 0:
             for idx in range(1, a.size):
                 a[0] += a[idx]
 
 
-    @dpex_exp.kernel
-    def another_kernel(nd_item: NdItem, a):
+    @dpex.kernel
+    def another_kernel(nd_item: kapi.NdItem, a):
         """The kernel does everything by calling a device_func."""
         increment_value(nd_item, a)
 
 
-    dpex_exp.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b)
+    N = 16
+    b = dpnp.ones(N, dtype=dpnp.int32)
+
+    dpex.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b)
 
 
 A device function does not require the first argument to be an index space id
diff --git a/docs/source/user_guide/kernel_programming/index.rst b/docs/source/user_guide/kernel_programming/index.rst
index 0606fabee1..a28564c813 100644
--- a/docs/source/user_guide/kernel_programming/index.rst
+++ b/docs/source/user_guide/kernel_programming/index.rst
@@ -26,7 +26,6 @@ the concepts needed to write data-parallel kernels in numba-dpex.
 Core concepts
 *************
 
-
 Writing a *range* kernel
 ========================
 
@@ -35,130 +34,18 @@ Writing a *range* kernel
 Writing an *nd-range* kernel
 ============================
 
-In a range kernel, the kernel execution is scheduled over a set of work items
-without any explicit grouping of the work items. The basic form of parallelism
-that can be expressed using a range kernel does not allow expressing any notion
-of locality within the kernel. To get around that limitation, kapi provides a
-second form of expressing a parallel kernel that is called an *nd-range* kernel.
-An nd-range kernel represents a data-parallel execution of the kernel by a set
-of explicitly defined groups of work items. An individual group of work items is
-called a *work group*. :ref:`ex_matmul_kernel` demonstrates an nd-range kernel
-and some of the advanced features programmers can use in this type of kernel.
-
-.. code-block:: python
-    :linenos:
-    :caption: **Example:** Sliding window matrix multiplication as an nd-range kernel
-    :name: ex_matmul_kernel
-
-    from numba_dpex import kernel_api as kapi
-    import numba_dpex.experimental as dpex_exp
-    import numpy as np
-    import dpctl.tensor as dpt
-
-    square_block_side = 2
-    work_group_size = (square_block_side, square_block_side)
-    dtype = np.float32
-
-
-    @dpex_exp.kernel
-    def matmul(
-        nditem: kapi.NdItem,
-        X,  # IN READ-ONLY    (X_n_rows, n_cols)
-        y,  # IN READ-ONLY    (n_cols, y_n_rows),
-        X_slm,  # SLM to store a sliding window over X
-        Y_slm,  # SLM to store a sliding window over Y
-        result,  # OUT        (X_n_rows, y_n_rows)
-    ):
-        X_n_rows = X.shape[0]
-        Y_n_cols = y.shape[1]
-        n_cols = X.shape[1]
-
-        result_row_idx = nditem.get_global_id(0)
-        result_col_idx = nditem.get_global_id(1)
-
-        local_row_idx = nditem.get_local_id(0)
-        local_col_idx = nditem.get_local_id(1)
-
-        n_blocks_for_cols = n_cols // square_block_side
-        if (n_cols % square_block_side) > 0:
-            n_blocks_for_cols += 1
-
-        output = dtype(0)
-
-        gr = nditem.get_group()
-
-        for block_idx in range(n_blocks_for_cols):
-            X_slm[local_row_idx, local_col_idx] = dtype(0)
-            Y_slm[local_row_idx, local_col_idx] = dtype(0)
-            if (result_row_idx < X_n_rows) and (
-                (local_col_idx + (square_block_side * block_idx)) < n_cols
-            ):
-                X_slm[local_row_idx, local_col_idx] = X[
-                    result_row_idx, local_col_idx + (square_block_side * block_idx)
-                ]
-
-            if (result_col_idx < Y_n_cols) and (
-                (local_row_idx + (square_block_side * block_idx)) < n_cols
-            ):
-                Y_slm[local_row_idx, local_col_idx] = y[
-                    local_row_idx + (square_block_side * block_idx), result_col_idx
-                ]
-
-            kapi.group_barrier(gr)
-
-            for idx in range(square_block_side):
-                output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]
-
-            kapi.group_barrier(gr)
-
-        if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
-            result[result_row_idx, result_col_idx] = output
-
-
-    def _arange_reshaped(shape, dtype):
-        n_items = shape[0] * shape[1]
-        return np.arange(n_items, dtype=dtype).reshape(shape)
-
-
-    X = _arange_reshaped((5, 5), dtype)
-    Y = _arange_reshaped((5, 5), dtype)
-    X = dpt.asarray(X)
-    Y = dpt.asarray(Y)
-    device = X.device.sycl_device
-    result = dpt.zeros((5, 5), dtype, device=device)
-    X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
-    Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
+.. include:: ./writing-ndrange-kernel.rst
 
-    dpex_exp.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
-
-
-When writing an nd-range kernel, a programmer
-defines a set of groups of work items instead of a flat execution range
-
-An nd-range kernel needs to be launched with
-an instance of the :py:class:`numba_dpex.kernel_api.NdRange` class and the first
-argument to an nd-range kernel has to be an instance of
-:py:class:`numba_dpex.kernel_api.NdItem`. An ``NdRange`` object defines a set of
-work groups each with it own set of work items.
+.. Launching a kernel
+.. ==================
 
+.. include:: ./call-kernel.rst
 
 The ``device_func`` decorator
 =============================
 
 .. include:: ./device-functions.rst
 
-Supported mathematical operations
-=================================
-
-.. include:: ./math-functions.rst
-
-Supported Python operators
-==========================
-
-Supported general Python features
-=================================
-
-.. include:: ./supported-python-features.rst
 
 Supported types of kernel argument
 ==================================
@@ -172,7 +59,7 @@ arguments are passed by value.
 Supported array types
 ---------------------
 - `dpctl.tensor.usm_ndarray`_ : A SYCL-based Python Array API complaint tensor.
-- `dpnp.ndarray`_ :  A ``numpy.ndarray`` type container that supports SYCL USM memory allocation.
+- `dpnp.ndarray`_ :  A ``numpy.ndarray``-like array container that supports SYCL USM memory allocation.
 
 Scalar types
 ------------
@@ -236,8 +123,24 @@ users should first convert their input tensor or ndarray object into either of
 the two supported array types, both of which support DLPack.
 
 
-Launching a kernel
-==================
+Supported Python features
+*************************
+
+Mathematical operations
+=======================
+
+.. include:: ./math-functions.rst
+
+Operators
+=========
+
+.. include:: ./operators.rst
+
+General Python features
+=======================
+
+.. include:: ./supported-python-features.rst
+
 
 Advanced concepts
 *****************
@@ -254,8 +157,10 @@ Group barrier synchronization
 Atomic operations
 =================
 
-Async kernel execution
-======================
+.. Async kernel execution
+.. ======================
+
+.. include:: ./call-kernel-async.rst
 
 Specializing a kernel or a device_func
 ======================================
diff --git a/docs/source/user_guide/kernel_programming/operators.csv b/docs/source/user_guide/kernel_programming/operators.csv
new file mode 100644
index 0000000000..71e855cedb
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/operators.csv
@@ -0,0 +1,35 @@
+Name, Operator, Note
+Addition, ``+``,
+Multiplication, ``*``,
+Subtraction, ``-``,
+Division, ``/``,
+Floor Division, ``//``,
+Modulo, ``%``,
+Exponent, ``**``,
+In-place Addition, ``+=``,
+In-place Subtraction, ``-=``,
+In-place Division, ``/=``,
+In-place Floor Division, ``//=``,
+In-place Modulo, ``%=``,
+In-place Exponent, ``**=``, Only supported on OpenCL CPU devices
+Bitwise And, ``&``,
+Bitwise Left Shift, ``<<``,
+Bitwise Right Shift, ``>>``,
+Bitwise Or, ``|``,
+Bitwise Exclusive Or, ``^``,
+In-place Bitwise And, ``&=``,
+In-place Bitwise Left Shift, ``<<=``,
+In-place Bitwise Right Shift, ``>>=``,
+In-place Bitwise Or, ``|=``,
+In-place Bitwise Exclusive Or, ``^=``,
+Negation, ``-``,
+Complement, ``~``,
+Pos, ``+``,
+Less Than, ``<``,
+Less Than Equal, ``<=``,
+Greater Than, ``>``,
+Greater Than Equal, ``>=``,
+Equal To, ``==``,
+Not Equal To, ``!=``,
+Matmul, ``@``, **Not supported**
+In-place Matmul, ``@=``, **Not supported**
diff --git a/docs/source/user_guide/kernel_programming/operators.rst b/docs/source/user_guide/kernel_programming/operators.rst
new file mode 100644
index 0000000000..f07ef4c986
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/operators.rst
@@ -0,0 +1,6 @@
+List of supported Python operators that can be used in a ``kernel`` or
+``device_func`` decorated function.
+
+.. csv-table:: Current support matrix of Python operators
+   :file: ./operators.csv
+   :header-rows: 1
diff --git a/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst b/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst
new file mode 100644
index 0000000000..43632176d3
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst
@@ -0,0 +1,129 @@
+
+In a range kernel, the kernel execution is scheduled over a set of work-items
+without any explicit grouping of the work-items. The basic form of parallelism
+that can be expressed using a range kernel does not allow expressing any notion
+of locality within the kernel. To get around that limitation, kapi provides a
+second form of expressing a parallel kernel that is called an *nd-range* kernel.
+An nd-range kernel represents a data-parallel execution of the kernel by a set
+of explicitly defined groups of work-items. An individual group of work-items is
+called a *work-group*. :ref:`ex_matmul_kernel` demonstrates an nd-range kernel
+and some of the advanced features programmers can use in this type of kernel.
+
+.. code-block:: python
+    :linenos:
+    :caption: **Example:** Sliding window matrix multiplication as an nd-range kernel
+    :name: ex_matmul_kernel
+
+    from numba_dpex import kernel_api as kapi
+    import numba_dpex as dpex
+    import numpy as np
+    import dpctl.tensor as dpt
+
+    square_block_side = 2
+    work_group_size = (square_block_side, square_block_side)
+    dtype = np.float32
+
+
+    @dpex.kernel
+    def matmul(
+        nditem: kapi.NdItem,
+        X,  # IN READ-ONLY    (X_n_rows, n_cols)
+        y,  # IN READ-ONLY    (n_cols, y_n_rows),
+        X_slm,  # SLM to store a sliding window over X
+        Y_slm,  # SLM to store a sliding window over Y
+        result,  # OUT        (X_n_rows, y_n_rows)
+    ):
+        X_n_rows = X.shape[0]
+        Y_n_cols = y.shape[1]
+        n_cols = X.shape[1]
+
+        result_row_idx = nditem.get_global_id(0)
+        result_col_idx = nditem.get_global_id(1)
+
+        local_row_idx = nditem.get_local_id(0)
+        local_col_idx = nditem.get_local_id(1)
+
+        n_blocks_for_cols = n_cols // square_block_side
+        if (n_cols % square_block_side) > 0:
+            n_blocks_for_cols += 1
+
+        output = dtype(0)
+
+        gr = nditem.get_group()
+
+        for block_idx in range(n_blocks_for_cols):
+            X_slm[local_row_idx, local_col_idx] = dtype(0)
+            Y_slm[local_row_idx, local_col_idx] = dtype(0)
+            if (result_row_idx < X_n_rows) and (
+                (local_col_idx + (square_block_side * block_idx)) < n_cols
+            ):
+                X_slm[local_row_idx, local_col_idx] = X[
+                    result_row_idx, local_col_idx + (square_block_side * block_idx)
+                ]
+
+            if (result_col_idx < Y_n_cols) and (
+                (local_row_idx + (square_block_side * block_idx)) < n_cols
+            ):
+                Y_slm[local_row_idx, local_col_idx] = y[
+                    local_row_idx + (square_block_side * block_idx), result_col_idx
+                ]
+
+            kapi.group_barrier(gr)
+
+            for idx in range(square_block_side):
+                output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]
+
+            kapi.group_barrier(gr)
+
+        if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
+            result[result_row_idx, result_col_idx] = output
+
+
+    def _arange_reshaped(shape, dtype):
+        n_items = shape[0] * shape[1]
+        return np.arange(n_items, dtype=dtype).reshape(shape)
+
+
+    X = _arange_reshaped((5, 5), dtype)
+    Y = _arange_reshaped((5, 5), dtype)
+    X = dpt.asarray(X)
+    Y = dpt.asarray(Y)
+    device = X.device.sycl_device
+    result = dpt.zeros((5, 5), dtype, device=device)
+    X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
+    Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
+
+    dpex.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
+
+
+When writing an nd-range kernel, a programmer defines a set of groups of
+work-items instead of a flat execution range.There are several semantic rules
+associated both with a work-group and the work-items in a work-group:
+
+- Each work-group gets executed in an arbitrary order by the underlying
+  runtime and programmers should not assume any implicit ordering.
+
+- Work-items in different wok-groups cannot communicate with each other except
+  via atomic operations on global memory.
+
+- Work-items within a work-group share a common memory region called
+  "shared local memory" (SLM). Depending on the device the SLM maybe mapped to a
+  dedicated fast memory.
+
+- Work-items in a work-group can synchronize using a
+  :func:`numba_dpex.kernel_api.group_barrier` operation that can additionally
+  guarantee memory consistency using a *work-group memory fence*.
+
+.. note::
+
+    The SYCL language provides additional features for work-items in a
+    work-group such as *group functions* that specify communication routines
+    across work-items and also implement patterns such as reduction and scan.
+    These features are not yet available in numba-dpex.
+
+An nd-range kernel needs to be launched with an instance of the
+:py:class:`numba_dpex.kernel_api.NdRange` class and the first
+argument to an nd-range kernel has to be an instance of
+:py:class:`numba_dpex.kernel_api.NdItem`. Apart from the need to provide an
+```NdItem`` parameter, the rest of the semantic rules that apply to a range
+kernel also apply to an nd-range kernel.
diff --git a/docs/source/user_guide/kernel_programming/writing-range-kernel.rst b/docs/source/user_guide/kernel_programming/writing-range-kernel.rst
index 376ffc34d1..70daf10975 100644
--- a/docs/source/user_guide/kernel_programming/writing-range-kernel.rst
+++ b/docs/source/user_guide/kernel_programming/writing-range-kernel.rst
@@ -1,6 +1,6 @@
 A *range* kernel represents the simplest form of parallelism that can be
 expressed in numba-dpex using kapi. Such a kernel represents a data-parallel
-execution over a set of work items with each work item representing a logical
+execution over a set of work-items with each work-item representing a logical
 thread of execution. :ref:`ex_vecadd_kernel` shows an example of a range kernel
 written in numba-dpex.
 
@@ -11,7 +11,7 @@ written in numba-dpex.
     :emphasize-lines: 9,17
 
     import dpnp
-    import numba_dpex.experimental as dpex
+    import numba_dpex as dpex
     from numba_dpex import kernel_api as kapi
 
 
@@ -29,81 +29,20 @@ written in numba-dpex.
     dpex.call_kernel(vecadd, kapi.Range(N), a, b, c)
 
 The highlighted lines in the example demonstrate the definition of the execution
-range on **line 17** and extraction of every work items' *id* or index position
+range on **line 17** and extraction of every work-items' *id* or index position
 via the ``item.get_id`` call on **line 10**. An execution range comprising of
-1024 work items is defined when calling the kernel and each work item then
-executes a single addition. Note that the array sizes for the input and output
-arguments are equal to the size of the execution range. For very large arrays,
-the design will not scale as there is usually an upper bound for the range size
-depending on device. For most current Intel GPU devices, the maximum range size
-is 2^32 and a kernel requesting more work items than that bound will not
-execute. As such, programmers need to consider the size of the data and the
-access patterns for their kernels before scheduling a range kernel. The maximum
-number of work items can be queried programmatically as shown in
-:ref:`ex_max_work_item`.
-
-.. code-block:: python
-    :linenos:
-    :caption: **Example:** Query maximum number of work items for a device
-    :name: ex_max_work_item
-
-    import dpctl
-    import math
-
-    d = dpctl.SyclDevice("gpu")
-    d.print_device_info()
-
-    max_num_work_items = (
-        d.max_work_group_size
-        * d.max_work_item_sizes1d[0]
-        * d.max_work_item_sizes2d[0]
-        * d.max_work_item_sizes3d[0]
-    )
-    print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})")
-
-    cpud = dpctl.SyclDevice("cpu")
-    cpud.print_device_info()
-
-    max_num_work_items_cpu = (
-        cpud.max_work_group_size
-        * cpud.max_work_item_sizes1d[0]
-        * cpud.max_work_item_sizes2d[0]
-        * cpud.max_work_item_sizes3d[0]
-    )
-    print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})")
-
-The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated
-graphics processor and a 9th Generation Coffee Lake CPU is shown in
-:ref:`ex_max_work_item_output`.
-
-.. code-block:: bash
-    :caption: **OUTPUT:** Query maximum number of work items for a device
-    :name: ex_max_work_item_output
-
-        Name            Intel(R) UHD Graphics 630 [0x3e98]
-        Driver version  1.3.24595
-        Vendor          Intel(R) Corporation
-        Filter string   level_zero:gpu:0
-
-    4294967296 (2^32)
-        Name            Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
-        Driver version  2023.16.12.0.12_195853.xmain-hotfix
-        Vendor          Intel(R) Corporation
-        Filter string   opencl:cpu:0
-
-    4503599627370496 (2^52)
-
-
+1024 work-items is defined when calling the kernel and each work-item then
+executes a single addition.
 
 There are a few semantic rules that have to be adhered to when writing a range
 kernel:
 
 * Analogous to the API of SYCL a range kernel can execute only over a 1-, 2-, or
-  a 3-dimensional set of work items.
+  a 3-dimensional set of work-items.
 
 * Every range kernel requires its first argument to be an instance of the
   :class:`numba_dpex.kernel_api.Item` class. The ``Item`` object is an
-  abstraction encapsulating the index position (id) of a single work item in the
+  abstraction encapsulating the index position (id) of a single work-item in the
   global execution range. The id will be a 1-, 2-, or a 3-tuple depending
   the dimensionality of the execution range.
 
@@ -119,28 +58,24 @@ kernel:
   kernel. Scalar values are always passed by value.
 
 * At least one argument of a kernel should be an array. The requirement is so
-  that the kernel launcher (:func:`numba_dpex.experimental.call_kernel`) can
-  determine the execution queue on which to launch the kernel. Refer
-  the "Launching a kernel" section for more details.
-
-A range kernel has to be executed by calling the
-:py:func:`numba_dpex.experimental.launcher.call_kernel` function. The execution
-range for the kernel is specified by creating an instance of a
-:class:`numba_dpex.kernel_api.Range` class and passing the ``Range`` object as
-an argument to ``call_kernel``. The ``call_kernel`` function does three things:
-compiles the kernel if needed, "unboxes" all kernel arguments by converting
-CPython objects into numba-dpex objects, and finally submitting the kernel to an
-execution queue with the specified execution range. Refer the
-:doc:`../../autoapi/index` for further details.
+  that the kernel launcher (:func:`numba_dpex.core.kernel_launcher.call_kernel`)
+  can determine the execution queue on which to launch the kernel. Refer to the
+  :ref:`launching-a-kernel` section for more details.
+
+A range kernel has to be executed via the
+:py:func:`numba_dpex.core.kernel_launcher.call_kernel` function by passing in
+an instance of the :class:`numba_dpex.kernel_api.Range` class. Refer to the
+:ref:`launching-a-kernel` section for more details on how to launch a range
+kernel.
 
 A range kernel is meant to express a basic `parallel-for` calculation that is
-ideally suited for embarrassingly parallel kernels such as elementwise
+ideally suited for embarrassingly parallel kernels such as element-wise
 computations over n-dimensional arrays (ndarrays). The API for expressing a
-range kernel does not allow advanced features such as synchronization of work
-items and fine-grained control over memory allocation on a device. For such
+range kernel does not allow advanced features such as synchronization of
+work-items and fine-grained control over memory allocation on a device. For such
 advanced features, an nd-range kernel should be used.
 
 .. seealso::
     Refer API documentation for :class:`numba_dpex.kernel_api.Range`,
     :class:`numba_dpex.kernel_api.Item`,  and
-    :func:`numba_dpex.experimental.launcher.call_kernel` for more details.
+    :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details.