Buffer uses signed bytes with v2 compressors #2735

QuLogic · 2025-01-20T09:58:17Z

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.13.1

Operating System

Fedora Rawhide

Installation

Fedora package

Description

I'm looking at cgohlke/imagecodecs#123 and after fixing some imports and setting zarr_format=2, I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

I have traced this to Buffer requiring dtype='b', along with casts in cpu.Buffer.from_bytes.

If I modify those checks/casts to use the unsigned dtype='B', then I can get imagecodecs tests to pass.

I see this came in with GPU support in #1967. Was this actually intentional, or was it more that no-one noticed that the NumPy b dtype is a signed byte? It would seem odd to me that bytes would be treated as signed as in regular Python they are treated as unsigned (e.g., b'\xff'[0] == 255, not -1).

If this is intentional, then it seems like something that should be documented in the migration guide that would break compressors for zarr_format=2.

Steps to reproduce

Install imagecodecs, modify tests to use zarr.storage.MemoryStorage instead of zarr.MemoryStorage, and set zarr_format=2, then run its tests.

Additional output

No response

The text was updated successfully, but these errors were encountered:

QuLogic · 2025-01-20T10:06:47Z

I see this came in with GPU support in #1967.

Oops, I did not go far back enough in the blame. It actual came in with Buffer's first implementation in #1826. I don't see any specific discussion on signedness other than pointing out Buffer is a specialization of NDBuffer: #1826 (comment)

d-v-b · 2025-01-20T10:20:29Z

This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed. But it seems like in numpy dtype parlance, a byte is an 8-bit integer? I can see how those could be signed, but I'm not aware of anything in zarr-python that depends on a particular signing. Assuming nothing breaks if we switch to the signing that works for imagecodecs, then we should consider that change.

@madsbk any insight here?

madsbk · 2025-01-20T10:28:42Z

Was this actually intentional, or was it more that no-one noticed that the NumPy "b" dtype is a signed byte?

This is an oversight, the buffers shouldn't care about signedness.
I am fine moving to "B" exclusively, but we could also accept both "B" and "b"?

d-v-b · 2025-01-20T10:49:06Z

To allow "B" and "b", we would need some way of passing that value for the dtype used in np.frombuffer. The simplest solution would be to give Buffer a dtype attribute, at which point we are halfway to admitting that the Buffer class is really just NDBuffer with 1 dimension and dtype constrained to "B" or "b".

madsbk · 2025-01-20T15:04:56Z

Also I just noticed that concatenating a "B" and "b" array in numpy returns int16. Thus, if we want to support both, we would need to change the CPU and GPU implementation of Buffer.__add__():

    def __add__(self, other: core.Buffer) -> Self:
        """Concatenate two buffers"""

        other_array = other.as_array_like()
        assert other_array.dtype == np.dtype("b")
        return self.__class__(
            np.concatenate((np.asanyarray(self._data), np.asanyarray(other_array)))
        )

QuLogic · 2025-01-21T01:30:49Z

This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed.

All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.

Also I just noticed that concatenating a "B" and "b" array in numpy returns int16.

Signed bytes are -128 to 127, unsigned bytes are 0 to 255. The only way to represent that whole range -128 to 255 is casting upward to 16 bits (which is -32768 to 32767).

I don't know that it makes sense to allow both, but if you want to, then I would suggest allowing both as input, but changing internals to always be unsigned (using a view should be copy-free).

This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735

d-v-b · 2025-01-21T09:28:59Z

All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.

But as far as I understood it, the buffer API shouldn't store bytes with integer semantics -- Buffer contains stuff like JSON documents, so the contents of Buffer are supposed to just piles of bits, with no meaning attached. We happen to be using numpy arrays to back Buffer instances, but the fact that those arrays have a dtype is an implementation detail that should not affect the public API of Buffer.

do I have this right @madsbk ?

madsbk · 2025-01-21T11:13:59Z

Yes, a Buffer just a contiguous blob of memory.

d-v-b · 2025-01-21T11:18:48Z

I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

it sounds like imagecodecs compressors are assuming a numpy array. but Buffer is not a numpy array (despite using one under the hood). So while I think the fix proposed here makes sense, there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

madsbk · 2025-01-21T11:25:53Z

Maybe just reinterpret the data like?

buf.view(dtype="uint8")

d-v-b · 2025-01-21T11:27:03Z

yes, I think the expectation for consumers of Buffer should be that they are responsible for imposing dtype semantics on it

d-v-b · 2025-01-21T16:04:24Z

I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

it sounds like imagecodecs compressors are assuming a numpy array. but Buffer is not a numpy array (despite using one under the hood). So while I think the fix proposed here makes sense, there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

@cgohlke could you provide some context for the " 👎 "? Did I get something wrong here?

cgohlke · 2025-01-21T18:47:10Z

it sounds like imagecodecs compressors are assuming a numpy array

Imagecodecs makes no such assumption. It requires that decode functions for ->bytes codecs receive an input that can be mapped to a contiguous uint8_t memoryview in Cython, such as bytes, bytearray, contiguous numpy.uint8 or void arrays, or other objects implementing buffer protocol with default type or 'B'. Other inputs are purposely rejected.

there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

I think the issue is that Zarr 3 Buffers are surprisingly full of signed integers while bytes, bytearray, and default type buffer protocol objects contain unsigned integers, for example:

>>> b'\xff'[0]
255

d-v-b · 2025-01-21T19:38:27Z

that makes sense, thanks for the clarification. I don't anticipate any big problems with switching to the "B" dtype, which I think will resolve this matter, but I am surprised that our current use of "b" resulted in a representation that could not be mapped to a contiguous uint8_t memoryview.

As far as I know, switching to "B" will make absolutely no difference in the literal bytes that are represented in a Buffer instance. Only the interpretation of those bytes as integers will change, but this should be meaningless for the Buffer API, and its consumers.

QuLogic · 2025-01-22T04:16:46Z

But as far as I understood it, the buffer API shouldn't store bytes with integer semantics

Bytes are integers; I'm not sure what distinction you're making here. If you are specifically referring to the Python bytes type, then that is comprised of unsigned 8-bit integers, as @cgohlke mentioned:

>>> b = b'\x01\x02\xfe\xff'
>>> b[0]
1
>>> b[1]
2
>>> b[2]
254
>>> b[3]
255

and the reverse works:

>>> bytes([1, 2, 254, 255])
b'\x01\x02\xfe\xff'

and it specifically maps to unsigned 8-bit as a memory view:

>>> mv = memoryview(b)
>>> mv.format
'B'

Whether you then interpret those bytes as text or multi-byte integers is a higher-level concern (and I agree Buffer shouldn't specify those), but it's still a collection of integers.

I am surprised that our current use of "b" resulted in a representation that could not be mapped to a contiguous uint8_t memoryview.

Note this is specifically about compressors for the v2 file format. Perhaps v3 has sufficiently hidden this, but that hasn't been implemented in imagecodecs yet.

You can cast a memoryview from int8 to uint8, but that requires an explicit call to .cast. With imagecodecs, the uint8_t is the type of the function parameter, and I suspect Cython is being very strict here by not allowing the implicit cast.

As far as I know, switching to "B" will make absolutely no difference in the literal bytes that are represented in a Buffer instance. Only the interpretation of those bytes as integers will change, but this should be meaningless for the Buffer API, and its consumers.

Okay, if we're on the same page here, then I'll try and finish up #2738.

This makes compressors consistent with v2, and buffers consistents with `bytes` types. Fixes zarr-developers#2735

d-v-b · 2025-01-22T09:09:59Z

Bytes are integers; I'm not sure what distinction you're making here. If you are specifically referring to the Python bytes type, then that is comprised of unsigned 8-bit integers, as @cgohlke mentioned:

Recall that the job of the Buffer class is to model stored objects. Would you describe stored objects (like files on a file system) as composed of integers, or bytes? I would say "bytes".

Bytes are collections of 8 bits, which can be used to represent integers, or characters, or bools, or parts of larger elements. The Buffer class gets used to encode JSON documents, compressed and uncompressed N-dimensional arrays, etc into a format suitable for storage, and for this reason the Buffer class should be totally agnostic to the encoding scheme used by the consumer of those bytes. Because the memoryview API assumes that its values have an encoding, it is more specific than our Buffer api.

it sounds like imagecodecs compressors are assuming a numpy array

Imagecodecs makes no such assumption. It requires that decode functions for ->bytes codecs receive an input that can be mapped to a contiguous uint8_t memoryview in Cython, such as bytes, bytearray, contiguous numpy.uint8 or void arrays, or other objects implementing buffer protocol with default type or 'B'. Other inputs are purposely rejected.

so I would amend my statement to say "it sounds like imagecodecs compressors are assuming an array of bytes with dtype semantics". The spirit of the statement is unchanged. The Buffer class models a contiguous range of bytes, not an array of bytes with dtype semantics. It is the job of consumers of the Buffer class to impose the concept of "data type" on the contiguous array of bytes represented by the buffer.

In practical terms, this means imagecodecs should cast the contents of Buffer into whatever dtype is most convenient for imagecodecs. Obviously changing the dtype to "B" makes things easier for imagecodecs, but that doesn't change the fact that imagecodecs is relying on an implementation detail of zarr-python by treating Buffer objects as typed arrays.

QuLogic added the bug Potential issues with the zarr-python library label Jan 20, 2025

QuLogic mentioned this issue Jan 20, 2025

Compatibility with Zarr 3 cgohlke/imagecodecs#123

Closed

QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 21, 2025

Use unsigned bytes to back Buffer

0754e86

This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735

QuLogic linked a pull request Jan 21, 2025 that will close this issue

Use unsigned bytes to back Buffer #2738

Open

6 tasks

QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 21, 2025

Use unsigned bytes to back Buffer

01c6e35

This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735

QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 22, 2025

Use unsigned bytes to back Buffer

610689e

This makes compressors consistent with v2, and buffers consistents with `bytes` types. Fixes zarr-developers#2735

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer uses signed bytes with v2 compressors #2735

Buffer uses signed bytes with v2 compressors #2735

QuLogic commented Jan 20, 2025

QuLogic commented Jan 20, 2025

d-v-b commented Jan 20, 2025

madsbk commented Jan 20, 2025 •

edited

Loading

d-v-b commented Jan 20, 2025

madsbk commented Jan 20, 2025

QuLogic commented Jan 21, 2025 •

edited

Loading

d-v-b commented Jan 21, 2025

madsbk commented Jan 21, 2025

d-v-b commented Jan 21, 2025

madsbk commented Jan 21, 2025

d-v-b commented Jan 21, 2025

d-v-b commented Jan 21, 2025

cgohlke commented Jan 21, 2025

d-v-b commented Jan 21, 2025

QuLogic commented Jan 22, 2025

d-v-b commented Jan 22, 2025

Buffer uses signed bytes with v2 compressors #2735

Buffer uses signed bytes with v2 compressors #2735

Comments

QuLogic commented Jan 20, 2025

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

QuLogic commented Jan 20, 2025

d-v-b commented Jan 20, 2025

madsbk commented Jan 20, 2025 • edited Loading

d-v-b commented Jan 20, 2025

madsbk commented Jan 20, 2025

QuLogic commented Jan 21, 2025 • edited Loading

d-v-b commented Jan 21, 2025

madsbk commented Jan 21, 2025

d-v-b commented Jan 21, 2025

madsbk commented Jan 21, 2025

d-v-b commented Jan 21, 2025

d-v-b commented Jan 21, 2025

cgohlke commented Jan 21, 2025

d-v-b commented Jan 21, 2025

QuLogic commented Jan 22, 2025

d-v-b commented Jan 22, 2025

madsbk commented Jan 20, 2025 •

edited

Loading

QuLogic commented Jan 21, 2025 •

edited

Loading