Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer uses signed bytes with v2 compressors #2735

Open
QuLogic opened this issue Jan 20, 2025 · 16 comments · May be fixed by #2738
Open

Buffer uses signed bytes with v2 compressors #2735

QuLogic opened this issue Jan 20, 2025 · 16 comments · May be fixed by #2738
Labels
bug Potential issues with the zarr-python library

Comments

@QuLogic
Copy link
Contributor

QuLogic commented Jan 20, 2025

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.13.1

Operating System

Fedora Rawhide

Installation

Fedora package

Description

I'm looking at cgohlke/imagecodecs#123 and after fixing some imports and setting zarr_format=2, I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

I have traced this to Buffer requiring dtype='b', along with casts in cpu.Buffer.from_bytes.

If I modify those checks/casts to use the unsigned dtype='B', then I can get imagecodecs tests to pass.

I see this came in with GPU support in #1967. Was this actually intentional, or was it more that no-one noticed that the NumPy b dtype is a signed byte? It would seem odd to me that bytes would be treated as signed as in regular Python they are treated as unsigned (e.g., b'\xff'[0] == 255, not -1).

If this is intentional, then it seems like something that should be documented in the migration guide that would break compressors for zarr_format=2.

Steps to reproduce

Install imagecodecs, modify tests to use zarr.storage.MemoryStorage instead of zarr.MemoryStorage, and set zarr_format=2, then run its tests.

Additional output

No response

@QuLogic QuLogic added the bug Potential issues with the zarr-python library label Jan 20, 2025
@QuLogic
Copy link
Contributor Author

QuLogic commented Jan 20, 2025

I see this came in with GPU support in #1967.

Oops, I did not go far back enough in the blame. It actual came in with Buffer's first implementation in #1826. I don't see any specific discussion on signedness other than pointing out Buffer is a specialization of NDBuffer: #1826 (comment)

@d-v-b
Copy link
Contributor

d-v-b commented Jan 20, 2025

This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed. But it seems like in numpy dtype parlance, a byte is an 8-bit integer? I can see how those could be signed, but I'm not aware of anything in zarr-python that depends on a particular signing. Assuming nothing breaks if we switch to the signing that works for imagecodecs, then we should consider that change.

@madsbk any insight here?

@madsbk
Copy link
Contributor

madsbk commented Jan 20, 2025

Was this actually intentional, or was it more that no-one noticed that the NumPy "b" dtype is a signed byte?

This is an oversight, the buffers shouldn't care about signedness.
I am fine moving to "B" exclusively, but we could also accept both "B" and "b"?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 20, 2025

To allow "B" and "b", we would need some way of passing that value for the dtype used in np.frombuffer. The simplest solution would be to give Buffer a dtype attribute, at which point we are halfway to admitting that the Buffer class is really just NDBuffer with 1 dimension and dtype constrained to "B" or "b".

@madsbk
Copy link
Contributor

madsbk commented Jan 20, 2025

Also I just noticed that concatenating a "B" and "b" array in numpy returns int16. Thus, if we want to support both, we would need to change the CPU and GPU implementation of Buffer.__add__():

    def __add__(self, other: core.Buffer) -> Self:
        """Concatenate two buffers"""

        other_array = other.as_array_like()
        assert other_array.dtype == np.dtype("b")
        return self.__class__(
            np.concatenate((np.asanyarray(self._data), np.asanyarray(other_array)))
        )

@QuLogic
Copy link
Contributor Author

QuLogic commented Jan 21, 2025

This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed.

All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.

Also I just noticed that concatenating a "B" and "b" array in numpy returns int16.

Signed bytes are -128 to 127, unsigned bytes are 0 to 255. The only way to represent that whole range -128 to 255 is casting upward to 16 bits (which is -32768 to 32767).

I don't know that it makes sense to allow both, but if you want to, then I would suggest allowing both as input, but changing internals to always be unsigned (using a view should be copy-free).

QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 21, 2025
This makes compressors consistent with v2, and seems more correct than
signed bytes.

Fixes zarr-developers#2735
@QuLogic QuLogic linked a pull request Jan 21, 2025 that will close this issue
6 tasks
QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 21, 2025
This makes compressors consistent with v2, and seems more correct than
signed bytes.

Fixes zarr-developers#2735
@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.

But as far as I understood it, the buffer API shouldn't store bytes with integer semantics -- Buffer contains stuff like JSON documents, so the contents of Buffer are supposed to just piles of bits, with no meaning attached. We happen to be using numpy arrays to back Buffer instances, but the fact that those arrays have a dtype is an implementation detail that should not affect the public API of Buffer.

do I have this right @madsbk ?

@madsbk
Copy link
Contributor

madsbk commented Jan 21, 2025

Yes, a Buffer just a contiguous blob of memory.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

it sounds like imagecodecs compressors are assuming a numpy array. but Buffer is not a numpy array (despite using one under the hood). So while I think the fix proposed here makes sense, there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

@madsbk
Copy link
Contributor

madsbk commented Jan 21, 2025

Maybe just reinterpret the data like?

buf.view(dtype="uint8")

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

yes, I think the expectation for consumers of Buffer should be that they are responsible for imposing dtype semantics on it

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

I can run many more tests, but several are failing with mismatched types, namely that imagecodecs compressors are expecting uint8_t, but are getting signed char.

it sounds like imagecodecs compressors are assuming a numpy array. but Buffer is not a numpy array (despite using one under the hood). So while I think the fix proposed here makes sense, there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

@cgohlke could you provide some context for the " 👎 "? Did I get something wrong here?

@cgohlke
Copy link
Contributor

cgohlke commented Jan 21, 2025

it sounds like imagecodecs compressors are assuming a numpy array

Imagecodecs makes no such assumption. It requires that decode functions for ->bytes codecs receive an input that can be mapped to a contiguous uint8_t memoryview in Cython, such as bytes, bytearray, contiguous numpy.uint8 or void arrays, or other objects implementing buffer protocol with default type or 'B'. Other inputs are purposely rejected.

there is a larger issue if imagecodecs assumes that Buffer instances are full of integers.

I think the issue is that Zarr 3 Buffers are surprisingly full of signed integers while bytes, bytearray, and default type buffer protocol objects contain unsigned integers, for example:

>>> b'\xff'[0]
255

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

that makes sense, thanks for the clarification. I don't anticipate any big problems with switching to the "B" dtype, which I think will resolve this matter, but I am surprised that our current use of "b" resulted in a representation that could not be mapped to a contiguous uint8_t memoryview.

As far as I know, switching to "B" will make absolutely no difference in the literal bytes that are represented in a Buffer instance. Only the interpretation of those bytes as integers will change, but this should be meaningless for the Buffer API, and its consumers.

@QuLogic
Copy link
Contributor Author

QuLogic commented Jan 22, 2025

But as far as I understood it, the buffer API shouldn't store bytes with integer semantics

Bytes are integers; I'm not sure what distinction you're making here. If you are specifically referring to the Python bytes type, then that is comprised of unsigned 8-bit integers, as @cgohlke mentioned:

>>> b = b'\x01\x02\xfe\xff'
>>> b[0]
1
>>> b[1]
2
>>> b[2]
254
>>> b[3]
255

and the reverse works:

>>> bytes([1, 2, 254, 255])
b'\x01\x02\xfe\xff'

and it specifically maps to unsigned 8-bit as a memory view:

>>> mv = memoryview(b)
>>> mv.format
'B'

Whether you then interpret those bytes as text or multi-byte integers is a higher-level concern (and I agree Buffer shouldn't specify those), but it's still a collection of integers.

I am surprised that our current use of "b" resulted in a representation that could not be mapped to a contiguous uint8_t memoryview.

Note this is specifically about compressors for the v2 file format. Perhaps v3 has sufficiently hidden this, but that hasn't been implemented in imagecodecs yet.

You can cast a memoryview from int8 to uint8, but that requires an explicit call to .cast. With imagecodecs, the uint8_t is the type of the function parameter, and I suspect Cython is being very strict here by not allowing the implicit cast.

As far as I know, switching to "B" will make absolutely no difference in the literal bytes that are represented in a Buffer instance. Only the interpretation of those bytes as integers will change, but this should be meaningless for the Buffer API, and its consumers.

Okay, if we're on the same page here, then I'll try and finish up #2738.

QuLogic added a commit to QuLogic/zarr that referenced this issue Jan 22, 2025
This makes compressors consistent with v2, and buffers consistents with
`bytes` types.

Fixes zarr-developers#2735
@d-v-b
Copy link
Contributor

d-v-b commented Jan 22, 2025

Bytes are integers; I'm not sure what distinction you're making here. If you are specifically referring to the Python bytes type, then that is comprised of unsigned 8-bit integers, as @cgohlke mentioned:

Recall that the job of the Buffer class is to model stored objects. Would you describe stored objects (like files on a file system) as composed of integers, or bytes? I would say "bytes".

Bytes are collections of 8 bits, which can be used to represent integers, or characters, or bools, or parts of larger elements. The Buffer class gets used to encode JSON documents, compressed and uncompressed N-dimensional arrays, etc into a format suitable for storage, and for this reason the Buffer class should be totally agnostic to the encoding scheme used by the consumer of those bytes. Because the memoryview API assumes that its values have an encoding, it is more specific than our Buffer api.

it sounds like imagecodecs compressors are assuming a numpy array

Imagecodecs makes no such assumption. It requires that decode functions for ->bytes codecs receive an input that can be mapped to a contiguous uint8_t memoryview in Cython, such as bytes, bytearray, contiguous numpy.uint8 or void arrays, or other objects implementing buffer protocol with default type or 'B'. Other inputs are purposely rejected.

so I would amend my statement to say "it sounds like imagecodecs compressors are assuming an array of bytes with dtype semantics". The spirit of the statement is unchanged. The Buffer class models a contiguous range of bytes, not an array of bytes with dtype semantics. It is the job of consumers of the Buffer class to impose the concept of "data type" on the contiguous array of bytes represented by the buffer.

In practical terms, this means imagecodecs should cast the contents of Buffer into whatever dtype is most convenient for imagecodecs. Obviously changing the dtype to "B" makes things easier for imagecodecs, but that doesn't change the fact that imagecodecs is relying on an implementation detail of zarr-python by treating Buffer objects as typed arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants