-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer uses signed bytes with v2 compressors #2735
Comments
Oops, I did not go far back enough in the blame. It actual came in with |
This is an educational experience for me, as I had no idea that a "byte" (in the sense of 8 bits) could be signed. But it seems like in numpy dtype parlance, a byte is an 8-bit integer? I can see how those could be signed, but I'm not aware of anything in zarr-python that depends on a particular signing. Assuming nothing breaks if we switch to the signing that works for imagecodecs, then we should consider that change. @madsbk any insight here? |
This is an oversight, the buffers shouldn't care about signedness. |
To allow |
Also I just noticed that concatenating a "B" and "b" array in numpy returns def __add__(self, other: core.Buffer) -> Self:
"""Concatenate two buffers"""
other_array = other.as_array_like()
assert other_array.dtype == np.dtype("b")
return self.__class__(
np.concatenate((np.asanyarray(self._data), np.asanyarray(other_array)))
) |
All fixed-width integers can be signed or unsigned; it just determines how the most-significant bit is interpreted.
Signed bytes are -128 to 127, unsigned bytes are 0 to 255. The only way to represent that whole range -128 to 255 is casting upward to 16 bits (which is -32768 to 32767). I don't know that it makes sense to allow both, but if you want to, then I would suggest allowing both as input, but changing internals to always be unsigned (using a view should be copy-free). |
This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735
This makes compressors consistent with v2, and seems more correct than signed bytes. Fixes zarr-developers#2735
But as far as I understood it, the buffer API shouldn't store bytes with integer semantics -- do I have this right @madsbk ? |
Yes, a Buffer just a contiguous blob of memory. |
it sounds like |
Maybe just reinterpret the data like? buf.view(dtype="uint8") |
yes, I think the expectation for consumers of |
@cgohlke could you provide some context for the " 👎 "? Did I get something wrong here? |
Imagecodecs makes no such assumption. It requires that
I think the issue is that Zarr 3 Buffers are surprisingly full of signed integers while >>> b'\xff'[0]
255 |
that makes sense, thanks for the clarification. I don't anticipate any big problems with switching to the As far as I know, switching to |
Bytes are integers; I'm not sure what distinction you're making here. If you are specifically referring to the Python
and the reverse works:
and it specifically maps to unsigned 8-bit as a memory view:
Whether you then interpret those bytes as text or multi-byte integers is a higher-level concern (and I agree
Note this is specifically about compressors for the v2 file format. Perhaps v3 has sufficiently hidden this, but that hasn't been implemented in You can cast a
Okay, if we're on the same page here, then I'll try and finish up #2738. |
This makes compressors consistent with v2, and buffers consistents with `bytes` types. Fixes zarr-developers#2735
Recall that the job of the Bytes are collections of 8 bits, which can be used to represent integers, or characters, or bools, or parts of larger elements. The
so I would amend my statement to say "it sounds like imagecodecs compressors are assuming an array of bytes with dtype semantics". The spirit of the statement is unchanged. The In practical terms, this means |
Zarr version
3.0.1
Numcodecs version
0.15.0
Python Version
3.13.1
Operating System
Fedora Rawhide
Installation
Fedora package
Description
I'm looking at cgohlke/imagecodecs#123 and after fixing some imports and setting
zarr_format=2
, I can run many more tests, but several are failing with mismatched types, namely thatimagecodecs
compressors are expectinguint8_t
, but are gettingsigned char
.I have traced this to
Buffer
requiringdtype='b'
, along with casts incpu.Buffer.from_bytes
.If I modify those checks/casts to use the unsigned
dtype='B'
, then I can getimagecodecs
tests to pass.I see this came in with GPU support in #1967. Was this actually intentional, or was it more that no-one noticed that the NumPy
b
dtype is a signed byte? It would seem odd to me thatbytes
would be treated as signed as in regular Python they are treated as unsigned (e.g.,b'\xff'[0] == 255
, not -1).If this is intentional, then it seems like something that should be documented in the migration guide that would break compressors for
zarr_format=2
.Steps to reproduce
Install
imagecodecs
, modify tests to usezarr.storage.MemoryStorage
instead ofzarr.MemoryStorage
, and setzarr_format=2
, then run its tests.Additional output
No response
The text was updated successfully, but these errors were encountered: