-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZipStore] Unable to read multiple Zarr ZipStore files #2752
Comments
Thanks for the report and the very nice reproducer @zhiweigan. I reran this on my machine and got the following output: ❯ ipython
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.17.2 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from zarr.storage import ZipStore
...: import zarr
...:
...: for i in range(3):
...: with ZipStore(f"test{i}.zip", mode="w") as store:
...: zarr.create_array(store, shape=(2,), dtype="float64")
...: print("Written Stores")
...:
...: from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
...:
...: print("Opening Stores")
...: with ProcessPoolExecutor() as executor:
...: futures = [
...: executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
...: for i in range(3)
...: ]
...: datasets = [future.result() for future in futures]
...: print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor
Written Stores
Opening Stores
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/jhamman/miniforge3/envs/zarr-dev/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/api/synchronous.py", line 1051, in open_array
sync(
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/core/sync.py", line 142, in sync
raise return_result
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/core/sync.py", line 98, in _runner
return await coro
^^^^^^^^^^
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/api/asynchronous.py", line 1232, in open_array
store_path = await make_store_path(store, path=path, mode=mode, storage_options=storage_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/storage/_common.py", line 318, in make_store_path
result = await StorePath.open(store, path=path_normalized, mode=mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/storage/_common.py", line 79, in open
if store.read_only and mode != "r":
^^^^^^^^^^^^^^^
File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/abc/store.py", line 155, in read_only
return self._read_only
^^^^^^^^^^^^^^^
AttributeError: 'ZipStore' object has no attribute '_read_only'
"""
The above exception was the direct cause of the following exception:
AttributeError Traceback (most recent call last)
Cell In[1], line 17
12 with ProcessPoolExecutor() as executor:
13 futures = [
14 executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
15 for i in range(3)
16 ]
---> 17 datasets = [future.result() for future in futures]
18 print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor
Cell In[1], line 17, in <listcomp>(.0)
12 with ProcessPoolExecutor() as executor:
13 futures = [
14 executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
15 for i in range(3)
16 ]
---> 17 datasets = [future.result() for future in futures]
18 print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor
File ~/miniforge3/envs/zarr-dev/lib/python3.11/concurrent/futures/_base.py:456, in Future.result(self, timeout)
454 raise CancelledError()
455 elif self._state == FINISHED:
--> 456 return self.__get_result()
457 else:
458 raise TimeoutError()
File ~/miniforge3/envs/zarr-dev/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self)
399 if self._exception:
400 try:
--> 401 raise self._exception
402 finally:
403 # Break a reference cycle with the exception in self._exception
404 self = None
AttributeError: 'ZipStore' object has no attribute '_read_only' So looks like a bug in the implementation of |
Ahh interesting that it simply hung when running on my linux box. Something to note is that if simply reading the files sequentially: for i in range(3):
with ZipStore(f"test{i}.zip", mode="r") as store:
zarr.open_array(store, mode="r") I then also get an expected output of
|
I think this is an issue with pickling import pickle
from zarr.storage import ZipStore
store = ZipStore("test0.zip", mode="r")
print(type(store))
print(store._read_only)
pickled = pickle.dumps(store)
unpickled = pickle.loads(pickled)
print(type(unpickled))
print(unpickled._read_only) fails for me with
|
Aha, bingo, this is because zarr-python/src/zarr/storage/_zip.py Lines 110 to 111 in 40da497
|
Zarr version
v3.0.1
Numcodecs version
v0.15.0
Python Version
3.12.8
Operating System
Linux
Installation
using pip into fresh virtual environment
Description
I was trying to migrate an internal tool from Zarr 2 to Zarr 3, but ran into an issue with reading from different ZipStore files in a multi-processing context. When reading several files using a ProcessPoolExecutor, it would stall (it is possibly a deadlock) when reading the files. However, the same process does not stall using a ThreadPoolExecutor.
Adapting to Zarr 2 syntax, the same code succeeds with no issue.
Steps to reproduce
Additional output
No response
The text was updated successfully, but these errors were encountered: