Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZipStore] Unable to read multiple Zarr ZipStore files #2752

Open
zhiweigan opened this issue Jan 22, 2025 · 4 comments · May be fixed by #2762
Open

[ZipStore] Unable to read multiple Zarr ZipStore files #2752

zhiweigan opened this issue Jan 22, 2025 · 4 comments · May be fixed by #2762
Labels
bug Potential issues with the zarr-python library help wanted Issue could use help from someone with familiarity on the topic

Comments

@zhiweigan
Copy link

zhiweigan commented Jan 22, 2025

Zarr version

v3.0.1

Numcodecs version

v0.15.0

Python Version

3.12.8

Operating System

Linux

Installation

using pip into fresh virtual environment

Description

I was trying to migrate an internal tool from Zarr 2 to Zarr 3, but ran into an issue with reading from different ZipStore files in a multi-processing context. When reading several files using a ProcessPoolExecutor, it would stall (it is possibly a deadlock) when reading the files. However, the same process does not stall using a ThreadPoolExecutor.

Adapting to Zarr 2 syntax, the same code succeeds with no issue.

Steps to reproduce

  1. Activate venv
  2. Install zarr
from zarr.storage import ZipStore
import zarr

for i in range(3):
    with ZipStore(f"test{i}.zip", mode="w") as store:
        zarr.create_array(store, shape=(2,), dtype="float64")
print("Written Stores")

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

print("Opening Stores")
with ProcessPoolExecutor() as executor:
    futures = [
        executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
        for i in range(3)
    ]
    datasets = [future.result() for future in futures]
print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor

Additional output

No response

@zhiweigan zhiweigan added the bug Potential issues with the zarr-python library label Jan 22, 2025
@zhiweigan zhiweigan changed the title [xarray] [ZipStore] Unable to read multiple Zarr ZipStore files [ZipStore] Unable to read multiple Zarr ZipStore files Jan 22, 2025
@jhamman
Copy link
Member

jhamman commented Jan 23, 2025

Thanks for the report and the very nice reproducer @zhiweigan. I reran this on my machine and got the following output:

ipython
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.17.2 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from zarr.storage import ZipStore
   ...: import zarr
   ...:
   ...: for i in range(3):
   ...:     with ZipStore(f"test{i}.zip", mode="w") as store:
   ...:         zarr.create_array(store, shape=(2,), dtype="float64")
   ...: print("Written Stores")
   ...:
   ...: from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
   ...:
   ...: print("Opening Stores")
   ...: with ProcessPoolExecutor() as executor:
   ...:     futures = [
   ...:         executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
   ...:         for i in range(3)
   ...:     ]
   ...:     datasets = [future.result() for future in futures]
   ...: print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor
Written Stores
Opening Stores
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/jhamman/miniforge3/envs/zarr-dev/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/api/synchronous.py", line 1051, in open_array
    sync(
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/core/sync.py", line 142, in sync
    raise return_result
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/core/sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/api/asynchronous.py", line 1232, in open_array
    store_path = await make_store_path(store, path=path, mode=mode, storage_options=storage_options)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/storage/_common.py", line 318, in make_store_path
    result = await StorePath.open(store, path=path_normalized, mode=mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/storage/_common.py", line 79, in open
    if store.read_only and mode != "r":
       ^^^^^^^^^^^^^^^
  File "/Users/jhamman/Library/CloudStorage/Dropbox/src/zarr-python/src/zarr/abc/store.py", line 155, in read_only
    return self._read_only
           ^^^^^^^^^^^^^^^
AttributeError: 'ZipStore' object has no attribute '_read_only'
"""

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
Cell In[1], line 17
     12 with ProcessPoolExecutor() as executor:
     13     futures = [
     14         executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
     15         for i in range(3)
     16     ]
---> 17     datasets = [future.result() for future in futures]
     18 print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor

Cell In[1], line 17, in <listcomp>(.0)
     12 with ProcessPoolExecutor() as executor:
     13     futures = [
     14         executor.submit(zarr.open_array, ZipStore(f"test{i}.zip", mode="r"), mode="r")
     15         for i in range(3)
     16     ]
---> 17     datasets = [future.result() for future in futures]
     18 print("Opened Stores") # Prints with ThreadPoolExecutor but not ProcessPoolExecutor

File ~/miniforge3/envs/zarr-dev/lib/python3.11/concurrent/futures/_base.py:456, in Future.result(self, timeout)
    454     raise CancelledError()
    455 elif self._state == FINISHED:
--> 456     return self.__get_result()
    457 else:
    458     raise TimeoutError()

File ~/miniforge3/envs/zarr-dev/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self)
    399 if self._exception:
    400     try:
--> 401         raise self._exception
    402     finally:
    403         # Break a reference cycle with the exception in self._exception
    404         self = None

AttributeError: 'ZipStore' object has no attribute '_read_only'

So looks like a bug in the implementation of ZipStore.read_only.

@jhamman jhamman added the help wanted Issue could use help from someone with familiarity on the topic label Jan 23, 2025
@zhiweigan
Copy link
Author

zhiweigan commented Jan 23, 2025

Ahh interesting that it simply hung when running on my linux box. Something to note is that if simply reading the files sequentially:

for i in range(3):
    with ZipStore(f"test{i}.zip", mode="r") as store:
        zarr.open_array(store, mode="r")

I then also get an expected output of

(.venv_zarr3) [zgan@fpip3-login0002 nimbus]$ python3 test_multi.py 
Written Stores
Opening Stores
Opened Stores

@dstansby
Copy link
Contributor

I think this is an issue with pickling ZipStore (which is how ProcessPoolExecutor hands objects between processes):

import pickle

from zarr.storage import ZipStore

store = ZipStore("test0.zip", mode="r")
print(type(store))
print(store._read_only)

pickled = pickle.dumps(store)
unpickled = pickle.loads(pickled)

print(type(unpickled))
print(unpickled._read_only)

fails for me with

<class 'zarr.storage._zip.ZipStore'>
True
<class 'zarr.storage._zip.ZipStore'>
Traceback (most recent call last):
  File "/Users/dstansby/software/zarr/zarr-python/test_pickle.py", line 13, in <module>
    print(unpickled._read_only)
          ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ZipStore' object has no attribute '_read_only'. Did you mean: 'read_only'?

@dstansby
Copy link
Contributor

Aha, bingo, this is because ZipStore has a custom __getstate__, which doesn't contain all of it's attributes:

def __getstate__(self) -> tuple[Path, ZipStoreAccessModeLiteral, int, bool]:
return self.path, self._zmode, self.compression, self.allowZip64

@dstansby dstansby linked a pull request Jan 24, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants