[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

Yu-Shi · 2023-12-25T11:37:10Z

Feature request

Some datasets may contain an id-like field, for example the id field in wikimedia/wikipedia and the _id field in BeIR/dbpedia-entity. HF datasets support efficient random access via row, but not via this kinds of id fields. I wonder if it is possible to add support for indexing by a custom "id-like" field to enable random access via such ids. The ids may be numbers or strings.

Motivation

In some cases, especially during inference/evaluation, I may want to find out the item that has a specified id, defined by the dataset itself.

For example, in a typical re-ranking setting in information retrieval, the user may want to re-rank the set of candidate documents of each query. The input is usually presented in a TREC-style run file, with the following format:

<qid> Q0 <docno> <rank> <score> <tag>

The re-ranking program should be able to fetch the queries and documents according to the <qid> and <docno>, which are the original id defined in the query/document datasets. To accomplish this, I have to iterate over the whole HF dataset to get the mapping from real ids to row ids every time I start the program, which is time-consuming. Thus I want HF dataset to provide options for users to index by a custom id column, not by row.

Your contribution

I'm not an expert in this project and I'm afraid that I'm not able to make contributions on the code.

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-01-02T13:42:51Z

You can simply use a python dict as index:

>>> from datasets import load_dataset
>>> ds = load_dataset("BeIR/dbpedia-entity", "corpus", split="corpus")
>>> index = {key: idx for idx, key in enumerate(ds["_id"])}
>>> ds[index["<dbpedia:Pikachu>"]]
{'_id': '<dbpedia:Pikachu>',
 'title': 'Pikachu',
 'text': 'Pikachu (Japanese: ピカチュウ) are a fictional species of Pokémon.  Pokémon are fictional creatures that appear in an assortment of comic books, animated movies and television shows, video games, and trading card games licensed by The Pokémon Company, a Japanese corporation. The Pikachu design was conceived by Ken Sugimori.'}

Yu-Shi · 2024-01-02T13:52:04Z

Thanks for your reply. Yes, I can do that, but it is time-consuming to do that every time I launch the program (some datasets are extremely big). HF Datasets has a nice feature to support instant data loading and efficient random access via row ids. I'm curious if this beneficial feature could be further extended to custom data columns.

davidmrau · 2024-04-05T07:55:55Z

+1 on the issue I think it would be extremely useful

GeeYangML · 2024-06-04T14:43:31Z

+1. This could be very useful.

ruth-ann · 2024-06-05T07:54:52Z

+1 - currently having to manually implement this

davidmrau · 2024-08-03T10:58:18Z

If anyone has an idea how to do this in the right way (perhaps @albertvillanova ?) I would be happy to implement it

SwayStar123 · 2024-10-08T12:50:56Z

This would be very helpful to implement aspect ratio bucketing for image and video datasets

gar1t · 2025-01-08T03:37:41Z

What you're asking for is an index that's provided with the dataset data, and happens to be optimized for your retrieval use case. Those who don't have this problem now need to download more bytes. It's bad for the planet.

If you want to avoid the indexing work, you can serialize the index dict to a file to be loaded on subsequent runs. You might also use sqlite to lazily create and use a db as your index.

carlos-lassance · 2025-01-22T10:59:44Z

If you want to avoid the indexing work, you can serialize the index dict to a file to be loaded on subsequent runs. You might also use sqlite to lazily create and use a db as your index.

Sure, but then can't we have a default implementation that is well optimized and part of the library instead of having a thousand different implementations of this? It does not need to be something that gets downloaded with the dataset (In the same way that when you get the dataset and run map, it does not save and makes people download the mapped version)

Yu-Shi added the enhancement New feature or request label Dec 25, 2023

Yu-Shi changed the title ~~Indexing datasets by a customly-defined id field to enable random access dataset items via the id~~ [Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

Yu-Shi commented Dec 25, 2023 •

edited

Loading

lhoestq commented Jan 2, 2024

Yu-Shi commented Jan 2, 2024

davidmrau commented Apr 5, 2024

GeeYangML commented Jun 4, 2024

ruth-ann commented Jun 5, 2024

davidmrau commented Aug 3, 2024 •

edited

Loading

SwayStar123 commented Oct 8, 2024

gar1t commented Jan 8, 2025

carlos-lassance commented Jan 22, 2025

[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

Comments

Yu-Shi commented Dec 25, 2023 • edited Loading

Feature request

Motivation

Your contribution

lhoestq commented Jan 2, 2024

Yu-Shi commented Jan 2, 2024

davidmrau commented Apr 5, 2024

GeeYangML commented Jun 4, 2024

ruth-ann commented Jun 5, 2024

davidmrau commented Aug 3, 2024 • edited Loading

SwayStar123 commented Oct 8, 2024

gar1t commented Jan 8, 2025

carlos-lassance commented Jan 22, 2025

Yu-Shi commented Dec 25, 2023 •

edited

Loading

davidmrau commented Aug 3, 2024 •

edited

Loading