huge csv file occupies huge memory while indexing document #12893

rayshaw001 · 2025-01-21T02:27:11Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

4Core/8G instance
huge csv file (350,000 lines)
upload csv file
indexing_processor will take huge memory while embedding segments
docker container will restart due to running out of memory

✔️ Expected Behavior

documents status will turning to avaliable

❌ Actual Behavior

docker container will restart due to running out of memory
documents status will pending in indexing

10,000 embeddings takes 10076 * 1536 * 8byte / 1024 / 1024 = 117Mb memory (float take 8 bytes memory in python)
350,000 embeddings will takes 117Mb * 35 / 1024 = 4Gb memory

log embeddings memory:

objgraph embeddings result:

key result:

full result:

dosubot · 2025-01-21T02:37:07Z

The issue you're experiencing with high memory usage during the indexing of a large CSV file in Dify is likely due to the memory requirements for handling a large number of embeddings. Here are a few suggestions to help manage memory usage:

Batch Processing: Instead of processing all 350,000 lines at once, consider processing the CSV file in smaller batches. This can help reduce the peak memory usage by only loading a subset of the data into memory at any given time.
Increase Memory Allocation: If possible, increase the memory allocation for your Docker container. This can be done by adjusting the Docker settings to allow more memory for the container running Dify.
Optimize Embedding Size: If feasible, consider reducing the dimensionality of the embeddings. This will decrease the memory required for each embedding, thus reducing the overall memory footprint.
Use Swap Space: Configure your system to use swap space, which can help handle memory spikes by temporarily using disk space as additional memory. However, this may slow down processing due to the slower read/write speeds of disk storage compared to RAM.
Monitor and Log Memory Usage: Continue using tools like objgraph to monitor memory usage and identify potential memory leaks or inefficiencies in the code.

Implementing these strategies can help mitigate the memory issues you're facing during the document indexing process.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

dosubot bot added the 🐞 bug Something isn't working label Jan 21, 2025

rayshaw001 linked a pull request Jan 21, 2025 that will close this issue

fix: run out of memory due to indexing/embedding documents #12882

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huge csv file occupies huge memory while indexing document #12893

huge csv file occupies huge memory while indexing document #12893

rayshaw001 commented Jan 21, 2025

dosubot bot commented Jan 21, 2025

huge csv file occupies huge memory while indexing document #12893

huge csv file occupies huge memory while indexing document #12893

Comments

rayshaw001 commented Jan 21, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Jan 21, 2025