Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huge csv file occupies huge memory while indexing document #12893

Open
5 tasks done
rayshaw001 opened this issue Jan 21, 2025 · 1 comment · May be fixed by #12882
Open
5 tasks done

huge csv file occupies huge memory while indexing document #12893

rayshaw001 opened this issue Jan 21, 2025 · 1 comment · May be fixed by #12882
Labels
🐞 bug Something isn't working

Comments

@rayshaw001
Copy link
Contributor

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. 4Core/8G instance
  2. huge csv file (350,000 lines)
  3. upload csv file
  4. indexing_processor will take huge memory while embedding segments
  5. docker container will restart due to running out of memory

✔️ Expected Behavior

documents status will turning to avaliable
Image

❌ Actual Behavior

docker container will restart due to running out of memory
documents status will pending in indexing
Image

10,000 embeddings takes 10076 * 1536 * 8byte / 1024 / 1024 = 117Mb memory (float take 8 bytes memory in python)
350,000 embeddings will takes 117Mb * 35 / 1024 = 4Gb memory

  • log embeddings memory:

Image

  • objgraph embeddings result:

key result:

Image

full result:

Image

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 21, 2025
@rayshaw001 rayshaw001 linked a pull request Jan 21, 2025 that will close this issue
5 tasks
Copy link

dosubot bot commented Jan 21, 2025

The issue you're experiencing with high memory usage during the indexing of a large CSV file in Dify is likely due to the memory requirements for handling a large number of embeddings. Here are a few suggestions to help manage memory usage:

  1. Batch Processing: Instead of processing all 350,000 lines at once, consider processing the CSV file in smaller batches. This can help reduce the peak memory usage by only loading a subset of the data into memory at any given time.

  2. Increase Memory Allocation: If possible, increase the memory allocation for your Docker container. This can be done by adjusting the Docker settings to allow more memory for the container running Dify.

  3. Optimize Embedding Size: If feasible, consider reducing the dimensionality of the embeddings. This will decrease the memory required for each embedding, thus reducing the overall memory footprint.

  4. Use Swap Space: Configure your system to use swap space, which can help handle memory spikes by temporarily using disk space as additional memory. However, this may slow down processing due to the slower read/write speeds of disk storage compared to RAM.

  5. Monitor and Log Memory Usage: Continue using tools like objgraph to monitor memory usage and identify potential memory leaks or inefficiencies in the code.

Implementing these strategies can help mitigate the memory issues you're facing during the document indexing process.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant