-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Adding documents to existing chroma index #17530
Comments
Hello there, @srsingh24! I'm a bot here to help you with bug fixes, questions, and contributing. Let's tackle this issue together! The import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.schema import TextNode
# Creating a Chroma client
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
# Construct vector store
vector_store = ChromaVectorStore(
chroma_collection=chroma_collection,
)
# Defining some sample data
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
# Add more TextNode instances as needed
]
# Adding documents to the ChromaVectorStore
for node in nodes:
vector_store.add(node) This method allows you to add new documents to the vector store without needing a To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu thank you. How do I initialize the ChromaVectorStore with an OpenAI embedding model? |
To initialize the
Here's a code snippet to guide you: # Set up OpenAI
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
# Install Chroma
# pip install chromadb
# Create client and a new collection
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
# Load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# Initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")
# Create collection
chroma_collection = db.get_or_create_collection("quickstart")
# Define embedding function using OpenAI
embed_model = openai.Embedding.create # Assuming you have a function to create embeddings
# Assign Chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create your index
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, embed_model=embed_model
)
# Query Data
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response) This setup configures the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu I want to initialize an empty chroma vector store with the LlamaIndex OpenAI Embeddings: https://docs.llamaindex.ai/en/stable/examples/embeddings/OpenAI/ Next, I will to use the following that you provided to add documents as I receive them: import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.schema import TextNode
# Creating a Chroma client
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
# Construct vector store
vector_store = ChromaVectorStore(
chroma_collection=chroma_collection,
)
# Defining some sample data
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
# Add more TextNode instances as needed
]
# Adding documents to the ChromaVectorStore
for node in nodes:
vector_store.add(node) |
To initialize an empty Chroma vector store with LlamaIndex OpenAI embeddings and add documents as they are received, follow these steps:
Ensure you have the necessary packages installed: pip install chromadb llama-index-embeddings-openai
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.schema import TextNode
# Initialize Chroma client
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
# Define embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# Assign Chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Initialize an empty VectorStoreIndex
index = VectorStoreIndex([], storage_context=storage_context, embed_model=embed_model)
# Assuming `new_documents` is a list of TextNode instances you receive
for node in new_documents:
index.insert(node) This setup allows you to initialize an empty Chroma vector store with OpenAI embeddings and dynamically add documents to it as they are received [1][2]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu, thanks! And then once i have loaded in documents, how do I load my chromadb without having to re-initialize the vector index without losing all my old documents? |
To load your ChromaDB without re-initializing the vector index and without losing your old documents, you can directly load the stored vectors without loading your documents or creating a new import chromadb
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
# Initialize client
db = chromadb.PersistentClient(path="./chroma_db")
# Get collection
chroma_collection = db.get_or_create_collection("quickstart")
# Assign Chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
vector_store, storage_context=storage_context
)
# Create a query engine
query_engine = index.as_query_engine()
response = query_engine.query("What is llama2?")
print(response) This approach allows you to load the existing vector index without re-initializing it or losing your previously stored documents [1]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu, I tried your approach of adding documents to the chromavectorstore:
and I get the following error:
|
The error you're encountering is due to passing a single # Defining some sample data
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
# Add more TextNode instances as needed
]
# Adding documents to the ChromaVectorStore
vector_store.add(nodes) # Pass the list of nodes directly By passing the entire list of To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Question Validation
Question
I was trying to figure out how to add documents to
ChromaVectorStore
after the initial initialization. I came across the following answer on StackOverFlowthat I should set the
mode
argument toappend
. While this was helpful, I wanted to cross check the docs and the source code but could not find any mention of themode
argument. Where can I find more information on it?The text was updated successfully, but these errors were encountered: