- Apolo cli. Instructions
- HuggingFace access to the model you want to deploy. For example DeepSeek
Note: this setup is mostly for POC purposes. For production-ready setup, you'll need to replace some of it's components with production-ready Apps.
$ git clone
this repo &&$ cd
into root of it.- Build image for web app with
$ apolo-flow build privategpt
- Create secret with HuggingFace token to pull models
$ apolo secret add HF_TOKEN <token>
(see https://huggingface.co/settings/tokens) $ apolo-flow run pgvector
-- start vector store$ apolo-flow run tei
-- start embeddings server$ apolo-flow run vllm
-- start LLM inference server. Note: if you want to change LLM hosted there, change it inlive.yaml:defaults.env.VLLM_MODEL
.$ apolo-flow run pgpt
-- start PrivateGPT web server.
Instruction
Currently, we support only deployment case with vLLM as LLM inference server, PGVector as a vector store and TextEmbeddingsInference as embeddings server.
Use following environment variables to configure PrivateGPT instance:
Scheme: env name (value type, required/optional) -- description
.
Shared among all the jobs:
VLLM_MODEL
(hugging face model reference, required) -- LLM model name to use (must be available at inference server).VLLM_TOKENIZER
(hugging face model reference, required) -- tokenized to use while sending requests to LLMVLLM_CONTEXT_WINDOW
(int, required) -- controls context size that will be sent to LLM
LLM config section:
VLLM_API_BASE
(URL, required) -- HTTP endpoint for LLM inferenceVLLM_TEMPERATURE
(float 0 < x < 1, optional) -- temperature parameter ('creativeness') for LLM. Less value -- more strict penalty for going out of provided context.
Embeddings config section:
EMBEDDING_MODEL
(str, optional) -- embeddings model to use
Other platform-related configurations like --life-span
, etc. also work here.