This project implements a query processing and vector search application using various machine learning models and a ClickHouse database. The application processes queries, generates embeddings, and ranks or retrieves relevant sections from a database.
This project was developed by Difinative Technologies in collaboration with SOCHARA.
- Connection to a ClickHouse database
- Initialization of various tokenizers and models
- Generation of embeddings for input queries
- Ranking of database sections based on cosine similarity
- Retrieval of database sections using different search methods (cosine distance, Euclidean distance, etc.)
- Structuring of text using language model OpenAI GPT-3.5
- Python 3.7 or higher
- ClickHouse server
- Necessary environment variables stored in a
.env
file
-
Clone the repository:
git clone https://github.com/aparna23na/Proj.git cd your-repo-name
-
Create a virtual environment and activate it: python -m venv venv source venv/bin/activate # On Windows, use
venv\Scripts\activate
-
Set up the .env file with the necessary environment variables: OPENAI_API_KEY=your_openai_api_key
CLICKHOUSE_HOST=your_clickhouse_host
CLICKHOUSE_PORT=your_clickhouse_port
CLICKHOUSE_USER=your_clickhouse_user
CLICKHOUSE_PASSWORD=your_clickhouse_password
CLICKHOUSE_DATABASE=your_clickhouse_database
METADATA_URL=https://example.com/metadata_api
OUTPUT_DIRECTORY=/path/to/pdf_downloads
ARCHIVE_BASE_URL=https://example.com/archive_base
PDF_DIRECTORY=/path/to/data_directory
By default, the application will be accessible at http://127.0.0.1:5000/.
Flask Application Home Page: Enter a query to receive a structured response based on the most relevant section from the database. About Page: Information about the project.
This Python script provides functionality for querying and processing PDF documents using natural language processing and vector search techniques.
- PDF text extraction and embedding generation
- Vector search using various methods (cosine similarity, Euclidean distance, ANN)
- Integration with ClickHouse database for efficient data storage and retrieval
- Multi-stage query processing for improved search results
- PDF description retrieval
- Result deduplication
initialize_clickhouse_connection() Initializes a connection to the ClickHouse database using the credentials and configurations stored in environment variables.
initialize_tokenizer_and_model() Loads a pre-trained BERT tokenizer and model from the Hugging Face library. If the tokenizer lacks a padding token, it assigns a default one.
initialize_llama_model() Loads a pre-trained LLaMA tokenizer and model from the Hugging Face library for causal language modeling tasks.
generate_embeddings(tokenizer, model, query) Generates embeddings for a given query text using the provided tokenizer and model. Returns the pooled embedding vector.
extract_important_words(query_text) Extracts important words from a given query text, excluding common stop words. Returns a list of significant words.
get_surrounding_chunks(client, id, summary_id, window_size=2) Fetches chunks of text surrounding a specific chunk from the database. The window_size parameter determines how many chunks before and after should be included.
get_original_filename(client, summary_id) Retrieves the original filename associated with a given summary ID from the database, processes it, and constructs the full file URL.
cosine_similarity(client, question_embedding) Performs a cosine similarity search using the provided question embedding. Returns the most similar chunk of text and the associated original filename.
vector_search_cosine_distance(client, question_embedding) Performs a cosine distance search using the provided question embedding. Returns the most similar chunk of text and the associated original filename.
ann_search(client, query_embedding, window_size=2, top_n=5) Performs an approximate nearest neighbor (ANN) search using the provided query embedding. Returns the top matching chunks and their descriptions if they are PDF files.
euclidean_search(client, question_embedding) Performs a Euclidean distance search using the provided question embedding. Returns the most similar chunk of text and the associated original filename.
query_clickhouse_word_with_multi_stage(client, important_words, query_embedding, top_n=5) Executes a multi-stage search combining keyword matching and semantic similarity. Returns the top matching chunks and descriptions for PDF files.
get_pdf_description(filename) Retrieves a brief description of a PDF file based on its content from the database. The description is truncated to 300 characters.
deduplicate_results(closest_chunks, top_n) Removes duplicate results from a list of chunks based on their filenames. Returns unique chunks and their filenames.
structure_sentence_with_llama(query, chunk_text, llama_tokenizer, llama_model) Structures a given chunk of text in response to a query using the LLaMA model. Returns the structured text.
structure_chunk_text(query, chunk_text) Formats a given chunk of text without making any changes to its content using OpenAI's GPT-3.5-turbo model. Returns the formatted text.
process_query_clickhouse(query_text, search_method='ann_search') processes a query by generating embeddings and performing a specified search method in the ClickHouse database. Returns the most relevant chunk of text and its filename.
process_query_clickhouse_pdf(query_text, top_n=5) Processes a query by extracting important words, generating embeddings, performing a multi-stage search, and retrieving descriptions for the top PDF files. Returns the most relevant contexts, filenames, and descriptions.
This file contains the main Flask application for the chat interface.
Imports: Imports necessary modules and functions, including Flask and custom functions.
Flask App Initialization: Creates the Flask application instance.
Logging and Warning Configuration: Sets up basic logging and suppresses specific deprecation warnings.
Conversation History: Initializes an empty list to store the chat history.
Main Route ('/')
- Handles both GET and POST requests for the main chat interface.
- POST request processing:
- Retrieves the user's query.
- Calls
process_query_clickhouse_pdf
to get responses and PDF information. - Formats the response with bot reply, PDF URLs, and descriptions.
- Updates the conversation history.
- GET request:
- Renders the main chat interface template.
About Route ('/about'): Renders the About Us page.
Error Handling: Provides basic error responses for invalid queries or processing errors.
Application Runner: Starts the Flask application in debug mode when the script is run directly.
index()
: Main route handler for the chat interface.about()
: Handler for the About page.
index.html
: Main chat interface template.about.html
: About page template.
This script sets up a web application that allows users to interact with a chat interface, process queries, and receive responses along with relevant PDF information. It maintains a conversation history and provides a simple About page.
The HTML structure is divided into several main sections:
Header - Contains the title of the chatbot and navigation buttons.
Chat Container - Encloses the chat header, chat body, and chat footer.
Chat Body - Displays the conversation history between the user and the bot.
Chat Footer - Contains the input form for users to type and send their messages.
Loading Animation - Displays a spinner animation while the chatbot processes the user query.
The CSS styles enhance the appearance and layout of the interface.
Body - Background image, font settings, and overall layout.
Chat Header - Fixed position with background color and text alignment.
Header Buttons - Style for the navigation buttons with hover effects.
Chat Container and Body - Layout settings for chat display, including flexbox settings.
Messages - Styles for user and bot messages with different background colors.
Loading Animation - Spinner animation to indicate processing.
The JavaScript handles user interaction and communication with the server.
Scroll to Bottom - Ensures the chat scrolls to the latest message.
Form Submission - Handles user input and sends it to the server using XMLHttpRequest.
Loading Animation - Displays a spinner while the server processes the query.
Response Handling - Updates the chat body with the user's message, bot's response, and any PDF links and descriptions.
The About Us page provides information about the organization, its mission, and focus areas. The page includes a header with navigation buttons, and a content section with details about the NGO.
The HTML structure is divided into several main sections:
- Document Type and Language
- Defines the document type and language of the page.
- Head Section
- Contains meta information and links to stylesheets.
- Body Section
- Encloses the chat header and content wrapper.
This script processes PDF files, extracts text, generates embeddings, and stores the data in a ClickHouse database.
- Extracts text from PDF files
- Generates text embeddings using BERT
- Stores PDF summaries and text chunks in ClickHouse
- Handles both text-based and scanned PDFs
- Avoids duplicate processing of PDFs
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt
-
Set up a
.env
file with the following variables:CLICKHOUSE_HOST=your_clickhouse_host
CLICKHOUSE_PORT=your_clickhouse_port
CLICKHOUSE_USER=your_clickhouse_user
CLICKHOUSE_PASSWORD=your_clickhouse_password
CLICKHOUSE_DATABASE=your_clickhouse_database
The script will:
- Create necessary ClickHouse tables if they don't exist
- Process all PDF files in the specified directory
- Extract text, generate embeddings, and store data in ClickHouse
create_clickhouse_tables(): Sets up required tables in ClickHouse
insert_pdf_summary(): Inserts a new PDF summary into the database
extract_text_from_pdf(): Extracts text from PDF files (including OCR for scanned PDFs)
insert_chunks(): Processes text chunks, generates embeddings, and inserts into database
process_pdf_file(): Orchestrates the processing of a single PDF file
main(): Main function to process all PDFs in the specified directory
-
abc_table
: Stores PDF summaries- Columns: id (UUID), user_name, original_filename, summarized_text
-
abc_chunks
: Stores text chunks and their embeddings- Columns: id, summary_id, chunk_text, embeddings
An index is created on the embeddings column for efficient similarity searches.
This script is designed to handle large numbers of PDF files. It uses BERT for generating text embeddings, which are stored for later use in search or analysis tasks.
The script includes error handling to manage issues that may arise during PDF processing or database operations. Check the console output for any error messages.
This script automates the process of downloading PDF files based on metadata retrieved from a specified API.
Environment Setup Uses dotenv
to load environment variables from a .env
file.
- Retrieves crucial URLs and directory paths from environment variables:
-
METADATA_URL
: API endpoint for metadata -
OUTPUT_DIRECTORY
: Where PDFs will be saved -
ARCHIVE_BASE_URL
: Base URL for constructing PDF download links
-
download_pdfs_from_metadata(metadata_url, output_dir) - Fetches metadata from the specified API
- Extracts PDF identifiers from the metadata
- Constructs individual PDF URLs
- Initiates download for each PDF
download_pdf(pdf_url, pdf_filename) - Downloads a single PDF file from the given URL
- Saves the PDF to the specified output directory
- Prints confirmation message upon successful download
Execution Flow
-
Creates the output directory if it doesn't exist
-
Calls
download_pdfs_from_metadata()
to start the download process
- Flask: Web framework for creating the application's interface and handling HTTP requests.
- torch: Deep learning library used for neural network operations and tensor computations.
- transformers: Provides pre-trained models like BERT for natural language processing tasks.
- clickhouse-driver: Client library for interacting with the ClickHouse database.
- scipy: Scientific computing library, used here for distance calculations in vector searches.
- python-dotenv: Loads environment variables from a .env file for configuration management.
- nltk: Natural Language Toolkit for text processing tasks like tokenization.
- openai: Client library for interacting with OpenAI's API, used for GPT-3.5 text generation.
- PyPDF2: Library for reading and extracting text from PDF files.
- uuid: Generates unique identifiers for database entries.
- pytesseract: OCR tool for extracting text from images (used for scanned PDFs).
- pdf2image: Converts PDF pages to images for OCR processing.