AI Hadith Retriever

Introduction

I’ve been using Chat GPT nearly every day, both in my professional and personal life. It has become an invaluable tool that enhances my productivity and offers numerous conveniences in my daily routines.

A known issue with GPT is hallucination, which can cause the AI to output incorrect information in a way that appears correct and valid to the user.

This poses a risk, as it implies that GPT cannot always be trusted to provide accurate information. Consequently, users need to cross-check and cross-reference the information provided by GPT with other sources, which can negate the productivity boost offered by GPT.

As a Muslim, I have a keen interest in Islamic studies. In this field, a significant amount of effort is dedicated to validating the authenticity of a hadith. Ensuring the authenticity of hadiths is crucial to maintain religious and historical accuracy, prevent misinterpretation, guard against innovation, and preserve Islamic law.

The problem arises when Chat GPT is queried for Islamic information. Given the sensitive nature of ensuring authentic sources of information on Islam, any incorrect information provided by GPT could have wide-ranging consequences in a person’s life.

I was fortunate to gain early access to Chat GPT plugins and it occurred to me - this is exactly what plugins are for!

We could create a plugin that provides authentic hadith sources based on the user’s query. It would output the hadith and cite its source, instilling confidence in the user that the information provided is authentic.

The plugin has been published and approved in the Chat GPT plugin store! You can access the plugin right now by visiting Chat GPT, clicking on the Alpha tab, then the Plugin Store, and searching for the Sahih AI plugin to install it.

An example prompt for the plugin could be something like, “What are some hadiths related to performing acts of charity?”

The response template would be:

1. English: {english}
Original Arabic: {original_arabic}
Source: {source}

2. English: {english}
Original Arabic: {original_arabic}
Source: {source}
...

Or you can continue reading to develop and run it yourself!

Technology stack

We will use a Quart server, which is based on Flask, to serve the Chat GPT plugin’s API requests. Additionally, we’ll employ a vector database that contains the Sahih Hadith data embeddings.

Furthermore, we’ll utilize the Langchain framework for its helpful functionalities in building, populating a vector database with data, and querying the data.

Two main requirements need to be addressed for the vector database and embedding function:

We aim to host our setup remotely to minimize server load. We prefer to keep costs at bay and not pay for any used service. To meet these requirements, we have chosen Pinecone as our vector database solution. It is a popular and dependable vector database that can be deployed remotely, and it offers one free index, which will be sufficient for our needs.

As for our embedding function, we have opted for Sentence Transformer instead of Open AI embeddings. Sentence Transformer is free, although it may utilize some computer resources.

I have also added support for Chroma vector database and Open AI embeddings. If you wish to experiment with these, I will guide you on how to set them up. Keep in mind that you will need to run a local server for Chroma DB and set up an Open AI account to obtain an API key.

Setup

Clone the repo

To clone the repository, run the following command:

git clone https://github.com/that-one-arab/sahih-ai

Libraries setup

RECOMMENDED: Install and use VirtualEnvWrapper to create a Python virtual environment:
```
mkvirtualenv sahih-ai
```
Install the required libraries using the following command:
```
pip install -r requirements.txt
```
Copy the .env.example file and rename the copy as .env.

Embedding function setup

An embedding function is responsible for creating vector representations from complex data. A vector database does not store data in its own format; instead, it transforms the data into a vector (embedding), which is essentially a list of numbers. This list of numbers, while unreadable to us humans, is readable to the vector database, enabling it to find patterns in data more efficiently.

You are free to choose from one of the following vector functions; just follow the steps for the one you select:

Sentence Transformer Embeddings

Install the Sentence Transformers library:
```
pip install sentence_transformers
```
No need to edit the .env file since Sentence Transformer is the default embedding function.

Open AI Embeddings

Run pip install openai tiktoken
Go to Open AI, sign up for an account, and then create an API key.

Edit the .env file and modify the following fields:

export EMBEDDINGS=openai
export OPENAI_API_KEY="YOUR_OPENAI_KEY"

Vector Database Setup

To store our Hadith data, we need to set up a vector database. The project supports both Pinecone and Chroma vector databases. The tradeoff is that Pinecone is hosted remotely, eliminating the need for a local database dependency, while Chroma DB is hosted locally. Follow either the “Pinecone setup” steps or the “Chroma setup” steps.

Pinecone Setup

Install Pinecone client

Run pip install pinecone-client

Create a Pinecone account

Visit Pinecone and create an initial account.

Create a Pinecone index

Create an index named sahih-ai with the following properties:
- Dimensions: Set it to 768 if you chose Sentence Transformer Embeddings, or set it to 1536 if you chose Open AI Embeddings.
- Metric: cosine.

Connect to the Pinecone index

Click on the API Keys section.

Reveal the created API key, then copy it to .env, and also note the index environment and copy that too:

# Vector Database configuration. Default is pinecone
export PINECONE_API_KEY="YOUR_PINECONE_API_KEY"
export PINECONE_ENVIRONMENT="YOUR_PINECONE_ENVIRONMENT"

Chroma Setup

For Chroma, we have the option of having an in-memory database or a server database. We will go with the latter since we are dealing with a large amount of data.

Install Chroma db

Run pip install chromadb to install chroma db client
Run git clone git@github.com:chroma-core/chroma.git to clone the chroma db server, which we will spin up in a docker container

Install Docker

Install the latest version of Docker here. Click on the appropriate OS installation guide and follow the steps. Docker desktop should download the docker and docker compose CLI tools that will allow us to spin up Docker containers.

Start the ChromaDB server

After Docker finishes installation:

cd into the chroma directory.

Edit the docker-compose.yml file and add ALLOW_RESET=TRUE under environment:

...
  command: uvicorn chromadb.app:app --reload --workers 1 --host 0.0.0.0 --port 8000 --log-config log_config.yml
  environment:
    - IS_PERSISTENT=TRUE
    - ALLOW_RESET=TRUE
  ports:
    - 8000:8000
...

Run the following command to run the Chroma server Docker container:
```
docker compose up -d --build
```

Initialize Vector DB

Now that we have configured the vector database that we want to use, all we need to do is initialize it. Run python vectordb.py init to initialize the vector DB. The script will take a while to finish due to the amount of data. After the script is done, the data will be populated, and our server will be ready to serve query requests.

Usage

Run python main.py to start the server in development. You can test the server by sending a request to localhost:8000/hello to verify that the server is working correctly.

For production, run export PRODUCTION=True && gunicorn -k uvicorn.workers.UvicornWorker main:app to run a web server that is more suitable for production.

If you want to delete the data in your chosen vector database, run python vectordb.py reset. Be careful, as you cannot undo this operation.

Data Sources

Hadith dataset: https://www.kaggle.com/datasets/fahd09/hadith-dataset
Hadith narrators: https://www.kaggle.com/datasets/fahd09/hadith-narrators
Quran dataset: https://www.kaggle.com/datasets/imrankhan197/the-quran-dataset