Custom ChatGPT on private data

With the rise of Large Language Models (LLMs) like ChatGPT and GPT-4, many are questioning whether it’s possible to train a private ChatGPT with their corporate data. Is this feasible? Can such language models offer these capabilities?
As we have done for some customers before, we will explain the architecture and data requirements step by step that you need to create your own Q&A engine with ChatGPT/LLMs, leveraging your own data.
The Drawbacks of Fine-Tuning LLM with your own data
In the world of natural language processing, fine-tuning language models (LLMs) with custom data has been a promising avenue for enhancing the capabilities of pretrained models. However, this approach comes with its own set of challenges and limitations.
Some Common drawbacks when finetuning a LLM:
- Factual correctness and traceability, where does the answer come from
- Access control, impossible to limit certain documents to specific users or groups
- Costs, new documents require retraining of the model and model hosting
- Knowledge cutoff (recently updated to September 2023!)
This makes it extremely hard, close to impossible in our little Belgium, to use fine-tuning for the purpose of Question Answering (QA).. How can we overcome such limitations and still benefit from these LLMs?
Separate your knowledge from your language model
To ensure that users receive accurate answers, we need to separate our language model from our knowledge base. This allows us to leverage the semantic understanding of our language model while also providing our users with the most relevant information. All of this happens in real-time, and no model training is required.
It might seem like a good idea to feed all documents to the model during run-time, but this isn’t feasible due to the character limit, on top it would be too expensive, and take too long…

The approach would be as follows (and is visualized below):
- User asks a question: the process begins when a user submits a question, such as “What is the lease budget for a level 59 employee?” to the application.
- Application finds relevant text: the application’s search algorithm scans through the available documents and data sources to identify the most relevant text or document that is likely to contain the answer to the user’s question. In this case, it might locate an employee compensation document.
- Concise prompt sent to LLM: once the relevant text or document is identified, a concise prompt is generated. For our example question, it might look like this: “What is the lease budget for a level 59 employee?” This prompt is paired with the relevant text that pertains to employee compensation and is sent to the Language Model (LLM).
- LLM processes the prompt: the LLM, which could be a model like GPT-4, processes the prompt and the relevant text, seeking an answer. It utilizes its knowledge and understanding of language to provide a response.
- User receives an answer or ‘No answer found’ response: there are two possible outcomes: a. If the LLM successfully generates an answer based on the prompt and relevant text, the application presents the answer to the user. For our example, it might respond with, “The lease budget for a level 59 employee is $50,000.” b. If the LLM is unable to find a suitable answer within the provided text, the application responds with a ‘No answer found’ message, indicating that it couldn’t locate relevant information for the user’s query.
This step-by-step process ensures that users can receive accurate and informative responses to their questions while also accounting for scenarios where the LLM may not have access to the required information.

Now we understand how the high-level architecture required to start building such a scenario, it is time to dive into the technicalities. If you want to know straight away how you can validate if it is possible in your business, scroll down!
Retrieve the most relevant data
Context is key. To ensure the language model has the right information to work with, we need to build a knowledge base that can be used to find the most relevant documents through semantic search. This will enable us to provide the language model with the right context, allowing it to generate the right answer.
Chunk and split your data
To accommodate token limits, we must segment our documents into smaller parts, potentially sharing multiple relevant sections for answering across various documents. Begin by dividing the document by page or by using a token-based text splitter. Once the documents are in a more accessible format, establish a searchable index for user queries. Enhance the index by including metadata like source, page numbers for source linkage, and additional information for access control and filtering.
Improve relevancy with different chunking strategies
To be able to find the most relevant information, it is important that we understand your data and potential user queries. What kind of data do we need to answer the question? The answer to these questions will help us decide how to split the data.
Some common patterns that we use:
- Use a sliding window; chunking per page or per token can have the unwanted effect of losing context. Use a sliding window to have overlapping content in your chunks, to increase the chance of having the most relevant information in a chunk.
- Provide more context; a very structured document with sections that nest multiple levels deep (e.g. section 1.3.3.7) could benefit from extra context like the chapter and section title. You could parse these sections and add this context to every chunk.
- Summarization, create chunks that contain a summary of a larger document section. This will allow us to capture the most essential text and bring this all together in one chunk.
Let’s search!
Now that our data is prepared, we need to find the right documents when needed. When faced with the choice of building a semantic search index for a customer, we explored two primary options:
Option 1: Utilizing a Search Product
Our first recommendation for the customer was to consider the ease and efficiency of leveraging an existing Search as a Service platform. We proposed using platforms like Cognitive Search available on Azure, which offers a managed document ingestion pipeline and harnesses the power of language models from Bing. This option allows for quick implementation and reliable results. It can work with both vector search and keyword search, or combine them.
Option 2: Implementing Custom Semantic Search with Embeddings
For customers who prioritize the latest semantic models and desire more control over their search index, we suggested the implementation of custom semantic search using embeddings. We explained that embeddings are essentially lists of floating-point numbers, and the proximity between these vectors measures their relatedness. By using text embedding models from OpenAI, the customer can achieve cutting-edge semantic search capabilities. However, this option comes with the requirement to precompute and store embeddings for all document sections.
There are various ways of storing these embeddings, including managed options like Azure Cache for Redis (RediSearch), FAISS (Facebook AI Similarity Search), delta table search (in Databricks preview) as well as open-source alternatives like Weaviate or Pinecone. During the application’s runtime, we elaborated that converting user questions into embeddings would enable the comparison of cosine similarity between the question’s embedding and previously generated document embeddings.
Based on the goals and needs we opted for Hybrid Search, using Azure Cognitive Search. You find the complete architecture on the visual below.

Last but not least: concise prompt to avoid hallucination
Your ChatGPT implementation heavily relies on the prompt to ensure accurate responses and prevent undesired output. Prompt engineering, often considered a distinct skill, is a critical aspect of this process. We acknowledge that it’s a comprehensive field, and here, we highlight the key points you need to consider.
In prompts, it’s essential to convey specific instructions to the model. Make it clear that the model should provide concise answers solely based on the context provided. If the model can’t generate a valid response, it should provide a predefined ‘no answer’ response. Additionally, it’s vital to include citations, typically as footnotes, to the original documents. This allows users to verify the factual accuracy by referring to the sources.
One-shot learning further enhances responses. During runtime, {q} is populated with the user’s question, and {retrieved} contains the relevant sections from your knowledge base.
Don’t forget to configure your parameters for temperature according to your desired response style: lower values for more deterministic responses, and higher values for more creative and unexpected ones.
Here’s an example of such a prompt:
“You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions. Please use ‘you’ to refer to the individual asking the questions, even if they ask with ‘I.’ Answer the following question using only the data provided in the sources below. When presenting tabular information, format it as an HTML table, avoiding markdown. Each source should be clearly identified with a source name followed by a colon and the relevant information. If the answer cannot be found in the sources below, respond with ‘I don’t know.'”
Can I use it in my organization?
Frequently, we encounter queries about an organization’s readiness to deploy an internal ChatGPT or whether it can effectively address their specific issues. The solution is straightforward: begin by assessing whether the standard ChatGPT can provide a solution when provided with the appropriate context. This corresponds to step 3 in our previously outlined architecture.
For instance, consider you have the status info about about an order in a table, and you receive the question: “When will my order be delivered”? Try to combine the info and question into ChatGPT and see if it is able to understand the question. If yes, the problem is doable, of course with the right expertise.
Conclusion
In summary, placing complete reliance on a language model for the generation of factual content is a flawed approach. Fine-tuning the model doesn’t rectify this issue, as it doesn’t equip the model with new knowledge and lacks a mechanism for response verification. To construct a robust Question & Answer (Q&A) system using a Large Language Model (LLM), it’s advisable to decouple your knowledge base from the LLM. Generate answers exclusively within the confines of the provided context, thereby enhancing the system’s accuracy and reliability.