The Generative AI Revolution
Those of us who tried OpenAI’s GPT-3 API when it was launched in 2020 knew that something fundamental and transformative had happened. A human and a machine had a conversation. As ChatGPT and GPT-4 came along that initial feeling transformed into a mixed feeling of hope as infinite positive outcomes were made possible and that knowledge work is eventually going to be outsourced to machines.
In my opinion, this takeover is going to take five to ten years. Industries that are more tolerant of errors are going to be the first to automate knowledge work. Highly sensitive industries will always keep humans in the loop. But one thing is definite: AI will greatly increase the productivity of existing workers. Companies will benefit from this revolution since they will be able to achieve more results and deliver better products and services with less cost and time to market. For the next ten years, AI assistants will be working with their human counterparts to help them solve problems and take on new challenges before full automation comes in.
Benefits of AI Assistant in Business
- Onboard new team members more rapidly and efficiently.
New team members can interact with the AI daily and independently understand the system and business with the AI’s assistance. It will answer their questions, provide guidance, and suggest resources to learn from. - Developers will be able to write and refactor code faster and get code review feedback from AI.
The AI can help developers write clean, efficient, and bug-free code by providing suggestions, corrections, and best practices. It will also review the code and give feedback on style, performance, security, and compatibility. - Developers will be able to write unit tests with the AI’s help.
It will generate test cases, inputs, and outputs based on the code and the specifications, checking the test coverage and quality to report any issues or gaps. - QA will be able to write automated tests using AI.
AI will help QA engineers write robust and reliable automated tests by generating test scripts, scenarios, and data. The AI will also run the tests, analyze the results, and report any failures or defects. - Support will be able to interact with AI to ask triage questions and try to resolve customer issues.
The AI will help Support staff diagnose and troubleshoot customer problems by asking relevant questions, accessing the logs and the database, and suggesting possible solutions or workarounds. - Everyone can use the assistant to write stories, tickets, and emails.
The AI will help the team communicate effectively and clearly by writing concise and accurate stories, tickets, and emails. The AI will also check the grammar, spelling, and tone of the messages and offer improvements.
How to evaluate and compare AI Assistants?
Chat bots have been around for a long time, but their power and applications have been fairly limited. Now, this new generative and multimodal AI Revolution has significantly improved their potential power. There is going to be a flood of AI assistants saturating the market. But not all of these will be of equal quality.
Here are some desirable qualities that should be considered in an AI assistant.
- Security and privacy: Conversations with AI assistants will become more personal and sensitive than ever. The cost of conversation leaks can be quite high.
- Fresh and up-to-date knowledge base: A poor knowledge base means poor and incorrect information provided to the users who will trust the AI assistant and might not fact-check the information. This can lead to productivity loss and even worse outcomes. Dealing with frequently changing data and source code may require live querying and frequent indexing.
- Scope Definition: AI Assistants should have defined boundaries for conversations. Users shouldn’t be able to chat on any topic. For example, a user shouldn’t be able to use a company’s AI assistant for personal legal advice. It is also desirable to minimize jailbreaks.
- Cost Efficiency: An AI assistant should be frugal with resources and require minimal operational upkeep or maintenance.
- Modularity: It is not a required property but it is beneficial for an AI Assistant to not be significantly dependent on any particular LLM. LLMs and multimodal models will iterate every year and it would be beneficial to swap one model for another without significant change.
- Capabilities: The following capabilities are desirable. Some are important, others are optional.
- Understanding of the Real World: An AI Assistant can’t be very helpful if it can’t converse with the user at the same level of understanding of the world as the user. An AI-assistant
- Logical Problem-Solving Ability: To be a useful AI Assistant, it is paramount that one has at least some basic level of logical problem-solving ability.
- Multimodal Support: The model can be capable of understanding images or audio natively without any separate OCR (Optical Character Recognition).
- Execution Environment/Workspace: Does the AI Assistant have access to an execution environment where it can process data files with its generated code? Also, how does it generate the output of those processes and provide it to the user? This capability is helpful since sometimes the user wants to use the assistant to operate on large files and direct the assistant using a multi-step interaction while the data files stay in the workspace accessible to the assistant. The assistant can generate code and execute it within the workspace to be able to create intermediate files.
- Code Verification: Is the assistant capable of verifying code compilation and execution before emitting the code to the user? In the future, it is conceivable that AI assistants will be able to see user’s screens in real-time or have access to video feed from a camera.
- Long-term Context: Does the AI Assistant have some long-term context and memory across chat sessions?
Introduction to Large Language Models (LLM)
A Large Language Model (LLM) is a type of deep learning artificial neural network. It has a vast number of parameters, also known as neural connections. An LLM is trained on extensive text data collected from various sources. These sources include the internet, books, forums, and journals. It can also include source code and synthetic data generated by other models. Furthermore, LLMs are trained in a variety of human languages and computer programming languages. This way they can not only understand languages but can also translate between them. It can also translate human languages to computer programs.
Examples of some recent LLM models include OpenAI’s GPT-4 and Google’s Bard and Gemini. There are several other models, some of which are open-source.
More recently, multi-modal models are being developed which can understand not only text data but can also natively understand other modalities such as images and sounds along with text. (A standout being Google’s Gemini)
LLM Context Window
An LLM’s neural network can only see a certain length of text and use it as an input. That length of text is called its context window. It is typically measured in the number of tokens where each token is a few characters of text. The entire context window is processed through an inference process and an output token is produced which is basically the next most likely token based on the input sequence. The LLM algorithm now concatenates this predicted token to the original context window and then treats it as input to the neural network again to predict the next token. This process keeps repeating until the algorithm thinks that no more output needs to be generated. This is a very compute-intensive operation and usually makes up much of the cost.
Newer models like GPT-4 Turbo have a context window of 128,000 tokens and can generate an output of about 4000 tokens. Each token is a group of four characters of text. This means that GPT-4 can keep about a few hundred pages worth of information in its “head” while it is trying to generate output. Currently, the cost of GPT-4 is around $0.01 per 1000 tokens of input (though it varies for different versions), which means that if you are giving 100,000 tokens to generate one inference then you will pay $1 for that inference alone. It is easy to see that the power of LLMs comes at a cost. Context window management and efficient utilization are the keys to achieving the optimal balance between power and cost
System Prompt
The system prompt is a text string that is put within the context window of an LLM. It is clearly delineated to be separate from the regular messages which are put in the context window. LLMs are trained to pay close attention to the system prompt part of the context window. The system prompt is always added to the context window for every inference request to the LLM.
Having a system prompt allows the organization or person deploying the AI Assistant to define its persona and set some ground rules and boundaries for it to follow. Any useful domain knowledge or additional context can also be put in the system prompt. There are several best practices and prompt engineering techniques for system prompts that allow the AI assistant to be personalized and perform better. A poor system prompt will give you meager results whereas an excellent prompt will allow the same underlying LLM to produce excellent output. A system prompt shouldn’t be more than a few thousand tokens since it will add static cost to every inference.
How Chat Works Using LLMs?
How does an LLM which is predicting one token at a time converse? The entire conversation history along with the information about each speaker is stored in the context window of the LLM and passed to it with every message. So, when the LLM is predicting one token at a time, it is generating the response based on the likely response given that particular conversation history. This “one token at a time” prediction is repeated until the model thinks that the response is completed. See the context window section above to understand that.
Overcoming the Limitations of LLMs
- Hallucinations: Hallucinations can be reduced by grounding the model by augmenting its context window with real-world data, facts, knowledge, and documentation. Also, the output validation and fact-checking can be performed using the same or different model. This results in a significant reduction of hallucinations.
- Limited Context Window: New models have a context window of about half a million characters which is reasonably big for many applications. We expect the next generation of LLMs to have an even bigger context window. The good news is that we can use LLMs to distill and rank important information that should fit within the limited context window.
- Limited Logical Problem-Solving Abilities: Even though current models aren’t great at logical problem solving, their performance can be significantly improved by using various prompting techniques such as chain of thought and also by generating multiple candidate solutions to the problem and then using LLM to pick the best solution.
- Limited Modalities: LLMs such as OpenAI’s GPT Vision and Google’s Gemini are exploring multi-modal model development and showing promising results. These models can already understand images in the diagrams and documents including screenshots. It is entirely possible for AI assistants built using these multimodal models to be able to help the user in tasks that are beyond the capabilities of text-only models.
- Static Models and No On-the-fly Learning: Indeed, today’s models are not capable of on-the-fly learning but this learning behavior can be demonstrated if this newly learned knowledge is accumulated in the context window. However, due to context window size limits, this limits the learning ability significantly. In the future, as context size increases, this technique can be applied more freely. However, some breakthrough in on-the-fly learning ability is needed in the long run.
Introduction to Retrieval Augmented Generation (RAG)
RAG and Grounding
Retrieval Augmented Generation (RAG) is a technique of retrieving relevant information or knowledge from one or more data sources and bringing it within the context window of the LLM. This can allow the LLM to utilize the knowledge to generate useful responses. This technique can be very effective, especially for data that changes frequently or couldn’t be made part of the training dataset of LLM.
Grounding is a more general term where an LLM is instructed to respond based on one or more datasets. That data can be static or dynamic. Sometimes, the terms Grounding and RAG are used interchangeably. Grounding and RAG techniques allow the model to cite sources of its responses. This allows the user to fact-check the response if they feel like it, but at minimum it significantly increases the user’s confidence in the response from the model.
Grounding and RAG allow the model to respond based on ground facts or reality instead of its intuition alone. It can significantly reduce hallucinations where LLM can make confusing or incorrect statements.
Fine-Tuned LLMs
Fine-tuning a model means providing it with more examples of input text and desired outputs in training mode. The model then adjusts some of the weights in its neural network to produce desired outputs. Through this fine-tuning, the model can acquire new knowledge or reinforce certain traits in generating responses or stay away from undesired responses.
However, once fine-tuned, the new fine-tuned model will need to be deployed as a dedicated instance to be utilized. Unlike foundation models, its cost can’t be shared by multiple tenants. Also, this training can’t be easily undone. So, any knowledge that changes with time shouldn’t be taught to the model. For example, you shouldn’t try to teach a model the exact architecture of your particular system since architecture will evolve with time. It is ok to teach it certain domain-specific design criteria or best practices. If you really want to give knowledge of transient things that can change with time, it is best to use the context window for that using either system prompt or RAG (Retrieval Augmented Generation)
RAG vs Fine-Tuned LLMs
System prompt and RAG both utilize the context window which the model has to process with every chat message. This incurs token processing costs. Even though models are getting much cheaper, this cost can add up for a large context window. LLMs have to reprocess the entire context window for responding.
As I explained earlier, fine-tuning a model can take some of the subject matter expertise and let the model incorporate that expertise in its weights instead of retrieving the knowledge into its context window. A fine-tuned may not need a longer context window to achieve the same performance but it is more of an art to strike a balance between both approaches.
RAG and Fine-tuning have their areas where they shine. They can be used together to achieve optimal performance and cost efficiency. However, just using a powerful foundation model such as GPT-4 combined with RAG can still result in a very useful domain-specific AI assistant. Its delivered value can still be higher than its cost for many industries and domains.
RAG Data Sources
- Documents such as Word, PowerPoint, pdf, txt, html, etc.
- Data such as Excel, csv, Json, xml etc.
- Source code such as C#, JavaScript, html, css, SQL, config files, etc.
- Database schema and stored procedures
- Meeting transcripts
- Chat history
- Company website dump
- Product guides such as user guides, installation guides, etc.
- Reference docs such as relevant standards and specifications
- Live data querying using either APIs or databases.
- Synthetic data
Data Chunking
Chunking is a technique where you divide long text, documents or source code into smaller files or text pieces. It has multiple benefits such as below.
- It reduces the portion of the source document that needs to be brought into the context so that only the most relevant portions of a longer document are considered by LLM. This reduces the token processing cost of LLM and potentially increases the quality since LLM’s processing and attention are focused on relevant portions.
- Especially with semantic searching, it works great on smaller chunks since a high semantic match score on a smaller chunk is more focused. In a longer document, meanings and semantics get diluted.
- It allows the ranking algorithm to do a better job of bringing the most relevant information to the top since in RAG you don’t want more than a handful of sources included in the context.
Data Chunking Challenges and Trade-offs
Chunking may sound trivial, but it isn’t. Blindly chunking without any regard for the type of document has a lot of issues. You can’t just break a sentence Midway and put half in one chunk and the other half in a different chunk. Similarly, it is not a good idea to break paragraphs, slide decks, and programming language functions or classes into multiple chunks. There are techniques such as text overlapping between chunks which can help in certain kinds of documents but not all. It is more like a combination of art and science. There are tradeoffs involved and experimentation is needed to fine-tune an approach for each kind of data source.
Lexical Search
Lexical search is traditionally known as keyword search but various techniques are employed to match with some variations such as dealing with singular and plural nouns in queries and source data, hyphenated, quotes and double quotes, numbers, etc.
Lexical search has been around for a long time and it is useful for searching for specific information such as serial numbers, model numbers, proper nouns, etc.
Semantic Search
Semantic search finds relevant results based on meanings and topics discussed in the data. It has the advantage of finding the data even when the search query doesn’t contain the exact keywords present in the data. For example, if there is a passage in a document that talks about the customer having a random reboot on their device. If a user who is using the AI assistant searches about periodic crashes, then it will still find the passage about random reboots from the document. Because in the real world, a reboot and a crash are somewhat related phenomena.
This is quite powerful since users often have vague ideas about what they are asking and rarely know the right words mentioned in the document. Another use case of semantic search is to find relevant code sections in the entire source code repository. If you inquire about an undocumented feature the AI assistant will still be able to find the relevant code and understand it if it helps answer the question. This is impossible to do using the lexical search approach.
Semantic search doesn’t perform particularly well when the user’s query is specifically asking about a particular entity such as a particular model number.
Hybrid Search
Lexical search is utilized even today in RAG but it is often combined with the more modern semantic search approach so that depending on the query, at least one approach will bring relevant data. In certain queries, an AI system may look up the specs of a model or product from the catalog and also use semantic search results to find relevant troubleshooting steps related to the problem.
Ranking the Search Results and Semantic Ranking
Ranking is the process of assigning a rank score for each result and picking the best matches from the results. The rank score can be based on the frequency and clustering of the keywords in lexical search or the closeness of the semantic meaning of the query with the semantic meaning of the search results. It is indeed possible to use more advanced AI-based approaches to rank the results. Ranking is important and compute spent on ranking usually pays off since less relevant search results pollute the AI assistant’s context window and may negatively affect its performance.
Conclusion
Putting It All Together
Developing an AI Assistant that works for your organization is not trivial. It requires specialized knowledge and experience with LLMs, Search Services & Indexing, Architecture, Databases, Security, etc.
Self-Hosting an AI Assistant Solution
- Host the front-end chat UI in a web application.
- Host the back-end assistant orchestrator which brings together System Prompt, RAG and LLM functionality.
- Use a foundation LLM model or, optionally, host a fine-tuned model.
- Host a lexical and semantic search service including its indexing capabilities. Also, host a semantic vectorization model to allow semantic vectorization of chunks.
- Host a knowledge store where the documents are stored before chunking and indexing.
- Host a synchronization solution to synchronize and pre-process various data sources to the knowledge store such as for source code, SharePoint, etc.
- Host a database for storing conversation history for the users.
To keep the solution low-cost, manageable, secure, and easy to maintain, it is crucial for the organization to use serverless and managed services for the above and minimize any custom development.
Implementation Time
We have developed an AI Assistant solution based on tried and tested services and approaches. We can rapidly put together a solution within your environment that will follow best practices and will be easy to maintain, resulting in less total cost of ownership while providing you with the peace of mind of a self-hosted solution.
We can implement a working solution within a two-week time frame from start to finish. Our base package includes configuring and hosting the services described above within your Azure Subscription. It will utilize both the lexical and semantic search capabilities and will use RAG to allow AI Assistant to use your domain-specific knowledge from source code and documents. We will further optimize your System Prompt to achieve better performance.
Cost of Our AI Assistant Services
We provide free consultation to determine if our solution is suitable for your needs.
We will charge $1500 USD for the basic AI Assistant setup described above. If you need further services, maintenance, or enhancements later then we can do that at an additional cost.
What is NOT Included In Our Base Package
We have the following limitations in our base-package but we can include these services on a case-by-case basis at an additional cost.
- No live querying of the web, any database, or API is supported.
- No voice recognition or text-to-speech is supported.
- No fine-tuning or fine-tuned LLM is included. Only the foundation model is utilized as part of the base package.
- Only GPT-4 with a 128K Token context window will be utilized in our base package.
- Only text modality is supported and no image generation or analysis will be supported.
- No code interpreter/workspace will be supported.
Cost to Self-host an AI Assistant
The cost of deploying a self-hosted AI Assistant can vary greatly depending on scalability, security, and high availability requirements. But to get started for about ~20 users, the total monthly cost is about $200 USD. It will go up from there as the number of users and consumption increases. For a large deployment, the cost can be about $500 per month. But understand that this cost is nothing compared to the amount of time it would save for each employee. While it’s possible for you to use AI Assistant as a Service companies who are external to your company, they may not offer the same level of control or customizability as a self-hosted solution.
Results and Limitations
Based on our experience, AI Assistants can boost productivity significantly, but the results will depend on your domain, quality of the documentation, and the knowledge store. One thing is sure, you can’t ignore this technology. Organizations that are able to harness the power of AI will have a competitive edge over their competitors.