The use cases for personal AI chat-bots will continue to grow as free models become more powerful and the larger players (Google Bard, OpenAI Chat-GPT) continue to apply more restrictions to their platforms.
This post will build off of our previous post detailing how to get started with Mistral-7B-Instruct model using Python. We recommend reading that post first: How To Get Started With Mistral-7B-Instruct-v0.2 Tutorial
In this post, we will cover why you would want to create your own chatbot. Then we will explain the differences between the Mistral-7B and Mistral-7B-Instruct models and why one is better than the other for a chatbot use case. The code is then provided and explained in detail. Finally, we will cover additional details on how to record the chatbot’s conversational history and how to use it to build relevant context for more accurate and useful responses.
Why Create A Custom Chatbot?
The most popular chatbots (AI) today are OpenAI’s Chat-GPT and Google’s Bard/Gemini. While these AI systems are the most advanced in the field and are publicly available, they do have certain drawbacks.
Security
The biggest risk to users and companies that use public AI systems is the exposure of confidential information. Public companies that provide free AI chatbots typically use the information entered by users to improve those systems and train new systems. Users should always be aware of the sensitivity of what the input into the AI systems. A personal AI chatbot, that exists on a local computer or secure servers, allow for companies and users to build automated systems without exposing sensitive information to third parties.
Creativity
For example, the companies that provide these AI systems limit the amount of customization that may be achieved. This is necessary to ensure the AI models provide a consistent user experience. While Chat-GPT allows the user to adjust the ‘temperature’ (a variable of how creative the output is), Google’s Bard has no such parameter. Thus, users are confined to the limits of creativity set by these companies. One advantage of having a personal chatbot is complete control over how ‘truthful’ or ‘creative’ the output is. Configuring a chatbot to produce more creative or inconsistent outputs could be of use for brainstorming tasks for example.
Content Moderation / Restrictions
Another drawback of publicly available chatbot systems is content moderation. As the user base of these systems grow, the companies are forced to provide more content control/restrictions to reduce bias or misinformation. Unfortunately, this can impact the usefulness of these AIs when researching certain topics such as finance. If you ask Google’s Bard to provide a list of which stocks to invest in, it will refuse to provide an answer as shown in below:
While it may not be the best advice to follow, it would be interesting to know what Google’s Bard would recommend.
Additional Reasons To Have Your Own Chatbot
Some other advantages to having your own personal AI chatbot are:
- Unfiltered creativity and exploration: You could access unfiltered information, engage in taboo or controversial topics, and explore creative avenues without external restrictions. This could be useful for brainstorming, artistic expression, or simply satisfying intellectual curiosity.
- Personalized learning and assistance: The chatbot could learn your preferences and interests to a deeper level, tailoring responses and assistance to your unique needs and desires, potentially exceeding the limitations of pre-trained models.
- Enhanced productivity and problem-solving: By leveraging the chatbot’s ability to process information and generate ideas, you could potentially boost your productivity in various tasks, from writing and research to planning and problem-solving.
- Control and privacy: Keeping the chatbot on your personal computer grants you complete control over its development, data, and outputs, ensuring maximum privacy and avoiding the concerns of external biases or data breaches.
While some downsides are:
- Misinformation and harmful content: Without restrictions, the chatbot could generate misinformation, perpetuate harmful stereotypes, or even produce illegal or offensive content. You would be solely responsible for ensuring its reliability and ethical use.
- Bias and manipulation: Personal biases in your training data or interactions could lead the chatbot to generate biased or manipulative responses, potentially hindering your decision-making or affecting your interactions with others.
- Addiction and negative impact: Unrestricted access to potentially addictive or harmful content could pose risks to your mental health and well-being. You would need to exercise discipline and self-awareness to manage your interactions with the chatbot responsibly.
- Limited real-world applicability: Content generated without external validation or grounding in reality might not translate well to practical applications or interactions with the outside world.
Be sure to take into consideration these pros and cons when creating your own personal chatbot or one for your company.
Difference Between Mistral-7B and Mistral-7B-Instruct Models
As mentioned in the post How To Get Started With Mistral-7B-Instruct-v0.2 Tutorial, the Mistral-7B-Instruct model was fine-tuned on a instruction/response format. This is basically the same format structure of a chat between two people, or a chatbot and a user.
Because the Mistral-7B-Instruct model is better suited to follow this conversation structure, we will use the that model for our chatbot.
Python Code
Before beginning, see the instructions in our previous article for ensuring the correct Python packages and drivers are installed on your system: How To Get Started With Mistral-7B-Instruct-v0.2 Tutorial
Here is the complete code. Feel free to copy and paste into your favorite Python editor.
We break down each section in detail and explain what it does below.
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from datetime import datetime
import json
# Suppress warning messages
from transformers.utils import logging
logging.set_verbosity(40)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# Program variables
filename = f"{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}.txt"
max_iterations = 10
conversation_history = list()
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# Load model
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, device_map="cpu", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, padding_side="left")
streamer = TextStreamer(tokenizer, skip_prompt=True)
# Load conversational history from a previous context file
# context_filename = "./*.txt"
# with open(context_filename, 'r') as f:
# data = json.load(f)
# conversation_history = data
# Function to capture keyboard input and add to conversational history
def capture_input():
input_text = input("User: ")
conversation_history.append({"role": "user", "content": input_text})
conversation_history.append({"role": "assistant", "content": ""})
print("Assistant: ", end='') # Prints without newline
# Start program by asking for initial input from user.
capture_input()
# Limit maximum iterations for conversation
for iteration in range(max_iterations):
# Convert conversational history into chat template and tokenize
inputs = tokenizer.apply_chat_template(conversation_history, return_tensors="pt", return_attention_mask=False)
# Generate output
generated_ids = model.generate(inputs,
streamer=streamer,
max_new_tokens=2048,
do_sample=True,
top_k=50,
top_p=0.92,
pad_token_id=tokenizer.eos_token_id
)
# Get complete output from model including input prompt
output = tokenizer.batch_decode(generated_ids)[0]
# Filter only new output information using '</s>' delimiter, then strip starting and trailing whitespace
output_filtered = output.split('</s>')[-2].strip()
# Update conversation history with the latest output
conversation_history[-1]["content"] = output_filtered
# Save entire conversation history to text file for debugging or use for loading conversational context
with open(filename, 'w') as f:
json.dump(conversation_history, f, ensure_ascii=False, indent=4)
# Capture input before start of next iteration
capture_input()
How The Code Works
In this code, we use the Mistral-7B-Instruct model and the HuggingFace.co libraries to handle the model download, configuration and execution.
Warning & Logging Output Suppression
The functions that operate the models have many layers of logging built-in. By default, the logging is turned on and will fill up the terminal when running the models. While this has no affect the output of the code, it does obscure the actual output from the model and makes it harder to read.
To keep the terminal output clean, we place the following code at the start of the script to suppress the logging outputs from the transformers and tensorflow libraries:
import os
# Suppress warning messages
from transformers.utils import logging
logging.set_verbosity(40)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
Variables
We define the variables next.
We create a filename using the datetime library – so each file has a unique name. This filename will be used to create a JSON text file later to store the conversational history developed while the chatbot is running.
The max_iterations variable controls the length of the conversation. We coded the chatbot to operate in a ‘for loop’ to limit the maximum number of epochs of conversation. Each epoch consists of a single user input followed by a response generated by the model. You may increase or decrease the number of iterations to what works best for you. Keep in mind, as the conversation history grows, it takes the model more time to process the input.
The conversation_history variable is defined as a list. This variable will hold the components that make up the conversation and thus the prompt that is submitted to the model. Later in the code, this variable is formatted into a JSON string and saved to a text file.
The model_id variable is passed to the functions that load the model and tokenizer.
# Program variables
filename = f"{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}.txt"
max_iterations = 10
conversation_history = list()
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
Loading & Configuring The Model
We use the transformers library to load the model, tokenizer and text streamer functions.
As is, this code will operate the model on the CPU. It is possible to configure the code to load the model to a GPU for faster processing. However, in our experience, the Mistral-7B models are too large to fit onto our 10GB RTX 3080 graphics card.
The last line of this code block sets up the “streamer” function. The streamer forces the model to output each word it generates directly to the terminal. This gives the impression of the chatbot responding in real-time. If the streamer functionality is not used, the user will have to wait until the model generates its complete response before it outputs to the terminal.
Set the parameter “skip_prompt=True” in the TextStreamer initializer function to prevent the entire chat history from being printing to the terminal each time the model iterates.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
# Load model
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, device_map="cpu", trust_remote_code=True
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, padding_side="left")
streamer = TextStreamer(tokenizer, skip_prompt=True)
Loading Conversational History From External Files
As will be explained below in more detail, this code is configured to both save to and load a conversation history from a text file. The first time this code is executed, the lines below should be commented out. However, after running the code for the first time, you will have a conversation saved to a file. You may load this file the next time the code is run by un-commenting the lines and ensuring the context_filename variable has the correct filepath to your conversation history file.
import json
# Load conversational history from a previous context file
# context_filename = "./*.txt"
# with open(context_filename, 'r') as f:
# data = json.load(f)
# conversation_history = data
Building The Prompt Structure
The Mistral-7B-Instruct model requires a strict prompting format to ensure the model works at peak performance. According the model’s page on HuggingFace.co, deviation from this format results in sub-optimal performance.
The prompt is structured as a list of dictionaries in Python. The dictionaries contain two fields: role and content. The values for role can be either “user” or “assistant”. The value for content will be the string containing the question or instruction.
The system builds a conversational history by alternating between the user and assistant. Here is an example of how the prompt is constructed:
[
{"role": "user", "content": "Write a poem about at dog named Chance."},
{"role": "assistant", "content": "There was once a dog named Chance. He liked to dance."},
{"role": "user", "content": "That was great! Now write a poem about a cat named Homer."},
{"role": "assistant", "content": ""}
]
The model takes the place of the assistant while we take the place of the user.
The conversation is constructed so the last dictionary in the list is an assistant role with an empty string for content. When this conversation is converted to a prompt and fed into the model, the model responds by generating the text in place of the empty content string for the assistant.
In the code above, the conversation_history variable is the list that holds the dictionaries in chronological order – alternating between user and assistant roles.
Zero-Shot, One-Shot or Few-Shot Prompts
The most basic prompt is called a zero-shot prompt. It simply includes a user role with an instruction and an assistant role with an empty string. When ran through the model, the model responds based solely on its training to generate the text in the assistant’s content.
Sometimes, it is helpful to provide examples to the model demonstrating how it should respond to user content/instructions. Provided one example of how the model should respond is called one-shot prompting, while providing more than one example is called few-shot prompting.
While providing more examples to a model, in theory, allows it to better obey your instructions and respond as you wish, the model may also find unintentional patterns and act on those. As an example, a model that is prompted with 3 examples, with the third example ending in a period while the previous two do not end with periods, may teach the model to end every 3rd response with a period.
We suggest that you experiment with different prompts and read up on some prompting techniques to improve your results.
Capturing The User’s Input
As described above, the prompt structure must follow a specific format. Here, we’ve create a function that handles the process of taking the user’s input, formatting it, and then appending it to the conversational_history.
The function creates a prompt in the terminal with the text “User:”. It then waits for the user to enter text and hit the return key. After the user’s input is captured, it is wrapped in a user role dictionary format and appended to the conversational_history list. Then, because the conversational_history list must have an assistant role as the last item, we create an assistant role dictionary with an empty string for its content and append that to the end of the conversational_history list.
The last line of this function prints a starting prompt to the terminal for the “Assistant” output that the model generates through the streaming pipeline. This helps the user see that it’s the model’s turn to respond.
# Function to capture keyboard input and add to conversational history
def capture_input():
input_text = input("User: ")
conversation_history.append({"role": "user", "content": input_text})
conversation_history.append({"role": "assistant", "content": ""})
print("Assistant: ", end='') # Prints without newline
This function is called just before entering the for-loop. This allows the user to start the conversation by typing first.
# Start program by asking for initial input from user.
capture_input()
Entering The For-Loop & Formatting The Prompt
As mentioned above, this code is designed with a for-loop to limit the length of a conversation. This can be removed and replaced with a while loop that loops forever. However, by limiting the length of the conversation, we can prevent the size of the conversation_history from growing too large and subsequently slowing the model down.
Once inside the for-loop, the first line performs two important operations:
First, the .apply_chat_template() function takes our conversation_history variable (list of dictionaries) and converts it into a prompt string with the proper [INST] and [/INST] tags placed throughout in the proper locations.
Secondly, the tokenizer function converts our prompt string into tokens to be processed by the model.
# Limit maximum iterations for conversation
for iteration in range(max_iterations):
# Convert conversational history into chat template and tokenize
inputs = tokenizer.apply_chat_template(conversation_history, return_tensors="pt", return_attention_mask=False)
Generating The Output From The Model
The model.generate function is where the magic happens. This function starts the process of running the prompt through the model and generating its output. We pass several arguments in this function:
- inputs – the tokens generated from our prompt by the tokenizer function
- streamer – streamer function we configured earlier in the script; it allows the model to print directly to the terminal as it generates each new token
- max_new_tokens – this sets the maximum size of the response the model can generate
- do_sample – by setting this to True, the model generates a more creative output
- top_k – this variable affects the creativity level of the model’s output
- top_p – this variable affects the creativity level of the model’s output
- pad_token_id = tokenizer.eos_token_id – this is done to suppress a warning and configure the model for open-end generation
# Generate output
generated_ids = model.generate(inputs,
streamer=streamer,
max_new_tokens=2048,
do_sample=True,
top_k=50,
top_p=0.92,
pad_token_id=tokenizer.eos_token_id
)
Feel free to experiment with the top_k and top_p values to see how they affect the model’s level of creativity.
Capturing The Model’s Output
As explained above, the addition of the streamer function will cause the model to print its output directly to the terminal as it’s being generated. However, when the generation function completes, the model will return its entire output.
We want to capture this output and append it to the conversation_history. As our conversation_history grows, the model can build a larger context around the conversation. As the context grows, the model is able to output responses that are more in-line and on-topic to the conversation.
When the generator function completes, it returns a sequence of tokens, stored in the ‘generated_ids’ variable. We pass the generated_ids to another tokenizer function that decodes them back into text words we can understand. These are stored in the ‘output’ variable.
Because the model outputs the entire chat history including the new portion it generates, we want to strip off all the old text – leaving us with only the newly generated response from the model.
To do this, we use a combination of the .split() function to separate the text string by delimiter and the .strip() function to clean up leading and trailing whitespace.
First, we use the .split() function and use the text sequence ‘</s>’ as a delimiter. The .split() function will return a list of text strings. Because the list includes both the delimited strings and the delimiters, we select the second to last item in the list (because the model output ends with the ‘</s>’ string). Feel free to print the entire output from the model to see how the strings are arranged.
Finally, after we get the final output string, we use the .strip() function to remove any leading or trailing whitespace.
The filtered output from the model gets placed in the ‘output_filtered’ variable.
# Get complete output from model including input prompt
output = tokenizer.batch_decode(generated_ids)[0]
# Filter only new output information using '</s>' delimiter, then strip starting and trailing whitespace
output_filtered = output.split('</s>')[-2].strip()
Updating The Conversation
Previously, we discussed the format of the prompt and how the last dictionary item in the list is the assistant role with an empty string as its content.
Now that we have the output from the model, we update the content field of the last assistant dictionary item in the conversation_history list.
# Update conversation history with the latest output
conversation_history[-1]["content"] = output_filtered
Now, the conversation history is updated and complete.
Saving The Conversation
At this point, we can take the conversation_history list variable, convert it to a JSON string and save it to a file.
Later if we want to restart the script, we can load this file into the conversation_history variable near the beginning of the script and continued the chat where it left off.
# Save entire conversation history to text file for debugging or use for loading conversational context
with open(filename, 'w') as f:
json.dump(conversation_history, f, ensure_ascii=False, indent=4)
We save the file using the json.dump function. We include the indent=4 argument to format the text file in a way that is easy to read (and modify if we choose to do so).
Starting The Next Iteration
Finally, we end the for-loop by calling the capture_input() function again to get the user’s input from the keyboard. After the user presses the return key, the loop starts over from the top.
# Capture input before start of next iteration
capture_input()
Growing Context
After each iteration of the for-loop the entire conversation_history, along with the newest input from the user, is put back into the model. This grows the conversation context.
The model does not have the ability to remember. It cannot maintain state. Instead, everytime the model is called, it only knows the data it was trained on and what information is contained within the prompt.
For this reason, we are responsible for building a conversation ‘history’ and feeding it back into the model each time we call it.
As the conversation continues, more information is contained in the next prompt – this is what helps the model determine the next tokens to generate in relation to your input.
Because this code saves the conversation history, you can store different contexts and use them and modify them as best suits you.
For example, you can create a conversation history of a specific character from a movie, how that character writes, what that character’s history is and other details. Then you can load that conversation into the script and start chatting with that character.
You can also build conversation histories full of information about your business, FAQs about your product, details about your marketing, etc…and then use the chatbot to answer questions.
This all requires experimentation but seems to have good results from what we tested.
Conclusion
Hopefully you found this tutorial helpful. With the increasing performance of free Large Language Models like the the Mistral-7B-Instruct, we should expect more helpful and useful chatbot use cases in the future.
Feel free to copy this code and build upon it. If you have any questions or would like to share your experiences creating new context histories, please leave a comment below.