This post will describe the process of working with the Mistral-7B-Instruct-v0.2 model using Python. The following steps will also work for the mistralai/Mistral-7B-Instruct-v0.1 model as well.
The key difference between this model and the Mistral-7B (How To Get Started With Mistral-7B Tutorial) is that this model was fine-tuned to follow instructions. Its instruction following ability makes this model better for chat applications.
Because this model has been fined-tuned to accept structured prompts the model produces more structured outputs. Typically, LLM’s produce unstructured outputs that are somewhat random in nature. This unpredictability of the outputs have made it difficult to utilize them in an automated fashion.
The Mistral-7B-Instruct model is better positioned for use in an automated fashion.
In this post, we will describe the process to get this model up and running. Then we will cover some important details for properly prompting the model for best results. Finally, we will discuss streaming, and how to configure the code to operate this feature -which is the ability for the model to write its response to the terminal as it’s being generated.
What is Mistral-7B-Instruct-v0.2
This is the second version of the Mistral-7B-Instruct model. This model is based on the Mistral-7B, however it has been fine-tuned with a few additional layers to follow instructions. Because chat models typically operate in a back and forth sequence in which the user typically provides instructions for the model to respond to, this model is well suited for chat-bot applications.
To learn more about the base model, Mistral-7B, see our tutorial on using the Mistral-7B model: How To Get Started With Mistral-7B Tutorial
This model has been released under the Apache 2.0 license and it can be used without restrictions.
This model may be used as is or it can be further ‘fine-tuned’ to provide improved ability on more specific tasks.
It should be noted that this model has no built-in moderation mechanisms. It is simply a large language model trained on data taken from the web. Additional fine-tuning would be necessary to have more control over what the model generates.
Where To Find
The Mistral-7B-Instruct-v0.2 model may be found on HuggingFace.co for download. The model’s page on HuggingFace.co can be found here: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Prerequisites
Using this model with Python is similar to the Mistral-7B and requires the same packages to be installed in your environment.
You will need to have PyTorch installed on your system. Additionally, you will need to install the following Python packages:
pip install transformers
It is always a good idea to install the accelerate package as well to help speed up processing:
pip install accelerate
You can read more about installing PyTorch and transformers from our previous post: How To Get Started With Mistral-7B Tutorial
Python Code
The following code is ‘basic’ and will be used to ensure the model downloads successfully to your hard drive and all necessary packages are installed. Later in the article we will show more complex code to prompt the model and generate the streaming output.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The first time this code is run on your machine, this code will download all the files associated with the model. This may take awhile as the model is large and contains over 20GB of files in its repository.
After downloading the files, the code will run the model and print its output to the terminal. When I executed the code on my machine, it output the following:
Loading checkpoint shards: 100%|||||||| 3/3 [00:05<00:00 1.84s/it]
Setting 'pad_token_id' to 'eos_token_id':2 for open-end generation.
Hello my name is Kieran and I am a 21 year old student from the UK. I am currently
Process finished with exit code 0
Prompting Technique
This model was fine-tuned to work with a specific prompting format that includes surrounding instructions with [INST] and [/INST] tags. Furthermore, the first instruction should begin with the following specific begin sentence id: <s>. When the model responds, it will complete its response with an end of sentence id: </s>.
Because the model was trained on data following this format, deviating from it reduces the model’s performance drastically. However, process of correctly formatting the prompt manually can be error-prone and time consuming with longer prompts.
Thankfully, the model designers have included a prompt template that allows us to use a Python list of dictionaries to organize the sections of our prompt. The prompt template will auto-convert the list of dictionaries into a single prompt string that includes the [INST], [/INST], <s> tags automatically.
The template is called a chat template and is called via the apply_chat_template() method.
See the following code, which includes a more complex prompt, following the chat format:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
input = tokenizer.apply_chat_template(messages, return_tensors="pt")
generated_ids = model.generate(input, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
NOTE: In the code above executes the model on the CPU. The model may be configured to operate on a GPU however if the GPU does not have enough memory, the process will fail. When testing models, I first run them on the CPU to ensure they work properly.
In the code above, you can see the format the prompt takes before being passed to the .apply_chat_template function.
This allows for easy prompt generation and even dynamic prompt generation by appending more user/assistant statements to the list as the program runs.
Lastly, although this model works well with zero-shot prompting, one-shot or few-shot prompting helps improve its performance.
NOTE: zero-shot, one-shot and few-shot prompting is the process of including instructions and answers directly in the prompt as an example of how the model should respond. This process can be as simple or complex as you would like, however, be careful as it can have unintended affects on how the model responds. You can read more about prompting here: https://huggingface.co/docs/transformers/main/tasks/prompting
Configuring Streaming Output
If you run the two sets of code above, you will notice the code hangs until the model completes generating the entire output. After the model finishes, the code continues and the model’s output is printed to the terminal.
Commercial chat bots are configured to provide a more natural conversational experience by outputting each word to the screen as the model generates them. This allows the user to read the model’s output as it’s being generated, allowing the user to decide if they want to wait for the complete response or terminate it early.
The streaming output feature requires only a few lines of code.
To configure the code above to stream the models output, we need to use the TextStreamer package. We pass the tokenizer to the TextStreamer function. Then we pass the output to the ‘streamer’ parameter in the model.generate function. The code below has been updated to stream the model’s output directly to the terminal.
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
input = tokenizer.apply_chat_template(messages, return_tensors="pt")
streamer = TextStreamer(tokenizer)
generated_ids = model.generate(input, streamer=streamer, max_new_tokens=1000, do_sample=True)
# decoded = tokenizer.batch_decode(generated_ids)
# print(decoded[0])
When this code is executed, the model will print the next word it generates directly to the terminal.
Improving The Model’s Output
HuggingFace.co has a great blog post that walks through several ways large language models may be adjusted to generate more creative and coherent outputs without getting stuck in a repeating loop.
Read more here: https://huggingface.co/blog/how-to-generate
For this article, we will use the top_k and top_p variables.
The model.generate function can accept many arguments. Keep in mind, however, that some arguments are designed to work together and if one is included, it may require the other relevant arguments are as well. In this case, we are going to use the do_sample, top_k and top_p arguments.
In the code above, change the following line:
model.generate(input, streamer=streamer, max_new_tokens=1000, do_sample=True)
to include top_k and top_p arguments like this:
model.generate(input, streamer=streamer, max_new_tokens=1000, do_sample=True, top_k=50, top_p=0.95)
By setting do_sample=True, the model stops being deterministic. Instead, the next word is sampled randomly.
Top_k sampling is a technique where instead of randomly sampling the entire vocabulary for the next word in the sequence, only the words with a probability score higher than the top_k threshold are made available for selection. This reduces the chances of selecting a low probability word but still allows for level of randomness (ie, lower likelihood of repeating).
Top_p sampling is a technique that chooses from the smallest possible set of words that have a cumulative probability that is higher than the top_p threshold.
The use of top_k and top_p together help filter out low probability words while still maintaining a high level of randomness. These use of these arguments help create outputs that are more natural-sounding and have a lower likelihood of repeating sequences.
Play around with the top_k and top_p values to see how it affects your model’s output.
Conclusion
The instruct version of the Mistral-7B model as the potential to generate more predictable outputs by its ability to obey instructions. This is an important feature that combined with the performance and accessibility of this model will open automation to many highly customized use cases.
The code provided in this post should server as a good foundation to build more advanced automation programs.
Be sure to experiment with the different prompting techniques (zero-shot, one-shot, etc…) to see how it affects the output.
Let me know in the comments if you experience any issues and I will include a trouble shooting section with solutions.