Mistral AI is a European start-up with a global focus specializing in generative artificial intelligence. They have created and trained several LLM (large language models) for free use. Some of the models have achieved performance that rivals or even out-performs OpenAI’s Chat GPT3.5.
These models are available for public use.
This post is specifically about the Mistral-7B model, however it may be applied to the other Mistral derivative models such as the Mixtral, Mistral-7B-instruct, and the Mixtral-8x7B-v0.1.
This post will guide you through the process of getting this model setup and running on your personal computer – no need to pay for expensive cloud processing platforms.
What is Mistral-7B?
Mistral-7B is a large language model containing 7.3Billion parameters. It has been trained on a large amount of data extracted from the open web. At the time of its release (September 27, 2023) it was the most powerful language model for its given size.
According to the company’s blog post, Mistral-7B:
- Outperforms Llama 2 13B on all benchmarks
- Outperforms Llama 1 34B on many benchmarks
- Approaches CodeLlama 7B performance on code, while remaining good at English tasks
- Uses Grouped-query attention (GQA) for faster inference
- Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost
The model has been released under the Apache 2.0 license and it can be used without restrictions.
This model may be used as a base model or it can be ‘fine-tuned’ to provide improved ability on more specific tasks.
It should be noted that this model has no built-in moderation mechanisms. It is simply a large language model trained on data taken from the web. Additional fine-tuning would be necessary to have more control over what the model generates.
Read the company’s blog post announcing Mistral-7B for more information on its performance metrics: https://mistral.ai/news/announcing-mistral-7b/
Where To Find
The Mistral-7B model may be found on HuggingFace.co for download. The model’s page can be found here: https://huggingface.co/mistralai/Mistral-7B-v0.1
HuggingFace.co is an online hub for AI models, datasets and more – all free. It is a large community made up of many contributors from around the world all helping advance AI by providing free datasets and models for research and development.
Every model on HuggingFace has a dedicated page containing information specific to that model. This page provides a location for the model’s authors to provide users with general information about how to use the model, pitfalls, licenses, and any other information the authors want to share. There is a tab for the repository where the model files are located. Lastly, there is a community tab that allow users to post and answer questions related to the model.
To learn more about HuggingFace.co see here: https://huggingface.co/
Prerequisites
We will be using Python to run this model.
This model is over 14GB in size. Be sure to have at least 15GB of free space before running the code below.
First, you will need to install the transformers package. Use: pip install transformers
The transformers package was created by HuggingFace.co. It includes many powerful functions that help reduce setup time when working with AI models. One of the best features is that it will handle the process of connecting to the HuggingFace repository, download the model, check for any updates and run the model – all from a single line of Python code!
Second, you will need to install PyTorch.
PyTorch is a library that provides tools for accelerating the models when running on either the CPU or GPU. The Transformers functions require PyTorch to be installed. However, it is important to make sure you install the correct PyTorch version for your system.
Click here for instructions to install PyTorch: https://pytorch.org/get-started/locally/
After these packages are installed, open your favorite Python editor and copy the code below into a blank file.
Python Code
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Running The Code For The First Time
The first time this code is run, the transformers function will start downloading the model files from the HuggingFace.co repository. The Mistral-7B-v0.1 model repository contains over 14GB of files. Depending on how fast your internet connection, it may take awhile for the files to download. After the download is complete, the model will start running automatically.
If everything works, you should see an output similar to:
Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying
Process finished with exit code 0
If the code does not work, or you get an error code, proceed to the troubleshooting section below.
How It Works
The model takes the text input, “Hello my name is” and tries to complete the sentence.
Large language models work by calculating the probability of every possible word in their vocabulary given the input context. Right out of the box, the model naively selects the word with the highest probability to be next in the sequence. It continues generating the next token until it hits the max_new_tokens limit of 20 set in the generate function.
The simplistic way of selecting the word with the highest probability to be next is easy to implement and works somewhat well. However, as the model is forced to generate more outputs (or a longer output) for the same input, it will demonstrate a strong tendency to repeat sequences.
To prevent this, there are many variables that may be used to reduce and even prevent the model from repeating itself – leading to more creative content generation. One such way is discussed below.
Improving The Output
HuggingFace has a great blog post that walks through several ways large language models may be adjusted to generate more creative and coherent outputs without getting stuck in a repeating loop.
Read more here: https://huggingface.co/blog/how-to-generate
For this article, we will use the top_k and top_p variables.
The model.generate function can accept many arguments. Keep in mind, however, that some arguments are designed to work together and if one is included, it may require the other relevant arguments are as well. In this case, we are going to use the do_sample, top_k and top_p arguments.
In the code above, change the following line:
outputs = model.generate(**inputs, max_new_tokens=20)
to include top_k and top_p arguments like this:
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=50, top_p=0.95)
By setting do_sample=True, the model stops being deterministic. Instead, the next word is sampled randomly.
Top_k sampling is a technique where instead of randomly sampling the entire vocabulary for the next word in the sequence, only the words with a probability score higher than the top_k threshold are made available for selection. This reduces the chances of selecting a low probability word but still allows for level of randomness (ie, lower likelihood of repeating).
Top_p sampling is a technique that chooses from the smallest possible set of words that have a cumulative probability that is higher than the top_p threshold.
The use of top_k and top_p together help filter out low probability words while still maintaining a high level of randomness. These use of these arguments help create outputs that are more natural-sounding and have a lower likelihood of repeating sequences.
Play around with the top_k and top_p values to see how it affects your model’s output.
Trouble Shooting
Mistral-7B is a very large model and consumes an enormous amount of memory while running. Several machines I’ve tested it on have had a difficult time loading the model into memory. In my experience, the model is too large to load onto my GPU. So I have been running it through my CPU. The CPU takes much longer to process results.
A possible answer to this is to use the amazon/MistralLite model. It can be found here: https://huggingface.co/amazon/MistralLite
The amazon/MistralLite model is based off the Mistral-7B-v0.1 model but has been fine tuned by Amazon engineers to reduce the number of weights and nodes – allowing the it to run while consuming a smaller memory footprint. The MistralLite model should still deliver performance similar to the original model.
To use the MistralLite model, simply change this line:
model_id = "mistralai/Mistral-7B-v0.1"
to this:
model_id = "amazon/MistralLite"
Then run the code again, let the new model download and it should run automatically.
Conclusion
Let me know in the comments if you experience any issues and I will try to update the trouble shooting section to include more solutions.
Once you have the full power of a large language model like Mistral-7B on your personal computer, there are many ways you can harness their power to create new programs and automation.