Google’s Gemini AI Aims to Break the Barrier Towards True Artificial Reasoning
Today (December 6th, 2023) Google DeepMind unveiled its most advanced AI system yet: Gemini
Unlike the PaLM2 text based models, Gemini was designed and built from the ground up to be multi-modal which means it can generalize, operate across and combine different types of information including text, code, audio, image and video. This is expected show signification advancement in reasoning capabilities.
Google DeepMind will roll out 3 versions of Gemini, each optimized for specific applications: Ultra, Pro and Nano.
So far, Gemini Ultra (the most advanced version of Gemini) surpasses OpenAI’s GPT-4 in the MMLU (massive multitask language understanding) benchmark. It also beat GPT-4 in several other benchmarks (see more below). Gemini Pro outperformed GPT-3.5 in the MMLU.
Starting today, Bard will begin using a fine-tuned version of Gemini Pro. Gemini Ultra will come to Bard early next year in a new experience called Bard Advanced.
Optimized For Different Tasks
The Ultra Version is the most powerful model for complex tasks. It is designed for operation in data centers and will likely be monetized through a cloud computing application layer. Google says this version will be used for the most complex tasks, making it the ideal choice for businesses and organizations that need to process large amounts of data or run demanding applications.
The Pro version will be integrated into the free version of Google’s Bard – providing Bard with a boosted set of capabilities. This model will be optimized for a wider – more generalized – set of tasks.
The Nano version is designed for mobile platforms – such as devices that are powered by batteries or have lower processing power. This version will become integrated into the Google Pixel phone.
Multi-Modal Capabilities
Previously, AI systems often relied on separate applications to convert images, audio, and other data into formats the core AI model could understand. This approach limited the model’s ability to learn and make connections across different information types.
Google built Gemini differently. It’s trained to interpret text, images, audio, and video directly, without needing intermediate conversion. This allows the model to develop a deeper “understanding” of the information it is given and create richer connections between different modalities.
This approach is similar to what Google DeepMind used with AlphaFold, their AI system for protein folding. Initially, AlphaFold relied on sub-components for specific tasks. However, DeepMind found that training the model to learn directly from raw data led to better performance – leading to a significant breakthrough in the field of biology.
By utilizing multi-modal capabilities, Gemini should be able to glean more details from the information it is prompted with.
This may finally lead to Bard being able to read and summarize PDFs, word documents and other pieces of information (finger’s crossed!).
With its ability to process multiple mediums, Gemini would be expected to significantly increase productivity across many use cases.
Reasoning Performance
As LLM models become more advanced, they require more data to expand to be trained on. By expanding to multi-modal capabilities, this opens a whole new world of data to train the AI system on. Now, instead of being limited to text, newer AI models may be built on a variety of information sources including text, images, audio, code and videos. These alternate forms of information provide the model with more relationships to learn that help increase its understanding of the outside world.
Current AI systems do not exhibit true capability of reasoning. Instead, the models utilize statistical methods to generate outputs that are most likely to be true based on their training data. This is why most AI systems perform poorly on mathematical reasoning exams.
One simple test that highlights the shortcomings of current AI models involves asking them the question: “What weighs more, 5 pounds of feathers or a 1-pound hammer?” Both ChatGPT and Google Bard, for instance, would incorrectly respond that the hammer weighs more. This flawed reasoning stems from their reliance on statistical data, rather than any genuine understanding of physics or logic.
Gemini is Google’s ambitious attempt to develop an AI system capable of true reasoning. Google does not explain how they were able to achieve this, instead they claim the model will take more time when answering complex questions to improve the likelihood of a correct answer. My guess is that the model will perform some real-time search to gather more information related to the question, and then determine which answer has the higher likelihood of being correct.
Here is a quote from the Google Blog explaining how Gemini performed on some AI benchmark tests:
From natural image, audio and video understanding to mathematical reasoning, Gemini Ultra’s performance exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks used in large language model (LLM) research and development.
With a score of 90.0%, Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding), which uses a combination of 57 subjects such as math, physics, history, law, medicine and ethics for testing both world knowledge and problem-solving abilities.
https://blog.google/technology/ai/google-gemini-ai/#performance
Other Announcements
Starting on December 13, developers and enterprise customers can access Gemini Pro via the Gemini API in Google AI Studio or Google Cloud Vertex AI.
Early next year, Google plans to launch Bard Advanced, a new, cutting-edge AI experience that gives more users access to Gemini Ultra.
Learn More
Learn more about Gemini’s integration with Bard: https://blog.google/products/bard/google-bard-try-gemini-ai/
Learn more at the Google blog post: https://blog.google/technology/ai/google-gemini-ai/
Also checkout Gemini’s page on Google DeepMind: https://deepmind.google/technologies/gemini/