In one day: Google’s Gemini 1.5 and OpenAI’s Sora

| By:   Tamer Karam           |  Feb. 16, 2024

sora_gemini

On Tuesday, Google announced Gemini 1.5, which has many advantages over its predecessor Gemini 1.0, including significant improvements in performance and speed, as well as in the amount of data that it can process at once. That, for the first time, made it appear as a real threat to ChatGPT's leadership.

OpenAI did not keep silent, but Sam Altman did not make a 4-minute video to say that ChatGPT is the best, as Zuckerberg did with Vision Pro. Although right after that, OpenAI announced a stunning new model for generating video from text descriptions; which mitigated the impact of Gemini 1.5 and demonstrated that OpenAI is still the leading and pioneering company in the field of artificial intelligence.

Gemini 1.5 can process massive amounts of information at once, including 11 hours of audio and more than 700,000 words, and one hour of video. It can then be asked any question about them, such as asking it to find a clip from the video that talks about a specific topic or that expresses a specific idea for which we can draw a simple sketch and make it as input.

The results of the tests in the technical report show that the Gemini Pro 1.5 is equivalent in performance to the Ultra 1.0, which is the best and biggest model Google has presented so far.

What's about Sora?

OpenAI's response was to launch its new model for converting text to video, called Sora (sky in Japanese). It has the ability to generate a video clip full of details and realism, with a maximum duration of 60 seconds and in high-quality resolution.

The company showed cherry-picked videos generated by the model, which appears to outperform everything we have seen previously, except for RunWay’s Gen-2 model, which needs wider testing to assess.

Sora is characterized by generating coherent video clips that directly express the text and could contain several characters that can be moved in different directions, taking into account the physical presence of objects in the real-world. That means, the model has some common sense and understanding of the physical world.

However, the company acknowledges that Sora may not always give results as expected of it, and sometimes commits foolishness. For example: a person might bite an apple but no traces of the bite appear.

It looks like 2024 will be an exciting year for artificial intelligence; So far, the best text generator model and the best video generation model have been released. So what next?


Share