Google’s NEW AI ‘Dreamix’ Takes the Industry By STORM!
Now, Google is once again showing why they’re one step ahead of all major competition when it comes to certain AI features. Recently, Google’s AI team released this research paper documenting their step for text to video, and honestly, the results are completely mind-blowing. Take a look at this to see that this is Google’s new text-to-video. So essentially, what we have here is the input video, and then, of course, the generated video. If we read the description, it pretty much describes exactly what’s going on, but I do need to provide some more context.
It says given a video and a text prompt, Dreamix edits the video while maintaining fidelity to color, posture, object size, and camera pose, resulting in a temporally consistent video. So here it turns the monkey, which you can all see on the left, to a dancing bear, which you can all see on the right, given the prompt “a bear dancing and jumping to dark beat music while moving his body.” So essentially, what this is, is a text-to-video.
Now, of course, you might be thinking this just isn’t basic text-to-video, because of course, most basic text-to-video would just be where you simply enter a text prompt and get a video out. But this is something that many models have actually struggled with, and I’m going to show you more examples from Google’s Dreamix as to why it’s really good. I’m going to be comparing it to Runway’s Jenna 2, which is another text-to-video editor that has gained some popularity recently. So let’s take a look at some more text-to-video examples from Google’s using the different models that they do have in this large language model software.
So you can see right here that this is another example. Dreamix can generate videos based on image and text inputs. So it can actually insert motion into a static image, and this is very different just from text-to-video, because this allows another level of population. Because you can see right here that this is just an image, but then they’ve changed it to say an underwater shot of a sea turtle with a shark approaching from behind. So this is definitely really good, because of course, everybody knows about Midjourney and how good that is an image generation, but imagine you could then take your image generation from Midjourney and put it into Google’s Dreamix and then just simply say a futuristic landscape cinematic shots. And I’m pretty sure with just images combined with this tool, we could definitely be generating some full-scale movies that would be really cool, which is why I say that this is far more advanced than people think. Because, of course, Google, as you all know, is working on Bud and many other software such as Palme, but this is definitely going to change the game once it does get integrated, and I do hope they release it soon because this other example I’m about to show you and the further ones in the video are truly breathtaking.
This one is pretty cool that you can all see right here. It says given a small collection of images showing the same subject, which is, of course, the same Lego toy character, Dreamix can generate new videos with the subject in motion. In this example, given a small number of images of the toy fireman, Dreamix is able to extract the visual features, then animate it to lift weights while maintaining fidelity and temporal consistency, which just means that it looks normal and not strange at all. So this is really good because, of course, we know that this can then be applied to many other things. So, for example, maybe you want to generate something moving with a Midjourney character, maybe you want to pre-visualize something. There are honestly many different applications, but I think that the toy fireman lifting weight is really, really good because the generated video does look pretty realistic compared to some of the other AI text-to-video platforms that do exist right now.
Now what I want to do quickly is I want to compare this to something that was recently released, which is Runway’s Gen 2. If you don’t know what Runway’s Gen 1 and Gen 2 was, essentially, this is text-to-video. Now, Generation One was where you had a video, then you had a driving image which you are seeing on the screen, and then, of course, the driving image essentially makes that video in the similar style of that image. So you can see right here that one’s Lego, that one’s some kind of fire thing, and then of course, Gen 1 has a massive discord, which is what they were all talking about. So it’s something that does very, very well at what it is capable of, but I do think in the instance that we’re looking at, Google’s text-to-video does a lot better. But you can see that here with Gen 2, what we do have is, of course, many different outputs that actually don’t need a driving image. And I think that if this company is able to just simply focus on this, I think it’s going to be at Midjourney level soon, where you’re going to be able to get complete footage from a single prompt, which is what you can see right here: an apartment extreme close-up of an eye, and these are generated without driving images. Driving images are just essentially images that prompt the scene to have a certain specific style. So for example, let’s take a look at this right here. You can all see that this, of course, is the driving image, and essentially what happens here is that this was given the prompt of a low-angle shot of a man walking down a street illuminated by the neon signs of the bars around him. And essentially, what’s good about this is that, of course, if you want to prompt this in a certain way, or you want a certain style, you can actually use that image if in order to craft that story, which does make this, in some aspects, much more effective. But like I said, there are some other examples that I do want to show you from Google Dreamix in the video editing aspect.
So you can see right here, the input video is, of course, them cooking some onions. And then, of course, they added the text prompt of stirring noodles in a pot, and you can see right here that, of course, they are stirring noodles in a pot. So when it comes to the video editing aspect, these examples that I’m showing you right now showcase how powerful this new software is. You can also see that it says, “Moving through a field on a wooden path with fire on all sides.” And you can see that from the input video, the generated video is honestly really, really good. And I’m not taking shots at Runway here. I think they’ve done something absolutely insane, but this does look like it is a mere step ahead of Runway. So it’ll be interesting to see where the software goes.
You can also see that the input video generates some outstanding results in the final generated video of this example when an old pickup truck is carrying wood logs. Definitely looks really, really realistic. And of course, it definitely shows multiple applications where this stuff can be used. Now you can also see right here that it says, “A small brown dog and a large white dog are rolling a soccer ball on the kitchen floor.” And you can see that’s given from the prompt of these two animals. And it just goes to show that even with an input video, you can quickly edit stuff in real time with this software, and I just wonder how much this is going to develop.
This one right here is another great example of how you can use natural language to increase a video’s capabilities by just having these things on screen, and honestly, this is truly, truly great because it just shows us what kinds of examples we’re going to be getting in the future when this stuff is fine-tuned. And this one right here is really, really cool as well. It’s a beach with palm trees and swans in the water, and from the input video compared to the generated video, you can see that it definitely does look pretty, pretty realistic. Definitely one of the more realistic ones. Of course, some of them may have some small artifacts, but you have to remember this is new software.
Now you can see right here, this one even manages to do the water reflections really, really well. We can see that an orangutan with orange hair bathing in a beautiful bathroom does look pretty realistic with as to what I would expect if I saw an orangutan bathing. Now you can also see right here, a deer rolling on a skateboard. This one isn’t as accurate, but it’s still very interesting to see how the models kind of put certain pieces together and how the final output is. And of course, we can see that the input video is there, and of course, the output video definitely looks interesting.
Now, of course, we need to take a look at image to video because this is the section which is even more crazy. Take a look at this. We can see that the input image is on the left, and the generated video is on the right-hand side, and the text prompt is a camel walking in the sand dunes. And honestly, this looks pretty much perfect. I mean, it doesn’t look like the highest quality, but it definitely looks great. We have another input image right here. You can see that we just have something that looks like a Christmas tree, and then, of course, we have Bigfoot walking in the snowstorm. And honestly, these examples that I’m about to show you are truly, truly incredible. I mean, they do look as realistic as possible, and it shows that their input image which generates the video looks about as realistic as you might expect in this early stage.
Now, of course, we have the input image in the generated video of the emperor penguins returning to their home, and you can see that this one definitely looks pretty accurate, which is pretty nice. So of course, we have one here, which showcases how you can actually zoom out on certain images, create different landscapes, different perspectives. It’s really, really interesting to show the dynamic fluidity of these models and just how accurate they really are at depicting exactly what we want with the text prompt and with the input image.
You can also see right here a unicorn running in the foggy forest while zooming out. That definitely does look really, really realistic and it looks really, really accurate. And I only have one question right now, which is when is Google going to release this to the public? Because this is better than a lot of things we have seen on the internet. And I mean, if they release this soon, I’m pretty sure that this is going to honestly take the entire industry by storm because something like this isn’t currently available, especially as accurate as this is. I mean, just take a look. Look at this one right here, a grizzly bear walking combined with these input images honestly provides us with some of the best results. Let me know what you all think about Google Remix. Is it something that is really good, or is it just something that you don’t think is that impressive? But I truly think this is impressive, and honestly, the next few years in AI video generation are going to be truly incredible.