青青草a国产免费观看|91麻豆精品国产福利|国产av五无码一级毛片|亚洲爆乳精品无码一区二区|久久亚洲AV成人无码国产|91无码人妻一区二区三区|色婷婷av一区二区三区性色|国产制服91一区二区三区制服,女人书籍排行榜,盗墓笔记小说txt下载,玄幻小说排行榜完本

position: EnglishChannel  > AI ripples> Chinese AI Model Emu3 Handles Text, Image, Video Seamlessly

Chinese AI Model Emu3 Handles Text, Image, Video Seamlessly

Source: Science and Technology Daily | 2024-12-17 15:44:35 | Author: Gong Qian

On October 21, the Beijing Academy of Artificial Intelligence (BAAI), a Chinese non-profit organization engaged in AI R&D, released Emu3, a multimodal AI model that seamlessly integrates text, image, and video modalities into a single, unified framework.

The BAAI research team said Emu3 is expected to be used in scenario applications such as robot brains, autonomous driving, multimodal dialogue and inference.

Emu3, based solely on next-token prediction, proves that next-token prediction can be a powerful paradigm for multimodal models.

The existing multimodal AI models are mostly designed for specific tasks. Each has its corresponding architecture and methods. For instance, in the field of video generation, many developers use the diffusion in time (DiT) architecture, as referenced by Sora. Other models such as Stable Diffusion are used for text-to-image synthesis, Sora for text-to-video conversion, and GPT-4V for image-to-text generation.

In contrast to these models, which have a combination of isolated skills rather than an inherently unified ability, Emu3, eliminates the need for diffusion or compositional approaches. By tokenizing images, text, and videos into a discrete space, BAAI has developed a single transformer from scratch.

Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA.

In September, BAAI open-sourced the key technologies and models of Emu3 including the chat model and generation model after supervised fine-tuning.

Emu3 has been receiving rave reviews from overseas developers. "For researchers, a new opportunity has emerged to explore multimodality through a unified architecture, eliminating the need to combine complex diffusion models with large language models. This approach is akin to the transformative impact of transformers in vision-related tasks," AI consultant Muhammad Umair said on social media platform Meta.

While next-token prediction is considered a promising path towards artificial general intelligence, it struggled to excel in multimodal tasks, which were dominated by diffusion models such as Stable Diffusion and compositional approaches like CLIP combined with large language models.

Raphael Mansuy, co-founder of QuantaLogic, an AI agent platform, thinks that Em3 has significant implications for Al development. Mansuy wrote on X that Em3's success suggests several key insights: Next-token prediction as a viable path to general multimodal Al; potential for simplified and more scalable model architectures; challenge to the dominance of diffusion and compositional approaches.

Editor:GONG Qian

抱歉,您使用的瀏覽器版本過(guò)低或開(kāi)啟了瀏覽器兼容模式,這會(huì)影響您正常瀏覽本網(wǎng)頁(yè)

您可以進(jìn)行以下操作:

1.將瀏覽器切換回極速模式

2.點(diǎn)擊下面圖標(biāo)升級(jí)或更換您的瀏覽器

3.暫不升級(jí),繼續(xù)瀏覽

繼續(xù)瀏覽
马尔康县| 金塔县| 平乡县| 金堂县| 綦江县| 巫山县| 托里县| 临澧县| 罗山县| 紫云| 滕州市| 得荣县| 潜江市| 龙陵县| 黔西| 正安县| 大化| 尼玛县| 中卫市| 遵义县| 曲阜市| 双桥区| 云梦县| 留坝县| 永登县| 肃宁县| 阿瓦提县| 阳曲县| 将乐县| 社旗县| 峨边| 合阳县| 惠来县| 蕲春县| 平阳县| 武冈市| 军事| 贵州省| 镇雄县| 定结县| 筠连县|