- Yapes AI
- Posts
- New Era of AI: “Kosmos-1”
New Era of AI: “Kosmos-1”
Group of researchers announced a new Artificial general intelligence(AGI) model named “KOSMOS-1”. KOSMOS-1 is designed to make wide range and high level tasks with better accuracy and efficiency than currently used AGI models. In current AGI models, we can benefit from very deep databases with just a few steps.

Group of researchers announced a new Artificial general intelligence(AGI) model named “KOSMOS-1”. KOSMOS-1 is designed to make wide range and high level tasks with better accuracy and efficiency than currently used AGI models. In current AGI models, we can benefit from very deep databases with just a few steps. We can write documents, we can find inspirations, we can make tools, projects and much more… But in these AGI models there must be one input type in order to get proper output.
For example: If you want AI to write a document for you, you need to give input as text. If you want to make an image with it, you need to use another language model that is trained for image pairing with words. So in order to get outputs you always need text inputs while requesting it from AI.
Well, it seems like KOSMOS-1 changed the game a bit.
Transition to MLLM from LLM

Until today, we were using incredible AI models which were constructed with Large Language Models(LLM). Large Language Models is a complex system that currently we know of with the names like GPT-3 made by OpenAI and Google’s GShard, BERT and T5 etc. These have been incredibly successful as a general-purpose interface for various natural language tasks.
However, these models have struggled to natively handle multimodal data like images and audio. Because they can’t perceive images and audio without training. This is a critical component of artificial general intelligence. To address this limitation, researchers have introduced KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, follow instructions, and learn in context. The model was trained using web-scale multimodal corpora, including text data, images and text, and image-caption pairs.

KOSMOS-1 natively supports language, perception-language, and vision tasks, allowing it to handle a wide range of perception-intensive tasks like visual dialogue, image captioning, and zero-shot image classification(understanding image without training) with descriptions. Additionally, the model's ability to perceive multimodal input enables it to acquire commonsense knowledge beyond text descriptions, opening new opportunities for robotics and document intelligence.
Better Than Ever?

In summary, the KOSMOS-1 multimodal large language model shows promising results in a wide range of language and multimodal tasks.The researchers demonstrate that moving from LLM’s to MLLM’s unlocks new capabilities and opportunities. Scaling up KOSMOS-1 and integrating speech capability are among the researchers plans for future work. Additionally, KOSMOS-1 could serve as a unified interface for multimodal learning, enabling control of text-to-image generation using instructions and examples.
It’s hard to imagine possibilities we will face in the future about this new era but we suggest our readers to prepare themselves for the future.
You can find more info in these links:
https://arxiv.org/abs/2302.14045
https://github.com/microsoft/unilm
You can subscribe to not miss any posts!
Be cool, be safe and be YAPE.