Talk To Your Image — A Step-by-Step LLaVa-1.5

4 min read

llava

What is LLaVa? LLaVA (Large Language-and-Vision Assistant) is a model that can be trained end-to-end by combining a Vision encoder and LLM. A vision encoder processes visual data like images and transforms it into a latent representation. On the other hand , the LLM processes data from both the vision encoder and text input to generate a response. LLaVA trains these two components end-to-end to enable multimodal visual-linguistic transformation. As a result, LLaVA showed high performance in visual reasoning ability, as an early study in visual instruction tuning. LLaVA challenges However, LLaVA underperformed on academic benchmarks that demand short-form responses,…...

This article is free to read

Login to read the full article


OR

By subscribing to our main site, you will also be subscribed to DDIntel - our regular letter showcasing our featured articles and applications.

TARIK KAOUTAR 🙋‍♂️Hey my name is Kaoutar Tarik (高達烈), I split my time between University and lab Security, I mostly write about AI, Data analysis, Blockchain, Data Science, Machine learning and internet entrepreneurship.