You are aware that those are often called LMMs, Large Multimodal Model. And one of the modes that makes it multi-modal is Language. All LMMs are or contain an LLM.
But thank you for moving the goalposts and making it clear you don’t know what you’re talking about and have no interest in an honest discussion. Goodbye.
You are aware that those are often called LMMs, Large Multimodal Model. And one of the modes that makes it multi-modal is Language. All LMMs are or contain an LLM.
LLMs are not called LMMs, they’re called LLMs LOL
But thank you for moving the goalposts and making it clear you don’t know what you’re talking about and have no interest in an honest discussion. Goodbye.
https://github.com/haotian-liu/LLaVA
I don’t think Google actually uses LLava but the concept is the same. The data gets converted into text for the model to process.
How do you convert text to images?