Is Medusa(-2) compatible with vision language models (VLMs) ? 

The repo contains code and examples for tuning medusa heads for text-only LLMs. Is the code for Medusa(-2) directly compatible with VLMs as well? I assume that Medusa should be compatible with VLMs because they do standard next-token-prediction like text-only LLMs, but I wonder how many code changes would be necessary to tune a VLM with Medusa.