✨✨ #QuickRead tl;dr✨✨
✨✨ Analysis Overview:
Analysis focuses on increasing the long-context capabilities of Multi-modal Giant Language Fashions (MLLMs) to deal with complicated duties like video understanding, high-resolution picture processing, and extra.
✨✨ Key Contributions:
– Hybrid Structure, a hybrid mannequin combining Mamba and Transformer blocks to effectively course of long-context multi-modal information, significantly in eventualities with a number of pictures.
– Picture Token Compression, the mannequin employs 2D pooling to cut back picture tokens, decreasing computational prices whereas sustaining efficiency.
– Coaching Technique, coaching carried out in three phases, Single-image Alignment, Single-image Instruction-tuning, and Multi-image Instruction-tuning — permitting it to incrementally improve its means to deal with multi-modal lengthy contexts.
– The mannequin demonstrates the power to course of almost 1000 pictures on a single 80GB GPU, which is a big enchancment over present fashions.