Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection Paper • 2409.08513 • Published 23 days ago • 10
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published 23 days ago • 42
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published 17 days ago • 69
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published 17 days ago • 46
Imagine yourself: Tuning-Free Personalized Image Generation Paper • 2409.13346 • Published 16 days ago • 66
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models Paper • 2409.13592 • Published 15 days ago • 45
A Case Study of Web App Coding with OpenAI Reasoning Models Paper • 2409.13773 • Published 17 days ago • 4