Neilblaze (Pratyay Banerjee)

upvoted an article 1 day ago

Article

ColPali: Efficient Document Retrieval with Vision Language Models 👀

By

•

Jul 5

• 110

upvoted a paper 5 days ago

ReFT: Representation Finetuning for Language Models

Paper • 2404.03592 • Published Apr 4 • 89

upvoted a paper 9 days ago

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21 • 111

upvoted a collection 9 days ago

Llama 3.2

Collection

This collection hosts the transformers and original repos of the Llama 3.2 and Llama Guard 3 • 11 items • Updated 10 days ago • 327

upvoted 4 papers 12 days ago

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Paper • 2409.13592 • Published 15 days ago • 45

upvoted a paper 17 days ago

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published 18 days ago • 66

upvoted 8 papers 19 days ago

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Paper • 2408.15881 • Published Aug 28 • 20

Efficient LLM Scheduling by Learning to Rank

Paper • 2408.15792 • Published Aug 28 • 19

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Paper • 2409.02897 • Published Sep 4 • 44

MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery

Paper • 2409.05591 • Published 27 days ago • 26

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published 25 days ago • 54

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

Paper • 2409.06595 • Published 25 days ago • 37

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Paper • 2409.04109 • Published 30 days ago • 41

upvoted 11 papers about 1 month ago

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Paper • 2407.21705 • Published Jul 31 • 25

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 105

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Paper • 2408.03314 • Published Aug 6 • 33

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5 • 60

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Paper • 2408.03837 • Published Aug 7 • 17

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8 • 154

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9 • 46

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Paper • 2408.06292 • Published Aug 12 • 115

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 111

Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27 • 121

upvoted an article 2 months ago

Article

Everything About Long Context Fine-tuning

By

•

May 10

• 27

upvoted 2 papers 2 months ago

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29 • 34

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Paper • 2309.04269 • Published Sep 8, 2023 • 32

upvoted an article 2 months ago

Article

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope

Jul 31

• 58

upvoted 28 papers 2 months ago

Depth Anything V2

Paper • 2406.09414 • Published Jun 13 • 91

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Paper • 2406.11931 • Published Jun 17 • 56

EvTexture: Event-driven Texture Enhancement for Video Super-Resolution

Paper • 2406.13457 • Published Jun 19 • 16

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Paper • 2406.15319 • Published Jun 21 • 60

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Paper • 2406.19280 • Published Jun 27 • 59

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Paper • 2406.18284 • Published Jun 26 • 19

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

Paper • 2407.01370 • Published Jul 1 • 85

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Paper • 2407.02855 • Published Jul 3 • 10

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

Paper • 2407.04078 • Published Jul 4 • 16

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4 • 18

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Paper • 2407.04842 • Published Jul 5 • 52

VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8 • 9

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9 • 10

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Paper • 2407.06358 • Published Jul 8 • 17

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Paper • 2407.07061 • Published Jul 9 • 26

Vision language models are blind

Paper • 2407.06581 • Published Jul 9 • 81

Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10 • 16

Controlling Space and Time with Diffusion Models

Paper • 2407.07860 • Published Jul 10 • 16

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 65

Inference Performance Optimization for Large Language Models on CPUs

Paper • 2407.07304 • Published Jul 10 • 52

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10 • 40

Generalizable Implicit Motion Modeling for Video Frame Interpolation

Paper • 2407.08680 • Published Jul 11 • 8

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Paper • 2407.08701 • Published Jul 11 • 10

GTA: A Benchmark for General Tool Agents

Paper • 2407.08713 • Published Jul 11 • 14

Self-Recognition in Language Models

Paper • 2407.06946 • Published Jul 9 • 24

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Paper • 2407.08083 • Published Jul 10 • 27

MAVIS: Mathematical Visual Instruction Tuning

Paper • 2407.08739 • Published Jul 11 • 30

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

Paper • 2406.02265 • Published Jun 4 • 6

Pratyay Banerjee

AI & ML interests

Organizations

Neilblaze's activity

ColPali: Efficient Document Retrieval with Vision Language Models 👀

Everything About Long Context Fine-tuning

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope