VLM Selector

Edge-first model selection

Find Your Perfect VLM
in 30 Seconds

Stop guessing. Input your constraints — latency, memory, hardware — and get ranked model recommendations with trade-off explanations.

Start Decision Tree

Step 1

Select Your Use Case

Choose the primary vision-language task for your application.

Step 2

Set Your Constraints

Define the hard limits for your deployment environment.

Max Latency200ms

20ms (real-time)500ms (batch)

Max Memory8.0GB

500MB16GB

Max Power Draw25W

3W (mobile)50W (desktop GPU)

Target Hardware

Step 3

Ranked Recommendations

7 models match all your constraints.

Florence-2-base

Microsoft

accuracy

Unified vision foundation model excelling at detection and captioning tasks.

Latency

28ms

Pass

Memory

1.2GB

Pass

Power

Pass

Hardware

0.23B

Pass

Strengths

+ Very small
+ Fastest inference
+ Runs anywhere

Trade-offs

- Weak on reasoning tasks
- Limited VQA capability

View on HuggingFace →

Edge Ready

PaliGemma 3B

Google

accuracy

Versatile VLM built on SigLIP and Gemma. Strong across multiple vision tasks.

Latency

85ms

Pass

Memory

4.5GB

Pass

Power

12W

Pass

Hardware

2.92B

Pass

Strengths

+ Multi-task capable
+ Good accuracy-size ratio
+ Fine-tunable

Trade-offs

- Requires GPU for real-time
- Moderate memory usage

View on HuggingFace →

Edge Ready

InternVL2-2B

OpenGVLab

accuracy

Competitive small VLM with strong multi-task vision performance.

Latency

68ms

Pass

Memory

3.6GB

Pass

Power

10W

Pass

Hardware

2.21B

Pass

Strengths

+ Competitive at small scale
+ Multi-task versatile
+ Active development

Trade-offs

- Requires GPU
- Mid-range accuracy

View on HuggingFace →

Edge Ready

Phi-3.5-Vision

Microsoft

accuracy

Efficient multimodal model balancing capability with deployability.

Latency

110ms

Pass

Memory

7.2GB

Pass

Power

18W

Pass

Hardware

4.15B

Pass

Strengths

+ Good balance of size and capability
+ Strong reasoning for size
+ Efficient architecture

Trade-offs

- Needs decent GPU
- Mid-range on detection

View on HuggingFace →

Edge Ready

Qwen2-VL-2B

Alibaba

accuracy

Compact multimodal model with strong OCR and document understanding.

Latency

72ms

Pass

Memory

3.8GB

Pass

Power

10W

Pass

Hardware

2.21B

Pass

Strengths

+ Best-in-class OCR
+ Good document understanding
+ Multi-language support

Trade-offs

- Higher memory than alternatives
- GPU recommended

View on HuggingFace →

Edge Ready

MobileVLM-3B

Meituan

accuracy

Purpose-built for mobile and edge deployment with optimized architecture.

Latency

55ms

Pass

Memory

3.2GB

Pass

Power

Pass

Hardware

2.96B

Pass

Strengths

+ Mobile-optimized
+ Low power consumption
+ Fast on-device

Trade-offs

- Lower accuracy ceiling
- Limited reasoning

View on HuggingFace →

Edge Ready

Moondream2

vikhyatk

accuracy

Tiny but capable VLM optimized for edge deployment. Excellent latency-to-accuracy ratio.

Latency

45ms

Pass

Memory

2.8GB

Pass

Power

Pass

Hardware

1.86B

Pass

Strengths

+ Extremely lightweight
+ Fast inference
+ Runs on CPU

Trade-offs

- Limited complex reasoning
- Lower accuracy on OCR

View on HuggingFace →

Qwen2-VL-7B

Alibaba

accuracy

Powerful multimodal model with state-of-the-art performance on vision-language benchmarks.

Latency

180ms

Pass

Memory

12.0GB

Fail

Power

35W

Fail

Hardware

7.61B

Pass

Strengths

+ Top-tier accuracy
+ Excellent OCR
+ Strong reasoning

Trade-offs

- Large model size
- High memory requirement
- Not edge-friendly

View on HuggingFace →

LLaVA-1.6-7B

LLaVA Team

accuracy

Popular open-source VLM with strong visual conversation and reasoning abilities.

Latency

165ms

Pass

Memory

11.5GB

Fail

Power

32W

Fail

Hardware

7.06B

Pass

Strengths

+ Strong conversational ability
+ Good visual reasoning
+ Large community

Trade-offs

- Large footprint
- Weaker on detection tasks

View on HuggingFace →

IDEFICS2-8B

Hugging Face

accuracy

Open multimodal model with strong document and chart understanding.

Latency

200ms

Pass

Memory

13.5GB

Fail

Power

38W

Fail

Hardware

8.36B

Pass

Strengths

+ Excellent document understanding
+ Strong chart/table parsing
+ Open weights

Trade-offs

- Very large
- Slow inference
- Not edge-deployable

View on HuggingFace →

Find Your Perfect VLM
in 30 Seconds

Select Your Use Case

Object Detection

Image Captioning

Visual Question Answering

OCR & Document Understanding

Visual Reasoning

Set Your Constraints

Ranked Recommendations

Florence-2-base

Strengths

Trade-offs

PaliGemma 3B

Strengths

Trade-offs

InternVL2-2B

Strengths

Trade-offs

Phi-3.5-Vision

Strengths

Trade-offs

Qwen2-VL-2B

Strengths

Trade-offs

MobileVLM-3B

Strengths

Trade-offs

Moondream2

Strengths

Trade-offs

Qwen2-VL-7B

Strengths

Trade-offs

LLaVA-1.6-7B

Strengths

Trade-offs

IDEFICS2-8B

Strengths

Trade-offs

Find Your Perfect VLMin 30 Seconds

Select Your Use Case

Object Detection

Image Captioning

Visual Question Answering

OCR & Document Understanding

Visual Reasoning

Set Your Constraints

Ranked Recommendations

Florence-2-base

Strengths

Trade-offs

PaliGemma 3B

Strengths

Trade-offs

InternVL2-2B

Strengths

Trade-offs

Phi-3.5-Vision

Strengths

Trade-offs

Qwen2-VL-2B

Strengths

Trade-offs

MobileVLM-3B

Strengths

Trade-offs

Moondream2

Strengths

Trade-offs

Qwen2-VL-7B

Strengths

Trade-offs

LLaVA-1.6-7B

Strengths

Trade-offs

IDEFICS2-8B

Strengths

Trade-offs

Find Your Perfect VLM
in 30 Seconds