Phi-4-multimodal-instruct
MicrosoftOverview
Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.
Phi-4-multimodal-instruct was released on February 1, 2025. API access is available through DeepInfra.
Performance
Timeline
Other Details
Related Models
Compare Phi-4-multimodal-instruct to other models by quality (GPQA score) vs cost. Higher scores and lower costs represent better value.
Performance visualization loading...
Gathering benchmark data from similar models
Benchmarks
Phi-4-multimodal-instruct Performance Across Datasets
Scores sourced from the model's scorecard, paper, or official blog posts
Pricing
Pricing, performance, and capabilities for Phi-4-multimodal-instruct across different providers:
| Provider | Input ($/M) | Output ($/M) | Max Input | Max Output | Latency (s) | Throughput | Quantization | Input | Output |
|---|---|---|---|---|---|---|---|---|---|
DeepInfra | $0.05 | $0.10 | 128.0K | 128.0K | 0.5 | 25.0 tok/s | — | Text Image Audio Video | Text Image Audio Video |
Example Outputs
Recent Posts
Recent Reviews
API Access
API Access Coming Soon
API access for Phi-4-multimodal-instruct will be available soon through our gateway.
FAQ
Common questions about Phi-4-multimodal-instruct
