Microsoft logo

Phi-4-multimodal-instruct

Overview

Overview

Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.

Phi-4-multimodal-instruct was released on February 1, 2025. API access is available through DeepInfra.

Performance

Timeline

ReleasedUnknown
Knowledge CutoffUnknown

Specifications

Parameters
5.6B
License
MIT
Training Data
Unknown
Tags
tuning:instruct

Benchmarks

Benchmarks

Phi-4-multimodal-instruct Performance Across Datasets

Scores sourced from the model's scorecard, paper, or official blog posts

LLM Stats Logollm-stats.com - Wed Jan 21 2026
Notice missing or incorrect data?Start an Issue discussion

Pricing

Pricing

Pricing, performance, and capabilities for Phi-4-multimodal-instruct across different providers:

ProviderInput ($/M)Output ($/M)Max InputMax OutputLatency (s)ThroughputQuantizationInputOutput
DeepInfra logo
DeepInfra
$0.05$0.10128.0K128.0K
0.5
25.0 c/s
Text
Image
Audio
Video
Text
Image
Audio
Video

API Access

API Access Coming Soon

API access for Phi-4-multimodal-instruct will be available soon through our gateway.

Recent Posts

Recent Reviews

FAQ

Common questions about Phi-4-multimodal-instruct

Phi-4-multimodal-instruct was released on February 1, 2025 by Microsoft.
Phi-4-multimodal-instruct was created by Microsoft.
Phi-4-multimodal-instruct has 5.6 billion parameters.
Phi-4-multimodal-instruct is released under the MIT license. This is an open-source/open-weight license.
Phi-4-multimodal-instruct has a knowledge cutoff of June 2024. This means the model was trained on data up to this date and may not have information about events after this time.
Yes, Phi-4-multimodal-instruct is a multimodal model that can process both text and images as input.