Microsoft logo

Phi-4-multimodal-instruct

Microsoft
phi-4-multimodal-instructVariant

Overview

Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.

Phi-4-multimodal-instruct was released on February 1, 2025. API access is available through DeepInfra.

Performance

Timeline

Release DateUnknown
Knowledge CutoffUnknown

Other Details

Parameters
5.6B
License
MIT
Training Data
Unknown
Tags
tuning:instruct

Related Models

Compare Phi-4-multimodal-instruct to other models by quality (GPQA score) vs cost. Higher scores and lower costs represent better value.

Performance visualization loading...

Gathering benchmark data from similar models

Benchmarks

Phi-4-multimodal-instruct Performance Across Datasets

Scores sourced from the model's scorecard, paper, or official blog posts

LLM Stats Logollm-stats.com - Sat Dec 06 2025
Notice missing or incorrect data?Start an Issue discussion

Pricing

Pricing, performance, and capabilities for Phi-4-multimodal-instruct across different providers:

ProviderInput ($/M)Output ($/M)Max InputMax OutputLatency (s)ThroughputQuantizationInputOutput
DeepInfra logo
DeepInfra
$0.05$0.10128.0K128.0K0.525.0 tok/s
Text
Image
Audio
Video
Text
Image
Audio
Video

Example Outputs

Recent Posts

Recent Reviews

API Access

API Access Coming Soon

API access for Phi-4-multimodal-instruct will be available soon through our gateway.

FAQ

Common questions about Phi-4-multimodal-instruct

Phi-4-multimodal-instruct was released on February 1, 2025.
Phi-4-multimodal-instruct has 5.6 billion parameters.