IBM Announces Granite 3.3 AI Models Featuring Enterprise-Ready Speech-to-Text Capabilities

IBM

IBM Announces Granite 3.3 AI Models Featuring Enterprise-Ready Speech-to-Text Capabilities

Granite Speech 3.3 includes a speech encoder to process English audio, a speech projector to convert that into features for the language model, and the model itself to generate accurate transcriptions or translations. LoRA adapters are included for efficient fine-tuning.

By Donna Joseph
April 18, 2025 10:00 PM

IBM Announces Granite 3.3 AI Models Featuring Enterprise-Ready Speech-to-Text Capabilities

Photo by SBR

ARMONK, N.Y., April 18, 2025 — IBM has taken a meaningful step in the AI space with the release of its Granite 3.3 model lineup, highlighting Granite Speech 3.3 8B. This speech-to-text model is built for real business use. It prioritizes accuracy, flexibility, and open access over marketing hype.

A Model Built for the Real World

Granite Speech 3.3 8B is the central release. It transcribes spoken English into clear, reliable text. IBM designed it for practical applications, not showpieces. After transcription, the model can translate English text into major languages such as French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin. This supports teams working across regions or serving multilingual customers. However, the model currently accepts only English as spoken input. Transcription in other languages isn’t supported yet.

Flexible Deployment

Granite Speech 3.3 is based on IBM’s Granite 3.3 8B Instruct large language model. For organizations with smaller infrastructure, a 2B version is also available. Both versions offer a balance between performance and efficiency, making them viable in different deployment settings.

The model is compact and cost-efficient, suited for enterprise-grade transcription and translation without locking users into proprietary systems.

Open Source Access

All Granite 3.3 models are released under the Apache 2.0 license. Developers can use, adapt, and scale these tools without legal or financial restrictions. This open approach addresses growing concerns in the enterprise space about vendor lock-in and escalating subscription costs.

How It Works

IBM has provided a straightforward look at how the model functions. Granite Speech 3.3 includes a speech encoder to process English audio, a speech projector to convert that into features for the language model, and the model itself to generate accurate transcriptions or translations. LoRA adapters are included for efficient fine-tuning.

The model delivers translation quality comparable to OpenAI’s GPT-4o and Google’s Gemini 2.0 Flash. Benchmark testing shows Granite Speech 3.3 outperforms both open and closed competitors in transcription accuracy.

Developer Resources

IBM has also released the 8B and 2B Base models for teams that want to build their own solutions. In addition, retrieval-augmented generation (RAG) LoRA adapters from the earlier Granite 3.2 release are available on Hugging Face under Granite Experiments. These tools support deeper customization without starting from scratch.

Limitations and Next Steps

The current model transcribes only English. Speech input in other languages isn’t yet supported, though IBM plans to add that in future updates. Improvements are also planned for training data quality, audio feature integration during training, and speech emotion recognition.

IBM is also developing Granite 4.0. The next release is expected to improve speed, expand context handling, and increase model capacity to support more demanding use cases.

Why It Matters

Many AI tools are long on promise but short on practical value. IBM has focused on delivering something reliable, open, and usable in production environments. Granite Speech 3.3 provides accurate transcription, strong translation capabilities, and real tools for developers. It’s built for teams who need AI that works, not just demos that impress.

Final Word

IBM’s Granite Speech 3.3 stands out for its practical design and transparent approach. It delivers clear value without complexity. In a space crowded with overbuilt solutions, it offers what enterprise users actually need—dependable performance, flexible licensing, and room to build.

Our Standards: Associated Press Stylebook

What To Read Next

Gigs Shows How Embedded Connectivity Can Redefine Mobile Service

The collaboration between Gigs and Sezzle points toward a future in which wireless connectivity functions as an integrated digital feature across industries.