Machine Learning Engineer, Dubbing

Sarvam AI

BengaluruEngineering

About the role

About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

About the Role

You will own the ML integration layer for Sarvam’s dubbing and live translation products — building production pipelines that connect ASR, translation, TTS, and voice cloning into seamless end-to-end systems. The scope spans offline video dubbing (batch processing across 12+ Indian languages) and real-time speech-to-speech translation in multi-participant environments where latency budgets are measured in hundreds of milliseconds. The team’s roadmap evolves with the field; we want engineers who are comfortable with that.

What You’ll Do

- Build and optimise the real-time speech-to-speech translation pipeline — streaming ASR with server-side VAD, low-latency translation, and TTS synthesis delivered as live audio streams

- Design fan-out architectures where a single ASR stream serves multiple concurrent listeners, each receiving personalised translated audio

- Implement voice cloning in streaming and batch contexts — reference audio selection heuristics, handling short vs. long utterances, and maintaining speaker identity across a session

- Optimise end-to-end latency across the ASR → translation → TTS chain, including transcript buffering, segmentation strategies, and flush timing for continuous speech

- Integrate ML pipelines with real-time media infrastructure (WebRTC, RTMP, SRT) for live broadcast and conferencing use cases

- Own the automated QC loop — designing multi-stage verification pipelines that catch and correct quality issues before delivery

- Build evaluation harnesses for speech quality — WER/CER tracking, tempo analysis, pronunciation verification, and automated QC scoring

- Optimise inference pipelines — quantisation, batching strategies, model server configuration, and runtime acceleration for VAD and vocal separation

- Design and maintain audio data pipelines — segment extraction, filtering, deduplication, and quality assurance

- Build robust integrations across multiple ASR, TTS, and translation backends — managing fallbacks, retries, and quality routing

- Debug and improve deployed speech systems — latency, audio artifacts, code-mixed content, regional dialect handling, and edge cases in production

- Translate real-world dubbing problems (timing preservation, naturalness, register matching, multi-speaker scenarios) into well-scoped ML tasks with the right data and evaluation strategy

What We’re Looking For

- Strong Python and PyTorch — comfortable reading model internals, profiling inference, and debugging

Underpaid estimate

~₹26 LPA for Machine Learning Engineers (industry-wide) · based on 45 submissions

Check yours