Embedded Data Scientist, Chanakya
Sarvam AI
About the role
About Sarvam
Sarvam is building the bedrock of Sovereign AI for India. The company is developing India's full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India's leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.
About the Role
Embedded Data Scientists transform complex client data into structures that AI systems can reliably reason over. You are deployed alongside Strategic Deployment Engineers at client sites, working directly with client data environments to understand, structure, and operationalise large-scale datasets.
This means working with heterogeneous, multimodal data — including documents, images, audio, geospatial data, and structured records — and designing the semantic structures that allow AI systems to interpret and reason over that data.
You will define how data is represented inside the AI system: how documents are segmented, how metadata is defined, how entities and relationships are represented, and how different data modalities connect. You will design ontologies, tagging systems, and knowledge graph structures that allow the reasoning engine to operate effectively.
You will often work with classified or operationally sensitive datasets in environments where standard tooling may not exist. You will own the quality of the data layer in your assigned accounts, ensuring the system is built on a foundation that enables reliable reasoning at scale.
What You'll Do
• Understand the client's data landscape across documents, imagery, audio, geospatial data, and structured records — including data sources, formats, workflows, and domain terminology
• Design domain ontologies representing entities, relationships, hierarchies, and operational concepts within the client's data environment
• Define document segmentation and chunking strategies that preserve semantic meaning and support effective retrieval
• Work with heterogeneous datasets and define how different modalities should be indexed, embedded, and linked
• Collaborate with Strategic Deployment Engineers to translate semantic structures into operational data ingestion pipelines
• Evaluate how well the AI system retrieves and reasons over client data, and refine structures to improve performance
• Collaborate with the models and other teams to define benchmarks and evaluation criteria that reflect real-world deployment conditions
• Translate insights from client data environments into structured signals for product and engineering teams
What We're Looking For
• 2–5 years in data science, applied machine learning, or large-scale data analysis roles
• Strong Python skills including pandas, NumPy, and modern NLP or LLM tooling
• Solid grounding in ML fundamentals — enough to understand model behav
Underpaid estimate
~₹24.5 LPA for Data Scientists (industry-wide) · based on 60 submissions