Nuance Labs

Member of Technical Staff — Model Optimization and Inference

🇺🇸 Seattle, United States On-site IT Senior Posted Jun 5, 2026
Location Seattle, United States
Workplace On-site
Seniority Senior
Category IT
IT Category Data Science & ML
Salary USD 250,000 - 350,000 / yearly
Language English
Posted June 5, 2026
Last verified June 7, 2026

Salary context for this role

JobGrid.eu combines visible employer pay, official public benchmarks, and current JobGrid listings for Data Science & ML.

Employer listing

Listed salary

USD 250,000 - 350,000 / yearly

Salary published on this job listing.

Source
Extracted from this visible public job listing
JobGrid context

Role summary by JobGrid

Member of Technical Staff — Model Optimization and Inference at Nuance Labs: Seattle, United States; On-site; Senior; IT; Data Science & ML. JobGrid adds normalized role facts, source context, and a path to the employer application page so candidates can compare the listing before applying.

  • Location and workplace: Seattle, United States, On-site
  • Role classification: IT, Data Science & ML, Senior
  • Employer salary shown on the listing: USD 250,000 - 350,000 / yearly
  • Source freshness: checked by JobGrid on 2026-06-07.

About Nuance Labs

Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence: a full-duplex audiovisual system that can listen, speak, react, interrupt, and respond like a real person.

We're a Series A company ($60M raised) backed by Lightspeed, Accel, South Park Commons, NVentures, and Define Ventures, with PhDs from MIT, UW, Oxford, CMU, and Johns Hopkins, and industry experience from Apple, Meta, Amazon AGI, and Discord. The team is small, the work is real, and the problems are unsolved.

How Nuance Differentiates

Most conversational AI avatars today are hacks — a face slapped on a speech-to-speech pipeline, stuck in the uncanny valley: emotionless, mechanical, one-turn-at-a-time. Current systems take 2–5 seconds to respond; natural conversation requires sub-500ms. That's a 10x improvement, and it demands rethinking the entire stack.

That rethinking starts with full-duplex: an AI that listens and speaks simultaneously, perceives emotion in real time, and responds with a face that actually reflects it. It's an extremely hard problem, and we're developing foundation models designed for it from the ground up.

About the Role

We can train a great model. The next problem is making it fast enough to actually use in a real-time conversation — and that gap is enormous. A model that responds in 3 seconds is a demo. A model that responds in under 500ms is a product.

We're looking for someone who specializes in taking trained models and squeezing every last millisecond out of them. You understand the full stack from model weights to serving infrastructure — quantization, KV cache optimization, kernel-level acceleration, batching strategies — and you know which lever to pull for which problem. You've worked with vLLM, SGLang, or similar frameworks and have opinions about where they fall short.

Our stack is more complex than a standard LLM deployment: we're serving a full-duplex multimodal system that must satisfy strict real-time latency constraints. There's a lot of unsolved optimization work here, and we need someone who finds that genuinely exciting.

What You'll Do

  • Own end-to-end inference optimization across our model stack — LLMs, audio models, and diffusion-based components
  • Implement and tune KV cache strategies for long-context conversations, including eviction policies, compression, and memory-efficient attention
  • Evaluate, deploy, and extend inference serving frameworks (vLLM, SGLang, TensorRT-LLM, etc.) for our specific workloads
  • Profile and benchmark end-to-end latency and throughput; identify and systematically eliminate bottlenecks
  • Build internal tooling that makes optimization work faster and more rigorous — profiling viewers, end-to-end inference test harnesses, and other infrastructure that helps the team move quickly
  • Accelerate diffusion model inference — consistency models, step distillation, caching strategies, and custom kernel optimizations
  • Apply and develop quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality
  • Work closely with research and infrastructure to ensure new models ship with optimized serving from day one

What We're Looking For

  • Deep expertise in LLM inference optimization — you've worked on KV caching, memory layout, attention kernels, or batching strategies in a production or research context
  • Proficiency with inference serving frameworks — vLLM, SGLang, TensorRT-LLM, or similar — including the ability to go beyond default configurations and adapt them to non-standard use cases
  • Experience optimizing diffusion model inference (latency reduction, step distillation, caching, or kernel-level work)
  • Strong Python and PyTorch skills; comfort reading and writing CUDA or Triton kernels is a significant plus
  • A systematic approach to profiling and optimization — you measure first, then optimize
  • Familiarity with speculative decoding or other inference-time acceleration techniques

Bonus Points

  • Hands-on experience with post-training quantization (GPTQ, AWQ, or similar) and understanding of quality/performance tradeoffs
  • Familiarity with multimodal or streaming inference architectures
  • Experience deploying real-time AI systems with hard latency SLAs
  • Prior work at an AI lab, inference startup, or on a high-traffic model serving platform
  • Contributions to open-source inference frameworks

Compensation

$250,000 – $350,000 base salary, plus meaningful equity. We think long-term ownership matters and structure equity accordingly.

Logistics

  • Location: In-person in Seattle, 5 days a week — we believe in the compounding value of working shoulder-to-shoulder
  • Health: HSA plan with ~$2,000 in company contributions — about 2x what most big tech companies offer
  • PTO: 15 days + public holidays, and we close for a full week over the holidays
  • Lunch, beverages, and snacks: On us, every workday — the kind of thing that makes you actually look forward to the workday
  • Commuter benefits
  • 401K: In the works

Nuance Labs is an equal opportunity employer. We believe diverse teams build better AI.