Remote Full Time
--
Confidential

Job Details

Keep large language models serving reliably at scale — fast, observable, and always up. This is the production-engineering backbone of an AI platform: deployment, autoscaling, observability, and the SLOs that keep inference humming under real traffic. You'll join a small, senior team at an established enterprise software company building LLM-powered capabilities into its products.
What you'll do:Own the reliability and scalability of LLM serving in production Build and run deployment, autoscaling, and orchestration on Kubernetes Instrument serving with real observability — time-to-first-token, tokens/sec, latency percentiles, error budgets Set and defend SLOs; load-test and harden the platform against traffic spikes Operate the serving stack (v LLM / Triton / Tensor RT-LLM) as a dependable production system
What you'll bring:Strong production / SRE / platform-engineering experience running services at scale Deep Kubernetes and cloud infrastructure skills Observability and reliability discipline (SLOs, monitoring, incident response) Comfort operating GPU-backed or ML workloads — a deep ML background is not required Solid software engineering fundamentals
Nice to have:Experience serving ML/LLM models specifically Familiarity with inference frameworks (v LLM, Triton, Tensor RT-LLM) Performance / load-testing background
If you've kept high-scale systems alive and want to move into AI infrastructure, this is a clean on-ramp.

Similar Jobs

About Confidential
EMEA
Computer Hardware

مؤسسة تقنية معلومات واتصالات ناشئة تبحث عن المتحمسين لافكار جديدة Contact: hally_aseeri@hotmail.com