Principal Engineer, On-Device AI Inference & Systems
The opportunity We are building the next generation of AI-driven game experiences, running generative models on-device, right where the players are — on phones, tablets, laptops, and desktops. Our games run inside a modern, browser-native runtime (built on technologies such as WebGPU and WebNN), so the models that power these experiences must be deployed and accelerated entirely within that runtime. As our Principal Engineer for On-Device AI Inference & Systems, you will be the foremost engineering authority on taking state-of-the-art multi-modal models (transformers and diffusion networks) and making them run fast, small, and reliably within that runtime, fully integrated into a production game engine. This is a deeply hands-on, high-impact engineering role. You will own the inference and integration stack end-to-end — from the moment a trained checkpoint leaves research, through export, optimization, and kernel-level tuning, to a shipped feature running inside the engine at interactive frame rates within a fixed memory and power budget. You will set the engineering standards, drive the architecture of the runtime and integration layers, and mentor a team of senior and mid-level engineers. Your work directly determines the latency, quality, memory footprint, and battery profile of AI features experienced by players worldwide. This role is for an engineer who is energized by the gap between a research model and a shipping, AI-based product. If you love profilers, frame captures, op-fusion, and shaving milliseconds and megabytes, this is your role. What you'll be doing Inference & On-Device Optimization Own the end-to-end optimization pipeline: model export, graph transformation, operator fusion, memory-layout planning, and hardware-specific kernel tuning across NPU, mobile GPU, and desktop/laptop GPU. Make authoritative decisions on quantization (INT4/INT8/FP16), weight sharing, structured/unstructured pruning, and knowledge distillation to hit hard latency, memory, and power budgets — and validate them against quality bars. Drive low-level performance work: write and tune WebGPU compute shaders (WGSL) and, where relevant, native kernels (Metal, Vulkan/SPIR-V compute, D3D12, CUDA); profile with browser and platform tools (Chrome/Dawn GPU traces, PIX, Instruments/Metal System Trace, Snapdragon Profiler, Nsight, RenderDoc), and eliminate bottlenecks at the op and memory-bandwidth level. Apply efficiency techniques — dynamic resolution, token reduction, cross-frame caching/reuse, reduced-step diffusion samplers — as engineering levers to meet budgets on target SKUs. Runtime & Systems Integration Evaluate, select, and drive adoption of WebGPU-targeted inference runtimes (ONNX Runtime Web, Transformers.js, WebLLM, TensorFlow.js) alongside native options (CoreML, ONNX Runtime, TFLite, ExecuTorch) — and extend or build runtime/glue code where off-the-shelf options fall short of our diffusion workloads. Design and own the integration between the ML runtime and the game engine: real-time scheduling, threading, memory pooling, zero-copy buffer sharing between the inference path and the render path, and frame-budget management alongside the renderer. Architect inference systems that handle diverse inputs — images, text, primitives, metadata — and produce pixel-level outputs with real-time performance, robust to the messy realities of production (cold starts, thermal throttling, device fragmentation, backgrounding). Build the supporting engineering: model packaging and asset pipelines, on-device fallbacks and SKU-aware capability tiers, crash/quality telemetry, and automated on-device benchmarking in CI. Research Productionization Partner closely with research scientists to turn novel architectures into implementations that are deployable, debuggable, and fast on device. Provide the feedback loop back into research: surface hardware constraints, op-support gaps, and cost models early so model design and deployment converge. Track breakthroughs in efficient inference (efficient attention, distillation, reduced-step diffusion) and assess them pragmatically: what actually moves latency/memory/power on our target devices, and what is worth the engineering cost. Engineering Leadership Lead and mentor a team of engineers; set engineering best practices, code-review standards, performance-regression gates, and on-device benchmarking methodology. Champion a culture of measurement: define and enforce KPIs for latency, quality, memory, and power, and ensure they are tracked rigorously across the device matrix. Partner with platform engineers, product managers, and runtime teams to align ML capabilities with device-SKU constraints and product roadmaps. What we're looking for 8+ years in software/ML engineering, with at least 4 years focused on on-device / edge inference or real-time, performance-critical systems. Proven production deployment of transformer- and/or diffusion-based models (e.g., ViT, Stable Diffusion) on mobile, desktop, or embedded hardware — shipped, not just prototyped. Hands-on experience deploying models through WebGPU — e.g., ONNX Runtime Web (WebGPU EP), Transformers.js, WebLLM, or TensorFlow.js — including writing/tuning WGSL compute shaders and working within WebGPU's adapter, device-limits, and binding model. Equivalent deep experience with a native GPU/compute API plus a clear path to WebGPU will also be considered. Hands-on expertise with at least one major inference runtime (ONNX Runtime / ORT Web, CoreML, TFLite, ExecuTorch) and deep understanding of operator fusion, memory layout, and runtime scheduling. Low-level performance engineering: strong command of at least one GPU/compute API — WebGPU/WGSL, Metal, Vulkan, D3D12, or CUDA — and the profiling tools to go with it. You can read a frame capture and a kernel trace and know where the time and memory go. Working knowledge of model-optimization techniques — quantization (INT4/INT8/FP16), weight sharing, pruning, and distillation —
Findigo hittar jobben och fyller i ansökan. Du klickar Skicka.
Visa jobbet och ansökUrsprunglig annons: unity.com