Master Thesis Project - 2026 (FALL)

Modulai AB Stockholm, Stockholms lan, Sweden Publicerat 13 maj 2026

full_timehybridmid

1. Reinforcement Learning for Large Language Models (LLMs) Background and description Modulai is offering a master's thesis opportunity focused on applying Reinforcement Learning (RL) to improve the capabilities of large language models (LLMs). Reinforcement learning was first pivotal in aligning LLMs with human preferences, but recent work shows its role now extends much further, RL has become the dominant paradigm for eliciting reasoning, enabling models to acquire advanced problem-solving strategies and adapt to complex, multi-step tasks. Recent advancements highlight the transformative role of RL in LLM post-training: - DeepSeek-R1 demonstrated that reasoning ability can be induced through large-scale RL with verifiable rewards, including a pure-RL variant (R1-Zero) trained with no supervised fine-tuning at all, popularizing RL as the central tool for building reasoning models. - DeepSeekMath explored how reinforcement learning can enable models to handle multi-step mathematical reasoning, and introduced the RL method now widely used across the field, Group Relative Policy Optimization (GRPO). - Tulu 3 introduced a family of fully-open post-trained models, leveraging Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO), and a technique dubbed Reinforcement Learning with Verifiable Rewards (RLVR). - DAPO released a fully open-source, large-scale RL system that refines GRPO with techniques such as Clip-Higher and dynamic sampling to stabilize long chain-of-thought training, surpassing R1-Zero-level results with substantially fewer training steps. - ReTool introduced reinforcement learning for tool use, showing how LLMs can learn to combine text-based reasoning and code interpreters for complex tasks. This project aims to investigate RL approaches for improving LLMs in specialized domains (such as reasoning and tool use). You will explore open-weight models, implement and compare RL methods inspired by the latest research, and evaluate how reinforcement learning impacts model capabilities. Through this work, you will contribute to the growing understanding of how RL can shape the next generation of LLMs. ML techniques and tools Open-weight LLMs Reinforcement learning for LLMs Python, PyTorch, Git, Hugging Face References - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, 2025): https://arxiv.org/abs/2501.12948 arXiv - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024): https://arxiv.org/abs/2402.03300 arXiv - Tulu 3: Pushing Frontiers in Open Language Model Post-Training (Lambert et al., Allen Institute for AI, 2024): https://arxiv.org/abs/2411.15124 Hugging Face - DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al., ByteDance Seed & Tsinghua University, 2025): https://arxiv.org/abs/2503.14476 arXiv - ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (Feng et al., ByteDance Seed, 2025): https://arxiv.org/abs/2504.11536 arXiv 2. Vision-Language-Action models (Stockholm) Background and description We also offer a master's thesis project in the emerging field of Vision-Language-Action (VLA) models for robotics. VLA models unify computer vision, natural language processing, and robotic control into end-to-end systems, enabling robots to understand visual scenes, interpret human instructions, and execute tasks without manual programming. Recent research (e.g., Liang et al., 2024 ) shows that VLA models can perform complex tasks such as “pick up the red mug from the cluttered table.” This thesis invites students to explore and advance these models, contributing to one of the most actively researched directions in AI-powered robotics.

Findigo hittar jobben och fyller i ansökan. Du klickar Skicka.

Visa jobbet och ansök

Ursprunglig annons: modulai.teamtailor.com