Master Thesis Project - 2026 (FALL)

Modulai Stockholm, Stockholms lan, Sweden Publicerat 13 maj 2026
permanenthybrid
<h4><strong>1. Reinforcement Learning for Large Language Models (LLMs)</strong></h4><p><br><span>Background and description</span><br><br><span>Modulai is offering a master's thesis opportunity focused on applying Reinforcement Learning (RL) to improve the capabilities of large language models (LLMs). Reinforcement learning was first pivotal in aligning LLMs with human preferences, but recent work shows its role now extends much further, RL has become the dominant paradigm for eliciting reasoning, enabling models to acquire advanced problem-solving strategies and adapt to complex, multi-step tasks.</span><br><br><span>Recent advancements highlight the transformative role of RL in LLM post-training:</span><br><br>- <span>DeepSeek-R1 demonstrated that reasoning ability can be induced through large-scale RL with verifiable rewards, including a pure-RL variant (R1-Zero) trained with no supervised fine-tuning at all, popularizing RL as the central tool for building reasoning models.</span><br><br>- <span>DeepSeekMath explored how reinforcement learning can enable models to handle multi-step mathematical reasoning, and introduced the RL method now widely used across the field, Group Relative Policy Optimization (GRPO).</span><br><br>- <span>Tulu 3 introduced a family of fully-open post-trained models, leveraging Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO), and a technique dubbed Reinforcement Learning with Verifiable Rewards (RLVR).</span><br><br>- <span>DAPO released a fully open-source, large-scale RL system that refines GRPO with techniques such as Clip-Higher and dynamic sampling to stabilize long chain-of-thought training, surpassing R1-Zero-level results with substantially fewer training steps.</span><br><br>- <span>ReTool introduced reinforcement learning for tool use, showing how LLMs can learn to combine text-based reasoning and code interpreters for complex tasks.</span><br><br><span>This project aims to investigate RL approaches for improving LLMs in specialized domains (such as reasoning and tool use). You will explore open-weight models, implement and compare RL methods inspired by the latest research, and evaluate how reinforcement learning impacts model capabilities. Through this work, you will contribute to the growing understanding of how RL can shape the next generation of LLMs.</span><br><br><span>ML techniques and tools</span></p><ul><li><p><span>Open-weight LLMs</span></p></li><li><p><span>Reinforcement learning for LLMs</span></p></li><li><p><span>Python, PyTorch, Git, Hugging Face</span><br></p></li></ul><p>References</p><p>- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, 2025):<span> </span><a target="_blank" href="https://arxiv.org/abs/2501.12948">https://arxiv.org/abs/2501.12948</a><span> </span><a target="_blank" href="https://arxiv.org/abs/2501.12948v2"><span>arXiv</span></a><br>- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024):<span> </span><a target="_blank" href="https://arxiv.org/abs/2402.03300">https://arxiv.org/abs/2402.03300</a><span> </span><a target="_blank" href="https://arxiv.org/abs/2402.03300"><span>arXiv</span></a><br>- Tulu 3: Pushing Frontiers in Open Language Model Post-Training (Lambert et al., Allen Institute for AI, 2024):<span> </span><a target="_blank" href="https://arxiv.org/abs/2411.15124">https://arxiv.org/abs/2411.15124</a><span> </span><a target="_blank" href="https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B"><span>Hugging Face</span></a><br>- DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al., ByteDance Seed & Tsinghua University, 2025):<span> </span><a target="_blank" href="https://arxiv.org/abs/2503.14476">https://arxiv.org/abs/2503.14476</a><span> </span><a target="_blank" href="https://arxiv.org/pdf/2503.14476"><span>arXiv</span></a><br>- ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (Feng et al., ByteDance Seed, 2025):<span> </span><a target="_blank" href="https://arxiv.org/abs/2504.11536">https://arxiv.org/abs/2504.11536</a><span> </span><a target="_blank" href="https://arxiv.org/abs/2504.11536v1"><span>arXiv</span></a><br></p><h4>2. Vision-Language-Action models (Stockholm)</h4><p><br>Background and description<br><br>We also offer a master's thesis project in the emerging field of Vision-Language-Action (VLA) models for robotics. VLA models unify computer vision, natural language processing, and robotic control into end-to-end systems, enabling robots to understand visual scenes, interpret human instructions, and execute tasks without manual programming.<br><br>Recent research (e.g.,<span> </span><em>Liang et al., 2024</em>) shows that VLA models can perform complex tasks such as “pick up the red mug from the cluttered table.” This thesis invites students to explore and advance these models, contributing to one of the most actively researched dir

Findigo hittar jobben och fyller i ansökan. Du klickar Skicka.

Visa jobbet och ansök

Ursprunglig annons: modulai.teamtailor.com