Mehmet Turan Yardimci

Incoming M.Sc. Computer Science, Karlsruhe Institute of Technology · Robot Learning

I work on reinforcement learning fine tuning of vision language action policies, and on hierarchical control for humanoid robots. Most of it runs on a single consumer GPU, which is a constraint I treat as part of the problem rather than an excuse: if a method only works on a cluster, very few people can check it. Platforms are the Unitree G1 and a UR5e arm, in NVIDIA Isaac Lab and LeRobot.

mehmetturanyardimci@hotmail.com GitHub LinkedIn Google Scholar arXiv

Download CV Europass format, last updated July 2026

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

Mehmet Turan Yardımcı

ICRA 2026 · Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL) · Poster, Vienna, June 2026

The paper isolates one design choice in multi objective reinforcement learning: whether a humanoid policy should share a single critic across locomotion and manipulation, or use separate critics with disjoint reward signals. Three policies on a 23 degree of freedom Unitree G1 in Isaac Lab, same observations, same curriculum, same hardware, with only the critic architecture varied. A secondary result is that once the architecture is fixed, adding anti gaming reward machinery buys nothing further.

Project page arXiv OpenReview Code

smolvla_flow_rl

Online reinforcement learning fine tuning of a flow matching vision language action policy, on a single 12GB consumer GPU. Open source, Apache 2.0.

Two robot simulator episodes playing side by side, each a manipulation task from the LIBERO Long suite.

Two LIBERO Long episodes from my own runs, side by side. It is footage, not a measurement: two episodes, chosen, with no repetition and no statistics behind either panel.

No results are reported in that repository, and none are reported here. Everything published so far describes the machinery, what it trains, what it weighs and how long it takes. Measurements will be released with the accompanying paper.

A flow matching policy produces an action chunk by integrating a learned velocity field from noise. Integrating it deterministically gives no action likelihood, so a policy gradient has nothing to act on. The repository takes the standard route around that, following the approach of pi_RL: one step of the integration is treated as stochastic, so the sampled chunk has a tractable log probability. The pretrained action head is left bit identical and driven through a callback. What is built around that is the part that makes such a run trustworthy rather than merely runnable:

Stochastic sampler for flow matching policies, leaving the pretrained action head untouched.
Policy gradient update that grades the action the environment actually executed, over the executed part of the chunk only.
Critic head with adaptive target rescaling, kept outside the action head.
Parameter manifest and learning rate proof printed before the first update, so what is training and at what rate is a logged fact rather than an assumption.
Process isolated simulator harness with a sparse terminal reward and no shaping.
Deterministic evaluation with a completeness gate that refuses to produce a number from an unfinished evaluation.
Verification protocol whose mechanical subset runs from a single gate entry point.

SmolVLA LeRobot Flow matching PPO LIBERO PyTorch RLinf 12GB VRAM

Project page → GitHub repository →

Research Interests

RL fine tuning of VLA policies Turning a pretrained vision language action policy into one that improves from its own experience, without discarding what imitation gave it

Flow matching and diffusion policies Sampling, likelihoods and policy gradients for generative action models

Humanoid loco manipulation Whole body control where locomotion and manipulation share one body and compete for it

Verification and reproducibility Making a training run report enough about itself that a silent failure cannot pass for a result

Isaac G1 Humanoid VLM-RL Loco-Manipulation

Hierarchical reinforcement learning for the Unitree G1 humanoid: whole body control with vision language model task planning on top.

43 Degrees of Freedom

4,096 Parallel Environments

17K+ Steps / Second

Multi stage curriculum training pipeline:

Standing→ Locomotion→ Torso Control→ Dual Arm Reaching→ Grasping→ Loco-Manipulation

Triple actor critic architecture: separate policies for locomotion (legs and waist), arm reaching (shoulders, elbows, wrists), and hand control (DEX3 three finger hands)
Dual actor critic loco manipulation: decoupled locomotion and arm policies trained sequentially with curriculum learning, for coordinated walking and reaching
Anti gaming mechanisms: absolute target sampling, three condition reach validation, movement centric rewards, so the curriculum cannot be exploited by a policy that stands still
29 DoF whole body model: 12 leg, 3 waist, 8 arm and 6 wrist joints, with 14 DEX3 finger joints for grasping
VLM planning layer: Qwen3-VL running locally through Ollama turns a natural language instruction into a skill sequence, with replanning from updated world state
Goal task: squat, pick an object off the ground, stand, walk to a table, place it in a box

NVIDIA Isaac Lab RSL-RL PyTorch PPO Qwen3-VL CUDA Curriculum RL Domain Randomization

GitHub repository → Hierarchical control →

Projects

G1 Unitree Locomotion Control (ULC)

Multi stage PPO pipeline for Unitree G1 whole body locomotion: flat walking, velocity tracking, terrain adaptation, torso stabilization and arm coordination, across a five stage curriculum. This is the codebase behind the ICRA workshop paper above.

Isaac Lab RSL-RL PPO CUDA

View on GitHub →

G1 Vision Language Action pipeline

An RL to IL to VLA pipeline for the G1: expert demonstrations collected from trained RL policies, converted to a LeRobot dataset, then distilled into end to end visuomotor policies with ACT, Diffusion Policy and GR00T N1.6. In progress, repository not public yet.

LeRobot ACT Diffusion Policy GR00T

Go2 VLM-RL Navigation

Language conditioned quadruped navigation on the Unitree Go2, combining a vision language model with reinforcement learning for instruction following and problem solving in Isaac Lab.

Isaac Lab VLM Navigation

View on GitHub →

Isaac Lab Anymal-C Quadruped Locomotion

PPO implemented from scratch for the ANYmal-C quadruped, reaching 17,000 or more steps per second on an RTX 5070 Ti with domain randomization and reward shaping across 4,096 parallel environments.

Isaac Lab PPO PyTorch CUDA

View on GitHub →

MuJoCo Ant-v5 PPO from Scratch

PPO and SAC written with NumPy and PyTorch alone for MuJoCo Ant-v5, past 2,700 reward by shaping away the hopping gait the default reward rewards. Sixteen parallel environments, GAE, observation normalization and learning rate annealing.

MuJoCo PPO SAC NumPy

View on GitHub →

BARN Benchmark: Local Path Planners

A comparative benchmark of TEB, DWA, MPC and Lattice local planners on the BARN navigation dataset in ROS and Gazebo, and the subject of my undergraduate thesis. The paper is under review.

ROS Gazebo Navigation Python

View on GitHub →

Live Actor-Critic Training (CartPole)

An interactive Streamlit application that trains an actor critic agent in the browser while you change its hyperparameters, built to make the learning dynamics visible rather than described.

Streamlit Actor-Critic Gymnasium

View on GitHub →

YOLO Fixed Wing UAV Detection

Real time detection for autonomous fixed wing UAV operations, running on a Jetson Nano alongside a Pixhawk flight controller, built and flown for TEKNOFEST competitions.

YOLO OpenCV Jetson Nano Pixhawk

View on GitHub →

PID Control with NXT Robot

A PID controller on LEGO Mindstorms NXT hardware in NXC, with real time sensor feedback closing the loop on motor control. The first control system I tuned by hand rather than by reading about it.

PID NXC Control Systems

View on GitHub →

Publications

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

Yardımcı, M.T.

ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), poster, Vienna, June 2026

arXiv:2606.11891 OpenReview Project page

Benchmarking Local Path Planners in ROS using the BARN Dataset

Yardımcı, M.T., Çoğurcu, Y.E.

Cukurova University Journal of the Faculty of Engineering, under review, 2026

Code and data

smolvla_flow_rl: a training substrate for online RL fine tuning of flow matching VLA policies

Yardımcı, M.T.

Software, Apache 2.0, 2026. The accompanying paper is in preparation and is where measurements will appear.

Project page Repository

Education

Jul 2026 — Sep 2028 (expected)

M.Sc. in Computer Science

Karlsruhe Institute of Technology (KIT), Germany

Incoming student. Intended focus: robot learning, reinforcement learning and vision language action models

Oct 2021 — Oct 2025

B.Sc. in Computer Engineering (English)

Cukurova University, Adana, Turkey

Honor student, Fall 2023 to 2024. High honor student, Fall 2024 to 2025
Thesis: Benchmarking Local Path Planners in ROS using the BARN Dataset
Coursework in artificial intelligence, pattern recognition, optimal control, reinforcement learning, robotics and human computer interaction

Sep 2024 — Feb 2025

Erasmus+ Exchange, Computer Science (English)

Bialystok University of Technology, Poland

GPA 4.9 out of 5.0
Robotics and automation, PID control, sensor integration, robot programming, computer graphics

Experience

Aug 2025 — Sep 2025

Computer Engineering Intern

Kivanc Tekstil, Adana

In house software development with .NET and C#

Jul 2025 — Aug 2025

Computer Engineering Intern

Medcem Cement Group, Silifke

Algorithmic solutions for production side software, with .NET and C#
Software architecture, debugging and interface design

May 2023 — Jul 2025

Team Leader, then Software Manager

1.5 Adana AGM Alkar UAV Team

Led a team of more than ten people building autonomous fixed wing UAV systems
TUBITAK and TEKNOFEST competition projects across three years
Vision based autonomy on embedded hardware: YOLO detection on a Jetson Nano, Pixhawk flight control, PID tuning

Skills

Robot learning

PPO SAC Actor-Critic GAE Reward shaping Curriculum learning Domain randomization Imitation learning Flow matching policies Diffusion policy

Vision language action

SmolVLA LeRobot ACT GR00T N1.6 LIBERO Qwen3-VL Florence-2

Simulation

NVIDIA Isaac Lab Isaac Sim MuJoCo Gazebo RViz ROS / ROS2

Engineering

PyTorch CUDA NumPy Weights & Biases TensorBoard Linux and WSL2 Git Python C / C++

Hardware

Unitree G1 Unitree Go2 UR5e RealSense D435i NVIDIA Jetson Pixhawk RTX 5070 Ti

Languages, Tests and Certificates

Turkish native

English professional working proficiency

TOEFL iBT 110 out of 120, March 2026

YDS 86.25, November 2025

Duolingo English Test 130, January 2026

Coursera Supervised Machine Learning: Regression and Classification, DeepLearning.AI and Stanford University