Aaryamonvikram Singh

Research Engineer at MBZUAI focused on shipping reasoning-first and multilingual LLMs with rigorous evaluation. I build the eval harnesses, data pipelines, and release tooling behind models like K2 and Nanda — and I care about making them safe, reliable, and multilingual.

Experience

Research Engineer Sep 2025 – Present
MBZUAI — Institute of Foundation Models, Abu Dhabi
  • Co-authored K2-V2 (70B) and K2-Think (32B); supported K2-Think V2 (70B) release and evaluation
  • Built evaluation tooling for long-context, math, code, and safety benchmarks — prompting, deterministic scoring, reporting
  • Added regression tests and automated reports to catch quality/safety regressions pre-release
  • Delivered technical talks and office hours for the K2-Think Hackathon series
Research Assistant Oct 2024 – Aug 2025
MBZUAI — Institute of Foundation Models, Abu Dhabi
  • Led Nanda Family 10B/87B development and release (EACL 2026); drove bilingual Hindi–English data strategy end-to-end
  • Contributed dataset curation and evaluation for Jais-2 (Arabic) and Sherkala-Chat (Kazakh, COLM 2025)
  • Curated Suraksha Eval (Hindi safety benchmark) and built Hindi TxT360 for pretraining data
  • Co-developed FinChain — financial reasoning benchmark across 12 domains, 30+ LLMs benchmarked
Research Fellow Mar 2024 – Sep 2024
SimPPL (supervised by Swapneel Mehta), Remote
  • Designed multi-agent experiments to measure and reduce fake-news sharing between LLM agents
  • Collaborated with postdocs at MIT, Princeton, and Oxford on intervention design and evaluation
NLP Intern Jun 2023 – Jan 2024
MBZUAI — Prof. Preslav Nakov, Abu Dhabi
  • Built a multithreaded Python pipeline collecting 160K+ news articles from 5K+ sources
  • Shipped an end-to-end media factuality and bias scoring system (Streamlit, FastAPI, SQLite)
  • Trained transformers and NELA+CatBoost ensembles for article-level prediction and source profiling

Publications

70B reasoning model with open weights
arXiv 2025
32B reasoning without massive compute
arXiv 2025
10B and 87B models with bilingual data strategy
EACL 2026
LLM for a moderately-resourced language
COLM 2025
arXiv 2025
Open model for Hindi conversation
arXiv 2025
Framing analysis on raw text
arXiv 2025
Open-weight models and datasets on HuggingFace
Open Source

Expertise

Reasoning LLMs

Chain-of-thought, parameter-efficient reasoning, 70B scale, K2 series

Multilingual NLP

Hindi, Arabic, Kazakh — bilingual data curation and strategy

Evaluation & Safety

Eval harnesses, safety benchmarks, regression testing, long-context

Release Engineering

CI/CD for models, automated reporting, HuggingFace, open-weight releases

Data Pipelines

Large-scale collection, curation, quality filtering, pretraining data

Misinformation Research

Multi-agent simulation, factuality scoring, bias detection

Open to research engineering & applied ML roles.