AI Research Engineer · Upstage · Seoul

Dongjun Kim

I turn deep research insights into real-world AI systems.

I'm an AI Research Engineer at Upstage, where I evaluate Solar, our frontier model family, through 50+ benchmark campaigns that steer training and data decisions. My research applies mechanistic interpretability to what models and benchmarks actually measure, with first-author orals at ACL 2026 and EMNLP 2025. Previously, I was in the NLP&AI Lab at Korea University, advised by Dr. Heuiseok Lim.

Portrait of Dongjun Kim
10
Publications
4
First-author papers
2×
Main-conference orals (ACL · EMNLP)
02

Publications

Click an entry to expand the case study

01

Removes language identity from embeddings at inference time to improve cross-lingual retrieval.

Dongjun Kim, Jeongho Yoon, Chanjun Park, Heuiseok Lim

ACL 2026 · Main Oral First author
LangSAE Editing method overview
Background
In multilingual retrieval, dense embeddings also encode language signals that can overshadow relevant cross-language evidence
Problem
Remove language signals at inference time to improve cross-language retrieval, without retraining the base model or re-encoding text
Method
  • Decompose embeddings into latent units with a sparse autoencoder
  • Identify language-associated latent units via cross-language activation statistics
  • Post-hoc editing removes those signals at inference while preserving the original embedding dimensions
Results
  • Consistent retrieval improvements across multiple languages
  • Strongest gains across different writing systems
  • Drops into existing systems with no retraining or re-encoding required
Role
  • First author: led the methodology and overall study design
  • Implemented the SAE-based language-latent identification and removal pipeline
  • Designed the multilingual retrieval experiments; led analysis and writing
02

Locates the compact parameter region that LLM coding ability lives in.

Dongjun Kim, Minhyuk Kim, Yong Chan Chun, Chanjun Park, Heuiseok Lim

ACL 2026 · Findings First author
Coding Spot analysis
Background
LLMs are strong at code generation and comprehension, but whether that ability lives in shared or language-specific parameters was unknown
Problem
Identify and characterize the parametric region responsible for coding: the Coding Spot
Method
  • Isolate candidate parameter subsets per programming language and across languages
  • Apply targeted modification and ablation to the identified subset
  • Compare coding vs. non-coding task performance before and after to establish causal contribution
Results
  • A compact Coding Spot exists: modifying it sharply degrades coding tasks
  • Non-coding capabilities are largely preserved, showing compartmentalization akin to functional specialization in the brain
Role
  • First author: conceived the approach and methodology
  • Implemented the ablation experiments and evaluation sweeps at layer/module granularity (reproducible tooling)
  • Designed control tasks verifying non-coding preservation; led analysis and writing
03

Decomposes benchmark scores into the cognitive abilities they actually measure.

Dongjun Kim, Gyuho Shim, Yong Chan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

EMNLP 2025 · Main Oral First author
Benchmark Profiling framework
Background
Benchmark scores are read as evidence of specific capabilities, yet a single number masks the mixture of abilities a task actually requires, and gains often fail to match user-perceived competence
Problem
Establish a framework that systematically diagnoses and quantifies which cognitive abilities each benchmark actually measures
Method
  • Define 10 cognitively grounded abilities with ability-specific diagnostic datasets
  • Locate ability-relevant parameters via gradient-based importance
  • Apply targeted MLP weight ablation and compare original vs. ablated performance to compute an Ability Impact Score (AIS)
  • Profile three instruction-tuned models across ten widely used benchmarks
Results
  • Most benchmarks tap a mixture of abilities, not the single skill on their label
  • Similarly labeled datasets rely on distinct ability mixes
  • Code generation rewards broad multi-skill improvement; narrow fine-tuning yields modest gains
  • Task-irrelevant abilities can interfere and hurt performance
Role
  • First author: led the methodology and overall study design
  • Built the diagnostic datasets and the gradient-importance / ablation pipelines as reproducible tooling
  • Automated large-scale experiment sweeps; designed and ran the human-expert evaluation
  • Led analysis, writing, and the open-source release
04

Keeps LLMs current with Korean legal amendments through on-the-fly knowledge editing.

Jaehyung Seo, Dahyun Jung, Jaewook Lee, Yong Chan Chun, Dongjun Kim, Hwijung Ryu, Donghoon Shin, Heuiseok Lim

EMNLP 2025 · Findings
KoLEG framework
Background
Korean law evolves through frequent, fine-grained amendments; even minor wording changes carry legal weight, yet retraining an LLM for every revision is impractical
Problem
Edit legal knowledge on the fly, achieving edit success, preservation, and generalization under continuous sequential updates
Method
  • Editing-Aware Learning Strategy combined with continuous retrieval
  • LawEdit Retriever for amendment-aware retrieval
  • Korean Legislative Amendment Dataset, built for temporal dynamics and linguistic subtlety
  • Timestamp-aware evaluation protocol
Results
  • Outperforms locate-then-edit and retrieval-based editing baselines while preserving linguistic capabilities
  • Stays robust under sequential edits; improves precedent-application tasks
  • Qualitatively validated by legal experts
Role
  • Designed and built the Korean law crawling pipeline: effective-date alignment, high-precision filtering, change tracking
  • Implemented timestamp-aware evaluation and quality control
  • Analyzed results and co-wrote the paper
05

A tri-modal benchmark of cultural awareness across 8 Asian countries and 10 languages.

Weihua Zheng, Zhengyuan Liu, et al., incl. Dongjun Kim (35 authors)

arXiv · 2025 Under review · ICLR 2026
MMA-ASIA benchmark overview
Background
Multimodal understanding and reasoning often degrade outside Western, high-resource settings, and culture-aware evaluation has lagged behind
Problem
Quantify cultural awareness across Asian languages and modalities, and verify that models answer for the right reasons
Method
  • Human-curated benchmark across 8 Asian countries, 10 languages, 27,000 questions; 79% require multi-step cultural reasoning
  • First dataset input-aligned across text, image, and speech, enabling direct cross-modal transfer tests
  • Five-dimensional protocol: country disparities, cross-lingual and cross-modal consistency, generalization, grounding validity
  • A grounding validation module detects shortcut learning; Vision-ablated Prefix Replay (VPR) probes modality divergence
Results
  • Reveals cultural-awareness gaps across countries and languages
  • Demonstrates cross-modal inconsistency and shortcut-learning risks in current models
Role
  • Led the Korean subset: taxonomy design, image collection, and QA authoring guidelines
  • Evaluated diverse VLMs and calibrated quality-control criteria
  • Contributed analysis and writing
06

The first open-ended benchmark for Korean instruction following.

Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim

arXiv · 2025 Under review · AACL 2026 First author
KITE benchmark overview
Background
Instruction-following evaluation is English-centric; Korean, with honorifics, rich morphology, and dual numbering systems, lacked an open-ended benchmark
Problem
Measure both general and Korean-specific instruction following on diverse open-ended tasks, fairly and reproducibly
Method
  • Tasks and prompts target Korean-specific phenomena (honorific shifts, morphology, dual numerals) alongside general instructions
  • Open-ended generation rather than multiple choice
  • Evaluation pipeline combines automated metrics with human assessment
  • Full public release of dataset and code
Results
  • Reveals performance disparities across models with per-ability insight into strengths and weaknesses
  • Released on Hugging Face and GitHub; used as a reference benchmark for Korean LLM comparison
Role
  • First author: designed and built the benchmark (tasks, prompts, scoring)
  • Ran large-scale automated and human evaluations; led the analysis
  • Released and maintains the public artifacts
07

Syntax-based demonstration retrieval that fixes term-boundary errors in LLM term extraction.

Yong Chan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim

ACL 2025 · Findings
Syntactic retrieval for ATE
Background
  • Automatic Term Extraction (ATE) identifies the domain-specific terms that underpin machine translation, information retrieval, and more
  • LLM-based ATE is barely explored, and term-boundary identification is the hard part
Problem
Semantic-similarity demonstration retrieval misleads term-boundary recognition under domain mismatch, so retrieve demonstrations by syntax instead
Method
  • Retrieve few-shot demonstrations by syntactic similarity over parse trees using FastKASSIM
  • Domain-agnostic guidance for capturing term boundaries
  • Analyze how lexical overlap between the query and retrieved examples affects performance
Results
  • Consistent F1 gains on three specialized ATE benchmarks (ACTER, ACLR2, BCGM)
  • Notably more stable than semantic retrieval in cross-domain settings
Role
  • Co-author: contributed to experiment design and methodology
  • Supported the syntactic retrieval experiments and analysis
08

A leaderboard that regenerates itself daily from real-world data.

Chanjun Park, Hyeonseok Moon, Dongjun Kim, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo, Heuiseok Lim

arXiv · 2025
Self-Improving Leaderboard overview
Background
Static test-set leaderboards fail to reflect rapidly changing models and data distributions, and they incentivize test-set overfitting
Problem
Build a fair, sustainable comparison framework with time-aware ranking and regression detection
Method
  • Agent system collects daily news → LLM agents generate and review Q&A items → multi-LLM automated evaluation (vLLM backend) → time-aware ranking with stability and volatility metrics
  • Live leaderboard operated on Hugging Face Spaces
Results
  • Stabilized real-time benchmark operations
  • Time-aware ranking with stability/volatility metrics and quarterly regression detection
  • Automation reduced operational cost and risk
Role
  • Agentic engineering lead: built the news-crawling agents (sources, scheduling, deduplication)
  • Designed the LLM agent pipeline converting daily news into reviewed Q&A benchmark items
  • Implemented the vLLM-based automated evaluation pipeline and time-aware ranking
  • Operated the live leaderboard on HF Spaces with monitoring and data hygiene
09

Measures ChatGPT and GPT-4 toxicity across Korean social issues and personas.

Seungyoon Lee, Dongjun Kim, Dahyun Jung, Chanjun Park, Heuiseok Lim

NAACL 2024 · SRW
Korean social bias analysis
Background
  • LLM bias and toxicity are widely studied, but almost entirely in English-speaking contexts
  • Rigorous analysis within the Korean sociocultural context was missing
Problem
Quantify and compare ChatGPT vs. GPT-4 toxicity under Korean social issues and persona conditions
Method
Design prompts reflecting major Korean social issues with varied personas → collect model outputs → compute automated toxicity metrics with human validation → analyze bias patterns by condition
Results
  • Certain persona–issue combinations consistently yield harmful content
  • Under some conditions GPT-4 produces more than twice the toxicity of ChatGPT
Role
  • Analyzed and organized results across persona and issue conditions
  • Built the visualizations and co-wrote the paper
10

An agent-based city digital twin for high-resolution pandemic forecasting.

Shakir Bilal, Wajdi Zaatour, Yilian Alonso Otano, Arindam Saha, Ken Newcomb, Soo Kim, Dongjun Kim, Raveena Ginjala, Derek Groen, Edwin Michael

Complex & Intelligent Systems · 2024
CitySEIRCast digital twin
Background
City-scale epidemic forecasting must handle population/spatial heterogeneity and early-data scarcity, calling for high-resolution digital-twin simulation
Problem
Couple an agent-based transmission model with real-world city data pipelines to deliver timely, high-resolution forecasts
Method
  • Build a city digital twin from navigation and social data, coupled with a modular agent-based SEIR model
  • Automate pipelines for scalable execution on hybrid cloud/HPC
Results
  • Realistic city-level simulations with high-resolution spatiotemporal forecasts
  • Actionable for identifying vulnerable groups/hotspots and evaluating policy scenarios
Role
  • Analyzed and cleaned county-provided COVID data (validation, missing-data handling, metadata curation)
  • Set up the HPC environment (Slurm, modules, containers) and large-scale batch pipelines
  • Optimized Python/C++ simulation code; implemented the agent-based SEIR application end to end
  • Deployed on Azure with containers; configured monitoring and logging
03

Projects

P-01 · 2026 – Evaluation Infrastructure

Solar Open 2

Upstage Solar Model Project · Evaluation Team · MSIT national project

  • Member of the Evaluation Team for Solar Open 2, Upstage's frontier open model for Korea's Independent AI Foundation Model project.
  • Develop the team's registry-based LLM evaluation framework: 50+ benchmarks across reasoning, math, coding, agentic, and Korean-specific domains, managed as declarative specs.
  • Orchestrate large-scale evaluation runs end to end: vLLM/SGLang serving on Slurm and Backend.AI clusters, provider batch-API pipelines, and LLM-as-judge protocols.
  • Track checkpoint-over-checkpoint progress with automated reporting and dashboards that feed training and data decisions.
P-02 · 2026 – Evaluation

Solar Pro 4

Upstage · Evaluation Team

  • Evaluating Solar Pro 4, Upstage's flagship frontier model, across the full benchmark suite: reasoning, coding, agentic, and Korean-specific domains.
  • Tracking frontier-model parity against external indices and prior Solar generations, with regression analysis between checkpoints.
  • Runs on the same evaluation stack as Solar Open 2: registry-driven benchmark specs, multi-node serving, and automated reporting.
P-03 · 2025 Evaluation Data

VAETKI

Korea University NLP&AI Lab · with NC AI and ETRI

  • Evaluated VAETKI, the consortium's independent foundation model, implementing an evaluation framework spanning 50+ benchmarks across reasoning, safety, and robustness with reproducible pipelines and CI triggers.
  • Designed a unified metric layer and Weights & Biases dashboards for time-series tracking, regression checks, and cross-model slice analysis.
  • Ran pre- and post-training analyses on successive VAETKI checkpoints, identified failure modes, and informed data and recipe updates.
  • Introduced contamination checks (deduplication and overlap scans) and standardized runbooks for fair, comparable evaluations.
P-04 · 2024 – 2025 Training

KULLM 3 · KULLM R · Ko-Gemma Training

Korea University NLP&AI Lab

  • Contributed to post-training on a team of 10: instruction tuning, the training framework, and multilingual / code-switch datasets.
  • Curated code and math corpora, established quality gates, and ran capability evaluations for coding and math.
  • KULLM Reasoning: implemented RL with custom reward functions and GRPO, applying adaptive response length.
  • Internal evaluations showed improved Korean reasoning and math accuracy over a Qwen3 reasoning baseline, with reduced verbosity.
04

Experience

2026.03 – Present

AI Research Engineer

Upstage · Seoul, South Korea

  • Evaluation Team, Solar Model Project: end-to-end evaluation of Solar Open 2 (Korea's Independent AI Foundation Model project) and Solar Pro 4 with one-command campaigns over a 50+ benchmark suite (incl. agentic) on multi-node GPU clusters at each training stage, feeding training and data decisions.
  • Designed light benchmark variants for rapid checkpoint screening, cutting core-suite wall-time by ~86% (42 → 6 min), with score fidelity validated through repeated-run variance analysis and per-category sampling.
  • Migrated and hardened agentic benchmarks (SWE-bench Verified/Pro/Multilingual, GAIA, MCP-Atlas, Terminal-Bench, τ²-bench, CritPT) with reproduction studies against reported scores and fixes to timeout semantics, grading environments, and reasoning-output parsing.
  • Reduced evaluation wall-time through systems work: tail-latency analysis, multi-completion sampling (up to 1.65× on repeat-sampling math benchmarks), prefix-caching studies, and a deterministic scheduler that auto-tunes benchmark ordering and parallelism under cluster-throughput and sandbox-concurrency constraints.
  • Built automated leaderboard reporting on Notion: incremental per-benchmark upload with pass@k, response/reasoning length, truncation-ratio, and degeneration monitoring.
  • Aligned in-house results with external indices (e.g., Artificial Analysis), root-causing HLE and CritPT score gaps down to judge choice, prompts, temperature, and a streaming reasoning-parser bug.
2023.03 – 2024.01

RLHF Data Trainer

Scale AI · San Francisco, CA, USA (Remote)

  • Supported alignment and safety tuning for flagship models at Google (Flamingo, Bard), Meta (Llama), and OpenAI.
  • Produced expert-level preference data for RLHF and instruction tuning on advanced platforms (e.g., OpenAI Feather).
  • Curated and annotated image reasoning, coding, math, multi-turn dialog, and safety-critical (PII/harmful) content.
2023.05 – 2023.11

AR/VR Software Engineering Intern

Simacro · Cambridge, MA, USA

  • Built Unity VR/AR apps converting static P&IDs into interactive digital twins, streamlining operations for clients incl. Hyundai Oil Bank.
  • Developed a Python/C# computer-vision API for symbol detection, automating 80% of labeling at 95% accuracy.
  • Integrated the Virnect image-tracking SDK into Unity for robust AR overlays on physical machinery.
2022.08 – 2023.05

High Performance Computing Intern

Dr. Edwin Michael's Lab, USF · Tampa, FL, USA

  • Built the CitySEIRCast digital twin for city-scale pandemic forecasting using MPI/OpenMP/CUDA.
  • Implemented NumPy/Pandas/SQL data pipelines; optimized C++/Python simulations for HPC clusters.
2022.01 – 2023.05

Mixed Reality Research Assistant

USF Mixed Reality Lab · Tampa, FL, USA

  • Designed an automatic room-mapping method reducing manual mapping time by 40%.
  • Published MR data collection and modeling work at IPMV 2023.
05

Education

Korea University · M.S. in Computer Science

Graduated Feb 2026 · Seoul, KR · Advised by Dr. Heuiseok Lim · NLP&AI Lab

Skills PyTorch, PEFT (LoRA, QLoRA), RL (GRPO, RLHF), Inference Optimization (vLLM, SGLang, KV Cache), Quantization (AWQ), DeepSpeed, Megatron-LM, CUDA, Docker, Kubernetes
MLOps Server admin: GPU/cluster operations, model serving, and demos

University of South Florida · B.S. in Computer Science

Graduated May 2023 · Tampa, FL, USA

HPC Supported Slurm-based cluster operations; containerized GPU jobs with Docker/Singularity; profiled and optimized PyTorch/CUDA workloads
XR Built AR/VR prototypes in Unity (C#, OpenXR): gaze/gesture interactions and UI design, user studies, and demo deployments

06 · Contact

Let’s build reliable AI together.

junkim100@gmail.com