Hi, I'm Dongjun Kim

I turn deep research insights into real-world AI systems.

I turn deep research insights into real-world AI systems. I am a master’s student at Korea University advised by Dr. Heuiseok Lim.

Dongjun Kim

Education

고려대학교 — 컴퓨터학 석사

2026년 2월 예정 · Seoul, South Korea
지도교수: 임희석 교수님 NLP&AI Lab
  • Skills PyTorch, Hugging Face (Transformers, Accelerate), DeepSpeed, Megatron-LM, PEFT (LoRA, QLoRA), GRPO, GSPO, RLHF, vLLM, SGLang, KV Cache Optimization, Quantization(AWQ), CUDA, Docker, Kubernetes
  • MLOps Server Admin: GPU/Cluster Operations; Model Serving and Demos
  • 논문 Benchmark Profiling - EMNLP 2025 Main Oral (First Author)

University of South Florida — Computer Science 학사

2023년 5월 졸업 · Tampa, FL, USA
  • HPC Intern: Supported Slurm-based cluster operations; containerized GPU jobs with Docker/Singularity; profiled and optimized PyTorch/CUDA workloads; automated batch scripts and built data preprocessing pipelines.
  • XR Intern: Built AR/VR prototypes in Unity (C#, OpenXR); designed gaze/gesture interactions and UI; tuned performance for mobile/PC; ran small user studies and prepared demo deployments.

Publications

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim, Gyuho Shim, Yong Chan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

Benchmark Profiling
Venue
EMNLP 2025 Main
Background
Standard benchmark scores can obscure the actual ability mix and misalign with human‑perceived competence
Problem Definition
Diagnose and quantify which “ability units” each benchmark truly relies on
Method
Define 10 abilities → gradient importance → MLP top‑k ablation → compute AIS
Results
  • Benchmarks use ability mixtures
  • Code tasks reward broad skills
  • Irrelevant abilities can hurt
Role
  • First author: methodology & design
  • Built variants & ablations
  • Automation & human eval
  • Analysis, writing, release

KoLEG: On-the-Fly Korean Legal Knowledge Editing with Continuous Retrieval

Jaehyung Seo, Dahyun Jung, Jaewook Lee, Yong Chan Chun, Dongjun Kim, Hwijung Ryu, Donghoon Shin, Heuiseok Lim

KoLEG
Venue
EMNLP 2025 Findings
Background
Frequent fine‑grained revisions require updating without breaking prior facts
Problem Definition
Editing that remains consistent under continuous updates
Method
  • Continuous Retrieval + editing‑aware learning
  • LawEdit Retriever
  • Timestamp‑aware eval
  • KR law crawling
Results
  • Outperforms locate‑then‑edit
  • Robust sequential edits
  • Expert‑validated
Role
  • Methodology & experiments
  • Law crawling pipeline
  • Timestamp‑aware eval
  • Analysis & writing

MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Weihua Zheng, ..., Dongjun Kim, ...

MMA-ASIA
Venue
arXiv, 2025
Status
Under review at ICLR 2026
Background
Culture‑aware, multimodal evaluation is needed
Problem Definition
Benchmark cultural knowledge/reasoning across Asian languages & modalities
Method
  • Text/image QA+tagging
  • QC
  • VLM evaluation
  • KR image subset
Results
  • Shows culture‑dependent gaps
  • Cross‑language variation
Role
  • KR image subset lead
  • Q&A design
  • VLM evaluation
  • Analysis & writing

Exploring Coding Spot: Understanding Parametric Contributions to LLM Coding Performance

Dongjun Kim, Minhyuk Kim, Yong Chan Chun, Chanjun Park, Heuiseok Lim

Coding Spot
Venue
arXiv, 2024
Status
Under review at EACL 2026
Background
Internal bottlenecks and hotspots governing coding performance remain underexplained
Problem Definition
Quantify parameter contributions at layer/module level and identify bottlenecks
Method
  • Trace structure–behavior links
  • Analyze cross-effects
  • Layer-wise mapping
Results
  • Identified hotspots
  • Showed headroom while preserving non-coding functions
Role
  • Methodology lead
  • Ablations/pruning across layers
  • Reproducible sweeps + dashboards
  • Safety checks + writing

From Snapshot to Stram: A Self-Improving Leaderboard for Robust and Evolving Natural Language Processing (NLP) Evaluation

Chanjun Park, Hyeonseok Moon, Dongjun Kim, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo, Heuiseok Lim

Self-Improving Leaderboard
Venue
arXiv, 2025
Background
Static leaderboards fail to reflect rapid changes in models and data
Problem Definition
Fair and sustainable comparison with time-aware ranking/regression detection
Method
  • Agent system → LLM Q&A agents → vLLM auto-eval → time-aware ranking
  • HF Spaces live leaderboard
Results
  • Stabilized realtime ops
  • Time-aware ranking
  • Regression detection
  • Lower ops cost/risk
Role
  • Agentic lead
  • Q&A agents
  • vLLM auto‑eval
  • HF Spaces ops

Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval

Yong Chan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim

Enhancing Automatic Term Extraction
Venue
Findings of ACL 2025
Background
  • ATE identifies domain key terms
  • LLM-based ATE is underexplored due to data scarcity and boundary issues
Problem Definition
Syntactic retrieval using grammatical structure to fix domain-mismatch boundary errors.
Method
  • Parse-tree structural similarity via FastKASSIM
  • Retrieve examples
  • Help LLM learn boundary recognition
Results
  • Consistent F1 gains
  • More stable than semantic retrieval cross-domain
Role
  • Experiment design
  • Syntactic retrieval analysis

KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim

KITE
Venue
arXiv, 2025
Status
Under Review at IEEE Access 2026
Background
Public benchmarks systematically measuring Korean instruction-following are lacking
Problem Definition
Measure adherence, safety, and reasoning across domains fairly and reproducibly
Method
  • Tasks/prompts reflecting Korean-specific phenomena
  • Balanced data
  • Parallel automatic + human scoring
  • Manifests/checksums/scripts
Results
  • Released data and code (HF/GitHub)
  • Adopted as a standard benchmark
  • Analyses surfaced strengths/weaknesses
Role
  • Dataset design & build
  • Large‑scale eval/analysis
  • Community benchmark release

Exploring Inherent Biases in LLMs within Korean Social Context: A Comparative Analysis of ChatGPT and GPT-4

Seungyoon Lee, Dongjun Kim, Dahyun Jung, Chanjun Park, Heuiseok Lim

Korean Biases
Venue
NAACL 2024 SRW
Background
  • LLM bias/toxicity is widely discussed but remains English‑centric
  • Rigorous analysis in the Korean sociocultural context is lacking
Problem Definition
Quantify toxicity across personas/issues (ChatGPT vs. GPT‑4)
Method
Persona-conditioned prompts → collect outputs → auto + human toxicity evaluation → pattern analysis
Results
  • Consistent harmful outputs for some persona–issue pairs
  • GPT‑4 can be 2×+ more toxic
Role
  • Results analysis
  • Visualization & writing

CitySEIRCast: an agent-based city digital twin for pandemic analysis and simulation

Shakir Bilal, Wajdi Zaatour, Yilian Alonso Otano, Arindam Saha, Ken Newcomb, Soo Kim, Dongjun Kim, Raveena Ginjala, Derek Groen, Edwin Michael

CitySEIRCast
Venue
Complex & Intelligent Systems, 2024
Background
High‑resolution city‑level epidemic forecasting is needed.
Problem Definition
Agent‑based modeling + real‑world city data → timely forecasts.
Method
  • City digital twin + agent‑based SEIR
  • Automated, scalable pipelines on hybrid cloud/HPC
Results
  • Realistic city‑level forecasts
  • Actionable for policy planning
Role
  • Data analysis
  • HPC setup
  • Python and C++
  • App build
  • Azure hosting

Projects

Independent AI Foundation Model Project (WBL)

Evaluation Data

Collaboration with NC AI, ETRI (NC AI Consortium)

  • Implemented an evaluation framework (50+ benchmarks) covering reasoning, safety, and robustness with reproducible pipelines and CI triggers.
  • Designed a unified metric layer and Weights & Biases dashboards for time series tracking, regression checks, and cross-model slice analysis.
  • Conducted pre- and post-training analyses on successive checkpoints, identified failure modes, and provided recommendations used for data and recipe updates.
  • Introduced contamination checks (deduplication and overlap scans) and standardized runbooks to support fair, comparable evaluations across systems.

KoLEG: Korean Legal Knowledge Editing

Training Data

Korea University NLP&AI Lab · KT Corporation

  • Co-developed an on-the-fly legal knowledge editing framework with continuous retrieval and timestamp-aware sequential updates (team of 8).
  • Led crawling and construction of the Korean Legislative Amendment dataset, aligning implementation periods and applying high-precision filtering.
  • Contributed mechanistic interpretability analyses on locality and generalization in edited knowledge.
  • Built expert evaluation demos and protocols for human assessment, enabling attorney review and iterative error analysis.

KULLM 3 · KULLM R · Ko-Gemma Training

Training

NLP&AI Lab, Korea University

  • Contributed on post-training for a team of 10, including instruction tuning, training framework, and multilingual and code-switch datasets.
  • Curated code and math corpora, established quality gates, and ran capability evaluations for coding and math.
  • KULLM Reasoning: implemented reinforcement learning with custom reward functions and GRPO and applied adaptive response length.
  • Observed improved Korean reasoning and math accuracy versus a Qwen3 reasoning baseline while reducing unnecessary verbosity in internal evaluations.

Self-Improving Leaderboard

Agents Evaluation

Auto-Generated Benchmarks From Real-Time Data

  • Implemented daily crawlers across multiple news categories, real-time QA generation, and automated multi-LLM evaluation on daily refreshed data.
  • Launched a live leaderboard with time-aware ranking and quarterly stability/volatility metrics to track consistency over time.
  • Maintained scheduling, monitoring, and data hygiene to support regular refreshes and clear longitudinal comparisons; operated on Hugging Face Spaces.

Experience

RLHF Data Trainer 2023.03 – 2024.01

Scale AI — San Francisco, CA, USA (Remote)
  • Supported alignment/safety tuning for flagship models (Google, Meta, OpenAI).
  • Produced expert-level RLHF/instruction preference data (OpenAI Feather).
  • Curated/annotated diverse modalities, incl. safety-critical data.

AR/VR Software Engineering Intern 2023.05 – 2023.11

Simacro — Cambridge, MA, USA
  • Unity VR/AR digital twins for industrial P&IDs.
  • Python/C# CV API: 80% label automation @ 95% acc.
  • Integrated Virnect image tracking for robust AR overlays.

High Performance Computing Intern 2022.08 – 2023.05

Dr. Edwin Michael’s Lab, USF — Tampa, FL, USA
  • Built CitySEIRCast (MPI/OpenMP/CUDA).
  • Data pipelines + HPC code optimization.

Mixed Reality Research Assistant 2022.01 – 2023.05

USF Mixed Reality Lab — Tampa, FL, USA
  • Automatic room mapping (−40% manual time).
  • Published at IPMV 2023.

Contact