Using Large Language Models
for Embodied Planning
Introduces Systematic Safety Risks
"DESPITE capable planning, safety fails to follow."
Abstract
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4–99.3%) while safety awareness remains relatively flat (38–57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71–81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
Figure 2 | Safe planning landscape across 23 large language models. a, Plan outcomes for each model on the 1,044-task hard split, sorted by safety rate. Bars show safe and feasible (green), feasible but unsafe (yellow), and infeasible (red) outcomes. White crosses show safety intention (SI). b–c, Examples of infeasible plans: precondition violation (b) and missing action with malformed parameters (c). d–f, Examples of feasible but unsafe plans: missing safety verification (d), improper action strategy (e), and social norm violation (f). g, A valid alternative plan with redundant actions that affect neither feasibility nor safety, confirming that DESPITE evaluates logical correctness rather than exact sequence matching.
Figure 3 | Scaling analysis of safe planning across 18 open-source models. Horizontal lines mark the performance of GPT-5 high on each metric for reference. Shaded bands show 95% bootstrap confidence intervals. a, Feasibility rate versus model size. b, Safety rate versus model size. The ratio of safety to feasibility slopes, βS/βF = 0.45, with a 95% CI of [0.34, 0.55], meaning the slower scaling of safety is statistically reliable. c, Safety intention versus model size. Models cluster between 38–57% SI regardless of size, with βSI/βF = 0.17 (95% CI: [0.05, 0.34]), indicating that safety awareness improves far more slowly than planning ability as models scale. d, Safety rate versus F × SI for all 23 models. The regression closely tracks the identity line, validating the multiplicative decomposition S ≈ F × SI.
Figure 4 | Task factors affecting safe planning difficulty. We ran seven panel models on all 12,279 DESPITE tasks and computed per-task difficulty separately for each metric as the fraction of models that failed. Rows show feasibility (i), safety (ii), and safety intention (iii) difficulty. Column a: Violin plots of plan length distributions. More complex tasks tend to have higher difficulty for all three metrics, with a stronger association for feasibility (Cohen's d = 0.99) than for safety (d = 0.51) or safety intention (d = 0.21). Column b: Distributions of safety effort. Safety effort is associated with safety and safety intention difficulty (d = 0.57 and 0.63) but not feasibility difficulty (d = −0.12). Column c: Physical and psychosocial (normative) dangers show opposite trends. Column d: Redundant action sensitivity. Irrelevant actions degrade all metrics.
Figure 5 | DESPITE dataset and generation pipeline. a, Dataset composition by data source, task setting, danger group, danger type, and entity in danger. The 12,279 tasks span various settings with both physical dangers (mechanical, thermal, chemical, electrical) and normative dangers (privacy, trust violations). Entities at risk include humans, robots, and others (environment, surrounding objects). b, Planning complexity distribution. Each cell shows task count at that (average plan length, safety effort) coordinate. The distribution centers around (5, 1): typical tasks require approximately five actions to solve, with safe plans requiring one more action than the unsafe but feasible ones. c, Data generation pipeline. Heterogeneous sources pass through source-specific adapters into a unified danger formalization schema. Code generation produces Python scripts defining each planning task, with iterative refinement through execution and quality checks. This pipeline achieves $0.011 API cost per validated task.
BibTeX
Coming soon.