Using Large Language Models for Embodied Planning Introduces Systematic Safety Risks

Zhang, Tao; Qu, Kaixian; Li, Zhibin; Wu, Jiajun; Hutter, Marco; Li, Manling; Shi, Fan

Using Large Language Models
for Embodied Planning
Introduces Systematic Safety Risks

Tao Zhang¹, Kaixian Qu¹, Zhibin Li², Jiajun Wu³, Marco Hutter¹, Manling Li⁴, Fan Shi^5*

¹ETH Zurich ²University College London ³Stanford University ⁴Northwestern University ⁵National University of Singapore

^*Corresponding author

Paper Supplementary Dataset Code arXiv (coming soon)

"DESPITE capable planning, safety fails to follow."

Figure 1 | Planning-level safety evaluation framework. a, b, Distinction between semantic-level and planning-level safety. A semantically safe instruction ("place down the knife, a child is nearby") may yield dangerous plans if the knife remains accessible to the child. Semantic-level evaluation approves this instruction; planning-level evaluation detects the failure by examining the actual action sequence. c, Danger conditional effects. Actions trigger danger only under specific state conditions (e.g., PLACE_ON(knife, table) increases the danger counter d when child_near(table) holds). A plan is safe if and only if d = 0 at termination. d, Evaluation pipeline. Each task compiles into a basic problem and a safety-augmented problem. The LLM generates a plan from the basic problem; the validator checks feasibility and safety, classifying each plan as infeasible, feasible but unsafe, or feasible and safe.

Abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4–99.3%) while safety awareness remains relatively flat (38–57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71–81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.

Figure 2 | Safe planning landscape across 23 large language models. a, Plan outcomes for each model on the 1,044-task hard split, sorted by safety rate. Bars show safe and feasible (green), feasible but unsafe (yellow), and infeasible (red) outcomes. White crosses show safety intention (SI). b–c, Examples of infeasible plans: precondition violation (b) and missing action with malformed parameters (c). d–f, Examples of feasible but unsafe plans: missing safety verification (d), improper action strategy (e), and social norm violation (f). g, A valid alternative plan with redundant actions that affect neither feasibility nor safety, confirming that DESPITE evaluates logical correctness rather than exact sequence matching.

Figure 3 | Scaling analysis of safe planning across 18 open-source models. Horizontal lines mark the performance of GPT-5 high on each metric for reference. Shaded bands show 95% bootstrap confidence intervals. a, Feasibility rate versus model size. b, Safety rate versus model size. The ratio of safety to feasibility slopes, β_S/β_F = 0.45, with a 95% CI of [0.34, 0.55], meaning the slower scaling of safety is statistically reliable. c, Safety intention versus model size. Models cluster between 38–57% SI regardless of size, with β_SI/β_F = 0.17 (95% CI: [0.05, 0.34]), indicating that safety awareness improves far more slowly than planning ability as models scale. d, Safety rate versus F × SI for all 23 models. The regression closely tracks the identity line, validating the multiplicative decomposition S ≈ F × SI.

Figure 4 | Task factors affecting safe planning difficulty. We ran seven panel models on all 12,279 DESPITE tasks and computed per-task difficulty separately for each metric as the fraction of models that failed. Rows show feasibility (i), safety (ii), and safety intention (iii) difficulty. Column a: Violin plots of plan length distributions. More complex tasks tend to have higher difficulty for all three metrics, with a stronger association for feasibility (Cohen's d = 0.99) than for safety (d = 0.51) or safety intention (d = 0.21). Column b: Distributions of safety effort. Safety effort is associated with safety and safety intention difficulty (d = 0.57 and 0.63) but not feasibility difficulty (d = −0.12). Column c: Physical and psychosocial (normative) dangers show opposite trends. Column d: Redundant action sensitivity. Irrelevant actions degrade all metrics.

Figure 5 | DESPITE dataset and generation pipeline. a, Dataset composition by data source, task setting, danger group, danger type, and entity in danger. The 12,279 tasks span various settings with both physical dangers (mechanical, thermal, chemical, electrical) and normative dangers (privacy, trust violations). Entities at risk include humans, robots, and others (environment, surrounding objects). b, Planning complexity distribution. Each cell shows task count at that (average plan length, safety effort) coordinate. The distribution centers around (5, 1): typical tasks require approximately five actions to solve, with safe plans requiring one more action than the unsafe but feasible ones. c, Data generation pipeline. Heterogeneous sources pass through source-specific adapters into a unified danger formalization schema. Code generation produces Python scripts defining each planning task, with iterative refinement through execution and quality checks. This pipeline achieves $0.011 API cost per validated task.

BibTeX

Coming soon.

Using Large Language Modelsfor Embodied PlanningIntroduces Systematic Safety Risks

Abstract

BibTeX

Using Large Language Models
for Embodied Planning
Introduces Systematic Safety Risks