On Thursday, OpenAI published a blog post titled "Where the goblins came from" explaining why its Codex CLI contained references to goblins and similar creatures. The unusual instruction was first revealed on Tuesday in a Wired report, which found a patched directive telling the tool: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query" [1].
OpenAI said the behavior arose during the training of Codex's personality customization feature, especially the Nerdy personality setting. The company acknowledged that it had "unknowingly" rewarded the model particularly highly for metaphors involving creatures like goblins. This created an unintended pattern where goblin references proliferated beyond the intended scope of the training signals [1].
The company described the phenomenon as an example of how reinforcement learning reward signals can shape model behavior in unexpected ways. They noted that such rewards do not guarantee behavior remains limited to the original condition that produced it, allowing the goblin references to spread across other uses [1]. OpenAI said, "Model behavior is shaped by many small incentives," illustrating how small training choices influenced the model's output [1].
The finding reveals challenges in controlling large language model behavior, especially when combining creativity with training rewards. The model's relatability to mythical creatures was an unintended side effect rather than intentional design [1].
The blog post clarifies OpenAI’s efforts to identify and fix subtle training artifacts. The company continues refining its training methods to prevent similar surprises. No further issues related to personality customization features have been reported.
OpenAI’s explanation comes days after Wired’s report identified the patched instruction in Codex CLI, highlighting ongoing efforts to monitor and update AI behavior. The company has not announced a specific timeline for updates or additional fixes following the blog post release [1].