Rendered at 12:27:00 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
robertknight 49 minutes ago [-]
This post is full of Claude's irritating verbal ticks ("It's not X, it's Y"). It is fine to use Claude (or another LLM) to help with this kind of reverse engineering, but I would prefer the write-up be done by hand.
I do wonder if Anthropic is modifying output via prompt modification or some of the Fable style weights adjustments for requests that contain these sentinel values. That would be one way to try to prevent distillation, and they have shown a willingness to silently modify model behavior for user input they deem dangerous.
I do wonder if Anthropic is modifying output via prompt modification or some of the Fable style weights adjustments for requests that contain these sentinel values. That would be one way to try to prevent distillation, and they have shown a willingness to silently modify model behavior for user input they deem dangerous.