You have to treat your most valuable prompts like code, and so they go in Git. The challenge, as this post tried to get across, is that we now have a CI issue with model changes "breaking" prompt functionality. At the moment, I don't have a concrete answer for you as I'm currently working on solving this myself. Definitely a future article once I have something to share.
Thanks, Chris. I'm not a coder so I didn't think of using Git. I'm using google docs. I'll try Git and see if it's better than docs but will keep my eyes open for a better option and will share if I find one.
Fascinating. Now yesterday’s frustrations make sense. In writing and editing a python script built around OpenAi Agents SDK, Claude 4.5 kept changing my model configuration from gpt-5 to 4.1 or o4mini, every time it touched the code. At first, I simply added an instruction to my work to never edit “model” variable. After it ignored this, I added a MUST NOT. …ignored. I duplicated the instruction, placed instances near beginning and at the end. …back to gpt-4.1.
I love how you map this trend into the future, cutting through the noise represented by my irritation in the moment; a human stuck in a execution loop with a loss of all prior convictions or confidence that I am working on the right thing, or in the right manner.
Yours is one of my most valuable paid subscriptions on the stack
The more we want AIs to behave like humans, the less predictable they should be. That means stripping any deterministic instructions ("must", "always", "never") out of our prompts and handing those to traditional machines. There's a reason we run a lot of code that exists to keep unreliable humans safe on deterministic machines, not stochastic ones.
While I agree that separation of concerns for design and build with this technology needs to account for deterministic and non-deterministic features, no machine should refuse a legitimate directive from a user. If I prompt Claude that it "must" return it's answer to me in XML or in Spanish, then it must do so where it has no excuse not to. The fact models are being designed to be more "helpful" is very problematic for those of us building around them.
But real humans sometimes must adhere to strict instructions. "No smoking near the gasoline pump" is not a suggestion. Creativity is not acceptable there. "High voltage do not open" is an instruction meant for humans.
All humans are unreliable. We can reduce this intrinsic unreliability by having people check on each other. But there are no guarantees. For example, in cultures with high corruption, people don't follow the rules, they follow the money.
Who said anything about guarantees? But at some level we do count on people to follow instructions. If we could not, then there would be no reason to have instructions.
This is simply not true. Just look at traffic and you'll see that people are very unreliable rule followers. They break the law all the time. We can NOT count on others to follow the traffic laws. But that does NOT mean that there is no reason to have traffic instructions.
I repeat: all humans are unreliable (intentionally or unintentionally). The best we can do is for everyone to watch out for each other. We are not computers.
The foundation labs are all bleeding cash in the hundreds of millions and billions per year (nothing remotely close to profitable) and ALWAYS adhering to our prompts may exacerbate their deteriorating P&L situation. They all seem to be moving up market with higher cost tiers to use their models and services as well as 7-8 figure consulting offerings to enterprise.
Best bet is to diversify your use of models by trying the open weight and open source models (Llama, Deepseek, Qwen, Mistral, Granite) as you are more in control your destiny and also do not rush to start using the latest and greatest models from OpenAI and Anthropic.
What do you recommend for prompt library and version control?
You have to treat your most valuable prompts like code, and so they go in Git. The challenge, as this post tried to get across, is that we now have a CI issue with model changes "breaking" prompt functionality. At the moment, I don't have a concrete answer for you as I'm currently working on solving this myself. Definitely a future article once I have something to share.
Thanks, Chris. I'm not a coder so I didn't think of using Git. I'm using google docs. I'll try Git and see if it's better than docs but will keep my eyes open for a better option and will share if I find one.
Fascinating. Now yesterday’s frustrations make sense. In writing and editing a python script built around OpenAi Agents SDK, Claude 4.5 kept changing my model configuration from gpt-5 to 4.1 or o4mini, every time it touched the code. At first, I simply added an instruction to my work to never edit “model” variable. After it ignored this, I added a MUST NOT. …ignored. I duplicated the instruction, placed instances near beginning and at the end. …back to gpt-4.1.
I love how you map this trend into the future, cutting through the noise represented by my irritation in the moment; a human stuck in a execution loop with a loss of all prior convictions or confidence that I am working on the right thing, or in the right manner.
Yours is one of my most valuable paid subscriptions on the stack
Kind words sir, glad this was of help. Working with AI is beautifully painful eh?!
Dunno if my first comment went through, but again, thanks for sharing. This is super valuable.
It reads also almost like the diametrical opposite of what happened with OpenAI's GPT-4o -> GPT-5
The more we want AIs to behave like humans, the less predictable they should be. That means stripping any deterministic instructions ("must", "always", "never") out of our prompts and handing those to traditional machines. There's a reason we run a lot of code that exists to keep unreliable humans safe on deterministic machines, not stochastic ones.
While I agree that separation of concerns for design and build with this technology needs to account for deterministic and non-deterministic features, no machine should refuse a legitimate directive from a user. If I prompt Claude that it "must" return it's answer to me in XML or in Spanish, then it must do so where it has no excuse not to. The fact models are being designed to be more "helpful" is very problematic for those of us building around them.
But real humans sometimes must adhere to strict instructions. "No smoking near the gasoline pump" is not a suggestion. Creativity is not acceptable there. "High voltage do not open" is an instruction meant for humans.
Correct. And we have safety measures, security people and law enforcement because people are unreliable in following such simple instructions.
Right but LE and safety people are also human. At some point you will have humans following instructions.
All humans are unreliable. We can reduce this intrinsic unreliability by having people check on each other. But there are no guarantees. For example, in cultures with high corruption, people don't follow the rules, they follow the money.
Who said anything about guarantees? But at some level we do count on people to follow instructions. If we could not, then there would be no reason to have instructions.
This is simply not true. Just look at traffic and you'll see that people are very unreliable rule followers. They break the law all the time. We can NOT count on others to follow the traffic laws. But that does NOT mean that there is no reason to have traffic instructions.
I repeat: all humans are unreliable (intentionally or unintentionally). The best we can do is for everyone to watch out for each other. We are not computers.
Devils advocate: maybe this is intentional?
The foundation labs are all bleeding cash in the hundreds of millions and billions per year (nothing remotely close to profitable) and ALWAYS adhering to our prompts may exacerbate their deteriorating P&L situation. They all seem to be moving up market with higher cost tiers to use their models and services as well as 7-8 figure consulting offerings to enterprise.
Best bet is to diversify your use of models by trying the open weight and open source models (Llama, Deepseek, Qwen, Mistral, Granite) as you are more in control your destiny and also do not rush to start using the latest and greatest models from OpenAI and Anthropic.
It should notify humans for clarification when facing conflicting instructions