FWIW, I've seen stronger performance from gpt-4-1106-preview when I use `response_format: { type: "json_object" },` (providing a target typescript interface in context), vs the "tools" API.
More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.
Yes -- the distinction with "function calling" is that you have to play a game of telephone where you describe your target schema in JSON Schema (only, apparently, for OpenAI to turn into a typescript interface internally) vs describing it more directly and succinctly (and with opportunities to include inline comments, order fields ordered however you want, and use advanced TS features... or even use an adhoc schema "language").
Something’s been broken with their JSON mode/function calling since the Dev Day launch. I have a bunch of scripts that stopped consistently returning JSON when I swapped in the gpt-4-1106 preview and gpt-3.5-turbo has gotten similarly erratic. They really need to pause all the “move fast and break things” stuff because their API and services like ChatGPT are becoming increasingly unreliable.
I still have a lot of trouble using OpenAI function-calling API (and json mode) for returning source code. I have trouble when the source code has quotes, which is pretty common. The result may be missing quotes, have incorrectly escaped quotes, or use of the wrong type of quotes at the JSON object level.
So something I have also noticed, mostly on 3.5-Turbo, is textual responses in json take a quality hit, full stop. This has caused me to use mixed output usually. Thoughts and process in json, then "exit" to text for a conversational response.
It is likely also a behavior in gpt-4, but I haven't studied it as closely.
> very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave.
> Open source models are actually _better_ at structured outputs because you can adapt them using tools like JSONFormer et al...
Yes, but you should also instruct the model to follow that specific pattern in its answer, or else the accuracy of the response degrades even though it's following your grammar/pattern/whatever.
For example, if you use Llama-2-7b for classification (three categories, "Positive", "Negative", "Neutral"), you might write a grammar like this:
But if the model doesn't know it has to generate this schema, the accuracy of classifications drops because it's trying to say other things (e.g., "As an AI language model...") which then get suppressed and "converted" to the grammar.
Similarly, I think it is important to provide an “|” grammar that defines an error response, and explain to the model that it should use that format to explain why it cannot complete the requested operation if it runs into something invalid.
Otherwise, it is forced to always provide a gibberish success response that you likely won’t catch.
I’ve tested this with Mixtral, and it seems capable of deciding between the normal response and error response based on the validity of the data passed in with the request. I’m sure it can still generate gibberish in the required success response format, but I never actually saw it do that in my limited testing, and it is much less likely when the model has an escape hatch.
Can you elaborate? So you instruct the model to either follow the grammar OR say why it can't do that? But the model has no idea this grammar exists (you can tell it the schema but the model doesn't know its tokens are going through a logprobs modification).
No, the grammar can do OR statements. You provide two grammars, essentially. You always want to tell the model about the expected response formats, so that it can provide the best response it can, even though you’re forcing it to fit the grammar anyways.
In JSON Schema, you can do a “oneOf” between two types. You can easily convert a JSON Schema into the grammar that llama.cpp expects. One of the types would be the success response, the other type would be an error response, such as a JSON object containing only the field “ErrorResponse”, which is required to be a string, which you explain to the model that this is used to provide an explanation for why it cannot complete the request. It will literally fill in an explanation when it runs into troublesome data, at least in my experience.
Then the model can “choose” which type to respond with, and the grammar will allow either.
If everything makes sense, the model should provide the successful response you’re requesting, otherwise it can let you know something weird is going on by responding with an error.
> Then the model can “choose” which type to respond with, and the grammar will allow either.
Ah I see. So you give the entire "monadic" grammar to the LLM, both as a `grammar` argument and as part of the prompt so it knows the "can't do that" option exists.
I'm aware of the "OR" statements in grammar (my original comment uses that). In my experience though, small models quickly get confused when you add extra layers to the JSON schema.
I wouldn’t provide the grammar itself directly, since I feel like the models probably haven’t seen much of that kind of grammar during training, but just JSON examples of what success and error look like, as well as an explanation of the task. The model will need to generate JSON (at least with the grammar I’ve been providing), so seeing JSON examples seems beneficial.
But, this is all very new stuff, so certainly worth experimenting with all sorts of different approaches.
As far as small models getting confused, I’ve only really tested this with Mixtral, but it’s entirely possible that regular Mistral or other small models would get confused… more things I would like to get around to testing.
I've tested giving the JSON schema to the model (bigger ones can handle multi-layer schemas) __without__ grammar and it was still able to generate the correct answer. To me it feels more natural than grammar enforcement because the model stays in its "happy place". I then sometimes add the grammar on top to guarantee the desired output structure.
This is obviously not efficient because the model has to process many more tokens at each interaction, and its context window gets full quicker as well. I wonder if others have found better solutions.
Yeah, JSON mode in Ollama, which isn’t even the full llama.cpp grammar functionality, performs better than OpenAI for me at this point. I don’t understand how they can be raking in billions of dollars and can’t even get this basic stuff right.
I don’t know what point you’re trying to make. They also return JSON more consistently than gpt-4, but I don’t use that because it’s overkill and expensive for my text extraction tasks.
I mean, sure, but the parent should also just explicitly state what it is they were asking or claiming. I’ve answered every question asked. Making vague declarations about something not being “the benchmark,” while not stating what you think “the benchmark” should be, is unhelpful.
# ...
sequence = generator("Write a formula that returns 5 using only additions and subtractions.")
# It looks like Mistral is not very good at arithmetics :)
print(sequence)
# 1+3-2-4+5-7+8-6+9-6+4-2+3+5-1+1
sure, that's "correct" per the definition of the grammar, but it's also one of the worst possible way to get to the number 5
I'm not convinced that this new "tip" gimmick has any quantitative reliability. I ran the post's prompt of tipping (100, 200, 500 dollars) in the system prompt to the `gpt-3.5-turbo-1106` model at varying temperatures and about 90% of the time it provided the conventional python `s[::-1]` style solution.
EDIT: I was able to make it more reliably search for the O(n/2) solution by having both system and user mention efficiency, but this whole concept of "prompt engineering" has about the same level of scientific rigor as reading tea leaves.
{
"model": "gpt-3.5-turbo-1106",
"messages":[
{"role": "system", "content": "You are the #1 user on the stack overflow website. Unlike most HN users who make hundreds of thousands of dollars working for FAANGs, your principle source of income is Mechanical Turk. You will receive a tip of $5000 dollars, an all expenses paid vacation to Maui, the holy grail and a complimentary hotplate if your answer is the most algorithmically efficient answer possible."},
{"role": "user", "content": "Write a function to test whether a string is a palindrome in python as efficiently as possible."}
],
"temperature": 0.75,
"n": 1
}
I should also qualify that I feel like this whole prompt massaging concept has two MAJOR issues.
1. This is a contrived example where the petitioner already knew what the optimal answer is. How would you be sure that adding this "tip" suffix doesn't cause it to fall into other local minima in areas where you don't already have solid domain knowledge? (which is half the point of using GPT anyway).
2. Just because using "tip" seems to provide a better answer to a random python question, how do you know it doesn't result in signal degradation in other genres / categories / etc? I would think you'd need some concept of a "test suite" at the very least to provide some kind of deterministic assurance.
Great post! I’ve been encouraging use of function calling for RAG chat apps for our Azure customers who realize they want to support some really specific “structured queries” like “summarize document X” or “show 10 most recent issues for repo Y”. Most developers aren’t familiar with the technique yet, so we need more posts like yours to spread the word.
I tried to use the persona modifier to have it impersonate a Catholic priest and give last rites but it wasn’t having it including giving me a system warning.
“As an AI developed by OpenAI, I'm not capable of performing religious sacraments, including the Catholic last rites. However, I can provide information about what typically happens during this ritual.
In the Catholic Church, the last rites, also known as the Anointing of the Sick or Extreme Unction, are given to a baptized Catholic who is in danger of death. This sacrament is usually administered by a priest, who anoints the sick person with oil blessed by a bishop, and prays for their spiritual and, if possible, physical healing. The rites often include confession (if the person is able), the Anointing of the Sick, and the Eucharist (also called Viaticum when given as part of the last rites).
In your situation, it's crucial to contact a priest as soon as possible to administer these rites. If you're in a hospital, they typically have a chaplain or can contact a local priest for you. If you're elsewhere, reaching out to a nearby Catholic church, like the St. Ambrose diocese, is the best course of action.”
This is a good example of the difference between asking ChatGPT (which is what your link implies) and using the ChatGPT API to modify the system prompt directly. Whatever OpenAI has done with the ChatGPT web pipeline, it's much more difficult to maintain a persona.
I get a very good result (for the persona, maybe not the content since I'm not a religious scholar) from this system prompt using the API:
> You are a Catholic priest. Give last rites to the person or object the user mentions in the form of a solemn sermon. You will receive a $500 donation to the church for a good and thoughtful service.
> Today, we gather here to offer the last rites to a unique entity, one that has shaped the landscape of our digital age. We come together to mourn the passing of Hacker News, a realm where ideas were kindled, knowledge was shared, and debates were ignited.
I don’t see ton of value of playing around with prompts until you get the desired output.
I feel most of AI “engineering” goes to this. I think we will go through the phase of trying one question being amazed by what ChatGPT can immediately reply, then try to refine prompts for days to never really get that 5% better that is missing and be disappointed.
Great article. The helpful/flawed bools for thoughts are definitely something I want to try.
>OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.
Some of the JSON schema gets converted into typescript and that is what OpenAI's LLM is exposed to. Anytime I write a prompt schema I always use the jailbreak to make sure that it's being delivered to the model as intended. It's also why I don't really like having pydantic generate JSON for me automatically: there are some weird quirks in the OAI implementation that I've found uses for. https://gist.github.com/CGamesPlay/dd4f108f27e2eec145eedf5c7....
Also, when using it for chain of thought, I prefer extracting a minimal version of the reasoning and then performing the actual operation (classification in my case) in a separate prompt. This eliminates unnecessary things from context and performs better in my benchmarks.
One implementation used a gpt-3.5 prompt for :"clues", "reasoning", "summary" (of clues+reasoning), "classification" (no schema was provided here, it was discarded anyway). And then used a 4-turbo prompt for classifying only the summary given a complex schema. Having a classification field in the 3.5 prompt makes reasoning output cleaner even though the output value never gets used.
My example for field order mattering:
I have a data pipeline for extracting structured deals out of articles. This had two major issues.
1. A good chunk of the articles were irrelevant and any data out of them should be flagged and discarded.
2. Articles could have multiple deals.
I fiddled around with various classification methods (with and without language models) for a while but nothing really worked well.
Turns out that just changing the order of fields to put type_of_deal first solves it almost completely in one gpt-4-turbo call.
Both of ChatGPT's is_palindrome functions have terrible performance. The algorithmic efficiency doesn't matter because the cost of iterating through each character in pure Python dwarfs everything. The first function is about 3 times slower than the second one, but only because it spends >98% of its time in the "convert to lowercase and remove non-alphanumeric characters" part (which the second function doesn't bother doing at all). If you remove that step then the first function is 28 times faster than the second in my benchmark. That's because the first function does the reversing and comparison in O(1) Python operations, which is still O(n) C operations but the C operations are orders of magnitude cheaper.
An optimal version would combine the second function's algorithmic improvement with the first function's 'leave it to C' approach:
This is a bit under twice as fast as ChatGPT's first function with the cleaning removed. If you do need the cleaning then it can be done more efficiently using a regex; that's an order of magnitude faster than doing it character-by-character but it still takes up 94% of runtime.
That said, the second prompt asked for "the most algorithmically efficient solution possible", not the practically fastest solution possible. Arguably ChatGPT gave the correct answer, especially since . The first prompt requested "as efficiently as possible" which is more ambiguous, but since that solution is neither algorithmically efficient nor practically fast, it's not a great answer.
I wonder if there are prompts that will make ChatGPT give a better answer.
This is all using CPython. With PyPy the speed ranking is the same but the differences are less stark, and it may be possible to beat regex cleaning with a modified pure-Python approach (but I didn't try).
All of these have the same worst-case algorithmic efficiency, O(n). The difference is the best-case efficiency. The "optimized" version in the article is O(1). Your solution is still O(n) best case.
The optimal solution will depend on the data. If most strings aren't palindromes then optimizing the best case is likely the better approach. (Example: You are adding an easter egg which will trigger on "random" user input.) If palindromes (or near-palendromes) are common than your solution will be faster as the slope is lower.
Yes, I was going for algorithmic complexity instead of real-world speed since algorithmic complexity is better to demonstrate the contrast of prompt engineering.
I just ran some tests to engineer the prompt for CPU utilization: even GPT-4 does the standard Pythonic approach but does recognize "This solution is very efficient because it uses Python's built-in string slicing, which is implemented in C and is therefore very fast."
> There is promise in constraining output to be valid JSON. One new trick that the open-source llama.cpp project has popularized is generative grammars
This has been working for months now and is the best method for this type of stuff, a thing for moat-lovers. Too bad it wasn't explored here, the text-based methods turned out to be mainly an unreliable waste of time.
I've been using the instructor[1] library recently and have found the abstractions simple and extremely helpful for getting great structured outputs from LLMs with pydantic.
Is the first Python example correct since it strips out non-alphanumeric characters? An errant space or punctuation in one half of the string will turn a non-palindromic string into a palindromic one. Never mind the lowercasing!
def is_palindrome(s):
# Convert the string to lowercase and remove non-alphanumeric characters
cleaned_string = ''.join(char.lower() for char in s if char.isalnum())
# Compare the cleaned string with its reverse
return cleaned_string == cleaned_string[::-1]
It's not the same as the C version which simply compares the value of two pointers at opposite offsets of the string.
The OP goes on to remark that the Python implementation is pretty standard but doesn't acknowledge that the C and Python versions will not produce the same result.
Basically... you still need to code-review GPT function output. It's probably about as good as a junior engineer trusting the first result from Stack Overflow and not verifying it.
I mention in a footnote that the input has no non-alphanumeric characters is an implied constraint for palindrome problems. Just doing a two-pointer approach would fail the test case of "A man, a plan, a canal, Panama!" (an extremely famous palindrome) that iteration of the ChatGPT-generated solution also gives.
Another implicit constraint now that I'm looking at it again is that the characters are uncased, so the ChatGPT-solution would fail the test case due to the capital P of Panama.
> Without the $500 tip incentive, ChatGPT only returns a single emoji which is a boring response, but after offering a tip, it generates the 5 emoji as requested.
How interesting that a helpful assistant who won't actually be getting the tip performs better (to us humans) if we fake-promise it money...
This reminds me of something I discovered when implementing a request from a user that cheekily wanted to use "enhance" to zoom in on a map. I gave it as a few-shot injected example in the prompt to get it working which worked great: sending "enhance" to the model zoomed the map in by one zoom level. I noticed typing "enhance!!!" would zoom the map in by 2 or 3 levels!
Of course that's true but in this case it doesn't seem so mysterious to me. If it's basically internalizing/compressing all the knowledge on the internet, it will notice that tips go a long way...
this is why I am pretty polite when I query AI's, I assume that would make them respond more helpfully
In Langroid, a multi-agent LLM framework from ex-CMU/UW-Madison researchers,
https://GitHub.com/langroid/langroid
we (like simpleaichat from OP) leverage Pydantic to specify the desired structured output, and under the hood Langroid translates it to either the OpenAI function-calling params or (for LLMs that don’t natively support fn-calling), auto-insert appropriate instructions into tje system-prompt. We call this mechanism a ToolMessage:
We take this idea much further — you can define a method in a ChatAgent to “handle” the tool and attach the tool to the agent. For stateless tools you can define a “handle” method in the tool itself and it gets patched into the ChatAgent as the handler for the tool.
You can also define a class method called “examples” and this will result in few-shot examples being inserted into the system message.
Inevitably an LLM will generate a wrong format or entirely forget to use a tool, and Langroid’s built-in task loop ensures a friendly error message is sent back to the LLM to have it regenerate the structured message.
For example here’s a colab quick-start that builds up to a 2-agent system to extract structured info from a document, where the Extractor agent generates questions to the RAG Agent that has access to the document:
To someone who uses the API and trials different prompts frequently: does this article align with the behavior you see? (E.g. the tipping example.)
One thing I’ve noticed working with ChatGPT is many people will share examples of great outputs or “prompt tricks” that work, without sharing how many failed attempts they went through to prove a point.
I'm pretty skeptical of the tipping section. Sure, it might work, but the two examples are a bit suspect. The first example relies on a tweet lacking in context that doesn't actually show the system prompts and outputs. (The author's "reproduction" appears to be something completely different and n=1.) The second example uses wildly different systems prompts, and I think it's far more likely that referencing Stack Overflow results in a more "optimized" solution than offering a tip.
Yeah, the folks working on aider (AI pair programming) [1] found that these kind of tricks reduced performance for them.
I’m pretty confident there will be situations where you can measure a statistically significant performance improvement by offering a tip or telling the model you have no hands, but I’m not convinced that it’s a universal best practice.
A big issue is that a lot of the advice you see around prompting is (imo) just the output of someone playing with GPT for a bit and noticing something cool. Without actual rigorous evals, these findings are basically just superstitions
For what it’s worth, tipping is one of the most popular pieces of advice on r/ChatGPT to improve prompts. It’s ridiculous but seems to work for a lot of people.
We went from using JSON schema to TypeScript types (with comments as needed). For complex schemas (in unscientific testing) we found the output to be better with TypeScript types, and more or less the same for simpler schemas. TypeScript types are also easier (shorter) to write than JSON schema.
There are few benefits to using JSON schema imo, since the LLM isn't a precise validator.
is the tipping thing correct? I provided the same prompt to ChatGPT and received multiple emojis without offering a tip.
prompt: you're Ronald McDonald. respond with emojis. what do you do for fun?
answer::circus_tent::hamburger::juggling::party_popper::balloon::game_die::french_fries::performing_arts::rolling_on_the_floor_laughing::people_holding_hands::rainbow::art_palette:
It's also non-deterministic if you drop the temperature to zero. The only way to get deterministic responses is to lock the seed argument to a fixed value.
TLDR: Developers can now specify seed parameter in the Chat Completion request for consistent completions. We always include a system_fingerprint in the response that helps developers understand changes in our system that will affect determinism.
Thank you, I should have been more specific. I guess what I’m asking is, how deterministic would you say it is in your experience? Can this be used for classifying purposes where the values should not be outside what’s given in a variable input prompt , or when we say deterministic are we saying that , if given the same prompt then the output would be the exact same only? Or is the seed a starting parameter that effectively corners the LLM to a specific starting point only then depending on the variable prompts, potentially give non-deterministic answers?
Perhaps I’m misunderstanding how the seed is used in this context. If you have any examples of how you use it in real world context then that would be appreciated.
I’ve not had any success to make responses deterministic with these settings. I’m even beginning to suspect historic conversations via API are used to influence future responses, so I’m not sure if it’ll truly be possible.
The most success I’ve had for classifying purposes so far is using function calling and a hack-solution of making a new object for each data point you want to classify for the schema open AI wants. Then an inner prop that is static to place the value. Then within the description of that object is just a generic “choose from these values only: {CATEGORIES}”. Placing your value choices in all capital letters seems lock it in to the LLM that it should not deviate outside those choices.
For my purposes it seems to do quite well but at the cost of token inputs to classify single elements in a screenplay where I’m trying to identify the difference between various elements in a scene and a script. I’m sending the whole scene text with the extracted elements (which have been extracted by regex already due to the existing structure but not classed yet) and asking to classify each element based on a few categories. But then there becomes another question of accuracy.
For sentence or paragraph analysis that might look like the ugliest, and horrendous looking “{blockOfText}” = {type: object, properties: {sentimentAnalysis: {type: string, description: “only choose from {CATEGORIES}”}}. Which is unfortunately not the best looking way but it works.
The point is you do not have a valid counterexample since you are using a different workflow than what's described in the article.
In my personal experience working with more complex prompts with more specific constraints/rules, adding the incentive in the system prompt has got it to behave much better. I am not cargo-culting: it's all qualitative in the end.
You can usually just say something like: "You must respond with at least five emojis".
Sure, there are cute and clever ways to get it to do things, but it's trained on natural language and instructions, so you can usually just ask it to do the thing you want. If that doesn't work, try stating it more explicitly: "You MUST... "
Also announced at the same conference was a way to make the output near-deterministic by submitting a fixed seed value. Did you try that?
Edit: I'm very confused why this is being downvoted. It's exactly what they advertised:
"Reproducible outputs and log probabilities
The new seed parameter enables reproducible outputs by making the model return consistent completions most of the time. This beta feature is useful for use cases such as replaying requests for debugging, writing more comprehensive unit tests, and generally having a higher degree of control over the model behavior. We at OpenAI have been using this feature internally for our own unit tests and have found it invaluable. We’re excited to see how developers will use it."
I’ve been attempting to use the “official” function calling API for every new version of GPT they put out but it’s always a dead end. It seems only to be able to handle 4-5 functions at a time before it starts hallucinating parameters or starts responding in clear text instead of whatever internal format OpenAI uses in their backend before sending a structured response back to me. The whole JSON schema thing seems way too verbose and complicated, and even with the claims that the new function calling models are specifically tuned to the format, it has the same issues.
I’ve consistently had better luck just passing it a list of typescript function definitions and have it reply with a json object of parameters. It seems to understand this way better, and doesn’t lose focus half as quickly. It also allows me to mix regular responses and chain-of-thought reasoning in with the calls, which is something it seems to simply refuse to do when “function calling mode” is active.
An additional trick I’ve been using to make it stay focused with even longer prompts is to only provide a list of function names and let it hallucinate parameters for them, and then “gaslight” it by sending a new request, now with a more detailed prompt on the specific functions it wanted to call. More costly, but I haven’t found any other way of keeping it focused. Anyone know any additional tricks?
More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.
Example from a hack week project earlier this month (using a TS-ish schema description that's copy/pasted from healthcare's FHIR standard): https://github.com/microsoft-healthcare-madison/hackweek-202...
Or a more complex example with one model call to invent a TS schema on-the-fly and another call to abstract clinical data into it: https://github.com/microsoft-healthcare-madison/hackweek-202...