Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks, which raises hopes for achieving Artificial General Intelligence. To better complete complex tasks, we need LLMs to program for the task and then follow the program to generate a specific solution for the test sample. We propose using natural language as a new programming language to describe task procedures, making them easily understandable to both humans and LLMs. LLMs are capable of directly generating natural language programs, but these programs may still contain factual errors or incomplete steps. Therefore, we further propose the Learning to Program (LP) method to ask LLMs themselves to learn natural language programs from the training dataset of complex tasks and then use the learned program to guide inference. Our experiments on the AMPS (high school math) and Math (competition mathematics problems) datasets demonstrate the effectiveness of our approach. When testing ChatGPT on 10 tasks from the AMPS dataset, our LP method's average performance outperformed the direct zero-shot test performance by 18.3%.
I'm interested as to why it does worse, and some cases much more so, on some problems.
I find myself struggling to connect all of the dots without seeing the entire log. I understand the need to editorialize to show your specific research and implementations. However I cannot fully grok what is being sent to the LLM without seeing an unedited version. It's probably very stupid, but I need to run inferences step by step on LLM prompts to see exactly what is being described.