Exploring the Impact of Feedback Loops on LLM Code Generation
Published April 5, 2023
Christopher Ostrouchov
Overview
Large Language Models (LLMs) have been around for several years but recent advances have revolutionized the field of natural language processing (NLP) and artificial intelligence (AI), opening up a world of possibilities across various domains. OpenAI's ChatGPT has taken the world by storm and shown remarkable ability to generate human-like responses to a wide range of queries.
These models require training on vast amounts of data and in some sense can be thought of as a way to summarize and expose the information in the training dataset. It is important to remember, however, that the information returned in response to a query is a statistical answer that does not result from the same logical reasoning steps humans are capable of. This leads to the so called hallucination effect that is now being talked about, i.e., when a confidently written but factually wrong response is given to a set of queries.
Between discussion sites like StackOverflow and code repositories like GitHub and GitLab, there is a tremendous amount of source code available online for potential training. Efforts like Github Copilot, show the power of AI assisted coding to boost productivity. Our testing has shown, however, that while Copilot can greatly assist a developer, it is also prone to hallucinations inventing new methods that don't exist or sometimes using outdated APIs that no longer exist in recent versions of the code.
We decided to experiment to see if we could add a feedback loop into the process of code generation to reduce model hallucinations and iterate towards a working solution faster. We also experimented with adding conversational feedback to the results rather than the auto-complete mechanism that Copilot currently implements (Note: Copilot just announced a preview release of a conversational chat based mode).
As an experiment we developed a Python package pseudocode which allows developers to use an LLM chat session to produce correct and tested code simply by providing code annotations and tests. We believe that pseudocode is a "higher level language" for writing Python code. Below you will find an example. We emphasize that the interface is via type annotations and function docstrings and provides easy ways to include automated tests.
- We define the code to be generated by providing a function signature with type annotations
- We define tests that must pass
- We submit these to OpenAI ChatGPT 3.5 turbo API with some instructions
- Any feedback is resubmitted to the ChatGPT session to continue refining the code
- If the user provides feedback, then the feedback is sent back to ChatGPT and we repeat the process
- We run the resulting code, and if the tests fail, we resubmit the errors to the ChatGPT session
Below we walk through a demonstration of this process.
Example
Let's use a concrete example of a slightly non-trivial task of getting GitHub issues.
We feel this is a good example because it is hard to remember how to use the GitHub API exactly and inevitably requires some back and forth using the PyGithub Python library, API requests, and viewing API documentation. LLMs show promise
for these applications since there is a large amount of surrounding
documentation but no one example to do exactly what you
need. pseudocode
enforces the practice of creating an interface
specification which:
- declares the function signature
def get_issues(repository: str) -> typing.List[typing.Tuple[str, int, datetime.datetime]]
- uses the function's
docstring
to specify what the function should be doing - automates tests run by
pseudocode
to provide feedback to the LLM of the quality and correctness of the code generated
The specification is essential for helping the LLM avoid hallucinating, and we have found that it is highly effective for the examples we've experimented with. When you run the example above, you will see the following.
pseudocode
generates appropriate prompts to the LLM asking it to
provide code in response to the user's requirements. In this case we
specified review=True
in the decorator which prompted pseudocode
to ask for user feedback on the generated code.
Already after some quick initial feedback we can see that the LLM takes into account the users needs. After user provided feedback the automated tests are run. If those tests fail or run into exceptions they will be automatically passed back to the LLM for corrections.
The final generated code after passing all tests.
Conclusion
There are several key ideas here we want to highlight about why we
think the approaches taken in pseudocode
are unique.
-
Autogenerated code needs guidance. Feedback is generated from the user and the rest is guided by automated testing and exception reporting back to the LLM.
-
Interface design is naturally a high level task which is the key to composable code. Declaring interfaces for LLMs to operate within is important.
-
Separation of code declarations and generated code. Similar to
.py
and.pyc
files we should separate the interfaces from the autogenerated code.
We mentioned earlier that this was an experiment of the interaction
between automated testing, user feedback, and LLMs. We want to explore
this more and already have additional ideas like applying formatting
and linting to generated LLM code via
black,
ruff, and
isort, and caching of generated LLM
code to avoid repetitive API calls and store a database of reusable
code. Similar to pseudocode.pseudo_function(...)
shown below, it would be nice
to have an equivalent for arbitrary files. This would provide a feel
somewhat like
cookiecutter on
steroids.
To see the full example, have a look at this video.
Check out the project at Quansight/pseudocode.