“Many developers use the same context repeatedly across multiple API calls when building AI applications, like when making edits to a codebase or having long, multi-turn conversations with a chatbot,” OpenAI explained, adding that the rationale is to reduce token consumption when sending a request to the LLM.
What that means is that when a new request comes in, the LLM checks if some parts of the request are cached. In case it is cached, it uses the cached version, otherwise it runs the full request.
OpenAI’s new prompt caching capability works on the same fundamental principle, which could help developers save on cost and time.