Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Researchers from Stanford University‘s Scaling Intelligence Lab introduced a new inference framework that could help large language models (LLMs) go through potential responses faster.
The framework, Archon, uses an inference-time architecture search (ITAS) algorithm to improve LLMs performance without additional training. It is model agnostic, open-source and designed to be plug-and-play for large and small models.
Archon could ideally help developers design AI model systems using multiple inference-time techniques to cut down on models to determine responses. The Scaling Intelligence Lab said techniques like Archon would help cut down on costs related to building models and inference. As LLM development turns toward larger parameters or more advanced reasoning, costs could increase despite companies like OpenAI anticipating more affordability.
According to the researchers, Archon automatically designs architectures that improve task generalization, enabling models to perform tasks beyond those they were initially trained on.
“Our Archon framework and ITAS algorithm draw inspiration from neural architectures and neural architecture search, respectively,” the researchers said in their paper. “Archon is constructed of layers of LLMs, in which models in the same layer run in parallel but each later runs sequentially.”
These layers perform different inference-time techniques, “either transforming the number of candidate responses through generation and fusion (like linear transformations) or reducing the number of candidate responses to improve quality (like non-linearities).”
Archon outperformed GPT-4o and Claude 3.5 Sonnet by 15.1 percentage points in benchmark tests such as MT-Bench, Arena-Hard-Auto, Alpaca-2.0 Eval, MixEval, MixEval Hard, MATH and CodeContests. When Archon faced open-source LLMs, it outperformed them by 11.2 percentage points.
Archon components
The ITAS algorithm is comprised of several LLM components and can do inference-time techniques.
The first component is the Generator, which creates possible answers for the model. The second component, the Guser, will take these responses and combine them into one. An example would be if the question posed to a model wants to know the capital of France, the fuser will take the generated responses of “the capital of France is Paris,” France is in Europe,” and turn it to “the capital of France, a country in Europe, is Paris.”
Next, Archon moves to the Ranker component, which ranks the best answers. A Critic component evaluates the ranked answers to determine whether they’re good or bad. The Verifier checks for logic and correctness before moving on to the Unit Test Generator and Evaluator, which do small tests to see if the response works and check the test results.
By building Archon this way, the researchers said the framework improves the quality of LLMs’ responses faster and without additional fine-tuning.
Archon’s limitations
So far, the Archon framework works best with LLMs with 70B parameters or more like Meta’s Code Llama 70B, making it difficult to point to most LLMs right now. The researchers said most of the challenge comes from the smaller model’s limited capabilities to follow instructions due to the smaller context windows.
“When we utilize the Archon architecture with only 7B open-source models, we get a notable decrease of 16%,” in performance, the paper stated.
Smaller models using the Archon framework lagged behind single-turn models by 15.7%.
The Stanford lab also said Archon “is not ideal for tasks that prefer the latency of a single LLM call,” such as chatbots. The framework makes multiple LLM calls because of the different operations it does so single question-and-answer queries won’t benefit from its capabilities. Archon may work better for tasks involving complex instructions like solving equations, programming, or even complicated customer service issues.
Despite its limitations, the researchers behind Archon said they hope it can accelerate the development of high-performing models without requiring more inference and training capital.
Source link