Authors Junyi Sha Renfei Tan

Affiliation MIT

Published Dec. 11th, 2024

<aside> đź’ˇ This blog explores an innovative approach to enhance decision-making with large language models (LLMs) through self-reflection and adaptive learning. By integrating strategic reasoning into game-playing contexts like Tic-Tac-Toe, we demonstrate how strategic LLMs can achieve strong performance with minimal training.

</aside>

Introduction

Large Language Models (LLMs) have revolutionized how we interact with technology, excelling in natural language understanding, code generation, and more. However, one area where they still struggle is multi-step reasoning - a critical skill for mastering complex tasks like strategy games. Games such as Tic-Tac-Toe might appear simple at first glance, but they require planning, foresight, and the ability to evaluate possible future outcomes—skills that challenge even the most advanced LLMs.

In this blog post, we dive into an innovative approach to bridge this gap. By integrating self-reflection process we aim to enhance LLMs' ability to reason over multiple steps, making their gameplay strategies more robust and intelligent. Our work focuses on Tic-Tac-Toe as a testing ground, exploring how reinforcement learning like techniques can enable LLMs to make smarter moves, anticipate an opponent’s strategy, and achieve better performance in this game.

Playing against GPT-4o: A game of Tic-Tac-Toe where human wins—highlighting the need for enhanced multi-step reasoning in AI. Even GPT-o1 with advanced reasoning would lose against human.

Playing against GPT-4o: A game of Tic-Tac-Toe where human wins—highlighting the need for enhanced multi-step reasoning in AI. Even GPT-o1 with advanced reasoning would lose against human.

Through the exploration, we not only aim to improve LLMs for games but also to uncover broader insights into improving reasoning tasks for AI systems in general.

Advancing Multistep Reasoning in Large Language Models

Comparison of Chain-of-Thought, Tree of Thoughts, and Graph of Thoughts. (Image source: Besta et al., 2024)

Comparison of Chain-of-Thought, Tree of Thoughts, and Graph of Thoughts. (Image source: Besta et al., 2024)

Existing literature has explored various techniques to enhance large language models (LLMs) for multistep reasoning tasks. One prominent approach is chain-of-thought prompting [1], which guides the model to break down complex problems into smaller, logical steps, thereby improving its reasoning capabilities. Feng et al. (2023) [7] provide a theoretical perspective on its mechanisms, while Wang et al. (2023) [8] introduce Self-Consistency, an approach that aggregates multiple reasoning paths to improve the robustness and accuracy of Chain-of-Thought outputs, further solidifying its role in solving complex tasks. Another framework Tree of Thoughts [2] simulate branching decision trees to help LLMs evaluate multiple possible paths before committing to an answer. Additionally, the Graph of Thoughts framework [3] has been introduced, structuring reasoning into graphs where "thoughts" are vertices and dependencies are edges, enabling operations like aggregation and refinement to improve reasoning efficiency and solution quality. While these techniques have shown promise in enhancing reasoning performance, they do not inherently involve mechanisms that involves genuine self-reflection.

Building on this idea of self-improvement, the Self-Play Fine-Tuning (SPIN) framework [4] introduces a novel method for enhancing LLMs without relying on external labeled data. Using the process of self-play, SPIN enables models to iteratively generate, evaluate, and refine their outputs, aligning them with desired target distributions. This approach highlights the potential of self-teaching mechanisms, where models improve autonomously through internal feedback loops. Our study shares a similar philosophy by incorporating a self-reflective structure into LLM gameplay scenarios. However, while SPIN focuses on fine-tuning LLMs for general tasks, our work adapts these self-teaching principles to strategic game-playing contexts, emphasizing adaptive reasoning and decision-making. In addition, we design a self-reflection approach inspired by traditional reinforcement learning, where rewards are conceptualized as key messages instead of probabilistic scores. This allows the LLM to interpret and internalize feedback in a more structured and human-readable form, providing meaningful improvements in gameplay strategies.

Traditional Architectures for Game-Playing AI

Game-playing AI has a long history of using traditional architectures that have evolved significantly over time, particularly for strategic games like Chess and Go. These architectures demonstrate a range of approaches, from reinforcement learning techniques to sophisticated deep learning models.

This algorithm describes the Q-Learning process for playing Tic-Tac-Toe, a reinforcement learning approach where an agent learns optimal actions through trial and error.

This algorithm describes the Q-Learning process for playing Tic-Tac-Toe, a reinforcement learning approach where an agent learns optimal actions through trial and error.

Q-Learning:

AlphaGo:

Self-Play Architectures:

These traditional architectures have paved the way for significant advancements in game-playing AI. While methods like Q-Learning provide robust foundations for small-scale games, models like AlphaGo and AlphaZero demonstrate the power of integrating deep learning and search. Our approach draws inspiration from these methods, focusing on a lightweight, self-reflective mechanism for LLMs, which is computationally efficient and adaptable to scenarios with limited trial opportunities.

Game Board Setup and Player Models

The foundation of our experiments is a simple and efficient Tic-Tac-Toe board implementation. The board is a 3x3 grid that supports basic operations such as resetting to an initial state, placing pieces, and checking for a game's outcome. It is designed to facilitate interaction between players and ensure valid gameplay mechanics. The board identifies game-ending conditions, including wins (rows, columns, or diagonals), losses and draws, and provides a shared environment for testing various player strategies.

To evaluate a range of strategies for playing Tic-Tac-Toe, we designed five distinct player models, each representing a unique approach to decision-making and gameplay: