The article discusses the peculiar performance disparity between different large language models (LLMs) in playing chess, with a focus on gpt-3.5-turbo-instruct's unexpected prowess compared to newer models. It explores theories about why this occurs, including the influence of training data quality and model architecture, and experiments with various methods to enhance chess-playing abilities of LLMs through prompts and fine-tuning.