Ruta

  • Home
  • Blogs
  • Papers
  • News
  • Projects
  • Music
  • About Me & Contacts
  • Home
  • Blogs
  • Papers
  • News
  • Projects
  • Music
  • About Me & Contacts
  • Breaking the illusion: Revisiting LLM anthropomorphism

    C Sypherd, W Tang, V Belle

    The 4th International Conference on Human and Artificial Rationalities, 1-19

    As LLMs have demonstrated remarkable performance across diverse domains, researchers have often utilized human categories to describe and evaluate their behavior. Such anthropomorphism results in the application of expectations, benchmarks, and interpretations typically reserved for humans to LLMs. LLM anthropomorphism has a number of benefits, such as facilitating understanding of LLMs, but risks misrepresenting fundamental differences between humans and LLMs and the reality of the progress being made. With that dichotomy in mind, we explore practical taxonomies for the application of anthropomorphic terms and human benchmarks to LLMs that mitigate the risks of LLM anthropomorphism.

    Published Jul 2025
  • Knowledge-free and knowledge-based Theory of Mind reasoning in Large Language Models

    W Tang, V Belle

    The 4th International Conference on Human and Artificial Rationalities, 1-17

    Large Language Models (LLMs) have recently shown promise and emergence of Theory of Mind (ToM) ability and even outperform humans in certain ToM tasks. To evaluate and extend the boundaries of the ToM reasoning ability of LLMs, we proposed a novel concept, taxonomy, and framework, that Knowledge-Free and Knowledge-Based ToM reasoning, and developed a multi-round text-based game, called Pick the Right Stuff, as a benchmark. We have evaluated seven LLMs with this game and found their performance on Knowledge-Free tasks is consistently better than on Knowledge-Based tasks. In addition, we found that one of the models with the small parameter size, mistral: 7b-instruct, shows similar performance to other evaluated models with large parameter sizes and even outperforms several of them. Furthermore, it even achieved a performance almost similar to gpt-4o on Knowledge-Based tasks. These results raise a thought-provoking question about whether increasing model parameter size may effectively enhance LLMs capabilities, at least discussed in the context of ToM reasoning ability. We expect this work to offer insights into the ToM reasoning ability of LLMs and to pave the way for the future development of ToM benchmarks and also for the promotion and development of more complex AI agents or systems that are required to be equipped with more complex ToM reasoning ability.

    Published Jul 2025
  • Lyria: A General LLM-Driven Genetic Algorithm Framework for Problem Solving

    W Tang, K Nuamah, V Belle

    While Large Language Models (LLMs) have demonstrated impressive abilities across various domains, they still struggle with complex problems characterized by multi-objective optimization, precise constraint satisfaction, immense solution spaces, etc. To address the limitation, drawing on the superior semantic understanding ability of LLMs and also the outstanding global search and optimization capability of genetic algorithms, we propose to capitalize on their respective strengths and introduce Lyria, a general LLM-driven genetic algorithm framework, comprising 7 essential components. Through conducting extensive experiments with 4 LLMs across 3 types of problems, we demonstrated the efficacy of Lyria. Additionally, with 7 additional ablation experiments, we further systematically analyzed and elucidated the factors that affect its performance.

    Preprint Jul 2025
  • HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation

    W Tang, Y Li, C Sypherd, E Polgreen, V Belle

    ACL 2025 Findings

    Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.

    Published May 2025
  • LTLBench: Towards benchmarks for evaluating temporal logic reasoning in large language models

    W Tang, V Belle

    Temporal reasoning (TR) is a critical component of artificial intelligence, encompassing understanding and processing temporal information and relationships between events. To discover and study the TR ability in Large Language Models (LLMs), various datasets have been constructed in different ways for evaluating various aspects of TR ability. Our work proposes a novel approach to design and develop a pipeline for constructing datasets to evaluate the TR ability of LLMs by leveraging random directed graph generation, LTL formula, and the NuSMV model checker. Based on the pipeline, we have also constructed a dataset as a benchmark, namely LTLBench, consisting of 2,000 TR challenges and evaluated six LLMs with it. Furthermore, we have conducted additional experiments to discover the impact of increasing the number of events and formula operators on the complexity of TR problems and the performance of LLMs. We have demonstrated that although LLMs exhibit some promise in handling TR challenges, they still struggle with complex TR. We expect this work can offer insights into TR ability in LLMs while also providing a valuable tool for future TR evaluations.

    Preprint Jul 2024
  • ToM-LM: Delegating Theory Of Mind Reasoning to External Symbolic Executors in Large Language Models

    W Tang, V Belle

    18th International Conference on Neural-Symbolic Learning and Reasoning

    Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others. While Large Language Models (LLMs) have shown some promise with ToM ability, they still struggle with complex ToM reasoning. Our approach leverages an external symbolic executor, specifically the SMCDEL model checker, and fine-tuning to improve the ToM reasoning ability of LLMs. In our approach, an LLM is first fine-tuned through pairs of natural language and symbolic formulation representation of ToM problems and is then instructed to generate the symbolic formulation with a one-shot in-context example. The generated symbolic formulation is then executed by the SMCDEL model checker to perform transparent and verifiable ToM reasoning and give the final result. We demonstrate that our approach, ToM-LM, shows a significant improvement over all the constructed baselines. Our study proposes a novel view …

    Published Apr 2024