A Framework for LLM Cognitive Assessment

An interactive guide to evaluating the thinking processes of Large Language Models.

Moving Beyond Accuracy

This application provides a procedural framework for a deep cognitive assessment of Large Language Models (LLMs). As AI is integrated into critical domains, simple accuracy scores are insufficient. We must probe deeper into the model's "thinking" to understand its capabilities and limitations. This interactive guide deconstructs the complex concept of LLM cognition into measurable dimensions, provides practical tools for evaluation, and establishes a system for diagnosing recurrent error patterns.

The goal is to move from a single performance score to a rich, detailed cognitive profile. This allows for a more nuanced diagnosis of model strengths and weaknesses, which is essential for guiding future research and building safer, more reliable AI. Use the navigation tabs above to explore the framework's core components: the theoretical dimensions of cognition, the practical toolkit for assessment, the taxonomy of errors, and the process for synthesizing results into actionable insights.

The Four Pillars of LLM Cognition

The framework is built on four core cognitive dimensions that provide a structure for understanding and evaluating machine "thinking." Each dimension represents a critical aspect of advanced reasoning. Click on each card to explore its definition and key components for assessment.

The Evaluator's Toolkit

This section provides the practical instruments needed to conduct a cognitive assessment. It includes advanced prompting techniques to elicit reasoning, a multi-level rubric for grading, and a guide to integrating various evaluation metrics.

Advanced Prompting Techniques

LLM Critical Thinking & Reasoning Rubric

This rubric translates cognitive dimensions into observable criteria. It evaluates not just the final output, but the transparency and robustness of the reasoning process.

LLM Critical Thinking & Reasoning Rubric (for LLMs)

This section provides master prompts designed to have an LLM act as a judge, assessing a sample text against each criterion. Use these prompts to automate the evaluation process.

Automated Rubric Assessment

Use the form below to automatically evaluate a sample text against the full reasoning rubric. This tool uses the Gemini API to act as a judge for each criterion. Enter the text you want to assess and your API key to begin.

Common Pitfalls

A comprehensive assessment requires not only grading successes but also systematically identifying and classifying failures. This section provides a catalog of common cognitive and operational errors to standardize analysis. Use the filters to explore different error categories.

Synthesis & Actionable Insights

The final stage of the assessment synthesizes all collected data into a holistic "cognitive profile." This profile provides a nuanced portrait of a model's strengths and weaknesses, moving beyond a single score to generate actionable insights for model improvement.

Sample Cognitive Profile: "Brittle Logician"

This chart visualizes the performance of a hypothetical model. As the profile shows, the model excels at structured, logical tasks but its performance degrades sharply when faced with ambiguity, novel constraints, or problems requiring creative synthesis. Its primary failure modes relate to a lack of cognitive flexibility and robustness.

Targeted Interventions:

  • Improve Flexibility: Fine-tune using techniques like Denial Prompting to reward novel solution paths.
  • Enhance Robustness: Augment training data with perturbed and adversarial examples to reduce reliance on superficial patterns.
  • Boost Creativity: Adjust inference-time parameters (e.g., temperature) and curate more diverse training datasets.