Human vs. AI: Poster Categorization
Comparing 5 LLMs against Human Evaluator (Ira) on 18 Student Projects
1. The "Data Visualization" Trap
Trend:
There was a massive divergence on the project "SMOKE".
Insight:
The Human (Ira) classified it as Not Computing , likely because using standard tools to make a graph isn't "computing research." However, every single LLM classified it as Computing in the Disciplines. The AIs saw charts and data and assumed "Computing," whereas the human required deeper computational methodology.
2. Agreement on Manual Labor
Trend:
100% consensus on physical projects like Cottonwood Comeback and Spaghetti Feed.
Insight:
Both Humans and AIs easily recognize that planting trees and cooking pasta are "Not Computing."
3. Defining "Society"
The category Computing and Society caused the most confusion. For "Accounting for Expression," the Human saw it as Disciplines , Claude/ChatGPT saw Society , and Gemini/Perplexity/Grok saw Not Computing. The nuance of "using tech for social good" vs "standard office work" is difficult for the models.
4. Strictness vs. Hallucination
ChatGPT and Grok were the strictest, marking the highest number of projects as Not Computing (13/18), closely matching Ira's count. Gemini and Claude were more generous, often finding "Society" or "Discipline" angles where others didn't.
5. The "Unseen/Seed" Ambiguity
One poster (Project 03/09) caused an identity crisis. Some LLMs read the title as "Unseen Enemies," others as "Seed Structures," and they categorized it differently based on which title they hallucinated or focused on.
| Project Title | Ira (Human) | Gemini | Claude | ChatGPT | Perplexity | Grok |
|---|
* Note: Project titles regarding "Unseen Enemies" and "Seed Structures" were merged due to source ambiguity.
Evaluation: Are LLMs Ready?
Accuracy < 80%The Verdict: No model achieved the 80% accuracy threshold required for reliable autonomous grading. The highest alignment with the Human Evaluator was approximately 66%(ChatGPT & Claude).
Recommendation: Do not use these models as the final judge. However, ChatGPT is effective as a "First Pass Filter" because it closely mimics the human's strictness on non-computing projects. Claude is useful for detecting subtle "Society" projects but is prone to over-categorization.