Human vs. AI: Poster Categorization
Comparing 5 LLMs against Human Evaluator (Ira) on 18 Student Projects (Round 2: Skeptical Prompt)
1. The "SMOKE" Correction
Trend:
In Round 1, all AI models hallucinated "Computing" because they saw charts. In Round 2, Grok, ChatGPT, and Gemini correctly aligned with the Human, marking it Not Computing.
Insight:
The new "Skeptical" prompt successfully taught the top models to distinguish between "Graphic Design" and "Computational Science."
2. The Pendulum Swing (Hyper-Strictness)
Trend:
Grok and ChatGPT rejected almost everything, marking 17/18 projects as Not Computing.
Insight:
They became too strict. Where the Human saw "Societal Impact" in mental health flyers, the Skeptical AIs only saw "Administrative Work," resulting in false negatives for valid non-technical projects.
3. Claude's Precision
Claude was the only model to correctly identify "The Art of Scientific Visualization" as Fundamental Computing (improving the tool) rather than just using it. It was the most balanced model, avoiding the extreme strictness of Grok/ChatGPT while still filtering out the basic posters.
| Project Title | Ira (Human) | Gemini | Claude | ChatGPT | Perplexity | Grok |
|---|
* Note: Cells with Green Borders indicate a match with the Human Evaluator.
Evaluation: Round 2 Results
Top Performer: Claude & Gemini (~66%)The Verdict: The "Skeptical" prompt fixed the hallucination issues but introduced a new "Blind Spot" regarding Societal projects. ChatGPT and Grok are excellent filters for rejecting non-computing work (High Precision) but fail to detect subtle nuance (Low Recall).
Recommendation: Use Claude for the most human-like balance of technical vs. societal understanding. Use ChatGPT if your only goal is to strictly filter out non-technical posters.