Loading article…
New research shows LLMs struggle with cyber threat intelligence and cognitive tasks. See how GPT-4 and other models fail as data sequence lengths increase.
Researchers at the Rochester Institute of Technology have launched CTIBench, a new evaluation suite designed to measure how large language models perform in cyber threat intelligence (CTI) applications [1]. The benchmark addresses a critical gap in AI testing, as existing tools are either too generic or lack the specific focus required to analyze threat data, vulnerability severity, and actor attribution [1].
CTIBench consists of 2,500 multiple-choice questions derived from authoritative sources like NIST, MITRE ATT&CK, and the Common Weakness Enumeration database [1]. Beyond knowledge testing, the benchmark forces models to perform practical reasoning tasks, such as mapping vulnerabilities to specific categories and predicting severity scores [1]. In initial tests, ChatGPT 4 outperformed other models in most categories, though Gemini 1.5 proved more accurate at predicting vulnerability severity [1]. While these tools aim to automate incident response and reduce triage time, researchers warn that LLMs remain prone to hallucinations and misunderstandings that could lead to unreliable intelligence in high-stakes security environments [1].
Separate research published in PNAS Nexus highlights a deeper, structural limitation in how these models process information. By applying the psychological "Stroop task"—which tests the ability to inhibit automatic responses—investigators found that frontier models like GPT-4o, Claude 3.5 Sonnet, and even newer systems like GPT-5 and Gemini 2.5 suffer from a "cognitive collapse" as data sequences grow longer [2].
While models performed well on short lists of five words, accuracy plummeted as the length increased [2]. GPT-4o, for instance, dropped from 91% accuracy at five words to just 15% at 40 words [2]. The study suggests that while humans can exert top-down executive control to suppress automatic impulses, transformer-based models default to their primary training—text reading—when faced with complex or mismatched data [2]. In trials involving mixed lists of matching and mismatched colors, accuracy for these models fell to near 0% [2].
These findings suggest that the current architecture of synthetic attention lacks the sustainable focus required to resist training biases when handling long or complex data arrays [2]. As developers push to integrate LLMs into critical fields like cybersecurity, the tension between the models' ability to process vast amounts of data and their tendency to fail under specific logical or length-based pressures remains a significant hurdle for reliable deployment [1, 2].
Coverage is mostly measured — 210 of 263 reports stay neutral.
Every Monday — the token unlocks, Fed dates & catalysts set to move crypto and markets this week. So you’re never blindsided.
Free · 3-min read · one-click unsubscribe
AI-assisted synthesis by the TrendWatcher Editorial Desk · sourced from 2 outlets · Jun 14, 2026 · How we report
Openai is a trending topic in the news. Recent coverage of Openai includes: Powerful A.
10 news sources analyzed
Based on our analysis of recent news articles, Openai has mixed coverage. Check the sentiment score above for detailed analysis.
TrendWatcher aggregates Openai news from 100+ trusted sources and provides AI-powered sentiment analysis updated in real-time.