Search results for "GPT-4O"
Today
04:54

Perplexity Discloses Web Search Agent Post-Training Method; Qwen3.5-Based Model Outperforms GPT-5.4 on Accuracy and Cost

Perplexity uses SFT followed by RL with Qwen3.5 models, leveraging a multi-hop QA dataset and rubric checks to boost search accuracy and efficiency, achieving best-in-class FRAMES performance. Abstract: Perplexity's post-training workflow for web-search agents combines supervised fine-tuning (SFT) to enforce instruction-following and language consistency with online reinforcement learning (RL) via the GRPO algorithm. The RL stage uses a proprietary multi-hop verifiable QA dataset and rubric-based conversational data to prevent SFT drift, with reward gating and within-group efficiency penalties. Evaluation shows Qwen3.5-397B-SFT-RL achieving top FRAMES performance, 57.3% accuracy with a single tool call and 73.9% with four calls at $0.02 per query, outperforming GPT-5.4 and Claude Sonnet 4.6 on these metrics. Pricing is API-based and excludes caching.
More
12:05

Kimi K2.6 Tops OpenRouter Programming Benchmark, Outperforms Claude and GPT Series

Kimi K2.6 tops OpenRouter leaderboard, outperforming Claude, GPT, and open-source rivals, signaling domestic AI progress and narrowing the gap with global leaders. Abstract: Kimi.ai announced that its latest model, Kimi K2.6, ranked first on the OpenRouter programming-ability leaderboard, leading developer evaluations. Benchmarks indicate K2.6 delivers superior performance across programming tasks relative to Claude, GPT-series, and other open-source models, highlighting gains in code generation and development-task handling and signaling progress for domestic AI toward international leaders.
More
07:05

Anthropic's Claude Code Removal Sparks Developer Backlash; OpenAI Gains Community Support

Anthropic drops Claude Code from Pro plan, drawing criticism as developers migrate to OpenAI; Codex remains free/basic, GPT-5.4 and Image 2.0 boost performance, driving large user migration. Abstract: The article examines Anthropic's removal of Claude Code from the $20 Pro plan, which triggers backlash from developers who call it a hidden price increase and a reliability risk. It contrasts this move with OpenAI's policy of keeping Codex in free and basic tiers, while highlighting strong model performance from GPT-5.4 and ChatGPT Images 2.0, and notes a rapid migration of users to OpenAI, with Codex reportedly exceeding 4 million weekly active users.
More
23:49

OpenAI Rolls Out ChatGPT Images 2.0 with Thinking Capabilities

OpenAI rolls out ChatGPT Images 2.0 with improved rendering, multilingual script support, real-time web search, and multiple outputs per prompt, plus refined accuracy and pricing benchmarks against Imagen. Abstract: The article reports OpenAI's ChatGPT Images 2.0 rollout, noting improved rendering of text and UI elements, broader language support, and better adherence to instructions. It describes new capabilities for real-time search, multiple outputs, and output refinement, while offering pricing context and noting the architecture remains undisclosed.
More