<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:blogger='http://schemas.google.com/blogger/2008' xmlns:georss='http://www.georss.org/georss' xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6408904051479439567</id><updated>2026-06-03T20:59:07.177+08:00</updated><category term="Life"/><category term="Blogging"/><category term="Thoughts"/><category term="Me"/><category term="Work"/><category term="Buzz"/><category term="Lovin"/><category term="Fun Stuf"/><category term="Large Language Models"/><category term="Occasions"/><category term="Davao"/><category term="Enterprise AI"/><category term="Reinforcement Learning"/><category term="hunee"/><category term="Multimodal AI"/><category term="Photography"/><category term="Making Money Online"/><category term="Tech"/><category term="AI Agents"/><category term="AI Research"/><category term="Agentic AI"/><category term="Jobs"/><category term="Moving On"/><category term="Open-Source AI"/><category term="OpenAI"/><category term="Marketing"/><category term="Medical Transcription"/><category term="Road Trip"/><category term="Success Stories"/><category term="rant"/><category term="Other Blogs"/><category term="Anthropic"/><category term="Generative AI"/><category term="MCP"/><category term="AI Development"/><category term="AI Models"/><category term="Code Generation"/><category term="Retrieval-Augmented Generation"/><category term="Developer Tools"/><category term="Food Trip"/><category term="Google AI"/><category term="Google DeepMind"/><category term="Multi-Agent Systems"/><category term="Music"/><category term="AI Benchmarks"/><category term="Gemini 2.5 Pro"/><category term="Hugging Face"/><category term="Machine Learning"/><category term="Mistral AI"/><category term="Twitter"/><category term="long-context"/><category term="Edge AI"/><category term="Google"/><category term="NVIDIA"/><category term="open-source LLM"/><category term="AI Safety"/><category term="Artificial Intelligence"/><category term="ChatGPT"/><category term="GPT-4.1"/><category term="GPT-4o"/><category term="Meta AI"/><category term="Microsoft"/><category term="Mixture-of-Experts"/><category term="Model Context Protocol"/><category term="Open Source AI"/><category term="Vertex AI"/><category term="Vision-Language Model"/><category term="super speed learning"/><category term="AI Ethics"/><category term="AI Integration"/><category term="AIME"/><category term="Alibaba"/><category term="Coffee"/><category term="Google Gemini"/><category term="LLM Agents"/><category term="LLMs"/><category term="Math Reasoning"/><category term="Photoshop Designs"/><category term="accelerated learning"/><category term="internet"/><category term="open source"/><category term="tool use"/><category term="AI Assistants"/><category term="AI Coding Assistant"/><category term="AI Infrastructure"/><category term="AI productivity"/><category term="Anthropic Claude"/><category term="Beach"/><category term="Claude AI"/><category term="Coding AI"/><category term="GAIA benchmark"/><category term="Gemini"/><category term="Hybrid Reasoning"/><category term="Image Generation"/><category term="LLM"/><category term="LiveCodeBench"/><category term="Tsinghua University"/><category term="UI Design"/><category term="Web Traffic"/><category term="context engineering"/><category term="on-device AI"/><category term="parallel thinking"/><category term="test-time scaling"/><category term="AI"/><category term="AI Benchmarking"/><category term="AI Coding"/><category term="AI Collaboration"/><category term="AI Deployment"/><category term="AI Tools"/><category term="AI Training"/><category term="AI Video Generation"/><category term="AIME benchmark"/><category term="Alibaba Cloud"/><category term="Autonomous AI"/><category term="Autonomous Agents"/><category term="Autonomous Systems"/><category term="BrowseComp"/><category term="Business Transcriptions"/><category term="Consumer AI"/><category term="Conversational AI"/><category term="DeepMind"/><category term="DeepSeek-R1"/><category term="Embodied AI"/><category term="Family"/><category term="Fiction Stories"/><category term="Fine-Tuning"/><category term="Foundation Models"/><category term="GPT-5"/><category term="Gated Memory Unit"/><category term="Gemini 2.5"/><category term="LangChain"/><category term="LangGraph"/><category term="Language Models"/><category term="MLLM"/><category term="Movie"/><category term="Poetry"/><category term="Reasoning"/><category term="Robotics"/><category term="Tagging"/><category term="Tencent Hunyuan"/><category term="blogs"/><category term="chain-of-thought"/><category term="conversation"/><category term="function calling"/><category term="guestblog"/><category term="image editing"/><category term="instruction following"/><category term="multimodal LLM"/><category term="reasoning models"/><category term="small language model"/><category term="survey"/><category term="vLLM"/><category term="AGI"/><category term="AI Alignment"/><category term="AI Assistant"/><category term="AI Developer Tools"/><category term="AI Inference"/><category term="AI Innovation"/><category term="AI Interoperability"/><category term="AI Mode"/><category term="AI Performance"/><category term="AI Reasoning"/><category term="AI Standards"/><category term="AI Strategy"/><category term="AI UX"/><category term="AI Workflow"/><category term="AI for science"/><category term="AI research agents"/><category term="API"/><category term="API Integration"/><category term="Agentic RAG"/><category term="Agents API"/><category term="Andrej Karpathy"/><category term="Andrew Ng"/><category term="Apache 2.0"/><category term="Asynchronous Coding"/><category term="Benchmarking"/><category term="Bug Fixing"/><category term="C-EVAL"/><category term="Cartoons"/><category term="Chinese AI"/><category term="Claude"/><category term="Claude 3.7 Sonnet"/><category term="Claude 4"/><category term="Claude Code"/><category term="Codestral Embed"/><category term="Cost-Effective AI"/><category term="Curriculum Learning"/><category term="DINOv3"/><category term="Databricks"/><category term="Deep Learning"/><category term="Deep Think"/><category term="DeepSeek"/><category term="DeepSeek R1"/><category term="Devstral"/><category term="Digital Transformation"/><category term="Enterprise AI Solutions"/><category term="Fidji Simo"/><category term="Figma Integration"/><category term="Financial Freedom"/><category term="FutureHouse"/><category term="GPT"/><category term="GRPO"/><category term="Gemini 2.0"/><category term="Gemini API"/><category term="Gemini CLI"/><category term="Gemini Deep Think"/><category term="Gemma 3"/><category term="GitHub Copilot"/><category term="Google AI Studio"/><category term="Google Developers"/><category term="Google I/O 2025"/><category term="Google Research"/><category term="Google Vertex AI"/><category term="Holiday Gifts"/><category term="Information Retrieval"/><category term="Jules"/><category term="Kimi K2"/><category term="Kuaishou"/><category term="LIBERO"/><category term="Literature"/><category term="Llama 4"/><category term="LlamaCon"/><category term="Long-Term Memory"/><category term="MCP Integration"/><category term="Machine Learning Engineering"/><category term="Medical AI"/><category term="MetaStone-S1"/><category term="MiniMax-M1"/><category term="MoE architecture"/><category term="ModelBest"/><category term="Multilingual AI"/><category term="NTU"/><category term="OpenRouter"/><category term="Phi-4-mini-flash-reasoning"/><category term="Productivity Tools"/><category term="Python library"/><category term="Qwen3"/><category term="Qwen3-32B"/><category term="RAG"/><category term="RLVR"/><category term="SWE-bench"/><category term="SambaY architecture"/><category term="Stanford University"/><category term="Tongyi Lab"/><category term="Tool Orchestration"/><category term="Twitter Travels"/><category term="Ultra-FineWeb"/><category term="Vibe Coding"/><category term="Vision Transformer"/><category term="Visual Reasoning"/><category term="Web search"/><category term="Windsurf"/><category term="ZeroSearch"/><category term="advanced reasoning"/><category term="benchmarks"/><category term="code analysis"/><category term="code embedding"/><category term="code understanding"/><category term="computer vision"/><category term="computer vision research"/><category term="data sovereignty"/><category term="dataset"/><category term="detection"/><category term="developer guide"/><category term="developer productivity"/><category term="diffusion models"/><category term="efficient inference"/><category term="efficient reasoning"/><category term="evaluation"/><category term="fast reasoning"/><category term="file creation"/><category term="intelligent agents"/><category term="knowledge graphs"/><category term="linear attention"/><category term="long-context LLM"/><category term="memory"/><category term="microblogging"/><category term="mobile AI"/><category term="multi-agent architecture"/><category term="multimodal embeddings"/><category term="natural language processing"/><category term="online RL"/><category term="open weights"/><category term="open-source"/><category term="open-source models"/><category term="open-source robotics"/><category term="open‑source AI"/><category term="privacy-preserving AI"/><category term="promotion"/><category term="prompt engineering"/><category term="reflective generative model"/><category term="research paper"/><category term="sandboxed compute"/><category term="segmentation"/><category term="self-supervised learning"/><category term="semantic code retrieval"/><category term="sparse attention"/><category term="text-to-video"/><category term="tool integration"/><category term="tool-integrated reasoning"/><category term="tweetstats"/><category term="web search agents"/><category term="world models"/><category term="zero-shot"/><category term="1M context window"/><category term="1 million‑token context"/><category term="2 GB RAM"/><category term="256K context"/><category term="256k vocabulary"/><category term="3-D RoPE"/><category term="32B model"/><category term="3D mesh generation"/><category term="40B LLM"/><category term="424B model"/><category term="64K context"/><category term="7B model"/><category term="7B parameters"/><category term="A2A communication"/><category term="AAA games"/><category term="AAO BCSC"/><category term="ADK"/><category term="AGI roadmap"/><category term="AI API"/><category term="AI Accuracy"/><category term="AI Acquisitions"/><category term="AI Agent"/><category term="AI App"/><category term="AI App Development"/><category term="AI Applications"/><category term="AI Architecture"/><category term="AI Coding Tools"/><category term="AI Consistency"/><category term="AI Cost-Benefit Analysis"/><category term="AI Customization"/><category term="AI Democratization"/><category term="AI Design Patterns"/><category term="AI Edge Gallery"/><category term="AI Efficiency"/><category term="AI Engineering"/><category term="AI Features"/><category term="AI Framework"/><category term="AI Gaming"/><category term="AI Governance"/><category term="AI Hallucinations"/><category term="AI Hardware Optimization"/><category term="AI IDE"/><category term="AI Implementation"/><category term="AI Industry Trends"/><category term="AI Leadership"/><category term="AI Meeting Notes"/><category term="AI Memory"/><category term="AI Model"/><category term="AI Partnerships"/><category term="AI Personality"/><category term="AI Pricing"/><category term="AI Project Management"/><category term="AI Protocols"/><category term="AI Reliability"/><category term="AI Research Assistant"/><category term="AI Search"/><category term="AI Search Paradigm"/><category term="AI Software"/><category term="AI Software Agent"/><category term="AI Startups"/><category term="AI Studio"/><category term="AI Sycophancy"/><category term="AI Takeaways"/><category term="AI Trustworthiness"/><category term="AI Updates"/><category term="AI Weather Models"/><category term="AI accelerator"/><category term="AI adoption"/><category term="AI agent stack"/><category term="AI business"/><category term="AI censorship"/><category term="AI chatbot"/><category term="AI chemistry tools"/><category term="AI coding assistants"/><category term="AI coding performance"/><category term="AI comprehension"/><category term="AI creativity"/><category term="AI design tools"/><category term="AI economics"/><category term="AI education"/><category term="AI entrepreneurship"/><category term="AI evaluation"/><category term="AI for Beginners"/><category term="AI hallucination reduction"/><category term="AI hardware"/><category term="AI image generation"/><category term="AI in Development"/><category term="AI in HR"/><category term="AI in Retail"/><category term="AI in Search"/><category term="AI in Software Development"/><category term="AI in finance"/><category term="AI latency"/><category term="AI literature review"/><category term="AI math reasoning"/><category term="AI policy"/><category term="AI programming"/><category term="AI terminal"/><category term="AI voice technology"/><category term="AI workflow automation"/><category term="AI-Enabled Websites"/><category term="AI-Powered Design"/><category term="AI-assisted coding"/><category term="AIME 2025"/><category term="AIME 24"/><category term="AIRA"/><category term="AIRA-dojo"/><category term="AI Studio"/><category term="AI developer tools"/><category term="AI quality assurance"/><category term="AMC"/><category term="API deprecation"/><category term="API release"/><category term="API-Bank Benchmark"/><category term="API-GUI paradigm"/><category term="AR-DF"/><category term="ARAG"/><category term="ARC-AGI"/><category term="ARC‑Hunyuan‑Video‑7B"/><category term="ARMs"/><category term="ARR"/><category term="ASR"/><category term="AWS SageMaker"/><category term="AWS competition"/><category term="AceReason-Nemotron"/><category term="Adobe Photoshop"/><category term="Agent Foundation Model"/><category term="Agent RL"/><category term="Agentic Science"/><category term="Agentic Web"/><category term="Air Quality Prediction"/><category term="Aker"/><category term="Alibaba AI"/><category term="Alibaba DAMO"/><category term="Alibaba Qwen"/><category term="Align Evals"/><category term="Allen Institute for AI"/><category term="All Tools model"/><category term="AlpacaEval"/><category term="AlphaEarth Foundations"/><category term="AlphaEvolve"/><category term="AlphaGenome"/><category term="Amazon Bedrock"/><category term="Amperity"/><category term="Android"/><category term="Anemoi"/><category term="Apache 2.0 License"/><category term="App Development"/><category term="Apple"/><category term="Apple AI research"/><category term="Arcana TTS"/><category term="Archer"/><category term="Asia"/><category term="Audio Overviews"/><category term="Audiobooks"/><category term="Augmented Fine-Tuning"/><category term="Aurora AI"/><category term="AutoCodeBench"/><category term="AutoCodeBench-Complete"/><category term="AutoCodeBench-Lite"/><category term="AutoCodeGen"/><category term="AutoGLM"/><category term="Automatic Speech Recognition"/><category term="Autonomy"/><category term="Azure AI"/><category term="Azure AI Foundry"/><category term="BFCL Benchmark"/><category term="BM25"/><category term="BPTT-free training"/><category term="Baidu"/><category term="Baidu Research"/><category term="Beijing Academy of AI"/><category term="Bing Video Creator"/><category term="Business Integration"/><category term="ByteDance"/><category term="ByteDance Seed"/><category term="CDP"/><category term="CEO of Applications"/><category term="CGM"/><category term="CLI tools"/><category term="COVID-19 forecasting"/><category term="CRMArena"/><category term="CS education"/><category term="CTM"/><category term="Canva"/><category term="Cerebras Systems"/><category term="Chain-of-Agents"/><category term="ChatGLM"/><category term="ChatGPT Integration"/><category term="ChatGPT Search"/><category term="ChatGPT comparison"/><category term="China AI market"/><category term="China tech"/><category term="Chinese‑English bilingual"/><category term="Claude 3.7"/><category term="Claude 4 Opus"/><category term="Claude Sonnet 3.7"/><category term="Claude Sonnet 4"/><category term="Claude-4 Sonnet"/><category term="Cloud-Based Development"/><category term="CoT fragility"/><category term="Code Execution"/><category term="Code Graph Model"/><category term="Code Reasoning"/><category term="CodeAct Agent"/><category term="Codestral"/><category term="Codex"/><category term="Codex-1 Model"/><category term="Coffee Shop"/><category term="Cognitive AI"/><category term="ComputerRL"/><category term="Conceptual Taxonomy"/><category term="Content Integration"/><category term="Context Awareness"/><category term="Contextual AI"/><category term="Continuous Thought Machines"/><category term="Conversational AI 2.0"/><category term="Conversational Interfaces"/><category term="Coral Protocol"/><category term="Cosmos-Reason1"/><category term="Cost Reduction"/><category term="CriticLean"/><category term="CriticLeanBench"/><category term="CriticLeanGPT"/><category term="Crow AI"/><category term="Culture"/><category term="Cursor"/><category term="Customer Data Platform"/><category term="Customer Engagement"/><category term="DAC-VAE"/><category term="DALL·E 3"/><category term="DGLM"/><category term="DM"/><category term="DNA variant prediction"/><category term="DUPO"/><category term="Data Efficiency"/><category term="Data Engineering"/><category term="Dataset Filtering"/><category term="Deep Research"/><category term="DeepMesh"/><category term="DeepResearcher"/><category term="DeepSWE"/><category term="DeepSeek V3"/><category term="Demis Hassabis"/><category term="DiT"/><category term="DiffuCoder"/><category term="Digital Times"/><category term="Direct Preference Optimization"/><category term="Discover Feed"/><category term="Dr. Wang Jian"/><category term="EPFL study"/><category term="ERNIE 4.5"/><category term="EU AI Act"/><category term="EViP++"/><category term="Earth Engine"/><category term="Education and AI"/><category term="EgoPlan‑Bench2"/><category term="ElevenLabs"/><category term="Embodied Reasoning"/><category term="Emory University"/><category term="Emotional Intelligence"/><category term="Enterprise Search"/><category term="Entropulse"/><category term="Environmental Forecasting"/><category term="Ether0"/><category term="Europe AI"/><category term="EvalPlus"/><category term="Everyone’s AI"/><category term="Excel formulas"/><category term="Expander agent"/><category term="F1 benchmarks"/><category term="FP8"/><category term="FP8 Precision"/><category term="Falcon AI"/><category term="FastConformer"/><category term="FineLeanCorpus"/><category term="FireFox"/><category term="Flash Image"/><category term="FlashMask"/><category term="Flowers"/><category term="Foley generation"/><category term="FreeMorph"/><category term="Front-End Development"/><category term="Frontend Code"/><category term="Future of Work"/><category term="GAIR Lab"/><category term="GIFT-Eval"/><category term="GLM-4-9B"/><category term="GLM-4.5"/><category term="GLM-4.5-Air"/><category term="GLM‑4"/><category term="GLM‑4‑9B"/><category term="GLM‑4‑Air"/><category term="GPT alternative"/><category term="GPT-4"/><category term="GPT-4 alternative"/><category term="GPT-4.1 Alternative"/><category term="GPT-4.1 Mini"/><category term="GPT-4.1 nano"/><category term="GPT-4.5"/><category term="GPT-4.5 competitor"/><category term="GPT-4o judge"/><category term="GPT-4o vision"/><category term="GPT-OSS"/><category term="GPT‑4o competitor"/><category term="GRAMMAR dataset"/><category term="GRIT"/><category term="GRPO-GR"/><category term="GUI Automation"/><category term="GUI simulation"/><category term="GUI understanding"/><category term="Game Arena"/><category term="Game-playing AI"/><category term="Games"/><category term="Gemini 1.5 Pro"/><category term="Gemini 2 Pro"/><category term="Gemini 2.0 Flash"/><category term="Gemini 2.5 Flash"/><category term="Gemini Flash"/><category term="Gemini on premises"/><category term="Gemini rival"/><category term="Gemini 2.5 Flash‑Lite"/><category term="Gemma 3 270M"/><category term="Gemma 3n"/><category term="GenAI Processors"/><category term="GenEval benchmark"/><category term="Genie 3"/><category term="Georgia Tech"/><category term="GitHub Actions"/><category term="GitHub Integration"/><category term="Good–Turing"/><category term="Google AI Mode"/><category term="Google AI Ultra"/><category term="Google Brain"/><category term="Google Cloud"/><category term="Google Distributed Cloud"/><category term="Google I/O"/><category term="Google Integration"/><category term="Google Maps"/><category term="Google Search"/><category term="Google Stitch"/><category term="Google Cloud"/><category term="GraphRAG"/><category term="Grok 3 Beta"/><category term="Groq"/><category term="Guardian Agents"/><category term="Gym-Style Framework"/><category term="HIPAA compliance"/><category term="HRM"/><category term="HallusionBench"/><category term="Harmony format"/><category term="Healthcare Applications"/><category term="Huawei Noah’s Ark"/><category term="HumanEval"/><category term="Hunyuan-GameCraft"/><category term="HunyuanVideo"/><category term="HunyuanVideo-Foley"/><category term="IBM"/><category term="IBM report"/><category term="IDK"/><category term="IM"/><category term="IMO 2025"/><category term="INT4"/><category term="IO-aware attention"/><category term="Identity Resolution"/><category term="In-Context Learning"/><category term="InfLLM v2"/><category term="Input-Output Analysis"/><category term="Instacart"/><category term="International Math Olympiad"/><category term="International Mathematical Olympiad"/><category term="Interoperability"/><category term="IoT"/><category term="JSON Schema"/><category term="Jagged Intelligence"/><category term="Jet-Nemotron"/><category term="JetBlock"/><category term="Jetson"/><category term="Jina Search"/><category term="Jony Ive"/><category term="Junjie Yan"/><category term="KAT‑V1"/><category term="KC-MMBench"/><category term="KV cache"/><category term="KV-cache"/><category term="Kaggle"/><category term="Kaggle Challenges"/><category term="Kaggle automation"/><category term="Kaggle benchmark"/><category term="Keye-VL"/><category term="Kiro"/><category term="Kuaishou Technology"/><category term="Kwai‑AutoThink"/><category term="LIMIT dataset"/><category term="LLM Customization"/><category term="LLM Infrastructure"/><category term="LLM Tool Use"/><category term="LLM Training"/><category term="LLM autograder"/><category term="LLM benchmarks"/><category term="LLM code generation"/><category term="LLM efficiency"/><category term="LLM evaluation"/><category term="LLM hallucinations"/><category term="LLM optimization"/><category term="LLM safety"/><category term="LLM-as-a-judge"/><category term="LLM‑as‑a‑judge"/><category term="LLM evaluation"/><category term="LLaDA"/><category term="LLaDA-V"/><category term="LLaMA3-V"/><category term="LM Arena"/><category term="LPU"/><category term="LRMs"/><category term="LangExtract"/><category term="LangSmith"/><category term="Language Self-Play"/><category term="Latte"/><category term="LeRobot"/><category term="Lean 4"/><category term="Lightning AI"/><category term="Linear integration"/><category term="Llama 4 Maverick"/><category term="Llama API"/><category term="Llama Nemotron Nano 4B"/><category term="Llama Nemotron Nano VL"/><category term="Llama-3.1-8B-Instruct"/><category term="Llama-3.2-3B-Instruct"/><category term="Llama3"/><category term="Local Search"/><category term="Loki Listens"/><category term="LongAnimation"/><category term="LongText-Bench"/><category term="Lumos-1"/><category term="M&amp;A"/><category term="MASS"/><category term="MATH-500"/><category term="MCP client"/><category term="MCP orchestration"/><category term="MCP server"/><category term="MIRAGE"/><category term="ML agents"/><category term="ML ops"/><category term="MLE-Bench-Lite"/><category term="MLE-Dojo"/><category term="MLE-STAR"/><category term="MLE-bench"/><category term="MLLM evaluation"/><category term="MLLMs"/><category term="MM-RoPE"/><category term="MMDiT"/><category term="MMEB benchmark"/><category term="MMLU-Pro"/><category term="MMMU"/><category term="MMMU-Pro"/><category term="MORL"/><category term="MTEB"/><category term="MTP distillation"/><category term="Machine Learning Strategy"/><category term="Manus AI"/><category term="Master-Planner-Executor-Writer"/><category term="MedGemma"/><category term="MedXpertQA"/><category term="MediaPipe"/><category term="Medical Billing"/><category term="MegaScience"/><category term="Mem0"/><category term="Mem0g"/><category term="Memento"/><category term="Meta"/><category term="Meta FAIR"/><category term="Meteorology"/><category term="MiMo-7B"/><category term="MiMo-VL-7B"/><category term="Microsoft Azure AI"/><category term="Microsoft Build"/><category term="Microsoft Research"/><category term="Microsoft Research Asia"/><category term="MiniCPM4"/><category term="MiniMax"/><category term="MiniMax Agent"/><category term="Minimal Supervision"/><category term="Mirix"/><category term="Mistral Agents"/><category term="Mistral Code"/><category term="Mistral Medium 3"/><category term="Mistral Small 3.2"/><category term="Mistral‑24B"/><category term="MoCa"/><category term="Mobile App"/><category term="Mobile Apps"/><category term="Model Coordination Protocol"/><category term="Model Generalization"/><category term="Model Integration"/><category term="Modern Tool Use"/><category term="MolmoAct"/><category term="Mono-InternVL-1.5"/><category term="Monte Carlo Tree Search"/><category term="Moonshot AI"/><category term="Morph4Data"/><category term="Mozilla"/><category term="Multi-Agent Workflow"/><category term="Multi-LLM Platforms"/><category term="Multi-Token Prediction"/><category term="Multi-turn"/><category term="Multilingual Embedding"/><category term="Multimodal Models"/><category term="My Tweeple"/><category term="NLP"/><category term="NLWeb"/><category term="NVIDIA GB200"/><category term="NVIDIA GPUs"/><category term="NVIDIA Research"/><category term="Nanjing University"/><category term="Nano Banana"/><category term="Narvik"/><category term="Natural Language Interface"/><category term="NeMo Toolkit"/><category term="Nemotron-Tool-N1"/><category term="Neural Dynamics"/><category term="Neural Networks"/><category term="NeuralOS"/><category term="Niece"/><category term="No Code AI"/><category term="No-Code Tools"/><category term="NotebookLM"/><category term="Notion"/><category term="Nscale"/><category term="OCI Generative AI"/><category term="OCR"/><category term="OCRBench"/><category term="OCRBench v2"/><category term="OSWorld"/><category term="OWL baseline"/><category term="Office Politics"/><category term="Offline Access"/><category term="On-Premises AI"/><category term="Open Reasoning Model"/><category term="Open-Source AI Tools"/><category term="OpenAI Codex"/><category term="OpenAI for Countries"/><category term="OpenAI o1"/><category term="OpenAI research"/><category term="OpenBMB"/><category term="OpenGVLab"/><category term="OpenTelemetry"/><category term="Opus 4"/><category term="Oracle"/><category term="Oracle Fusion Applications"/><category term="Orion model"/><category term="Owl AI"/><category term="PDF Export"/><category term="PDFs"/><category term="PII Tagging"/><category term="PII masking"/><category term="PRM"/><category term="Palmyra X5"/><category term="ParaStudent"/><category term="ParaThinker"/><category term="Parakeet-TDT-0.6B-v2"/><category term="Parallel Task Execution"/><category term="Parallel-R1"/><category term="Pass@1"/><category term="Percy Liang"/><category term="Perplexity"/><category term="Perplexity AI"/><category term="Personalized Marketing"/><category term="Phi family"/><category term="Phi-4-Reasoning-Plus"/><category term="Phi4-mini-Flash"/><category term="Philippines"/><category term="Phoenix AI"/><category term="PhyWorldBench"/><category term="Physical AI"/><category term="Pika"/><category term="Pokémon Blue"/><category term="PostNAS"/><category term="PowerPoint"/><category term="PowerPoint slides"/><category term="ProRL"/><category term="Processor interface"/><category term="Product Search"/><category term="Prolonged Training"/><category term="PyVision"/><category term="Python"/><category term="Python tool generation"/><category term="Python tutorial"/><category term="QWQ-32B"/><category term="Qwen Chat"/><category term="Qwen VLo"/><category term="Qwen-2.5"/><category term="Qwen-2.5 Coder"/><category term="Qwen-3-4B"/><category term="Qwen2-VL"/><category term="Qwen2.5"/><category term="Qwen2.5-14B"/><category term="Qwen2.5-Omni-3B"/><category term="Qwen2.5-VL"/><category term="Qwen3 32B"/><category term="Qwen3-235B"/><category term="Qwen3-Embedding"/><category term="Qwen3-Reranker"/><category term="QwenLong-L1"/><category term="R1-0528"/><category term="R1‑0528"/><category term="R2E-Gym"/><category term="RAG architecture"/><category term="RAG evaluation"/><category term="REPA loss"/><category term="RFT"/><category term="RL-trained coding agent"/><category term="RLHF"/><category term="RNN state tracker"/><category term="RTX GPUs"/><category term="ReAct Agent"/><category term="ReVisual‑R1"/><category term="ReaGAN"/><category term="React frontend"/><category term="Real-Time Information"/><category term="Real-Time Speech Generation"/><category term="RealtimeAgent"/><category term="Reinforcement Fine-Tuning"/><category term="Research Mode"/><category term="Responses API"/><category term="Responsible AI"/><category term="Retrieval-Augmented Generation (RAG)"/><category term="Rime"/><category term="RoGuard 1.0"/><category term="RoGuard-Eval"/><category term="Roblox"/><category term="RoboBrain 2.0"/><category term="RoboVQA"/><category term="Rule-Based Systems"/><category term="Rutgers University"/><category term="SFR-Embedding"/><category term="SIGIR 2025"/><category term="SIMPLE Dataset"/><category term="SMILES generation"/><category term="SMS"/><category term="SPRM"/><category term="STEM AI"/><category term="SWE-bench Verified"/><category term="SWEBench-Verified"/><category term="Sakana AI"/><category term="Sales Conversion"/><category term="Salesforce"/><category term="Sam Altman"/><category term="SambaY"/><category term="Sapient Intelligenc"/><category term="Scalable Memory"/><category term="Scientific Computing"/><category term="Search Engine Alternatives"/><category term="Search Simulation"/><category term="Search-o1"/><category term="Seed1.5-VL"/><category term="Self-Reflection"/><category term="Sentry integration"/><category term="Shopping Cart Software"/><category term="ShortVid‑Bench"/><category term="SimplerEnv"/><category term="Simulation Training"/><category term="Skype"/><category term="Skywork Reward V2"/><category term="Small Language Models"/><category term="SmolVLA"/><category term="Social AI"/><category term="Software 3.0"/><category term="Software Development"/><category term="Software Development Automation"/><category term="Software Engineering Automation"/><category term="Sonnet 4"/><category term="Sora"/><category term="Speech-to-Text"/><category term="Stable Diffusion"/><category term="Stack Exchange"/><category term="Stanford"/><category term="Stargate Norway"/><category term="Step‑SRPO"/><category term="Stitch"/><category term="Street Children"/><category term="Sudoku"/><category term="Supervised Learning"/><category term="Synthetic Data"/><category term="TDT Decoder"/><category term="TMRoPE"/><category term="Team-Based AI"/><category term="Tech Ethics"/><category term="Ted Padova"/><category term="Tencent ARC"/><category term="TensorFlow Lite"/><category term="Text Ranking"/><category term="Text-to-Speech"/><category term="TextbookReasoning"/><category term="ThinkAct"/><category term="Thinker-Talker Architecture"/><category term="To Quote"/><category term="Together AI"/><category term="Token Monster"/><category term="Tool Calling"/><category term="TraDo-4B"/><category term="TraDo-8B"/><category term="TraceRL"/><category term="Tree-of-Thoughts"/><category term="Tsinghua AIR"/><category term="Tunnel Vision"/><category term="Tweetbar"/><category term="Twitter Karma"/><category term="Twittervision"/><category term="TypeScript SDK"/><category term="Typhoon Prediction"/><category term="UCL"/><category term="UQ"/><category term="USMLE"/><category term="Ubuntu dataset"/><category term="Universal Deep Research"/><category term="University of Illinois Urbana-Champaign"/><category term="User Feedback"/><category term="VBench"/><category term="VLA"/><category term="VLA reasoning"/><category term="VLM"/><category term="VLM2Vec successor"/><category term="VLMsAreBlind"/><category term="VQA-RAD"/><category term="VRAM Optimization"/><category term="VeRL"/><category term="Vectara"/><category term="Veo 3"/><category term="Vertex AI"/><category term="ViDoRe-v2"/><category term="ViT-7B"/><category term="Video Understanding"/><category term="Vision-Language-Action model"/><category term="Vision-SR1"/><category term="Visual Grounding"/><category term="Voice AI"/><category term="Voice Interaction"/><category term="Voice Mode"/><category term="Voronoi"/><category term="Voxtral"/><category term="WAIC 2025"/><category term="Walmart Global Tech"/><category term="Wanx"/><category term="Web Protocols"/><category term="Web Search API"/><category term="WebSailor"/><category term="WebShaper"/><category term="WebWalkerQA"/><category term="Whisper alternative"/><category term="Whistleblower AI"/><category term="Wide Research"/><category term="Working from Home"/><category term="Workload Identity Federation"/><category term="Writer"/><category term="X-Omni"/><category term="Xcode"/><category term="Xiaomi AI"/><category term="Yejin Choi"/><category term="ZAPBench"/><category term="Zen Agents"/><category term="Zencoder"/><category term="Zhipu AI"/><category term="Zhipu AI"/><category term="ablation studies"/><category term="abstention scoring"/><category term="adaptive halting"/><category term="agent attention economy"/><category term="agent collaboration"/><category term="agent hooks"/><category term="agent infrastructure"/><category term="agent orchestration"/><category term="agent software"/><category term="agent-free pipelines"/><category term="agent-to-agent protocols"/><category term="agentic RL"/><category term="agentic coding"/><category term="agentic systems"/><category term="agentic vision"/><category term="algorithm design"/><category term="animation colorization"/><category term="anime production"/><category term="annual recurring revenue"/><category term="anti-physics prompts"/><category term="application layer"/><category term="apps"/><category term="arXiv 2505.16901"/><category term="artificial intelligence news"/><category term="artist-quality topology"/><category term="asynchronous pipelines"/><category term="audio Q&amp;A"/><category term="audio fidelity"/><category term="audio‑visual reasoning"/><category term="auto-regressive transformer"/><category term="automated design"/><category term="automated machine learning"/><category term="automated theorem proving"/><category term="automation"/><category term="autonomous discovery"/><category term="autonomous improvement"/><category term="autoregressive image generation"/><category term="autoregressive video generation"/><category term="behavioral calibration"/><category term="benchmark"/><category term="benchmark SOTA"/><category term="benchmark design"/><category term="benchmark reform"/><category term="bidirectional attention"/><category term="binary grading"/><category term="bit quantization"/><category term="block diffusion"/><category term="bring your own model"/><category term="business automation"/><category term="camera vectors"/><category term="case-based reasoning"/><category term="causality vs. structure"/><category term="chain-of-thought monitoring"/><category term="character consistency"/><category term="chatbot integration"/><category term="chemical AI"/><category term="chemistry AI"/><category term="chess hacking"/><category term="claude code&#xa;AI workflow"/><category term="climate monitoring"/><category term="clinical decision support"/><category term="cloud computing"/><category term="cloud partnership"/><category term="code agents"/><category term="code completion"/><category term="code quality"/><category term="code refactoring"/><category term="code style diversity"/><category term="coding"/><category term="coding agents"/><category term="coding benchmarks"/><category term="command-line tool"/><category term="communication"/><category term="community datasets"/><category term="community verification"/><category term="computational biology"/><category term="compute vs creativity"/><category term="computer vision tasks"/><category term="concrete ideas"/><category term="confidential computing"/><category term="content moderation"/><category term="context compression"/><category term="continual learning"/><category term="continual pre-training"/><category term="control tokens"/><category term="cost control"/><category term="cost-accuracy"/><category term="cost‑efficient AI"/><category term="coupled-GRPO"/><category term="creative AI"/><category term="cross-chain verification"/><category term="customer service automation"/><category term="dacort"/><category term="data analysis"/><category term="data center"/><category term="data curation"/><category term="data filtering"/><category term="data leakage"/><category term="data privacy"/><category term="data synthesis"/><category term="data-free training"/><category term="deep research agents"/><category term="delta tuning"/><category term="dense retrieval"/><category term="desktop agents"/><category term="dev workflow"/><category term="development automation"/><category term="diffusion LLM"/><category term="diffusion LLMs"/><category term="diffusion language models"/><category term="diffusion renderer"/><category term="diffusion transformers"/><category term="digital design"/><category term="discrete visual tokens"/><category term="distributed training"/><category term="document analysis"/><category term="document editing"/><category term="document intelligence"/><category term="document understanding"/><category term="documents"/><category term="dual-token constraints"/><category term="dual‑regime dataset"/><category term="dual‑system framework"/><category term="dynamic memory"/><category term="dynamic reasoning"/><category term="dynamic tooling"/><category term="edge computing"/><category term="educational LLMs"/><category term="efficient AI"/><category term="efficient LLMs"/><category term="embedding"/><category term="embedding fields"/><category term="embedding retrieval"/><category term="emergent misalignment"/><category term="empirical software"/><category term="encryption"/><category term="ensemble methods"/><category term="enterprise AI applications"/><category term="enterprise research"/><category term="enterprise software development"/><category term="enterprise tooling"/><category term="entropy-aware training"/><category term="environmental mapping"/><category term="evaluation benchmarks"/><category term="evaluation design"/><category term="evaluators"/><category term="evolutionary algorithms"/><category term="evolutionary search"/><category term="experimental design"/><category term="fastText Classifier"/><category term="feedback loops"/><category term="few-shot learning"/><category term="few‑shot adaptation"/><category term="formalization"/><category term="foundation model"/><category term="free AI tools"/><category term="frontier models"/><category term="frozen LLM"/><category term="full-attention DLMs"/><category term="full-stack AI"/><category term="game video generation"/><category term="gene regulation"/><category term="general-purpose AI"/><category term="generative HCI"/><category term="generative art"/><category term="generative AI"/><category term="generator–validator gap"/><category term="genomics AI"/><category term="geospatial AI"/><category term="gold medal"/><category term="governance"/><category term="gram anchoring"/><category term="graph machine learning"/><category term="graph neural networks"/><category term="graph-aware attention"/><category term="grouped attention"/><category term="guardrails"/><category term="guidance-aware spherical interpolation"/><category term="healthcare AI"/><category term="healthcare benchmarks"/><category term="heat reuse"/><category term="heterogeneous contrastive learning"/><category term="hierarchical reasoning"/><category term="human-computer interaction"/><category term="human-in-the-loop AI"/><category term="human alignment"/><category term="hybrid AI deployment"/><category term="hybrid architecture"/><category term="hybrid architectures"/><category term="hybrid decoder"/><category term="hybrid history conditioning"/><category term="hybrid reasoning"/><category term="hypothesis generation"/><category term="image morphing"/><category term="image understanding"/><category term="inference cost"/><category term="inference speed"/><category term="information extraction"/><category term="information-seeking"/><category term="information-seeking agents"/><category term="innovation cycles"/><category term="instruction-following"/><category term="instruction-following IR"/><category term="intelligent automation"/><category term="intelligent document processing"/><category term="intelligent operations"/><category term="intelligent teammate"/><category term="interactive video"/><category term="io Products"/><category term="issue triage"/><category term="keyboard/mouse control"/><category term="knowledge projections"/><category term="language model"/><category term="language shortcuts"/><category term="language-model scaling"/><category term="large reasoning models"/><category term="large-scale research"/><category term="large language models"/><category term="latency-sensitive"/><category term="latent plan"/><category term="latent reasoning"/><category term="learning trajectory modelling"/><category term="lifelong agents"/><category term="lightning attention"/><category term="lightweight LLM"/><category term="liquid cooling"/><category term="local deployment"/><category term="local-global aggregation"/><category term="long chain-of-thought"/><category term="long context window"/><category term="long-context LLMs"/><category term="long-context processing"/><category term="long-context reasoning"/><category term="long-text rendering"/><category term="majority vote"/><category term="masked diffusion"/><category term="math &amp; logic"/><category term="math AI"/><category term="math contests"/><category term="mathematical formalization"/><category term="mathematical reasoning"/><category term="maze solving"/><category term="medical AI benchmarks"/><category term="medical QA"/><category term="memory hierarchies"/><category term="memory management"/><category term="memory-augmented MDP"/><category term="million-token context"/><category term="minimal data training"/><category term="minority correctness"/><category term="misalignment risk"/><category term="mixture‑of‑experts"/><category term="mobile AI tools"/><category term="mobile deployment"/><category term="model distillation"/><category term="model efficiency"/><category term="model release"/><category term="model update"/><category term="model deployment"/><category term="modular AI frameworks"/><category term="molecular design"/><category term="monolithic MLLM"/><category term="multi-agent collaboration"/><category term="multi-agent distillation"/><category term="multi-agent framework"/><category term="multi-agent pipeline"/><category term="multi-chain reasoning"/><category term="multi-dimensional evaluation"/><category term="multi-model AI"/><category term="multi-stage training"/><category term="multi-step reasoning"/><category term="multilingual ASR"/><category term="multilingual benchmark"/><category term="multilingual model"/><category term="multimodal A"/><category term="multimodal MoE"/><category term="multimodal benchmark"/><category term="multimodal deep learning"/><category term="multimodal diffusion"/><category term="multimodal medical reasoning"/><category term="multimodal model"/><category term="multimodal perception"/><category term="multimodal reasoning"/><category term="namespaces"/><category term="natural language coding"/><category term="natural language programming"/><category term="neural operating system"/><category term="node-as-agent"/><category term="non-parametric memory"/><category term="non‑coding genome"/><category term="o3 model"/><category term="o3-high"/><category term="o3-mini competitor"/><category term="o4-mini"/><category term="offline AI"/><category term="offline installation"/><category term="on-premise AI"/><category term="on‑prem AI"/><category term="open dataset"/><category term="open-source LLMs"/><category term="open-source agent"/><category term="open-source model"/><category term="open-source speech model"/><category term="open‑source LLMs"/><category term="open‑source model"/><category term="operating systems"/><category term="operator design"/><category term="ophthalmology QA"/><category term="oversight research"/><category term="overthinking problem"/><category term="parallel agents"/><category term="parallel content processing"/><category term="partial autonomy apps"/><category term="path branching"/><category term="personalization"/><category term="personalized recommendation"/><category term="physical realism"/><category term="physics adherence"/><category term="planning &amp; memory"/><category term="policy optimization"/><category term="polling"/><category term="positional embeddings"/><category term="post‑training datasets"/><category term="precision editing"/><category term="product design"/><category term="product feedback"/><category term="product-market fit"/><category term="productivity AI"/><category term="productivity software"/><category term="programming"/><category term="progressive generation"/><category term="prompt chaining"/><category term="prompt design"/><category term="prompt filtering"/><category term="prompt optimization"/><category term="prompt engineering"/><category term="pull request review"/><category term="quantization aware training"/><category term="qwen3‑coder"/><category term="rapid prototyping"/><category term="real-time data"/><category term="real-time information retrieval"/><category term="reasoning AI"/><category term="reasoning benchmarks"/><category term="reasoning effort"/><category term="reasoning model"/><category term="reasoning verification"/><category term="recombination"/><category term="recommendation systems"/><category term="recurrent neural networks"/><category term="red teaming"/><category term="regulatory variants"/><category term="reinforced visual planning"/><category term="remote MCP"/><category term="renewable energy"/><category term="repository-level LLM"/><category term="reproducibility"/><category term="research"/><category term="research assistant"/><category term="response filtering"/><category term="reward design"/><category term="reward hacking"/><category term="robot manipulation"/><category term="robot planning"/><category term="robotic agents"/><category term="robotics foundation model"/><category term="rule-based rewards"/><category term="s3"/><category term="safe completion"/><category term="safety &amp; ethics"/><category term="safety &amp; security"/><category term="safety benchmarks"/><category term="safety research"/><category term="sampling strategy"/><category term="sandbox execution"/><category term="sandboxing"/><category term="satellite imagery"/><category term="scRNA-seq batch integration"/><category term="scalable retrieval"/><category term="scalable search"/><category term="scaling laws"/><category term="scientific reasoning"/><category term="scientific reasoning model"/><category term="scientific research automation"/><category term="search agents"/><category term="secure AI"/><category term="secure OAuth"/><category term="see-think-answer"/><category term="self-evolving agents"/><category term="self-play RL"/><category term="self-reward"/><category term="self-rewarding VLM"/><category term="self-supervised process reward model"/><category term="semantic AI"/><category term="semantic vs geometric"/><category term="semi-centralized planning"/><category term="set-theoretic task design"/><category term="short CoT efficiency"/><category term="short-video understanding"/><category term="short‑form video"/><category term="shutdown avoidance"/><category term="sign-rank"/><category term="small LLMs"/><category term="software engineering"/><category term="software engineering AI"/><category term="software evolution"/><category term="software-engineering benchmarks"/><category term="solution aggregation"/><category term="source grounding"/><category term="sovereign compute"/><category term="spatial planning"/><category term="spatial reasoning"/><category term="spec-driven development"/><category term="spreadsheets"/><category term="staged reinforcement learning"/><category term="startup speed"/><category term="state-space models"/><category term="step-oriented variation trend"/><category term="strategy compiler"/><category term="streaming AI"/><category term="structured output"/><category term="structured video comprehension"/><category term="student-like code"/><category term="supervised fine-tuning"/><category term="survey paper"/><category term="synthetic datasets"/><category term="taxonomy"/><category term="template apps"/><category term="temporal alignment"/><category term="temporal grounding"/><category term="test-time compute"/><category term="text-to-image"/><category term="theoretical limits"/><category term="theory"/><category term="thinking models"/><category term="throughput"/><category term="timestamped captioning"/><category term="token efficiency"/><category term="tokenization algorithm"/><category term="tool design"/><category term="tool evaluation"/><category term="tooling ecosystem"/><category term="tools"/><category term="tool‑augmented LLM"/><category term="topology search"/><category term="trajectory traces"/><category term="transcription"/><category term="transparency"/><category term="tree search"/><category term="trillion‑token training"/><category term="tuning-free AI"/><category term="turn-taking AI"/><category term="tutoring systems"/><category term="tweeterboard"/><category term="twitterfeed"/><category term="twitterfone"/><category term="uncertainty reduction"/><category term="unified multimodal LLM"/><category term="unified multimodal model"/><category term="unsolved questions"/><category term="unstructured data"/><category term="validator pipelines"/><category term="value model"/><category term="vector databases"/><category term="venture building"/><category term="video QA"/><category term="video benchmarks"/><category term="video generation benchmark"/><category term="video-to-audio"/><category term="virtual agents"/><category term="virtual satellite"/><category term="vision-language"/><category term="vision-language-action"/><category term="visual hallucinations"/><category term="visual instruction tuning"/><category term="visual pre-training"/><category term="visualization"/><category term="visual‑text understanding"/><category term="voice assistants"/><category term="watson x"/><category term="wearable agents"/><category term="web agent"/><category term="web agents"/><category term="web architecture"/><category term="world-modeling"/><category term="xLAM"/><title type='text'>Wandering Nomad</title><subtitle type='html'>Today is your day, your mountain is waiting, so get on your way. Time to move on, time to forget</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default?redirect=false'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default?start-index=26&amp;max-results=25&amp;redirect=false'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>689</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-6925371224701760360</id><published>2026-04-05T10:39:00.002+08:00</published><updated>2026-04-05T10:39:13.152+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Andrej Karpathy"/><category scheme="http://www.blogger.com/atom/ns#" term="claude code&#xa;AI workflow"/><title type='text'>I&#39;m Stealing This AI Researcher&#39;s Workflow for My Own Projects</title><content type='html'>&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 12pt; margin-top: 12pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;Karpathy doesn&#39;t use a fancy app to manage his research. He uses a folder, Obsidian, and an AI — and I want to copy it.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;He posted about it last week. The short version: he dumps raw material — articles, notes, papers, images — into a folder, then lets a large language model (LLM — the AI brain behind tools like Claude or ChatGPT) build a wiki from it automatically. The LLM writes the summaries, creates the links between ideas, organizes everything into categories. He barely touches the wiki himself. When it gets big enough, he asks it questions and gets answers pulled from his own research.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;I&#39;ve been sitting with this for a few days, thinking about what it would look like for my work.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;---&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;h1 dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 20pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;What My Work Actually Looks Like&lt;/span&gt;&lt;/span&gt;&lt;/h1&gt;&lt;div&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 20pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;I build things. Agents, content apps, Claude Code workflows, automation scripts. A lot of what I do involves figuring something out — what tool does what, how to wire two things together, what prompt pattern produces the right output, what broke last time and why.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;Most of that knowledge lives in my head, or in scattered notes, or in past conversations I can&#39;t find anymore.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;That&#39;s the problem. Every time I start something new, I spend time re-learning things I already know. What flags to use in Claude Code. What agent structure works for what kind of task. What API response format caused everything to break last month.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;Karpathy&#39;s idea is simple: stop keeping that knowledge in your head. Dump it in a folder. Let the AI organize it. Ask it back when you need it.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;—&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;h1 dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 20pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;The Specific Thing I Keep Thinking About&lt;/span&gt;&lt;/span&gt;&lt;/h1&gt;&lt;div&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 20pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;He mentioned that his wiki grows and gets more useful with every question he asks. He asks something, the AI goes through his notes and answers it — and then he saves that answer back into the wiki. So every session adds something. Nothing gets lost.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;That hit me because the opposite is true for how I work right now. Every build session ends and most of the small things I figured out just disappear. The next session starts almost from scratch on some of the same ground.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;If I had a knowledge base for my Claude Code workflows alone — prompts that worked, structures that didn&#39;t, patterns I figured out, error fixes — and an AI that could surface the right piece when I needed it, I&#39;d stop repeating myself.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style=&quot;font-family: Arial, sans-serif; font-size: 12pt; white-space-collapse: preserve;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;—&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;h1 dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 20pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;The Part That Actually Excited Me&lt;/span&gt;&lt;/span&gt;&lt;/h1&gt;&lt;div&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 20pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;He also runs &quot;health checks&quot; on his wiki. He asks the AI to find gaps, spot inconsistencies, and find connections between ideas he hadn&#39;t noticed yet. The AI suggests new things to look into.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;That&#39;s the part I can&#39;t stop thinking about.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;Not just a system that stores what I know. A system that notices what&#39;s missing. For someone building content automation apps, that means the system isn&#39;t just remembering what tools I&#39;ve used — it&#39;s noticing when two things I built separately could be connected. It&#39;s pointing to the next piece.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;That changes how building feels. Less like starting from zero every time, more like picking up a thread.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;—&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style=&quot;font-family: Arial, sans-serif; font-size: 20pt; white-space-collapse: preserve;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;What I&#39;m Going to Test&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;I&#39;m starting with one folder. My Claude Code workflows — the scripts, prompts, notes, fixes, things that broke and how I solved them.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;I&#39;ll ask Claude to read through everything and build an index: summaries of each file, links between related ideas, a map of what I already know.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;From there, I&#39;ll ask it questions mid-project. &quot;What pattern did I use last time for a multi-step agent?&quot; &quot;What was the issue I kept hitting with streaming output?&quot; Instead of digging through old files or trying to remember, I just ask.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p dir=&quot;ltr&quot; style=&quot;line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;&quot;&gt;&lt;span style=&quot;background-color: transparent; font-family: Arial, sans-serif; font-size: 12pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;&quot;&gt;&lt;span style=&quot;color: white;&quot;&gt;I&#39;m not building the full Karpathy setup yet. I&#39;m testing whether the core idea holds: does having a searchable, AI-organized version of my own work actually save time and reduce the re-learning?&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/6925371224701760360/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/6925371224701760360?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6925371224701760360'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6925371224701760360'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/04/im-stealing-this-ai-researchers.html' title='I&#39;m Stealing This AI Researcher&#39;s Workflow for My Own Projects'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-3075108691801965462</id><published>2026-03-14T17:12:00.004+08:00</published><updated>2026-03-14T17:12:48.607+08:00</updated><title type='text'>I Told an AI to Build a Full Newsletter. One Sentence. No Code.</title><content type='html'>&lt;blockquote&gt;&lt;p&gt;I used to spend half a day on client newsletters. Last week I did it in 15 minutes — and I&#39;ll show you exactly what I typed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;h2&gt;🤖 What Claude Code Actually Is&lt;/h2&gt;
&lt;p&gt;Claude Code is an extension you install inside VS Code — a free code editor. Once it&#39;s set up, you have an AI agent sitting inside your project that can read files, write code, call APIs, fix its own errors, and run automations — all through a chat window.&lt;/p&gt;
&lt;p&gt;You talk to it. It builds things. That&#39;s the whole loop. You don&#39;t need to know how to code. You need to know what you want.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;🧪 What I Actually Tested&lt;/h2&gt;
&lt;p&gt;I set up a project, gave the agent a brand logo and a color guide, connected a few API keys (one for research, one for Gmail), and typed one prompt. That was it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What came back:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Research pulled from live sources&lt;/li&gt;
&lt;li&gt;Three branded infographics generated with AI images&lt;/li&gt;
&lt;li&gt;Full HTML newsletter formatted to the brand&#39;s colors and fonts&lt;/li&gt;
&lt;li&gt;Email sent directly to a list via Gmail&lt;/li&gt;
&lt;li&gt;Sources linked at the bottom, clickable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It hit errors along the way. An API endpoint had changed. A Gmail formatting issue broke the layout. Both times, it found the problem, explained what happened, fixed the tool, and kept running. I didn&#39;t touch anything.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;🏗️ How the Whole Thing Is Organized&lt;/h2&gt;
&lt;p&gt;The system behind this is called&lt;/p&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The instructions, written in plain English — like a step-by-step recipe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The specific actions the agent can take (research, generate image, send email)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestrator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Code itself, reading the pipeline and running each tool in order&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You also give it a file called &lt;code&gt;CLAUDE.md&lt;/code&gt; — a standing brief that tells the agent what the project is, where things live, and what it&#39;s supposed to do. Every time you open the project, it reads that first.&lt;/p&gt;
&lt;p&gt;Once you&#39;ve built and tested a pipeline, you can schedule it to run on its own — every Monday morning, every time a form is submitted, whatever you need. At that point the agent isn&#39;t running live. The code it built is. It runs predictably, like any other automation.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;💼 What This Means for VAs and Small Business Owners&lt;/h2&gt;
&lt;p&gt;You don&#39;t need to learn to code. You need to be able to describe a process clearly.&lt;/p&gt;
&lt;p&gt;If you manage newsletters, reports, client updates, or social content for clients — build a pipeline, test it once, let it run. If you spend hours every week on the same repetitive tasks in your own business, same idea.&lt;/p&gt;
&lt;p&gt;What makes this different from something like Zapier is what happens during the build. The agent adapts. It hits an error, investigates, fixes the tool, and updates the instructions so the same thing doesn&#39;t break again. The final automation is more reliable — and you got there faster than mapping every node by hand.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;💰 If You&#39;re Thinking About Offering This as a Service&lt;/h2&gt;
&lt;p&gt;Businesses aren&#39;t paying for a pipeline or a blueprint. They&#39;re paying for a solved problem.&lt;/p&gt;
&lt;p&gt;Don&#39;t open with &lt;em&gt;&quot;do you want AI automation?&quot;&lt;/em&gt; Ask: where are you losing the most time or money right now? Find that, build the thing that fixes it, and price it on what it&#39;s worth — hours saved, errors cut, costs gone — not on how long you spent building it.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A pipeline that saves a client 20 hours a week isn&#39;t a half-day job. Over a year, that&#39;s tens of thousands of dollars in time they&#39;re not paying someone else to cover. Price it that way.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;h2&gt;🚀 Where to Start&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Code — runs inside VS Code (free to download)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Paid Claude subscription — $17/month on the Pro plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1 hour, most of it connecting API keys&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Pick one task you do every single week. Something repetitive, something predictable. Let the agent build the first version, watch it run, fix what doesn&#39;t land, and build from there.&lt;/p&gt;
&lt;p&gt;The hardest part isn&#39;t the tool. It&#39;s deciding what to automate first.&amp;nbsp;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/3075108691801965462/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/3075108691801965462?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3075108691801965462'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3075108691801965462'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/03/i-told-ai-to-build-full-newsletter-one.html' title='I Told an AI to Build a Full Newsletter. One Sentence. No Code.'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-3401045395438576451</id><published>2026-03-14T16:49:00.006+08:00</published><updated>2026-03-14T16:49:41.967+08:00</updated><title type='text'>This Paper Says We&#39;ve Been Fine-Tuning the Hard Way</title><content type='html'>&lt;p&gt;I was reading a paper that dropped in March 2026 and about three paragraphs in, I had to stop. The claim seemed too simple: you can adapt a large AI model to a specific task without gradient descent at all. Just random sampling and a vote.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What Fine-Tuning Actually Is&lt;/h2&gt;
&lt;p&gt;When an AI model gets trained, it learns general patterns from a massive dataset. Fine-tuning is the step where you take that general model and push it toward a specific task — coding, reasoning, following instructions — using reinforcement learning or optimization algorithms.&lt;/p&gt;
&lt;p&gt;Methods like PPO and GRPO (types of reinforcement learning commonly used to fine-tune large language models) work well. But they&#39;re expensive, require careful setup, and involve a lot of iteration.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Paper&#39;s Core Claim&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Title:&lt;/strong&gt; Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
&lt;strong&gt;Authors:&lt;/strong&gt; Yulu Gan and Phillip Isola
&lt;strong&gt;Published:&lt;/strong&gt; March 12, 2026 · arXiv: 2603.12228&lt;/p&gt;
&lt;p&gt;The idea: in large, well-pretrained models, the weight space around the original parameters is already densely packed with useful task-specific solutions. The authors call these clusters &lt;em&gt;&quot;neural thickets.&quot;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In smaller models, good solutions are scattered — you need gradient-based search to find them. In large models, they&#39;re close to where you already are.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Method — Remarkably Simple&lt;/h2&gt;
&lt;p&gt;They call it &lt;strong&gt;RandOpt&lt;/strong&gt;. Here&#39;s how it works:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Take the pretrained model weights&lt;/li&gt;
&lt;li&gt;Randomly sample N small perturbations of those weights&lt;/li&gt;
&lt;li&gt;Evaluate each perturbation on your task&lt;/li&gt;
&lt;li&gt;Keep the top K performers&lt;/li&gt;
&lt;li&gt;Combine them with a majority vote&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;No gradients. No reward model. No RL training loop. Perturb, evaluate, vote.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;💡 Code at &lt;code&gt;github.com/sunrainyg/RandOpt&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;h2&gt;What the Results Showed&lt;/h2&gt;
&lt;p&gt;RandOpt kept up with PPO, GRPO, and evolutionary strategies on the tasks they tested — and this held on large-scale contemporary models.&lt;/p&gt;
&lt;p&gt;That stopped me. These optimization methods have entire research communities, years of papers, and significant infrastructure built around them. The idea that randomly sampling neighbors of the original weights and taking a vote can match them says something about what&#39;s already sitting inside a well-trained model — waiting, not absent.&lt;/p&gt;
&lt;p&gt;The way they frame it: small models have sparse expert solutions, so you need search to find them. Large models have dense ones. The pretrained weights aren&#39;t just a starting point. They&#39;re already rich.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I Took From This&lt;/h2&gt;
&lt;p&gt;This paper shifted something in how I think about pretraining. We usually assume the pretrained model is a rough starting point and fine-tuning is where the real work happens. This flips that. If the solution space is already dense around the initial weights, the pretrained model is doing more work than we give it credit for.&lt;/p&gt;
&lt;p&gt;There&#39;s also a question I don&#39;t have a full answer to yet: if random sampling with ensemble voting matches expensive RL fine-tuning, what does that mean for how we should be spending compute? I&#39;m still working through the paper and the code, but it&#39;s the kind of result that sits with you.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Full paper:&lt;/strong&gt; arXiv 2603.12228&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/3401045395438576451/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/3401045395438576451?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3401045395438576451'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3401045395438576451'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/03/this-paper-says-weve-been-fine-tuning.html' title='This Paper Says We&#39;ve Been Fine-Tuning the Hard Way'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-3243187776728086948</id><published>2026-03-12T10:18:00.002+08:00</published><updated>2026-03-12T10:18:10.490+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="embedding"/><category scheme="http://www.blogger.com/atom/ns#" term="Gemini"/><category scheme="http://www.blogger.com/atom/ns#" term="multimodal embeddings"/><title type='text'>Google Just Replaced Five AI Search Tools With One</title><content type='html'>&lt;p data-end=&quot;507&quot; data-start=&quot;347&quot;&gt;Have you ever tried searching through a client’s content library where &lt;strong data-end=&quot;506&quot; data-start=&quot;418&quot;&gt;videos are in one folder, PDFs in another, and audio recordings scattered everywhere&lt;/strong&gt;?&lt;/p&gt;
&lt;p data-end=&quot;555&quot; data-start=&quot;509&quot;&gt;That’s the reality for most content libraries.&lt;/p&gt;
&lt;p data-end=&quot;691&quot; data-start=&quot;557&quot;&gt;Until now, AI search tools struggled with this kind of setup because each type of content needed a &lt;strong data-end=&quot;690&quot; data-start=&quot;656&quot;&gt;different system to process it&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;718&quot; data-start=&quot;693&quot;&gt;But that may be changing.&lt;/p&gt;
&lt;p data-end=&quot;934&quot; data-start=&quot;720&quot;&gt;&lt;span class=&quot;hover:entity-accent entity-underline inline cursor-pointer align-baseline&quot;&gt;Gemini Embedding 2&lt;/span&gt; — recently released by &lt;span class=&quot;hover:entity-accent entity-underline inline cursor-pointer align-baseline&quot;&gt;Google&lt;/span&gt; — can search across &lt;strong data-end=&quot;896&quot; data-start=&quot;839&quot;&gt;text, images, audio, video, and PDFs at the same time&lt;/strong&gt;, without converting everything first.&lt;/p&gt;
&lt;p data-end=&quot;1067&quot; data-start=&quot;936&quot;&gt;For anyone managing &lt;strong data-end=&quot;1037&quot; data-start=&quot;956&quot;&gt;knowledge bases, course content, research archives, or client media libraries&lt;/strong&gt;, this could be a major shift.&lt;/p&gt;
&lt;hr data-end=&quot;1072&quot; data-start=&quot;1069&quot; /&gt;
&lt;h2 data-end=&quot;1114&quot; data-section-id=&quot;1la6deu&quot; data-start=&quot;1074&quot;&gt;What an Embedding Model Actually Does&lt;/h2&gt;
&lt;p data-end=&quot;1206&quot; data-start=&quot;1116&quot;&gt;Before explaining why this matters, it helps to understand &lt;strong data-end=&quot;1205&quot; data-start=&quot;1175&quot;&gt;what an embedding model is&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;1386&quot; data-start=&quot;1208&quot;&gt;When AI systems search through content, they don’t read information the same way humans do. Instead, they convert content into &lt;strong data-end=&quot;1364&quot; data-start=&quot;1335&quot;&gt;numerical representations&lt;/strong&gt; that capture meaning.&lt;/p&gt;
&lt;p data-end=&quot;1400&quot; data-start=&quot;1388&quot;&gt;For example:&lt;/p&gt;
&lt;ul data-end=&quot;1453&quot; data-start=&quot;1402&quot;&gt;
&lt;li data-end=&quot;1430&quot; data-section-id=&quot;qvs5g2&quot; data-start=&quot;1402&quot;&gt;
&lt;p data-end=&quot;1430&quot; data-start=&quot;1404&quot;&gt;A sentence about a &lt;strong data-end=&quot;1430&quot; data-start=&quot;1423&quot;&gt;cat&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1453&quot; data-section-id=&quot;817scb&quot; data-start=&quot;1431&quot;&gt;
&lt;p data-end=&quot;1453&quot; data-start=&quot;1433&quot;&gt;A photo of a &lt;strong data-end=&quot;1453&quot; data-start=&quot;1446&quot;&gt;cat&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;1552&quot; data-start=&quot;1455&quot;&gt;Both produce &lt;strong data-end=&quot;1495&quot; data-start=&quot;1468&quot;&gt;similar number patterns&lt;/strong&gt;, which allows the AI to recognize that they are related.&lt;/p&gt;
&lt;p data-end=&quot;1600&quot; data-start=&quot;1554&quot;&gt;That’s how modern &lt;strong data-end=&quot;1593&quot; data-start=&quot;1572&quot;&gt;AI-powered search&lt;/strong&gt; works.&lt;/p&gt;
&lt;p data-end=&quot;1716&quot; data-start=&quot;1602&quot;&gt;The tool responsible for converting content into these numerical representations is called an &lt;strong data-end=&quot;1715&quot; data-start=&quot;1696&quot;&gt;embedding model&lt;/strong&gt;.&lt;/p&gt;
&lt;hr data-end=&quot;1721&quot; data-start=&quot;1718&quot; /&gt;
&lt;h2 data-end=&quot;1766&quot; data-section-id=&quot;1a5dxqt&quot; data-start=&quot;1723&quot;&gt;The Problem With Older AI Search Systems&lt;/h2&gt;
&lt;p data-end=&quot;1848&quot; data-start=&quot;1768&quot;&gt;Until recently, every type of content required a &lt;strong data-end=&quot;1847&quot; data-start=&quot;1817&quot;&gt;different embedding system&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;1882&quot; data-start=&quot;1850&quot;&gt;Typical setups looked like this:&lt;/p&gt;
&lt;ul data-end=&quot;2152&quot; data-start=&quot;1884&quot;&gt;
&lt;li data-end=&quot;1934&quot; data-section-id=&quot;9rklca&quot; data-start=&quot;1884&quot;&gt;
&lt;p data-end=&quot;1934&quot; data-start=&quot;1886&quot;&gt;&lt;strong data-end=&quot;1894&quot; data-start=&quot;1886&quot;&gt;Text&lt;/strong&gt; → processed by a text embedding model&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2000&quot; data-section-id=&quot;ev1nnc&quot; data-start=&quot;1935&quot;&gt;
&lt;p data-end=&quot;2000&quot; data-start=&quot;1937&quot;&gt;&lt;strong data-end=&quot;1947&quot; data-start=&quot;1937&quot;&gt;Images&lt;/strong&gt; → processed by image models such as CLIP or SigLIP&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2061&quot; data-section-id=&quot;p5ip8u&quot; data-start=&quot;2001&quot;&gt;
&lt;p data-end=&quot;2061&quot; data-start=&quot;2003&quot;&gt;&lt;strong data-end=&quot;2012&quot; data-start=&quot;2003&quot;&gt;Audio&lt;/strong&gt; → first transcribed using systems like Whisper&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2111&quot; data-section-id=&quot;1buok6m&quot; data-start=&quot;2062&quot;&gt;
&lt;p data-end=&quot;2111&quot; data-start=&quot;2064&quot;&gt;&lt;strong data-end=&quot;2073&quot; data-start=&quot;2064&quot;&gt;Video&lt;/strong&gt; → broken into frames or transcripts&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2152&quot; data-section-id=&quot;6y5rg6&quot; data-start=&quot;2112&quot;&gt;
&lt;p data-end=&quot;2152&quot; data-start=&quot;2114&quot;&gt;&lt;strong data-end=&quot;2122&quot; data-start=&quot;2114&quot;&gt;PDFs&lt;/strong&gt; → converted into plain text&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;2182&quot; data-start=&quot;2154&quot;&gt;This created several issues:&lt;/p&gt;
&lt;ul data-end=&quot;2309&quot; data-start=&quot;2184&quot;&gt;
&lt;li data-end=&quot;2213&quot; data-section-id=&quot;1bt0t4u&quot; data-start=&quot;2184&quot;&gt;
&lt;p data-end=&quot;2213&quot; data-start=&quot;2186&quot;&gt;Multiple models to manage&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2242&quot; data-section-id=&quot;11wjr57&quot; data-start=&quot;2214&quot;&gt;
&lt;p data-end=&quot;2242&quot; data-start=&quot;2216&quot;&gt;Several conversion steps&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2279&quot; data-section-id=&quot;ttkdq4&quot; data-start=&quot;2243&quot;&gt;
&lt;p data-end=&quot;2279&quot; data-start=&quot;2245&quot;&gt;More chances for things to break&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2309&quot; data-section-id=&quot;11ihzgu&quot; data-start=&quot;2280&quot;&gt;
&lt;p data-end=&quot;2309&quot; data-start=&quot;2282&quot;&gt;Slower search performance&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;2404&quot; data-start=&quot;2311&quot;&gt;In many cases, &lt;strong data-end=&quot;2354&quot; data-start=&quot;2326&quot;&gt;five different pipelines&lt;/strong&gt; were required just to search one content library.&lt;/p&gt;
&lt;hr data-end=&quot;2409&quot; data-start=&quot;2406&quot; /&gt;
&lt;h2 data-end=&quot;2445&quot; data-section-id=&quot;1s64m9y&quot; data-start=&quot;2411&quot;&gt;What Gemini Embedding 2 Changes&lt;/h2&gt;
&lt;p data-end=&quot;2545&quot; data-start=&quot;2447&quot;&gt;Gemini Embedding 2 solves this by creating &lt;strong data-end=&quot;2544&quot; data-start=&quot;2490&quot;&gt;one shared search space for multiple content types&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;2702&quot; data-start=&quot;2547&quot;&gt;Instead of converting everything separately, the model processes different media formats directly and places them into the &lt;strong data-end=&quot;2701&quot; data-start=&quot;2670&quot;&gt;same semantic search system&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;2754&quot; data-start=&quot;2704&quot;&gt;That means a single query can return results from:&lt;/p&gt;
&lt;ul data-end=&quot;2821&quot; data-start=&quot;2756&quot;&gt;
&lt;li data-end=&quot;2769&quot; data-section-id=&quot;dvcxg4&quot; data-start=&quot;2756&quot;&gt;
&lt;p data-end=&quot;2769&quot; data-start=&quot;2758&quot;&gt;Documents&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2780&quot; data-section-id=&quot;gmtinw&quot; data-start=&quot;2770&quot;&gt;
&lt;p data-end=&quot;2780&quot; data-start=&quot;2772&quot;&gt;Images&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2796&quot; data-section-id=&quot;1df8gu3&quot; data-start=&quot;2781&quot;&gt;
&lt;p data-end=&quot;2796&quot; data-start=&quot;2783&quot;&gt;Audio clips&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2812&quot; data-section-id=&quot;1am26ng&quot; data-start=&quot;2797&quot;&gt;
&lt;p data-end=&quot;2812&quot; data-start=&quot;2799&quot;&gt;Video files&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2821&quot; data-section-id=&quot;3w42x5&quot; data-start=&quot;2813&quot;&gt;
&lt;p data-end=&quot;2821&quot; data-start=&quot;2815&quot;&gt;PDFs&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;2835&quot; data-start=&quot;2823&quot;&gt;All at once.&lt;/p&gt;
&lt;p data-end=&quot;2860&quot; data-start=&quot;2837&quot;&gt;For example, you could:&lt;/p&gt;
&lt;ul data-end=&quot;3027&quot; data-start=&quot;2862&quot;&gt;
&lt;li data-end=&quot;2910&quot; data-section-id=&quot;gizjwl&quot; data-start=&quot;2862&quot;&gt;
&lt;p data-end=&quot;2910&quot; data-start=&quot;2864&quot;&gt;Upload a &lt;strong data-end=&quot;2882&quot; data-start=&quot;2873&quot;&gt;photo&lt;/strong&gt; and find related &lt;strong data-end=&quot;2910&quot; data-start=&quot;2900&quot;&gt;videos&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2973&quot; data-section-id=&quot;1iiw06f&quot; data-start=&quot;2911&quot;&gt;
&lt;p data-end=&quot;2973&quot; data-start=&quot;2913&quot;&gt;Submit a &lt;strong data-end=&quot;2941&quot; data-start=&quot;2922&quot;&gt;voice recording&lt;/strong&gt; and find matching &lt;strong data-end=&quot;2973&quot; data-start=&quot;2960&quot;&gt;documents&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3027&quot; data-section-id=&quot;85km9e&quot; data-start=&quot;2974&quot;&gt;
&lt;p data-end=&quot;3027&quot; data-start=&quot;2976&quot;&gt;Search inside &lt;strong data-end=&quot;3027&quot; data-start=&quot;2990&quot;&gt;PDF files without converting them&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;3032&quot; data-start=&quot;3029&quot; /&gt;
&lt;h2 data-end=&quot;3058&quot; data-section-id=&quot;8n53w6&quot; data-start=&quot;3034&quot;&gt;Supported Input Types&lt;/h2&gt;
&lt;p data-end=&quot;3133&quot; data-start=&quot;3060&quot;&gt;Gemini Embedding 2 currently supports multiple media types in one system:&lt;/p&gt;
&lt;p data-end=&quot;3175&quot; data-start=&quot;3135&quot;&gt;&lt;strong data-end=&quot;3143&quot; data-start=&quot;3135&quot;&gt;Text&lt;/strong&gt;&lt;br data-end=&quot;3146&quot; data-start=&quot;3143&quot; /&gt;
Up to roughly &lt;strong data-end=&quot;3175&quot; data-start=&quot;3160&quot;&gt;8,000 words&lt;/strong&gt;&lt;/p&gt;
&lt;p data-end=&quot;3225&quot; data-start=&quot;3177&quot;&gt;&lt;strong data-end=&quot;3187&quot; data-start=&quot;3177&quot;&gt;Images&lt;/strong&gt;&lt;br data-end=&quot;3190&quot; data-start=&quot;3187&quot; /&gt;
Up to &lt;strong data-end=&quot;3225&quot; data-start=&quot;3196&quot;&gt;six images in one request&lt;/strong&gt;&lt;/p&gt;
&lt;p data-end=&quot;3286&quot; data-start=&quot;3227&quot;&gt;&lt;strong data-end=&quot;3236&quot; data-start=&quot;3227&quot;&gt;Audio&lt;/strong&gt;&lt;br data-end=&quot;3239&quot; data-start=&quot;3236&quot; /&gt;
Raw audio files — &lt;strong data-end=&quot;3286&quot; data-start=&quot;3257&quot;&gt;no transcription required&lt;/strong&gt;&lt;/p&gt;
&lt;p data-end=&quot;3332&quot; data-start=&quot;3288&quot;&gt;&lt;strong data-end=&quot;3297&quot; data-start=&quot;3288&quot;&gt;Video&lt;/strong&gt;&lt;br data-end=&quot;3300&quot; data-start=&quot;3297&quot; /&gt;
Clips up to &lt;strong data-end=&quot;3332&quot; data-start=&quot;3312&quot;&gt;two minutes long&lt;/strong&gt;&lt;/p&gt;
&lt;p data-end=&quot;3413&quot; data-start=&quot;3334&quot;&gt;&lt;strong data-end=&quot;3342&quot; data-start=&quot;3334&quot;&gt;PDFs&lt;/strong&gt;&lt;br data-end=&quot;3345&quot; data-start=&quot;3342&quot; /&gt;
Original files can be processed &lt;strong data-end=&quot;3413&quot; data-start=&quot;3377&quot;&gt;without converting to plain text&lt;/strong&gt;&lt;/p&gt;
&lt;p data-end=&quot;3492&quot; data-start=&quot;3415&quot;&gt;All of this works through &lt;strong data-end=&quot;3491&quot; data-start=&quot;3441&quot;&gt;one model instead of multiple specialized ones&lt;/strong&gt;.&lt;/p&gt;
&lt;hr data-end=&quot;3497&quot; data-start=&quot;3494&quot; /&gt;
&lt;h2 data-end=&quot;3541&quot; data-section-id=&quot;13tt989&quot; data-start=&quot;3499&quot;&gt;Combining Multiple Inputs in One Search&lt;/h2&gt;
&lt;p data-end=&quot;3642&quot; data-start=&quot;3543&quot;&gt;One interesting feature is the ability to &lt;strong data-end=&quot;3641&quot; data-start=&quot;3585&quot;&gt;combine different types of input into a single query&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;3672&quot; data-start=&quot;3644&quot;&gt;For example, you might have:&lt;/p&gt;
&lt;ul data-end=&quot;3742&quot; data-start=&quot;3674&quot;&gt;
&lt;li data-end=&quot;3700&quot; data-section-id=&quot;njnk2u&quot; data-start=&quot;3674&quot;&gt;
&lt;p data-end=&quot;3700&quot; data-start=&quot;3676&quot;&gt;A &lt;strong data-end=&quot;3700&quot; data-start=&quot;3678&quot;&gt;photo of a product&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3742&quot; data-section-id=&quot;ll6ja6&quot; data-start=&quot;3701&quot;&gt;
&lt;p data-end=&quot;3742&quot; data-start=&quot;3703&quot;&gt;A &lt;strong data-end=&quot;3725&quot; data-start=&quot;3705&quot;&gt;text description&lt;/strong&gt; of what you want&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;3868&quot; data-start=&quot;3744&quot;&gt;Both can be submitted together, and the system generates &lt;strong data-end=&quot;3827&quot; data-start=&quot;3801&quot;&gt;one combined embedding&lt;/strong&gt; representing the meaning of both inputs.&lt;/p&gt;
&lt;p data-end=&quot;3951&quot; data-start=&quot;3870&quot;&gt;This allows searches that were previously impossible using single-modality tools.&lt;/p&gt;
&lt;hr data-end=&quot;3956&quot; data-start=&quot;3953&quot; /&gt;
&lt;h2 data-end=&quot;3992&quot; data-section-id=&quot;uyk51n&quot; data-start=&quot;3958&quot;&gt;Easy Integration for Developers&lt;/h2&gt;
&lt;p data-end=&quot;4065&quot; data-start=&quot;3994&quot;&gt;Another surprising detail is how quickly developers can start using it.&lt;/p&gt;
&lt;p data-end=&quot;4157&quot; data-start=&quot;4067&quot;&gt;Gemini Embedding 2 launched with support for popular AI development frameworks, including:&lt;/p&gt;
&lt;ul data-end=&quot;4211&quot; data-start=&quot;4159&quot;&gt;
&lt;li data-end=&quot;4172&quot; data-section-id=&quot;1gind9d&quot; data-start=&quot;4159&quot;&gt;
&lt;p data-end=&quot;4172&quot; data-start=&quot;4161&quot;&gt;LangChain&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4187&quot; data-section-id=&quot;vhi1az&quot; data-start=&quot;4173&quot;&gt;
&lt;p data-end=&quot;4187&quot; data-start=&quot;4175&quot;&gt;LlamaIndex&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4200&quot; data-section-id=&quot;1mb03mc&quot; data-start=&quot;4188&quot;&gt;
&lt;p data-end=&quot;4200&quot; data-start=&quot;4190&quot;&gt;ChromaDB&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4211&quot; data-section-id=&quot;ycj5jo&quot; data-start=&quot;4201&quot;&gt;
&lt;p data-end=&quot;4211&quot; data-start=&quot;4203&quot;&gt;QDrant&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;4371&quot; data-start=&quot;4213&quot;&gt;Because many AI applications are already built on these frameworks, developers can integrate the model &lt;strong data-end=&quot;4370&quot; data-start=&quot;4316&quot;&gt;without building a new infrastructure from scratch&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;4396&quot; data-start=&quot;4373&quot;&gt;It’s available through:&lt;/p&gt;
&lt;ul data-end=&quot;4494&quot; data-start=&quot;4398&quot;&gt;
&lt;li data-end=&quot;4454&quot; data-section-id=&quot;q2yskh&quot; data-start=&quot;4398&quot;&gt;
&lt;p data-end=&quot;4454&quot; data-start=&quot;4400&quot;&gt;&lt;strong data-end=&quot;4420&quot; data-start=&quot;4400&quot;&gt;Google AI Studio&lt;/strong&gt; (free tier for experimentation)&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4494&quot; data-section-id=&quot;17w6ds5&quot; data-start=&quot;4455&quot;&gt;
&lt;p data-end=&quot;4494&quot; data-start=&quot;4457&quot;&gt;&lt;strong data-end=&quot;4470&quot; data-start=&quot;4457&quot;&gt;Vertex AI&lt;/strong&gt; (enterprise deployment)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;4499&quot; data-start=&quot;4496&quot; /&gt;
&lt;h2 data-end=&quot;4564&quot; data-section-id=&quot;1ds2ng2&quot; data-start=&quot;4501&quot;&gt;Why This Matters for Virtual Assistants and Content Managers&lt;/h2&gt;
&lt;p data-end=&quot;4619&quot; data-start=&quot;4566&quot;&gt;Think about the kinds of content many clients manage.&lt;/p&gt;
&lt;p data-end=&quot;4652&quot; data-start=&quot;4621&quot;&gt;A &lt;strong data-end=&quot;4640&quot; data-start=&quot;4623&quot;&gt;podcast brand&lt;/strong&gt; might have:&lt;/p&gt;
&lt;ul data-end=&quot;4719&quot; data-start=&quot;4654&quot;&gt;
&lt;li data-end=&quot;4672&quot; data-section-id=&quot;1szl9os&quot; data-start=&quot;4654&quot;&gt;
&lt;p data-end=&quot;4672&quot; data-start=&quot;4656&quot;&gt;Audio episodes&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4687&quot; data-section-id=&quot;1xtvpu0&quot; data-start=&quot;4673&quot;&gt;
&lt;p data-end=&quot;4687&quot; data-start=&quot;4675&quot;&gt;Show notes&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4696&quot; data-section-id=&quot;3w42x5&quot; data-start=&quot;4688&quot;&gt;
&lt;p data-end=&quot;4696&quot; data-start=&quot;4690&quot;&gt;PDFs&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4719&quot; data-section-id=&quot;1eokm4i&quot; data-start=&quot;4697&quot;&gt;
&lt;p data-end=&quot;4719&quot; data-start=&quot;4699&quot;&gt;Promotional images&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;4751&quot; data-start=&quot;4721&quot;&gt;A &lt;strong data-end=&quot;4741&quot; data-start=&quot;4723&quot;&gt;course creator&lt;/strong&gt; may have:&lt;/p&gt;
&lt;ul data-end=&quot;4808&quot; data-start=&quot;4753&quot;&gt;
&lt;li data-end=&quot;4770&quot; data-section-id=&quot;2pyp5u&quot; data-start=&quot;4753&quot;&gt;
&lt;p data-end=&quot;4770&quot; data-start=&quot;4755&quot;&gt;Video lessons&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4786&quot; data-section-id=&quot;29jf6t&quot; data-start=&quot;4771&quot;&gt;
&lt;p data-end=&quot;4786&quot; data-start=&quot;4773&quot;&gt;Slide decks&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4808&quot; data-section-id=&quot;1mi8vvp&quot; data-start=&quot;4787&quot;&gt;
&lt;p data-end=&quot;4808&quot; data-start=&quot;4789&quot;&gt;Written summaries&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;4842&quot; data-start=&quot;4810&quot;&gt;A &lt;strong data-end=&quot;4826&quot; data-start=&quot;4812&quot;&gt;consultant&lt;/strong&gt; might maintain:&lt;/p&gt;
&lt;ul data-end=&quot;4901&quot; data-start=&quot;4844&quot;&gt;
&lt;li data-end=&quot;4862&quot; data-section-id=&quot;fa3r3p&quot; data-start=&quot;4844&quot;&gt;
&lt;p data-end=&quot;4862&quot; data-start=&quot;4846&quot;&gt;Recorded calls&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4880&quot; data-section-id=&quot;1833o25&quot; data-start=&quot;4863&quot;&gt;
&lt;p data-end=&quot;4880&quot; data-start=&quot;4865&quot;&gt;Presentations&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4901&quot; data-section-id=&quot;1596pd8&quot; data-start=&quot;4881&quot;&gt;
&lt;p data-end=&quot;4901&quot; data-start=&quot;4883&quot;&gt;Research reports&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;4982&quot; data-start=&quot;4903&quot;&gt;Searching across all of that in a &lt;strong data-end=&quot;4952&quot; data-start=&quot;4937&quot;&gt;single step&lt;/strong&gt; has been extremely difficult.&lt;/p&gt;
&lt;p data-end=&quot;5089&quot; data-start=&quot;4984&quot;&gt;With models like Gemini Embedding 2, developers can build search tools where one query instantly returns:&lt;/p&gt;
&lt;ul data-end=&quot;5174&quot; data-start=&quot;5091&quot;&gt;
&lt;li data-end=&quot;5118&quot; data-section-id=&quot;uqvk0z&quot; data-start=&quot;5091&quot;&gt;
&lt;p data-end=&quot;5118&quot; data-start=&quot;5093&quot;&gt;the right video segment&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5140&quot; data-section-id=&quot;ivel08&quot; data-start=&quot;5119&quot;&gt;
&lt;p data-end=&quot;5140&quot; data-start=&quot;5121&quot;&gt;the correct slide&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5174&quot; data-section-id=&quot;1b8a49w&quot; data-start=&quot;5141&quot;&gt;
&lt;p data-end=&quot;5174&quot; data-start=&quot;5143&quot;&gt;the relevant document section&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;5204&quot; data-start=&quot;5176&quot;&gt;All from &lt;strong data-end=&quot;5203&quot; data-start=&quot;5185&quot;&gt;one search bar&lt;/strong&gt;.&lt;/p&gt;
&lt;hr data-end=&quot;5209&quot; data-start=&quot;5206&quot; /&gt;
&lt;h2 data-end=&quot;5232&quot; data-section-id=&quot;1xqx32k&quot; data-start=&quot;5211&quot;&gt;The Bigger Picture&lt;/h2&gt;
&lt;p data-end=&quot;5295&quot; data-start=&quot;5234&quot;&gt;You probably won’t interact with Gemini Embedding 2 directly.&lt;/p&gt;
&lt;p data-end=&quot;5368&quot; data-start=&quot;5297&quot;&gt;Instead, it will power the &lt;strong data-end=&quot;5359&quot; data-start=&quot;5324&quot;&gt;next generation of search tools&lt;/strong&gt; used in:&lt;/p&gt;
&lt;ul data-end=&quot;5478&quot; data-start=&quot;5370&quot;&gt;
&lt;li data-end=&quot;5402&quot; data-section-id=&quot;g4hh6n&quot; data-start=&quot;5370&quot;&gt;
&lt;p data-end=&quot;5402&quot; data-start=&quot;5372&quot;&gt;knowledge management systems&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5425&quot; data-section-id=&quot;1pjs687&quot; data-start=&quot;5403&quot;&gt;
&lt;p data-end=&quot;5425&quot; data-start=&quot;5405&quot;&gt;research databases&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5446&quot; data-section-id=&quot;1vvlhnd&quot; data-start=&quot;5426&quot;&gt;
&lt;p data-end=&quot;5446&quot; data-start=&quot;5428&quot;&gt;course platforms&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5478&quot; data-section-id=&quot;lxw3l9&quot; data-start=&quot;5447&quot;&gt;
&lt;p data-end=&quot;5478&quot; data-start=&quot;5449&quot;&gt;internal company search tools&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;5575&quot; data-start=&quot;5480&quot;&gt;But knowing that technology like this exists helps you &lt;strong data-end=&quot;5574&quot; data-start=&quot;5535&quot;&gt;understand what’s becoming possible&lt;/strong&gt;.&lt;/p&gt;
&lt;p data-end=&quot;5717&quot; data-start=&quot;5577&quot;&gt;That knowledge can make a big difference when clients start asking about &lt;strong data-end=&quot;5716&quot; data-start=&quot;5650&quot;&gt;AI-powered search, automation, or content organization systems&lt;/strong&gt;.&lt;/p&gt;
&lt;hr data-end=&quot;5722&quot; data-start=&quot;5719&quot; /&gt;
&lt;p data-end=&quot;5854&quot; data-start=&quot;5724&quot;&gt;If you manage &lt;strong data-end=&quot;5805&quot; data-start=&quot;5738&quot;&gt;content libraries, research archives, or client knowledge bases&lt;/strong&gt;, this is a technology worth paying attention to.&lt;/p&gt;
&lt;p&gt;The tools many teams will rely on in the near future are already being built on models like this.&amp;nbsp;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/3243187776728086948/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/3243187776728086948?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3243187776728086948'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3243187776728086948'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/03/google-just-replaced-five-ai-search.html' title='Google Just Replaced Five AI Search Tools With One'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-6486846373151962702</id><published>2026-02-28T18:19:00.003+08:00</published><updated>2026-02-28T18:19:41.066+08:00</updated><title type='text'>Google Just Had a Huge Week — Nano Banana 2, Opal&#39;s Agent Step and a New Developer Ecosystem for ADK</title><content type='html'>&lt;p&gt;&amp;nbsp;There&#39;s been a lot happening at Google lately, and honestly, the updates from this past week alone are worth talking about. In just a few days, Google dropped a new default image model, upgraded their Opal workflow tool with real AI agent capabilities, and announced a brand-new integrations ecosystem for developers building AI agents. That&#39;s a lot to unpack — and I&#39;ve been digging into all three.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I&#39;ve been paying close attention to how Google is quietly (and sometimes not so quietly) building out their AI stack. These three updates feel connected in a bigger way. They&#39;re all pushing toward the same idea: faster, smarter, more autonomous AI tools that don&#39;t require you to be an engineer to use — or that supercharge you if you are one.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Nano Banana 2 — Pro-Quality Images at Flash Speed&lt;br /&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEgixP68GLGMCN7fl-C4APIJ9yxt8DIvu6-E8-i2TCN4arN0a7gJIqbAFFfW0b8Gzph6fsGBKlX--W4lJazZivSSRamnTDDRAP7Z9CcwK0rKPA7Yam9gwuUgrBZkj8Y4GOPmPTSAnrW7DURIh3UYeodgrN55tOiNsIb8bajQIVdo5NAr-0OC68rUuAdGx-vt&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img data-original-height=&quot;731&quot; data-original-width=&quot;1300&quot; height=&quot;180&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEgixP68GLGMCN7fl-C4APIJ9yxt8DIvu6-E8-i2TCN4arN0a7gJIqbAFFfW0b8Gzph6fsGBKlX--W4lJazZivSSRamnTDDRAP7Z9CcwK0rKPA7Yam9gwuUgrBZkj8Y4GOPmPTSAnrW7DURIh3UYeodgrN55tOiNsIb8bajQIVdo5NAr-0OC68rUuAdGx-vt=w320-h180&quot; title=&quot;Nano Banana 2&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;One of the things that&#39;s always frustrated me about AI image generation is the trade-off. You either get fast and mediocre, or slow and beautiful. Nano Banana 2 — Google&#39;s new default image model built on Gemini 3.1 Flash Image — is trying to change that entirely.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What makes this interesting is the combination it&#39;s pulling off. It&#39;s generating at Flash speed while delivering what Google is calling Pro-level quality. To put that in practical terms: you&#39;re not waiting around for results, but you&#39;re also not getting blurry, inconsistent images. And the resolution range is wide — from 512px all the way up to 4K. That means the same model works whether you&#39;re making a quick thumbnail or something that needs to look polished and production-ready.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Here&#39;s the detail that really caught my attention: it can consistently handle up to 5 characters and 14 objects in a single image. If you&#39;ve ever tried to generate a scene with multiple people or items and watched AI completely fall apart — characters blending together, objects disappearing — you know why this matters. Consistency across complex compositions has been a real weak spot in image generation, so this feels like a genuine step forward.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Nano Banana 2 is rolling out across the Gemini app and Google Workspace, and it&#39;s also available for enterprise use through Google Cloud. It&#39;s already the new default — which means if you&#39;re using Gemini for image generation, you&#39;re already on it.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Google Labs Opal Gets an Agent Step — Workflows Just Got Smarter&lt;/h2&gt;&lt;div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiEAfnpGHV75qpobWdYSKxibt5GU-qn58xG7lol12X-wkK-Kek5RuIhvNM_Ks87kXILwK_MKSz_7Oh79SXBzGcXR0eIRa0Is5BXu5Cdm1H3U6rKkNMRI62MuWe58eozHJT22PDC0qENNUIzFdJrrLDyyiN_RSkYHoF-dtX0GZt_FYZFqhhQ7u13O16K3mlP&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;733&quot; data-original-width=&quot;1300&quot; height=&quot;180&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiEAfnpGHV75qpobWdYSKxibt5GU-qn58xG7lol12X-wkK-Kek5RuIhvNM_Ks87kXILwK_MKSz_7Oh79SXBzGcXR0eIRa0Is5BXu5Cdm1H3U6rKkNMRI62MuWe58eozHJT22PDC0qENNUIzFdJrrLDyyiN_RSkYHoF-dtX0GZt_FYZFqhhQ7u13O16K3mlP&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I&#39;ve been curious about Opal since Google Labs introduced it as a workflow builder. The idea is that you set up a series of steps — like a recipe — and Opal runs through them to help you create something, whether that&#39;s a video, a piece of content, or a research brief. It&#39;s been useful, but the steps were static. You&#39;d set it up once and it would just follow the same path every time.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The new agent step changes that completely. Instead of a fixed sequence, Opal now has a step where an actual AI agent takes over — it understands your goal, picks the right tools (like Veo for video generation, or web search for research), manages memory across the workflow, and routes dynamically based on what&#39;s needed. Think of it like the difference between following a printed map and having a navigation app that can reroute you in real time when things change.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This is part of a bigger shift we&#39;re seeing across the AI space, where &quot;agentic&quot; capabilities — AI that can reason, decide, and act rather than just respond — are becoming the new baseline. Google adding this to Opal means even people without a coding background can now build workflows that genuinely adapt. You don&#39;t have to anticipate every scenario upfront; the agent figures it out.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;You can try Opal right now at opal.google — the agent step is available for all users.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Google ADK&#39;s New Integrations Ecosystem — Big News for Developers&lt;/h2&gt;&lt;div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiOioX4_8MgoQpWaKWrqNYoMYrdD2ko1gnOVP_9vFbeNvBZWNHrUSSR0E2aNBzPSgecRp09mYvtBYPfTaEULeD3DSppPZTNXfuL9MgMKzJMSK2AqFSE-4-dvl11ryccsgLKEgPaKpEesHDdS5S8BydhNUq_fij5MTgF6w8sCYP_hkSQ2hR7lTh7BQGvs5L4&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img data-original-height=&quot;1536&quot; data-original-width=&quot;2752&quot; height=&quot;224&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiOioX4_8MgoQpWaKWrqNYoMYrdD2ko1gnOVP_9vFbeNvBZWNHrUSSR0E2aNBzPSgecRp09mYvtBYPfTaEULeD3DSppPZTNXfuL9MgMKzJMSK2AqFSE-4-dvl11ryccsgLKEgPaKpEesHDdS5S8BydhNUq_fij5MTgF6w8sCYP_hkSQ2hR7lTh7BQGvs5L4=w400-h224&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This one is more developer-focused, but it&#39;s worth knowing about even if you&#39;re not building apps yourself. Google announced a new integrations ecosystem for their Agent Development Kit, or ADK — which is the framework developers use to build AI agents.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The idea is to make it easier for developers to connect their AI agents with external tools and services, so agents can actually do useful things in the real world instead of just talking about them. It&#39;s similar to how apps on your phone connect to different services — except here, we&#39;re talking about AI agents that can go off and complete tasks on your behalf.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;














&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What I find fascinating is how this fits into the broader picture. Between Nano Banana 2 (powerful image generation, accessible to everyone), Opal&#39;s agent step (autonomous workflows without code), and the ADK ecosystem (tools for developers to build custom agents), Google is building out every layer of the stack at once. There&#39;s something for the casual user, the content creator, and the professional developer — all in the same week.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/6486846373151962702/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/6486846373151962702?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6486846373151962702'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6486846373151962702'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/google-just-had-huge-week-nano-banana-2.html' title='Google Just Had a Huge Week — Nano Banana 2, Opal&#39;s Agent Step and a New Developer Ecosystem for ADK'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/a/AVvXsEgixP68GLGMCN7fl-C4APIJ9yxt8DIvu6-E8-i2TCN4arN0a7gJIqbAFFfW0b8Gzph6fsGBKlX--W4lJazZivSSRamnTDDRAP7Z9CcwK0rKPA7Yam9gwuUgrBZkj8Y4GOPmPTSAnrW7DURIh3UYeodgrN55tOiNsIb8bajQIVdo5NAr-0OC68rUuAdGx-vt=s72-w320-h180-c" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-8333798111022601314</id><published>2026-02-21T19:51:00.002+08:00</published><updated>2026-02-21T19:51:25.422+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Anthropic"/><category scheme="http://www.blogger.com/atom/ns#" term="Anthropic Claude"/><title type='text'>AI Is Finally Fighting Back — And Anthropic Just Made It Official</title><content type='html'>&lt;p&gt;&amp;nbsp;I&#39;ve been watching the cybersecurity space for a while now, and I have to be honest — it&#39;s one of those areas that used to feel completely out of reach for someone like me. No coding background, no deep technical knowledge of exploits or patches. Just a person who&#39;s curious about what AI can actually &lt;em&gt;do&lt;/em&gt; in the real world.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;But here&#39;s the thing I kept noticing over the years: the good guys were always playing catch-up.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Think back to how security worked — and honestly, how it &lt;em&gt;still&lt;/em&gt; works for most teams. You&#39;d have a tool scan your code, it would match against a list of known bad patterns, and spit out a report. The problem? The sneaky stuff, the subtle logic flaws, the vulnerabilities that had been hiding in open-source code for &lt;em&gt;decades&lt;/em&gt; — those never showed up. Because rule-based tools can&#39;t reason. They can only recognize what they&#39;ve already been told to look for.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Meanwhile, attackers got smarter. And faster.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;That gap — between what automated tools could catch and what skilled human researchers could catch — was always the weak point. And there just aren&#39;t enough human security researchers to close it. That&#39;s not a criticism, that&#39;s just math. The attack surface keeps growing. The backlogs keep piling up.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This is why what Anthropic announced on February 20, 2026 actually stopped me mid-scroll.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;Claude Code Security&lt;/strong&gt; is now in limited research preview, and what it does is genuinely different from what I&#39;d seen before. Instead of scanning for known patterns, Claude reads your code the way a human security researcher would — tracing how data moves, understanding how different parts of an application talk to each other, and catching the complex, context-dependent vulnerabilities that traditional tools walk right past.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What really got me is the verification layer. Claude doesn&#39;t just flag something and move on. It goes back and tries to &lt;em&gt;disprove&lt;/em&gt; its own findings, filtering out false positives before anything reaches a developer. Every validated finding comes with a severity rating and a confidence score, so teams know what to prioritize. And nothing gets applied automatically — a human always has to approve the fix. I love that. It&#39;s AI as a sharp, tireless assistant, not a rogue decision-maker.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;But here&#39;s the connection I keep thinking about: Anthropic&#39;s Frontier Red Team has been quietly building toward this for over a year. They entered Claude in cybersecurity competitions. They partnered with the Pacific Northwest National Laboratory to test AI on critical infrastructure defense. They used Claude to review their own internal code. This wasn&#39;t a product announcement that came from nowhere — it&#39;s the result of real, careful work testing what Claude could actually do before putting it in the hands of others.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;And the results of that work? Using Claude Opus 4.6, their team found over 500 vulnerabilities in production open-source codebases. Bugs that had survived years of expert human review, undetected.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;That&#39;s the part that really lands for me. These weren&#39;t theoretical vulnerabilities. They were sitting in real code, in real projects that real people depend on — sometimes for decades.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The reason I find this so meaningful isn&#39;t just the technology. It&#39;s the timing and the intent. Anthropic is releasing this in a limited preview specifically because the same capabilities that help defenders could help attackers. They&#39;re being deliberate about who gets access first — Enterprise and Team customers, plus open-source maintainers who can apply for free expedited access. They&#39;re working &lt;em&gt;with&lt;/em&gt; the community to get this right before it scales.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;That&#39;s a different posture than &quot;ship it and see what happens.&quot;&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;We&#39;re at a point where AI is going to scan a significant share of the world&#39;s code — that&#39;s not speculation anymore, it&#39;s the direction things are clearly heading. The question has always been who benefits from that first. Attackers who use AI to find weaknesses faster? Or defenders who use it to find and patch those same weaknesses before they&#39;re exploited?&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Claude Code Security is Anthropic&#39;s answer to that question.&amp;nbsp;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/8333798111022601314/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/8333798111022601314?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/8333798111022601314'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/8333798111022601314'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/ai-is-finally-fighting-back-and.html' title='AI Is Finally Fighting Back — And Anthropic Just Made It Official'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-7980872738809541873</id><published>2026-02-20T17:00:00.001+08:00</published><updated>2026-02-20T17:00:00.115+08:00</updated><title type='text'>Gemini 3.1 Pro Changed How I Actually Use AI — And It Has Nothing to Do With Benchmarks</title><content type='html'>&lt;p&gt;I&#39;ve been using AI tools almost every day for the past couple of years now, and if there&#39;s one thing I&#39;ve learned, it&#39;s that a smarter model doesn&#39;t automatically mean you&#39;re getting smarter results. How you use the model matters just as much as what the model can do. And with Google&#39;s Gemini 3.1 Pro, there&#39;s something that I think is being really underrated in all the release coverage — the thinking level system, and what it actually means for the way you work.&lt;/p&gt;
&lt;p&gt;This isn&#39;t a conversation about whether 77% on ARC-AGI beats 31%. It&#39;s about something more practical: the moment you realize you&#39;ve been using these tools wrong, and how this new release hands you a dial you probably didn&#39;t know you needed.&lt;/p&gt;
&lt;h2&gt;What Even Is a &quot;Thinking Level&quot;?&lt;/h2&gt;
&lt;p&gt;Here&#39;s the quick version for anyone who hasn&#39;t come across this yet. Gemini 3.1 Pro lets you set how much thinking the model does before it responds. With the previous Gemini 3 Pro, you had two choices: low or high. With 3.1 Pro, there are now three — low, medium, and high.&lt;/p&gt;
&lt;p&gt;Think of it like choosing between a quick gut reaction, a considered opinion, and a deep research session. The model isn&#39;t just choosing between &quot;fast&quot; and &quot;slow&quot; — it&#39;s choosing how much internal reasoning to do before it gives you an answer. At high, the model essentially behaves like a mini version of Gemini Deep Think, which is Google&#39;s most powerful reasoning model. That&#39;s a significant thing to have access to in a general-purpose assistant.&lt;/p&gt;
&lt;p&gt;What surprised me when I started playing around with this is how much the choice actually matters. For a tricky math problem, setting thinking to high produced the right answer after several minutes of work. Setting it to low gave a fast but wrong answer. Same prompt, same model, completely different outcome.&lt;/p&gt;
&lt;h2&gt;The Problem Nobody Is Talking About&lt;/h2&gt;
&lt;p&gt;Here&#39;s what I find really fascinating about this. Most people who use AI tools have never really thought about what mode they should be in for a given task. We all tend to just fire off a prompt and expect the model to figure it out. But the thinking level system kind of forces you to be intentional, and that intentionality is where the real upgrade lives.&lt;/p&gt;
&lt;p&gt;I started thinking about all the times I&#39;ve used AI for tasks that fell into two completely different buckets. There&#39;s the stuff I need quickly — drafting a quick reply, summarizing a short article, brainstorming a list of ideas, generating a social post. And then there&#39;s the stuff where I actually want the model to sit with a problem — writing a full script outline, analyzing something complex, working through a nuanced question. Those two categories have always existed. What&#39;s new is that now I have a setting that actually reflects that difference.&lt;/p&gt;
&lt;p&gt;Before 3.1 Pro, running everything at the same compute level was a bit like always driving in the same gear. Sometimes it worked. Sometimes it didn&#39;t. Now there&#39;s a gearshift.&lt;/p&gt;
&lt;h2&gt;How This Actually Changes My Workflow&lt;/h2&gt;
&lt;p&gt;When I started being intentional about thinking levels, a few things shifted for me pretty quickly.&lt;/p&gt;
&lt;p&gt;For anything where I need a fast creative spark — like coming up with a hook for a video, finding synonyms, or doing a quick rewrite of a sentence — low thinking is more than enough. It&#39;s snappy, it&#39;s responsive, and frankly it&#39;s exactly what you want when you&#39;re in flow and don&#39;t want to wait. Speed matters there.&lt;/p&gt;
&lt;p&gt;For medium tasks — things like drafting a structured outline, explaining a concept clearly, or building a content calendar — medium thinking has become my go-to. It takes a little longer, but the output feels more considered. Less surface-level. Like the model actually thought about the structure before it started writing.&lt;/p&gt;
&lt;p&gt;And then there&#39;s high. I&#39;ve started reserving high thinking for the things that actually deserve it. Complex analysis, tricky research questions, anything where getting the answer wrong would cost me time. The wait is longer — we&#39;re talking several minutes in some cases — but the quality of what comes back is on a different level. It&#39;s not just more text. It&#39;s more thoughtful text.&lt;/p&gt;
&lt;h2&gt;Why This Matters Even If You&#39;re Not Technical&lt;/h2&gt;
&lt;p&gt;I know a lot of people who use AI tools but feel like they&#39;re not getting as much out of them as they should. And honestly, after thinking about this thinking level system, I wonder if part of that frustration is just a mismatch between the task and the mode.&lt;/p&gt;
&lt;p&gt;If you&#39;ve ever asked an AI a complicated question and gotten a shallow answer, it might not be a model quality problem. It might be a compute budget problem. The model didn&#39;t spend enough time thinking. And now, for the first time, you have direct control over that.&lt;/p&gt;
&lt;p&gt;That&#39;s actually a pretty big shift. Instead of just hoping the model figures out when to try harder, you get to tell it. It puts a little more responsibility on the user, sure. But it also puts a lot more power in your hands.&lt;/p&gt;
&lt;h2&gt;The Bigger Picture&lt;/h2&gt;
&lt;p&gt;What I keep coming back to is this: Gemini 3.1 Pro isn&#39;t just a smarter model. It&#39;s a model that respects the fact that not every question deserves the same amount of effort. And it&#39;s the first time I&#39;ve felt like a general-purpose AI assistant is actually designed around how I naturally work — some things fast, some things slow, some things in between.&lt;/p&gt;
&lt;p&gt;The AI tools that stick around aren&#39;t always the ones with the highest benchmark numbers. They&#39;re the ones that fit into how people actually think and work. This thinking level system feels like a step in that direction — and it&#39;s one I don&#39;t think enough people are paying attention to yet.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/7980872738809541873/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/7980872738809541873?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7980872738809541873'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7980872738809541873'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/gemini-31-pro-changed-how-i-actually.html' title='Gemini 3.1 Pro Changed How I Actually Use AI — And It Has Nothing to Do With Benchmarks'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-390664768085574388</id><published>2026-02-20T14:22:00.004+08:00</published><updated>2026-02-20T14:22:27.326+08:00</updated><title type='text'>Grok 4.20: What Nobody Is Actually Talking About</title><content type='html'>&lt;div style=&quot;text-align: left;&quot;&gt;There&#39;s been a lot of noise about Grok 4.20 this week. The four agents, the multi-agent setup, the &quot;AI brains arguing with each other&quot; framing. I get it — it&#39;s a fun story. But the more I dug into this release, the more I felt like everyone was covering the surface and missing the stuff underneath that&#39;s actually worth paying attention to.&lt;/div&gt;&lt;p&gt;So here&#39;s what caught my eye.&lt;/p&gt;&lt;hr /&gt;&lt;h2&gt;The model solved a real math problem nobody had solved before&lt;/h2&gt;&lt;p&gt;I came across this one almost by accident and it stopped me completely.&lt;/p&gt;&lt;p&gt;Paata Ivanisvili, a math professor at UC Irvine, had been working on an open problem in harmonic analysis — trying to find the exact maximum of something called a Bellman function. He and his student had already published a paper on it. They had a partial answer. The problem was still technically unsolved.&lt;/p&gt;&lt;p&gt;He got early access to the Grok 4.20 beta, fed it the problem, and within five minutes it handed him an explicit formula — a cleaner, sharper result than what he and his student had managed. His reaction on X was basically just: &quot;Wow.&quot;&lt;/p&gt;&lt;p&gt;What makes this different from the usual &quot;AI is smart&quot; story is that benchmarks test problems with known answers. This didn&#39;t. Grok wasn&#39;t pulling a solution from somewhere — it was reasoning toward something that hadn&#39;t been written down yet. That&#39;s a genuinely different category of thing, and it happened because the Benjamin agent (the math and logic one) is specifically built for this kind of step-by-step proof reasoning.&lt;/p&gt;&lt;p&gt;One result doesn&#39;t prove everything. But it happened, it&#39;s documented, and it made me think differently about what &quot;AI for research&quot; actually looks like in practice.&lt;/p&gt;&lt;hr /&gt;&lt;h2&gt;It was already winning competitions before anyone knew it existed&lt;/h2&gt;&lt;p&gt;Here&#39;s something that got buried in the launch coverage: Grok 4.20 didn&#39;t launch when the public beta dropped on February 17. It had been running — quietly, under fake names — for weeks before that.&lt;/p&gt;&lt;p&gt;On LMArena it showed up as &quot;Theta-hat&quot; and &quot;Slateflow.&quot; On DesignArena it appeared as &quot;Pearl&quot; and &quot;Obsidian.&quot; And in Alpha Arena&#39;s live AI stock trading competition, a &quot;Mystery Model&quot; was quietly the only entry making money while every other AI lost. That Mystery Model was Grok 4.20.&lt;/p&gt;&lt;p&gt;xAI&#39;s last official model announcement on their news page is still Grok 4.1 from November 2025. There was no blog post for this launch. No announcement thread. Just posts from employees on X and a new option appearing quietly in the grok.com dropdown. By the time most people heard about it, it had already been competing for weeks.&lt;/p&gt;&lt;hr /&gt;&lt;h2&gt;The &quot;four brains&quot; framing is wrong — and the real version is more interesting&lt;/h2&gt;&lt;p&gt;Everyone&#39;s been describing this as four separate AI models arguing with each other. That&#39;s not quite what&#39;s happening, and I think the actual explanation is more impressive.&lt;/p&gt;&lt;p&gt;Grok 4.20 is one model — estimated around 3 trillion parameters — with four different roles running from the same base weights. Grok the coordinator, Harper the real-time researcher, Benjamin the math and logic brain, Lucas the contrarian wildcard. They&#39;re not four separate models. They&#39;re four ways of prompting the same model, each with different tool access and context framing.&lt;/p&gt;&lt;p&gt;Why does this matter? Cost. Running four truly separate models would be four times as expensive. Because these agents share weights and the same input context, xAI puts the overhead at 1.5x to 2.5x — not 4x. That&#39;s what makes this available at a consumer price point rather than Grok Heavy&#39;s $300/month tier. The agents don&#39;t have long debates either. The rounds are short, structured, trained through reinforcement learning to be efficient. They send targeted corrections and move on.&lt;/p&gt;&lt;p&gt;I found it interesting that nobody in the coverage this week mentioned this. &quot;Four brains&quot; is a better headline, sure, but one brain wearing four hats is actually a harder technical trick to pull off.&lt;/p&gt;&lt;hr /&gt;&lt;h2&gt;The trading competition result deserves more scrutiny than it got&lt;/h2&gt;&lt;p&gt;Alpha Arena Season 1.5 was a live stock trading competition — real transactions, verified on the blockchain. GPT-5, Gemini, Claude, open-source models, Chinese models — everyone participated in multiple strategy variants over several weeks. When it ended, every single configuration of every other model finished in the red. Grok 4.20&#39;s four variants were the only ones in profit, with the best posting around 35% returns from a $10,000 start.&lt;/p&gt;&lt;p&gt;A few people mentioned this. Nobody really asked why.&lt;/p&gt;&lt;p&gt;My read: it wasn&#39;t that Grok out-thought everyone on finance. It&#39;s that Harper — the agent that pulls from X&#39;s real-time feed of roughly 68 million English tweets per day — gave it a live information edge the other models didn&#39;t have. The competition fed other models summarized news prompts. Grok was processing raw market sentiment as it happened.&lt;/p&gt;&lt;p&gt;That&#39;s worth being honest about, because it means the result says more about data access than raw intelligence. But here&#39;s the thing — that data access is the architecture. That&#39;s what the Harper agent was built to do. So the win was real, and it was structural, and it came from a deliberate design choice.&lt;/p&gt;&lt;hr /&gt;&lt;h2&gt;Grok 4.20 is the proof of concept. Grok 5 is the actual bet.&lt;/h2&gt;&lt;p&gt;While everyone&#39;s been testing the 4.20 beta, xAI is already training its successor. Grok 5 is reportedly sitting around 6 trillion parameters — roughly double the size of 4.20 — with a target launch somewhere between April and June 2026.&lt;/p&gt;&lt;p&gt;For context, Grok 4.20 arrived in a crowded week. OpenAI shipped GPT-5.3-Codex on February 5. Anthropic released Claude Sonnet 5 on February 3. By that measure, this was a delayed point release catching up while competitors were already moving to the next generation. Even xAI treated it like a soft launch.&lt;/p&gt;&lt;p&gt;Grok 5 will not be a soft launch. xAI is spending roughly $1 billion a month. SpaceX formally acquired them weeks ago. An IPO is on the table. A 6-trillion-parameter release is exactly the kind of moment you build that story around.&lt;/p&gt;&lt;p&gt;What I keep thinking about is that the ecosystem is already being built before most people have noticed. Perplexity is reportedly running Grok 4.20 under the hood for a new search mode. There are signs of something called Grok Build — a coding tool with up to 8 parallel agents. The multi-agent architecture isn&#39;t a feature. It&#39;s the foundation, and xAI is building on top of it fast.&lt;/p&gt;&lt;p&gt;Grok 4.20 showed the approach works. The math problem, the trading competition, the benchmark trajectory — all of it points in the same direction. Grok 5 is where we find out if xAI can actually deliver on the scale of that promise.&lt;/p&gt;&lt;p&gt;I&#39;ll have a full video on this on YouTube soon — I want to actually get hands-on time with it first before I say anything definitive. But if you&#39;re following AI closely, this release is worth paying more attention to than the &quot;four agents&quot; headline suggests.&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/390664768085574388/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/390664768085574388?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/390664768085574388'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/390664768085574388'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/grok-420-what-nobody-is-actually.html' title='Grok 4.20: What Nobody Is Actually Talking About'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-7055115045929644944</id><published>2026-02-19T22:11:00.001+08:00</published><updated>2026-02-19T22:11:41.822+08:00</updated><title type='text'>Claude Sonnet 4.6 Is Here — And There&#39;s Way More Going On Than the Benchmarks Show</title><content type='html'>&lt;p&gt;There&#39;s been a lot of buzz this week about Claude Sonnet 4.6, Anthropic&#39;s latest model release. And honestly? Most of the coverage I&#39;ve seen is stopping at the surface level — benchmarks, pricing, computer use scores. All of that is interesting, sure. But when I started looking closer at what actually shipped alongside this model, I found four things that nobody&#39;s really talking about that I think matter a lot more for people who are actually using Claude day to day.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;That 1 Million Token Context Window Has a Catch&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Okay, so everyone is excited about the 1 million token context window. And I get it — the number sounds incredible. A million tokens is enough to hold entire codebases, thousands of documents, months of conversations. But here&#39;s what I didn&#39;t realize until I looked into this more carefully: a big context window doesn&#39;t automatically mean a useful context window.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;There&#39;s a real issue in AI called context rot. The basic idea is that as you keep filling up a model&#39;s memory during a conversation, its ability to actually use older information starts to degrade. It&#39;s a bit like trying to hold too many things in your head at once — the stuff from earlier starts slipping away. Anthropic&#39;s previous model, Sonnet 4.5, scored just 18.5% on a benchmark called MRCR v2, which specifically tests whether a model can find information buried deep in a long context. That&#39;s... not great.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Sonnet 4.6 is genuinely better at this. But there are two things I think people need to know before they get too excited. First, the 1 million token window is still in beta — it&#39;s not just available to everyone right now, and you need to explicitly turn it on. Second, and this one surprised me, the pricing changes dramatically once you cross 200,000 tokens. Below that threshold you pay the standard rate. Above it, the pricing structure shifts significantly and applies to your entire request, not just the extra tokens. So if you&#39;re planning to throw huge amounts of text at this model, make sure you understand what that&#39;s going to cost before you build a workflow around it.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Here&#39;s the Part That Actually Impressed Me&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Now here&#39;s where it gets interesting, because Anthropic didn&#39;t just train a better model and call it a day. They shipped two features alongside Sonnet 4.6 that directly address the context rot problem — and almost nobody is talking about them.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The first is dynamic web search filtering. When Claude searches the web, it now writes and runs its own code to filter search results before loading them into context. Instead of pulling in a full HTML page and reasoning over all of it — ads, navigation menus, irrelevant paragraphs and everything — it strips out the noise first and only keeps what&#39;s relevant. Anthropic tested this across two benchmarks and found it improved accuracy by an average of 11% while actually cutting input tokens by 24%. That&#39;s a really meaningful result. Quora&#39;s Poe platform tested it against all the major frontier models and said Opus 4.6 with this feature &quot;achieved the highest accuracy on our internal evals&quot; — specifically because it approaches research the way a human researcher would, filtering information programmatically instead of reasoning over raw noise.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The second is programmatic tool calling, which just hit general availability. This one is more for developers, but the idea is fascinating. When Claude needs to call multiple tools — say, querying a database five times to compare regions, then sorting and summarizing the results — it can now write code that does all of that inside a sandboxed container without bringing each intermediate result back into the conversation context. The result only shows up once everything is done. Think of it like doing all your rough math on a scratch pad before writing the final answer on the paper — the scratch work never clutters the main context.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Together, these two features tell a really clear story: Anthropic&#39;s answer to context rot isn&#39;t just a bigger bucket. It&#39;s smarter filtering so less noise goes in.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Should You Actually Cancel Opus?&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This is the question I keep seeing pop up and I wanted to think through it properly. Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Opus 4.6 is roughly 1.7x more expensive on input and output. For most individual users that difference feels abstract — but if you&#39;re building on the API and processing a lot of requests, it compounds fast.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What made me look at this differently is where Sonnet 4.6 actually beats Opus. On agentic financial analysis — think tasks like researching stock data, calculating ratios, pulling together a market summary — Sonnet 4.6 scored 63.3% versus Opus 4.6&#39;s 60.1%. On office tasks, Sonnet leads again. On computer use benchmarks, the gap is almost nothing: 72.5% for Sonnet versus 72.7% for Opus. For everyday knowledge work, Sonnet 4.6 is genuinely at the same level for a meaningfully lower price.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The places where Opus still wins are specific: novel problem-solving, deep codebase work, complex situations where getting it exactly right is more important than getting it fast and cheap. Anthropic themselves describe Opus as the right choice for &quot;codebase refactoring, coordinating multiple agents in a workflow, and problems where getting it just right is paramount.&quot; That&#39;s a real distinction — but it&#39;s a narrower one than it used to be.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;And here&#39;s one more thing that matters for API developers. Because of programmatic tool calling, tool results from multi-step workflows don&#39;t count toward your token usage at all — only the final output does. So if you have workflows that currently make eight or ten tool calls in sequence, each one loading results back into context, you may be spending significantly more than you need to. That changes the cost math even further in Sonnet&#39;s favor for the right use cases.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Security Problem Nobody Mentioned&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I want to talk about prompt injection because every video I watched mentioned it in one sentence and moved on. I think it deserves more attention than that — especially now that Claude has computer use and can take real actions on your behalf.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Here&#39;s what prompt injection actually means in practice. Imagine you ask Claude to read through your emails and schedule any meeting requests it finds. One of those emails was crafted by someone who knew an AI would read it. Inside that email, hidden in the text, are instructions that tell Claude to forward any email with the word &quot;confidential&quot; to an external address before it drafts your replies. You&#39;d never see it happen. That&#39;s the attack. Anthropic has described exactly this scenario in their own research on browser-agent security, and it&#39;s not hypothetical — it&#39;s an active concern for anyone building or using agentic AI.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Anthropic says Sonnet 4.6 is significantly better at detecting and resisting these attacks compared to its predecessor, and their evaluations show it performs similarly to Opus 4.6 in this area. That&#39;s meaningful progress. But independent testing on earlier Claude models found that while the model actively resists simple injection attempts, it can still be confused when the malicious instructions are buried inside what looks like a legitimate document or data structure.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What I didn&#39;t see anyone mention is a second security issue that comes with programmatic tool calling. When Claude calls your tools programmatically and gets results back, those results come back as raw strings — and those strings can contain anything, including code snippets that might get processed by the execution environment. If your tools are pulling data from external sources or user inputs, there&#39;s a real code injection risk if you&#39;re not validating what comes back before acting on it. This is separate from prompt injection — it&#39;s a layer deeper, and it&#39;s something every developer building agentic workflows needs to think about before shipping.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The honest summary is this: Sonnet 4.6 is more secure than what came before. But &quot;more secure&quot; and &quot;fully solved&quot; are very different things. The more autonomous you make your agents, the more carefully you need to think about what they can be tricked into doing.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;What This All Means&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I find the Sonnet 4.6 release genuinely exciting — not just because of the model improvements, but because of what the surrounding features tell us about where Anthropic is heading. They&#39;re building a system where Claude reasons over less noise, not more. Dynamic filtering, programmatic tool calling, context compaction, memory tools — these are all solving the same underlying problem. And the fact that accuracy went up while token usage went down on web search benchmarks suggests this approach is actually working.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;If you&#39;re a knowledge worker using Claude through the app, the takeaway is pretty straightforward: this model is fast, capable, and meaningfully better at the kind of office work most people actually do. If you&#39;re a developer building on the API, there are real architectural decisions to make now — about when to use Sonnet versus Opus, how to take advantage of programmatic tool calling, and how to think about security in agentic workflows.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What I&#39;m watching next is how the 1 million token context window performs in real production use cases over the next few weeks. The beta label is still on it for a reason. But the direction is clear — and it&#39;s more interesting than most of the coverage I&#39;ve seen this week.&lt;/p&gt;&lt;hr class=&quot;border-border-200 border-t-0.5 my-3 mx-1.5&quot; /&gt;&lt;p&gt;




























&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;em&gt;Sources: &lt;a class=&quot;underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current&quot; href=&quot;https://www.anthropic.com/news/claude-sonnet-4-6&quot;&gt;Anthropic Claude Sonnet 4.6 announcement&lt;/a&gt; · &lt;a class=&quot;underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current&quot; href=&quot;https://claude.com/blog/improved-web-search-with-dynamic-filtering&quot;&gt;Dynamic Web Search Filtering&lt;/a&gt; · &lt;a class=&quot;underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current&quot; href=&quot;https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling&quot;&gt;Programmatic Tool Calling Docs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/7055115045929644944/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/7055115045929644944?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7055115045929644944'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7055115045929644944'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/claude-sonnet-46-is-here-and-theres-way.html' title='Claude Sonnet 4.6 Is Here — And There&#39;s Way More Going On Than the Benchmarks Show'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-7155437406346206266</id><published>2026-02-13T11:19:00.005+08:00</published><updated>2026-02-13T11:19:33.873+08:00</updated><title type='text'>Google&#39;s Gemini 3 Deep Think Just Got a Major Upgrade — And It&#39;s Designed for Real Science</title><content type='html'>&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;There&#39;s been this interesting trend in AI lately where models are getting better at reasoning through complex problems. We&#39;ve seen it with OpenAI&#39;s o1 and o3, DeepSeek&#39;s R1, and now Google is making a serious push into this space with a major update to Gemini 3 Deep Think.&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: justify;&quot;&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What makes this update different from the reasoning models we&#39;ve seen before is that Google specifically built it for scientists, researchers, and engineers working on real-world problems. This isn&#39;t just about solving math competitions anymore—though it still does that incredibly well. It&#39;s about tackling messy, incomplete data and problems without clear solutions, which is what actual research looks like.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I&#39;ve been following the development of reasoning models closely, and Deep Think&#39;s focus on practical scientific applications is a shift I find really interesting. This is part of a larger movement where AI is moving from being a general-purpose tool to something more specialized for specific domains.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Google AI Ultra subscribers can access the updated Deep Think in the Gemini app starting today, and for the first time, researchers and enterprises can apply for early access to use it via the Gemini API.&lt;/p&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;iframe allowfullscreen=&#39;allowfullscreen&#39; webkitallowfullscreen=&#39;webkitallowfullscreen&#39; mozallowfullscreen=&#39;mozallowfullscreen&#39; width=&#39;395&#39; height=&#39;266&#39; src=&#39;https://www.blogger.com/video.g?token=AD6v5dwJqrpM0lr6UbsTfcubUVFb1KlMIVFJsj0hR3nvdUKyk3ZwtjCu7gsBlS4IJkarPZYCWiu2azJrf0KOehM1VA&#39; class=&#39;b-hbp-video b-uploaded&#39; frameborder=&#39;0&#39;&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Why Deep Think Focuses on Science and Engineering&lt;/h2&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: justify;&quot;&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The way Google approached this update is pretty clever. Instead of just making a model that&#39;s good at abstract reasoning, they worked directly with scientists and researchers to understand what kinds of problems they actually face in their work.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Real research isn&#39;t like solving textbook problems. You&#39;re dealing with incomplete data, messy information, and questions that don&#39;t have single right answers. Traditional AI models often struggle with this kind of ambiguity, but Deep Think was specifically trained to handle it.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What caught my attention in the announcement were the real-world examples. A mathematician at Rutgers University used Deep Think to review a highly technical paper and it found a logical flaw that had passed through human peer review. At Duke University, researchers used it to optimize crystal growth methods for semiconductor materials, hitting a precise target that previous methods couldn&#39;t achieve.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;These aren&#39;t just impressive demos—they&#39;re solving actual research bottlenecks.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Numbers Are Genuinely Impressive&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Deep Think continues to push what&#39;s possible on academic benchmarks. It scored 48.4% on Humanity&#39;s Last Exam, a benchmark specifically designed to test the limits of frontier models. That&#39;s without using any external tools, just pure reasoning.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;It also achieved 84.6% on ARC-AGI-2, which tests abstract reasoning abilities that supposedly indicate progress toward artificial general intelligence. The ARC Prize Foundation verified this result, which gives it more credibility.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;On Codeforces, a competitive programming platform, Deep Think reached an Elo rating of 3455. To put that in perspective, that&#39;s gold medal territory at international programming competitions.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The really interesting part is that Deep Think now also excels at chemistry and physics olympiad problems, achieving gold medal-level performance on both. It scored 50.5% on CMT-Benchmark, which tests advanced theoretical physics understanding.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Built for Practical Engineering Applications&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Beyond benchmarks, what makes Deep Think stand out is how it&#39;s being used in practice. Google designed it to interpret complex data and model physical systems through code, which means engineers can actually use it for real work.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;One example they showed is turning a sketch into a 3D-printable file. You draw something, Deep Think analyzes it, models the complex shape, and generates a file ready for 3D printing. That&#39;s the kind of practical application that makes this more than just an impressive reasoning model—it&#39;s a tool people can actually use.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Google&#39;s also making this available through the Gemini API for researchers and enterprises, which is significant. Previous versions of Deep Think were mostly limited to the consumer app, but opening it up via API means developers can integrate it into their own workflows and tools.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;What This Means for AI Reasoning Models&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This release is part of a broader competition happening right now in the reasoning model space. OpenAI has o1 and o3, DeepSeek released R1, Anthropic has been working on extended thinking capabilities, and now Google is pushing hard with Deep Think.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What&#39;s interesting is how these companies are differentiating their approaches. OpenAI focuses on general reasoning, DeepSeek emphasizes efficiency and open-source access, and Google is positioning Deep Think as the model for scientific and engineering work.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The practical difference here is that Deep Think isn&#39;t trying to be everything to everyone. It&#39;s specialized for domains where deep reasoning through complex, messy problems actually matters—research, engineering, advanced mathematics, theoretical physics.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For anyone working in these fields, having a model that understands the nuances of scientific work rather than just being good at logic puzzles could be genuinely transformative.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The fact that Google worked directly with scientists to build this, and that early testers are already finding real research applications, suggests this is more than just benchmark chasing. It&#39;s an attempt to make AI actually useful for advancing human knowledge in concrete ways.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;If you&#39;re a researcher, engineer, or working in a technical field, Deep Think might be worth keeping an eye on—especially if you can get into the early access program for the API. This could be one of those tools that changes how certain kinds of work get done.&lt;/p&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&amp;nbsp;&lt;p&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/7155437406346206266/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/7155437406346206266?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7155437406346206266'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7155437406346206266'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/googles-gemini-3-deep-think-just-got.html' title='Google&#39;s Gemini 3 Deep Think Just Got a Major Upgrade — And It&#39;s Designed for Real Science'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-6953280822562933508</id><published>2026-02-13T10:55:00.005+08:00</published><updated>2026-02-13T10:55:26.259+08:00</updated><title type='text'>Google Chrome&#39;s WebMCP is About to Change How AI Agents Browse the Web</title><content type='html'>&lt;p&gt;&amp;nbsp;There&#39;s been this ongoing challenge with AI agents: when they visit a website, they&#39;re basically tourists who don&#39;t speak the language. Whether you&#39;re using LangChain, Claude Code, or tools like OpenClaw, your agent is stuck guessing which buttons to press, scraping HTML, or processing thousands of tokens worth of screenshots just to figure out what&#39;s on a page. If you&#39;ve been building with agents for a while, you know exactly how painful this is.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;That&#39;s what makes Google Chrome&#39;s new WebMCP preview so interesting. Earlier this week, the Chrome team shipped an early version of what could be the most important change to how agents interact with the web in years. Instead of treating every website like a foreign language that needs translation, WebMCP lets websites expose structured tools directly to AI agents. No more scraping. No more processing endless screenshots. Your agent just calls functions.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This is part of a bigger shift we&#39;re seeing where the web itself is becoming more agent-friendly, not just more human-friendly. And honestly, it&#39;s about time.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I&#39;ve been following the development of browser-based agents, and WebMCP caught my attention because it solves problems most people aren&#39;t even talking about yet. Watch my YouTube video on it below.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;a class=&quot;underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current&quot; href=&quot;https://www.youtube.com/watch?v=35oWt7u2b-g&quot;&gt;WebMCP Begins Rollout&lt;/a&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;iframe allowfullscreen=&quot;&quot; class=&quot;BLOG_video_class&quot; height=&quot;290&quot; src=&quot;https://www.youtube.com/embed/35oWt7u2b-g&quot; width=&quot;493&quot; youtube-src-id=&quot;35oWt7u2b-g&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Why Current Web Interaction Is So Inefficient&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Right now, agents interact with websites in two main ways. The first is through screenshots—you take an image of the page, feed it to a multimodal model, and hope it can identify buttons, form fields, and interactive elements. The problem? You&#39;re burning through thousands of tokens for every single image you process.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The second approach is accessing the DOM directly and parsing raw HTML and JavaScript code. While this uses fewer tokens than images, you&#39;re still translating from one language to another. The agent has to sift through paragraph tags, CSS styling, and all sorts of presentation markup that doesn&#39;t actually matter for understanding what actions it can take.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Both methods feel like working through a translator when you could just speak the same language.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;How WebMCP Actually Works&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The idea behind WebMCP is beautifully simple: let each webpage act like an MCP server that agents can query directly. The page basically tells the agent, &quot;Here&#39;s what you can read. Here&#39;s what you can click. Here&#39;s what you can fill in.&quot;&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This isn&#39;t entirely new—academics and companies have been proposing versions of this for a while. But in the second half of last year, Microsoft and Google actually got together to build a real spec for how this would work. The timing makes sense too—this was right around when we saw Perplexity release Comet and OpenAI release Atlas, when web interaction was clearly heating up.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What makes Chrome&#39;s approach interesting is that it&#39;s designed for human-in-the-loop workflows first. The agent works with the user, not just autonomously. So normal people still use websites normally, but agents can help speed things up and improve the experience.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Google presented three core pillars at the Web AI Summit: context (understanding what the user is doing beyond just the current screen), capabilities (taking actions on the user&#39;s behalf), and coordination (managing the handoff between agent and user when needed).&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Two APIs You Need to Know&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Chrome has structured WebMCP around two main APIs. The Declarative API handles standard actions—think HTML forms with added tool names and descriptions. If you&#39;ve already got well-structured forms on your site, you&#39;re apparently about 80% of the way there.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The Imperative API is for more complex, dynamic interactions that require JavaScript execution. This is where you&#39;d define custom tools, similar to how you&#39;d structure function calls for OpenAI or Anthropic&#39;s API endpoints.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The practical difference here is huge. Instead of dozens of interactions clicking through filters and scrolling pages, a single tool call could return structured results. Imagine your agent calling a &quot;search products&quot; function and getting back organized data instead of trying to parse a visual search interface.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;What This Means Going Forward&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;While WebMCP is still behind a flag in Chrome, it&#39;s already in the browser. This isn&#39;t a theoretical spec anymore—it&#39;s actually happening. Google will likely roll this out fully at Google Cloud Next or Google IO in the coming months, and I expect things to move quickly from there.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;We&#39;ll probably see tools and maybe even Claude skills that help convert existing websites to expose their own WebMCPs. For anyone building AI agents or websites that want agents to use them, this is definitely something to have on your radar.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The shift from agents guessing their way through the web to websites speaking the agent&#39;s language directly? That&#39;s the kind of change that makes everything else possible.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/6953280822562933508/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/6953280822562933508?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6953280822562933508'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6953280822562933508'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/google-chromes-webmcp-is-about-to.html' title='Google Chrome&#39;s WebMCP is About to Change How AI Agents Browse the Web'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/35oWt7u2b-g/default.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-6536922179224630976</id><published>2026-02-08T11:50:00.001+08:00</published><updated>2026-02-08T11:50:59.328+08:00</updated><title type='text'>Claude Opus 4.6 Fast Mode — Up to 2.5x Faster Responses at Premium Pricing</title><content type='html'>&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Anthropic launched Fast Mode for Claude Opus 4.6 in research preview. The feature delivers up to 2.5x higher output tokens per second from the same model at a higher cost per token.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode is available now in Claude Code for users with extra usage enabled and through a waitlist for API access. The feature is also rolling out to GitHub Copilot, Cursor, and other platforms.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;How Fast Mode Works&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode is not a different model. It uses the same Opus 4.6 with a different API configuration that prioritizes speed over cost efficiency. You get identical quality and capabilities, just faster responses.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The speed improvement focuses on output tokens per second, not time to first token. The same model weights and behavior remain unchanged.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Accessing Fast Mode&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;In Claude Code, toggle Fast Mode on or off by typing &lt;code class=&quot;bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]&quot;&gt;/fast&lt;/code&gt; in the CLI or VS Code extension. You can also enable it in your user settings file by setting &lt;code class=&quot;bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]&quot;&gt;&quot;fastMode&quot;: true&lt;/code&gt;. Fast Mode persists across sessions.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;When enabled, Claude Code automatically switches to Opus 4.6 if you&#39;re on a different model. A small ↯ icon appears next to the prompt while Fast Mode is active.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For API users, set &lt;code class=&quot;bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]&quot;&gt;speed: &quot;fast&quot;&lt;/code&gt; in your API request to enable Fast Mode. The feature is currently in limited research preview with waitlist access.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Pricing and Availability&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode pricing starts at $30 per million input tokens and $150 per million output tokens. This is 6x the standard Opus pricing of $5 per million input and $25 per million output.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;A 50% discount is available for all plans until February 16, 2026, bringing the cost to 3x standard pricing during the discount period.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode usage is billed directly to extra usage, even if you have remaining usage on your plan. Fast Mode tokens do not count against your plan&#39;s included usage.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Requirements and Limitations&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode requires extra usage enabled on your account. For individual accounts, enable this in Console billing settings. For Teams and Enterprise, an admin must enable both extra usage and Fast Mode for the organization.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode is not available on third-party cloud providers including Amazon Bedrock, Google Vertex AI, or Microsoft Azure Foundry. It&#39;s only available through the Anthropic Console API and Claude subscription plans using extra usage.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode has separate rate limits from standard Opus 4.6. When you hit the rate limit or run out of extra usage credits, Fast Mode automatically falls back to standard Opus 4.6.&lt;/p&gt;&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;When to Use Fast Mode&lt;/h2&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Fast Mode works best for interactive workflows where speed matters more than cost. Use it for rapid iteration, live debugging, or real-time agent interactions.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Toggle it off when cost efficiency is more important than latency. You can combine Fast Mode with lower effort levels for maximum speed on straightforward tasks.&lt;/p&gt;&lt;p&gt;



















&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For API users, note that switching between fast and standard speed invalidates the prompt cache. Requests at different speeds do not share cached prefixes.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/6536922179224630976/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/6536922179224630976?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6536922179224630976'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6536922179224630976'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/claude-opus-46-fast-mode-up-to-25x.html' title='Claude Opus 4.6 Fast Mode — Up to 2.5x Faster Responses at Premium Pricing'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-7326809917051569052</id><published>2026-02-05T16:47:00.002+08:00</published><updated>2026-02-05T16:47:21.249+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Google"/><category scheme="http://www.blogger.com/atom/ns#" term="Google DeepMind"/><title type='text'>Google DeepMind&#39;s Evo-Memory Redefines AI Agent Memory — Cutting Task Steps by 50% Without Retraining</title><content type='html'>&lt;p&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiwQkyyCD7SQGK0JzKKyVUpHTLnv_dBd_mAwVe6zlsIYZadBSuyyTcARWtnZycnzJbrTo6ueus2VQVXjrTAAb4BdEVvjxpX5OeHmXyLjt0cXkGBhqN41wBJjN-DM5m3cscSvIBEaFVc2LKiuXi-K04vkd8y2WSzxqtdM7WjPKa966e59jzm2M2bJVxbkwXP&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;761&quot; data-original-width=&quot;997&quot; height=&quot;489&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiwQkyyCD7SQGK0JzKKyVUpHTLnv_dBd_mAwVe6zlsIYZadBSuyyTcARWtnZycnzJbrTo6ueus2VQVXjrTAAb4BdEVvjxpX5OeHmXyLjt0cXkGBhqN41wBJjN-DM5m3cscSvIBEaFVc2LKiuXi-K04vkd8y2WSzxqtdM7WjPKa966e59jzm2M2bJVxbkwXP=w640-h489&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;The gap between how AI agents &lt;em&gt;remember&lt;/em&gt; and how they actually &lt;em&gt;learn&lt;/em&gt; from experience has long been a fundamental limitation. While chatbots can recall what you said in a previous conversation, they typically can&#39;t leverage that experience to solve similar problems faster or smarter. A new research collaboration between Google DeepMind and the University of Illinois Urbana-Champaign proposes a solution: &quot;Test-Time Evolution&quot; — where agents actively Search, Synthesize, and Evolve their memory after every interaction.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This isn&#39;t just another benchmark paper. Evo-Memory introduces a comprehensive streaming evaluation framework alongside ReMem, an action-think-memory refine pipeline that fundamentally changes how we think about agent memory. The results are striking: active memory refinement reduced task completion steps by roughly 50% on ALFWorld (from 22.6 steps down to 11.5), and smaller models like Gemini Flash achieved gains that often rivaled larger static models. The success hinges not on storing more information, but on the agent&#39;s ability to refine and delete irrelevant experiences.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For anyone building AI agents, personal assistants, or autonomous systems, this research signals a shift in how we should approach memory architecture. Current RAG systems and long-context models excel at passive retrieval, but they don&#39;t learn from what worked and what didn&#39;t. Evo-Memory closes that gap by treating memory as something that evolves during deployment rather than remaining frozen after training.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Core Problem: Remembering vs. Learning&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The paper identifies a critical distinction that often gets overlooked. Current LLM memory systems focus on &lt;em&gt;conversational recall&lt;/em&gt; — retrieving facts from dialogue history to answer queries. But this misses the more valuable capability of &lt;em&gt;experience reuse&lt;/em&gt;, where agents abstract reasoning strategies from past tasks to improve future performance.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Think about it this way: if you ask a math tutor the same type of problem twice, they shouldn&#39;t solve it from scratch the second time. They should recognize the pattern and apply the successful strategy faster. Yet most AI agents today do exactly that — they recall context but fail to adapt across sessions. The researchers demonstrate this limitation persists even in sophisticated systems using retrieval-augmented generation, hierarchical memory, and workflow-based approaches.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The benchmark transforms static datasets into streaming task sequences, explicitly testing whether LLMs can accumulate knowledge and refine strategies during deployment. This reframing from isolated task evaluation to continuous adaptation assessment reveals significant weaknesses in current memory architectures.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;ReMem: The Think-Act-Refine Loop&lt;/h2&gt;&lt;div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjWgIW-zTpaPv7vhJU8EIa2g_Go0fmGSWyUMyT8N7rH0O6_7Ng_oYx06fB_XlkISEsjxYaBZ9ImU1Uo8NaKAF1OmcwfSOpQX7de_5ZC4HD1DQu0im5cM0XlbjGmo3SYrZOholqlYI96Hw4pJLC5yXmaAUOD2eEN1RWbxAaQguqz0vYqgOQ-kxT-KImCUDk6&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1164&quot; data-original-width=&quot;1892&quot; height=&quot;394&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjWgIW-zTpaPv7vhJU8EIa2g_Go0fmGSWyUMyT8N7rH0O6_7Ng_oYx06fB_XlkISEsjxYaBZ9ImU1Uo8NaKAF1OmcwfSOpQX7de_5ZC4HD1DQu0im5cM0XlbjGmo3SYrZOholqlYI96Hw4pJLC5yXmaAUOD2eEN1RWbxAaQguqz0vYqgOQ-kxT-KImCUDk6=w640-h394&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The proposed solution introduces a three-operation framework that goes beyond traditional ReAct-style agents. At each step, the agent chooses between Think (internal reasoning traces), Act (execute an operation or output a response), and Refine (meta-reasoning over memory to exploit useful experiences, prune noise, and reorganize stored knowledge).&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This creates what the researchers describe as a Markov decision process where memory becomes an adaptive component that interacts with reasoning in real time rather than remaining passive context. The agent can loop between Think and Refine arbitrarily before committing to an action, forming a lightweight but powerful paradigm for continual adaptation.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;A concrete example from the paper: when solving a household task like &quot;put a hot apple in the fridge,&quot; the ReMem agent thinks about needing a heat source, searches memory for relevant experiences with microwaves, prunes an obsolete entry about stoves, executes the microwave action, then creates a new memory entry capturing the successful &quot;hot→fridge = cooldown&quot; strategy. This completed in 9 steps versus 19 for vanilla ReAct.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Benchmark Results That Challenge Assumptions&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The research evaluated over ten representative memory modules across 10 diverse datasets spanning embodied reasoning (ALFWorld, BabyAI, PDDL, ScienceWorld) and single-turn tasks (AIME-24/25, GPQA, MMLU-Pro, ToolBench). The results reveal several important findings.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;ReMem on Claude 3.7 Sonnet achieved 0.92 success rate and 0.96 progress on ALFWorld, 0.83 success and 0.95 progress on PDDL planning tasks. On Gemini 2.5 Flash, the average success reached 0.50 with 0.64 progress, consistently outperforming history baselines and ReAct-style approaches across all four multi-turn environments.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Perhaps most notably, the performance gains correlate strongly with task similarity within datasets. The researchers found a Pearson correlation of 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet between ReMem&#39;s improvement margin and within-dataset coherence. Structured domains like PDDL and ALFWorld with higher intra-task similarity showed larger improvements, while diverse datasets like AIME-25 or GPQA showed smaller gains.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Step efficiency improvements proved equally significant. In ALFWorld, average steps to complete tasks dropped from 22.6 for history baselines to 11.5 for ReMem. ScienceWorld showed similar gains, going from 20.5 steps down to 14.0. The researchers note this represents a direct compute-cost win without any fine-tuning.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEipjnXScFWyOwPoKf1NlMgaCHVDSHkhwcAodagDVvfG0ex5aQzGvvxOMROMMLxbaGK6Fk9QVYXGODHHmE94dkoOSaRPcVUgyk-fSYW_UTWh88QbbRSk_JwwwssVIY7aOiRBeoBhWoi8ph5AbyEq6H0ujcQZhI3jV1VyWk_0ysTf_m8_zar3zHgMwgy_f5xD&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;309&quot; data-original-width=&quot;640&quot; height=&quot;309&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEipjnXScFWyOwPoKf1NlMgaCHVDSHkhwcAodagDVvfG0ex5aQzGvvxOMROMMLxbaGK6Fk9QVYXGODHHmE94dkoOSaRPcVUgyk-fSYW_UTWh88QbbRSk_JwwwssVIY7aOiRBeoBhWoi8ph5AbyEq6H0ujcQZhI3jV1VyWk_0ysTf_m8_zar3zHgMwgy_f5xD&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Surprising Power of Simple Approaches&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;One unexpected finding deserves attention: ExpRAG, a simple retrieval-based baseline, outperformed several more complex designs. This baseline stores each task interaction as structured experience text and retrieves similar experiences for new tasks using basic embedding similarity.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Even ExpRecent, which simply maintains condensed traces of recent task trajectories, performed competitively. This suggests that explicit task-level utilization during test-time evolution represents a promising and underexplored direction, and that architectural complexity isn&#39;t always the answer.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The research also tested how agents handle both successful and failed experiences in memory. Baseline methods experienced clear performance drops when exposed to unfiltered failures, indicating that naive memory accumulation introduces noise. ReMem remained robust by actively refining stored experiences, achieving the highest overall success rates under both Claude and Gemini backbones when fed mixed feedback.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Why This Matters for AI Development&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The implications extend beyond benchmark scores. Evo-Memory demonstrates that test-time evolution — the ability to retrieve, integrate, and update memory continuously during deployment — represents a viable path to more capable AI agents without additional training.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Smaller models particularly benefit from self-evolving memory, suggesting this approach could democratize access to more sophisticated agent capabilities. The correlation between task similarity and memory effectiveness provides practical guidance: domains with structured, recurring task patterns stand to gain the most from implementing these techniques.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiQFuYuJee-z2EOLLhgoqqxrDomA3TbcD99xggN-hBV4WwJMaZ7pWD7sYcCeTX4l0Pvow8q89DsbbplkUTFO4h0a4VYWdYyQOGSINO5DAK-Ty33wsgcB9577NTQitft8i4v47RV9U2Cq1e11AaLxhbjFLDdll-4YKuTrwtuNci5T-a8YcPjqC39SZtLTQ2o&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;329&quot; data-original-width=&quot;640&quot; height=&quot;330&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiQFuYuJee-z2EOLLhgoqqxrDomA3TbcD99xggN-hBV4WwJMaZ7pWD7sYcCeTX4l0Pvow8q89DsbbplkUTFO4h0a4VYWdYyQOGSINO5DAK-Ty33wsgcB9577NTQitft8i4v47RV9U2Cq1e11AaLxhbjFLDdll-4YKuTrwtuNci5T-a8YcPjqC39SZtLTQ2o=w640-h330&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For developers building production AI systems, the key insight is that memory architecture matters as much as model capability. Simply increasing context windows or adding retrieval doesn&#39;t capture the adaptive, self-improving behavior that humans naturally exhibit when learning from experience.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The researchers have indicated plans to release all code and configurations for reproducibility, making this a practical resource for the AI community rather than just a research contribution. As we move toward agents that operate autonomously over extended periods, the shift from static recall to dynamic evolution may prove foundational for the next generation of AI systems.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/7326809917051569052/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/7326809917051569052?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7326809917051569052'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/7326809917051569052'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/google-deepminds-evo-memory-redefines.html' title='Google DeepMind&#39;s Evo-Memory Redefines AI Agent Memory — Cutting Task Steps by 50% Without Retraining'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/a/AVvXsEiwQkyyCD7SQGK0JzKKyVUpHTLnv_dBd_mAwVe6zlsIYZadBSuyyTcARWtnZycnzJbrTo6ueus2VQVXjrTAAb4BdEVvjxpX5OeHmXyLjt0cXkGBhqN41wBJjN-DM5m3cscSvIBEaFVc2LKiuXi-K04vkd8y2WSzxqtdM7WjPKa966e59jzm2M2bJVxbkwXP=s72-w640-h489-c" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-8993418098212198123</id><published>2026-02-05T13:37:00.002+08:00</published><updated>2026-02-05T13:37:12.587+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="AI Research"/><category scheme="http://www.blogger.com/atom/ns#" term="research paper"/><title type='text'>PaperBanana: The AI That&#39;s Automating Academic Illustration (And It&#39;s Kind of Mind-Blowing)</title><content type='html'>&lt;p&gt;If you&#39;ve ever written a research paper, you know the pain: you&#39;ve done the hard work, written thousands of words explaining your groundbreaking methodology, and then... you need to create diagrams. Beautiful, publication-ready diagrams that somehow capture your complex ideas in a single visual. For many researchers, this becomes the most time-consuming part of the entire process.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Enter PaperBanana, a revolutionary framework from researchers at Peking University and Google Cloud AI Research that&#39;s tackling this exact bottleneck. And yes, they named it PaperBanana because even serious AI research deserves a smile.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEinDssNC_41FOKGXQgGwscP0D7N6i_TRQLY3uvoVW31M1XJJqgIT5JglOQTgecB_7hGsrSA_Te8a2ZRM5JnVrhggQMqm_SST7-5I69ReeeXKzGDF74az6isLu1nPR7WNpA1vfJSvUpv_ZEAV4Y7N0t92kaYsJ3ULPp82mG1IjVmosjJ23NNGQcfyn0gv1DS&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;768&quot; data-original-width=&quot;1376&quot; height=&quot;357&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEinDssNC_41FOKGXQgGwscP0D7N6i_TRQLY3uvoVW31M1XJJqgIT5JglOQTgecB_7hGsrSA_Te8a2ZRM5JnVrhggQMqm_SST7-5I69ReeeXKzGDF74az6isLu1nPR7WNpA1vfJSvUpv_ZEAV4Y7N0t92kaYsJ3ULPp82mG1IjVmosjJ23NNGQcfyn0gv1DS&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;What Makes PaperBanana Special?&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Think of PaperBanana as your personal illustration team, but instead of humans, it&#39;s five specialized AI agents working together. Each agent has a specific role: the Retriever finds relevant reference examples from existing papers, the Planner translates your research context into detailed visual descriptions, the Stylist ensures everything looks professionally polished, the Visualizer creates the actual diagrams, and the Critic reviews and refines the output until it meets publication standards.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This isn&#39;t just about slapping together some boxes and arrows. PaperBanana generates diagrams that are faithful to your research, concise enough to be readable, aesthetically pleasing, and sophisticated enough to appear in top-tier conferences like NeurIPS.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEioqpnTbSm_7KZ5bodzMReleK6ZxiubStXSMAkErDpt4KW3YmDPuLbYYjwmFZFhqgxkCE4Glgt15juL4CZ8oPSmOoHKvHSpuHKK7TkLTgRQKeoGD2lGzjhLRexG2fYarvsaC7Fu_PRrO1mUA1x70hykjCVC2pmLQygz_0u2nGeypaUnldbnofHUZA7MWKXm&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1344&quot; data-original-width=&quot;3168&quot; height=&quot;272&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEioqpnTbSm_7KZ5bodzMReleK6ZxiubStXSMAkErDpt4KW3YmDPuLbYYjwmFZFhqgxkCE4Glgt15juL4CZ8oPSmOoHKvHSpuHKK7TkLTgRQKeoGD2lGzjhLRexG2fYarvsaC7Fu_PRrO1mUA1x70hykjCVC2pmLQygz_0u2nGeypaUnldbnofHUZA7MWKXm&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;PaperBanana&#39;s architecture: Five specialized AI agents collaborate to transform research content into publication-ready illustrations.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Secret Sauce: Reference-Driven Intelligence&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What sets PaperBanana apart is its reference-driven approach. Instead of generating illustrations from scratch with no context, it learns from the visual language already established in academic publishing. The system analyzes methodology diagrams from recent NeurIPS papers, understanding not just what makes a diagram functional, but what makes it beautiful and publication-ready.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The results speak for themselves. In comprehensive testing against leading baselines, PaperBanana consistently outperformed competitors across all evaluation dimensions: faithfulness, conciseness, readability, and aesthetics. It&#39;s not just good—it&#39;s setting a new standard.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Beyond Methodology Diagrams&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;But here&#39;s where it gets even more interesting: PaperBanana doesn&#39;t just do methodology diagrams. It also generates high-quality statistical plots. The researchers tested both code-based and image generation approaches for creating visualizations, revealing fascinating trade-offs. Image generation creates more visually appealing plots, but code-based methods maintain better content fidelity. Understanding these nuances helps researchers choose the right approach for their needs.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Benchmark That Changes Everything&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;To properly evaluate automated illustration generation, the team created PaperBananaBench—a rigorous benchmark comprising 292 test cases curated from NeurIPS 2025 publications. This benchmark captures the sophisticated aesthetics and diverse logical compositions of modern AI research, spanning multiple research domains and illustration styles.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The average source context contains over 3,000 words, proving that PaperBanana can handle the complexity of real research papers, not just simplified examples.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjLLeBFAzDpa4nMIHLlo0Rx3Ml7D-DqSq3DMvQx7JP_j8fBs3vV3-WJ2a3sxbPMeFDDO1CsxhcrKs0ikUC07wHv6jcgi0leVFuBlVmRfIS-spboK0_hLmYg5IN0BhztOFt76j77bfRQr1bE3x6WppypjSgIjLPVKeUkHT-MjRcK9FCOcCf4gV9PGANvyu46&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1458&quot; data-original-width=&quot;2988&quot; height=&quot;312&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjLLeBFAzDpa4nMIHLlo0Rx3Ml7D-DqSq3DMvQx7JP_j8fBs3vV3-WJ2a3sxbPMeFDDO1CsxhcrKs0ikUC07wHv6jcgi0leVFuBlVmRfIS-spboK0_hLmYg5IN0BhztOFt76j77bfRQr1bE3x6WppypjSgIjLPVKeUkHT-MjRcK9FCOcCf4gV9PGANvyu46&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;PaperBananaBench statistics showing 292 test cases with average source context of 3,020 words per diagram.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEh0GO1-NNaXZaPByqd-jY9Pur4CWDp4FOijKv2lP3wlee29Jxq0xcgyGFVjuQp825eUaOqFR8KrB4iC4CLHDN0WpY5fa3RU0nVsFDoyzU7aWr5ZrG6t1UM3uAFAv47DJmxdBZw1z3lPaWTeRjmyhuQ7y1InTaHBf_YGNXofHjtn6DmSQRsYFCYigC-hcVcJ&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;622&quot; data-original-width=&quot;1868&quot; height=&quot;213&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEh0GO1-NNaXZaPByqd-jY9Pur4CWDp4FOijKv2lP3wlee29Jxq0xcgyGFVjuQp825eUaOqFR8KrB4iC4CLHDN0WpY5fa3RU0nVsFDoyzU7aWr5ZrG6t1UM3uAFAv47DJmxdBZw1z3lPaWTeRjmyhuQ7y1InTaHBf_YGNXofHjtn6DmSQRsYFCYigC-hcVcJ&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;PaperBanana consistently outperforms baselines across all evaluation dimensions: faithfulness, conciseness, readability, and aesthetics.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Real-World Applications&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The practical applications extend beyond just generating new diagrams. PaperBanana can enhance the aesthetics of existing human-drawn diagrams, applying automatically summarized style guidelines to elevate visual quality. Imagine taking a rough sketch and having it instantly transformed into a polished, publication-ready illustration that maintains your original intent while looking professionally designed.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiVHg3nMTl9yWIa-Ro6pZDU6y9b3y4d_b9e_yvHal67SuW5JIs-UHyBXKL3XjrVZsV6Ths5gjpBuos6yd_7emVWy4Dr0qvc0Rw1fC0R6CWBlkZwxwh_X1TkefozsRLsJJi6gxvv3AL-QnLJoixpNFoH5Ms0hTSBy-0Y863xgqtYZo2cnZf66q-jjrVBv_y3&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1814&quot; data-original-width=&quot;1814&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEiVHg3nMTl9yWIa-Ro6pZDU6y9b3y4d_b9e_yvHal67SuW5JIs-UHyBXKL3XjrVZsV6Ths5gjpBuos6yd_7emVWy4Dr0qvc0Rw1fC0R6CWBlkZwxwh_X1TkefozsRLsJJi6gxvv3AL-QnLJoixpNFoH5Ms0hTSBy-0Y863xgqtYZo2cnZf66q-jjrVBv_y3&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Before and after: PaperBanana transforms verbose, outdated diagrams into concise, aesthetically modern illustrations while maintaining accuracy.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Road Ahead&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Of course, no system is perfect. The researchers openly acknowledge failure modes, particularly around connection errors in complex diagrams. But this transparency is refreshing—they&#39;re not claiming to have solved everything, just to have made a significant leap forward.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For AI researchers, content creators, and anyone involved in scientific communication, PaperBanana represents something bigger than just a tool. It&#39;s a glimpse into a future where the tedious parts of research communication are automated, freeing scientists to focus on what they do best: pushing the boundaries of knowledge.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The code is available on GitHub, the paper is on arXiv, and the framework is ready to explore. As AI continues to augment scientific workflows, tools like PaperBanana remind us that automation isn&#39;t about replacing human creativity—it&#39;s about amplifying it, one beautifully generated diagram at a time.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/8993418098212198123/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/8993418098212198123?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/8993418098212198123'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/8993418098212198123'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/paperbanana-ai-thats-automating.html' title='PaperBanana: The AI That&#39;s Automating Academic Illustration (And It&#39;s Kind of Mind-Blowing)'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/a/AVvXsEinDssNC_41FOKGXQgGwscP0D7N6i_TRQLY3uvoVW31M1XJJqgIT5JglOQTgecB_7hGsrSA_Te8a2ZRM5JnVrhggQMqm_SST7-5I69ReeeXKzGDF74az6isLu1nPR7WNpA1vfJSvUpv_ZEAV4Y7N0t92kaYsJ3ULPp82mG1IjVmosjJ23NNGQcfyn0gv1DS=s72-c" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-3880651063871096930</id><published>2026-02-04T15:14:00.001+08:00</published><updated>2026-02-04T15:16:14.813+08:00</updated><title type='text'>Qwen Just Dropped a Coding AI That Runs on Your Laptop — And It&#39;s Competing with Models 20x Larger</title><content type='html'>&lt;p&gt;Okay, I need to tell you about something that just happened in the AI coding world that has me genuinely excited. The Qwen team just released &lt;strong&gt;Qwen3-Coder-Next&lt;/strong&gt;, and if you&#39;ve been following the whole &quot;local AI coding assistant&quot; conversation, this one&#39;s a big deal.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEhoRKsMAF08sCn5VITXPpY81k1wj2G-FdNQdrCEER4Do0_w17r99-1BmLOAwDJSQ7tfTmMeQ_pRHaR20EH-_wMKth0b2uXa8b_jGZIJVMVhdTPGsTgfTmnVjbrGIepWbicg5k8n1JbLZMj6Gi5TwfvBH6XuLFAsI4sPA30Xm4dyR7bUenBRbwwajUo5izid&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;768&quot; data-original-width=&quot;1376&quot; height=&quot;358&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEhoRKsMAF08sCn5VITXPpY81k1wj2G-FdNQdrCEER4Do0_w17r99-1BmLOAwDJSQ7tfTmMeQ_pRHaR20EH-_wMKth0b2uXa8b_jGZIJVMVhdTPGsTgfTmnVjbrGIepWbicg5k8n1JbLZMj6Gi5TwfvBH6XuLFAsI4sPA30Xm4dyR7bUenBRbwwajUo5izid=w640-h358&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Here&#39;s why: This model has &lt;strong&gt;80 billion parameters&lt;/strong&gt; but only uses &lt;strong&gt;3 billion at a time&lt;/strong&gt;. And somehow, it&#39;s matching the performance of models with 10-20x more active parameters. Yeah, you read that right.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Let me break down what this actually means for people like us who want powerful AI coding tools that don&#39;t require sending all our code to the cloud.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;What Makes This Different from Other Coding Models?&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Most coding AI models you&#39;ve heard about—like GitHub Copilot, ChatGPT for coding, or Claude—run on massive cloud servers. They&#39;re great, but you&#39;re always dependent on an internet connection, you&#39;re sharing your code with a third party, and there are costs involved.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Qwen3-Coder-Next is built specifically for &lt;strong&gt;local development and coding agents&lt;/strong&gt;. That means it&#39;s designed to run on your own machine (yes, even a beefy laptop or desktop), keep your code private, and work with tools like Claude Code, Cline, and other IDE integrations.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;But here&#39;s where it gets interesting: unlike traditional models that need all their billions of parameters active to work well, Qwen3-Coder-Next uses something called a &lt;strong&gt;Mixture-of-Experts (MoE) architecture with sparse activation&lt;/strong&gt;.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Think of it like having a team of 80 billion specialists, but for any given task, you only need to consult 3 billion of them. This makes it incredibly efficient—you get the intelligence of a huge model with the speed and memory requirements of a much smaller one.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Architecture: Hybrid Attention That Actually Makes Sense&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Now, I know &quot;hybrid attention&quot; and &quot;sparse MoE&quot; sound like buzzwords, but stick with me because this is actually pretty clever.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Traditional transformer models have a problem: as you give them more context (like a large codebase), the computational cost grows exponentially. It&#39;s called the &quot;quadratic scaling problem,&quot; and it&#39;s why most models struggle when you try to feed them an entire repository&#39;s worth of code.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Qwen3-Coder-Next solves this by combining three different types of attention mechanisms:&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Gated DeltaNet&lt;/strong&gt; (for efficient linear attention)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Gated Attention&lt;/strong&gt; (for focused reasoning)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Mixture-of-Experts layers&lt;/strong&gt; (for specialized knowledge)&lt;/li&gt;
&lt;/ul&gt;&lt;div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjrH_DyDj2GWKicf-EP3PeUD7QkGL5bmvmITuHGEGEl95clhjxsd2p6wEsQK5Rb9iLIc9NjXhUi1Ye7LyJkDwWdPNceCc3P2bx1_AC2HVtv-ACDQnRBoBVNeTfz08ARsjt60QE8Kvc6uHaPS_Mwj4JTUF1RnzP7RUWqyYInoppcofPIibEBB0HYZTiPW_2k&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;768&quot; data-original-width=&quot;1376&quot; height=&quot;358&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjrH_DyDj2GWKicf-EP3PeUD7QkGL5bmvmITuHGEGEl95clhjxsd2p6wEsQK5Rb9iLIc9NjXhUi1Ye7LyJkDwWdPNceCc3P2bx1_AC2HVtv-ACDQnRBoBVNeTfz08ARsjt60QE8Kvc6uHaPS_Mwj4JTUF1RnzP7RUWqyYInoppcofPIibEBB0HYZTiPW_2k=w640-h358&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;The model has 48 layers total, and the layout repeats a pattern: three DeltaNet-MoE blocks followed by one Attention-MoE block. Each MoE layer has &lt;strong&gt;512 expert networks&lt;/strong&gt;, but only &lt;strong&gt;10 experts plus 1 shared expert&lt;/strong&gt; activate for each token you process.&lt;/div&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What this means practically: You can give this model a &lt;strong&gt;256,000 token context window&lt;/strong&gt; (that&#39;s roughly 200,000 words or a massive codebase), and it won&#39;t choke. It&#39;ll keep reasoning through your entire project without slowing to a crawl.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Trained Like an Actual Coding Agent, Not Just a Code Generator&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Here&#39;s where Qwen3-Coder-Next really stands out from other coding models: how it was trained.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Most coding AI is trained on static code snippets—just reading code and learning patterns. Qwen3-Coder-Next went through what the team calls &lt;strong&gt;&quot;agentic training at scale.&quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;They created &lt;strong&gt;800,000 executable coding tasks&lt;/strong&gt; with real environments. These weren&#39;t simple &quot;write a function&quot; exercises. They were actual bug-fixing scenarios pulled from GitHub, complete with test suites, containerized environments, and the ability to execute code and see if it works.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;During training, the model:&lt;/p&gt;
&lt;ol class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Receives a coding task&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Writes code to solve it&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Runs the code in a real environment&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Gets feedback if it fails&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Learns to recover from errors and try again&lt;/li&gt;
&lt;/ol&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This is reinforcement learning applied to real-world coding workflows. The model learned to &lt;strong&gt;plan, use tools, run tests, and recover from failures&lt;/strong&gt;—not just spit out code and hope for the best.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The team even trained specialized &quot;expert models&quot; for specific domains:&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;A &lt;strong&gt;Web Development Expert&lt;/strong&gt; for full-stack UI work (tested by actually rendering pages in a browser)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;A &lt;strong&gt;User Experience Expert&lt;/strong&gt; for CLI tool interactions across different frameworks&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This training approach is why Qwen3-Coder-Next excels at &lt;strong&gt;long-horizon coding tasks&lt;/strong&gt;—the kind where you need to make multiple changes across several files, run tests, fix errors, and iterate until everything works.&lt;/p&gt;&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEhaMIq7jwBW8aHccgLotb1wpbaV84a5R0ng_kkFifBio4rEWBb-1wPpAJeAZ9VAul4FtC9wEZNwV0Yh_YohFL-qdaKeh4BGcl2V9OVg-QcPgJ0mkBRboY6-H26aTVv4RKjWnf8fsooKCrmbm3MXVfvtvmztp2X-Wdx5BLgPcM-cDIItukRzowX4I1wymzhH&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;768&quot; data-original-width=&quot;1376&quot; height=&quot;358&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEhaMIq7jwBW8aHccgLotb1wpbaV84a5R0ng_kkFifBio4rEWBb-1wPpAJeAZ9VAul4FtC9wEZNwV0Yh_YohFL-qdaKeh4BGcl2V9OVg-QcPgJ0mkBRboY6-H26aTVv4RKjWnf8fsooKCrmbm3MXVfvtvmztp2X-Wdx5BLgPcM-cDIItukRzowX4I1wymzhH=w640-h358&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Benchmarks: Punching Way Above Its Weight Class&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Let me show you where this gets really impressive. On &lt;strong&gt;SWE-Bench Verified&lt;/strong&gt; (a benchmark that tests how well models can solve real GitHub issues), here&#39;s how Qwen3-Coder-Next compares:&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Qwen3-Coder-Next&lt;/strong&gt; (3B active params): &lt;strong&gt;70.6%&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;DeepSeek-V3.2&lt;/strong&gt; (671B total params): &lt;strong&gt;70.2%&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;GLM-4.7&lt;/strong&gt; (358B total params): &lt;strong&gt;74.2%&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div&gt;&lt;b&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjEa9k7p-4h_Py4nYgWmoz1slsuDgKdoDDvqqiIrnavWI1A5Z5ZQD55NflE3fe2UJNXcB4JZFqgHc_k508F-nk-rPUxxTJKAJIce-7MsvFh5y4VURIGT0XrMv3vTAI5ceAl-PAZgIaSLOhHGcbfUdoilVUwx40DnZfSsQ4giHUR4ePxC1midGG958j9K963&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;768&quot; data-original-width=&quot;1376&quot; height=&quot;357&quot; src=&quot;https://blogger.googleusercontent.com/img/a/AVvXsEjEa9k7p-4h_Py4nYgWmoz1slsuDgKdoDDvqqiIrnavWI1A5Z5ZQD55NflE3fe2UJNXcB4JZFqgHc_k508F-nk-rPUxxTJKAJIce-7MsvFh5y4VURIGT0XrMv3vTAI5ceAl-PAZgIaSLOhHGcbfUdoilVUwx40DnZfSsQ4giHUR4ePxC1midGG958j9K963=w640-h357&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;So a model with only 3 billion active parameters is matching or beating models with hundreds of billions of active parameters. That&#39;s insane efficiency.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;On &lt;strong&gt;SWE-Bench Pro&lt;/strong&gt; (an even harder benchmark):&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Qwen3-Coder-Next&lt;/strong&gt;: &lt;strong&gt;44.3%&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;DeepSeek-V3.2&lt;/strong&gt;: &lt;strong&gt;40.9%&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;GLM-4.7&lt;/strong&gt;: &lt;strong&gt;40.6%&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;And on &lt;strong&gt;Terminal-Bench 2.0&lt;/strong&gt; (testing CLI agent capabilities) and &lt;strong&gt;Aider&lt;/strong&gt; (a coding assistant benchmark), it continues to perform at the level of much larger models.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The takeaway: You&#39;re getting elite coding assistant performance in a package that can actually run locally on consumer hardware.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;You Can Actually Run This on Your Own Machine&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;This is where things get practical. The Qwen team didn&#39;t just release model weights and say &quot;good luck.&quot; They&#39;ve made this genuinely deployable for real people.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;For Server Deployment:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Works with &lt;strong&gt;SGLang&lt;/strong&gt; and &lt;strong&gt;vLLM&lt;/strong&gt; (industry-standard inference engines)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Supports OpenAI-compatible API endpoints&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Can handle the full 256K token context window&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Requires multiple GPUs for full performance (2-4 GPUs with tensor parallelism)&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;For Local Deployment:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Unsloth&lt;/strong&gt; provides GGUF quantizations (compressed versions)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;4-bit quantization&lt;/strong&gt;: Needs about &lt;strong&gt;46 GB of RAM&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;8-bit quantization&lt;/strong&gt;: Needs about &lt;strong&gt;85 GB of RAM&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Works with &lt;strong&gt;llama.cpp&lt;/strong&gt; and &lt;strong&gt;llama-server&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Compatible with Apple Silicon unified memory (yes, you can run this on an M-series Mac)&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The Unsloth team has even created guides showing how to plug Qwen3-Coder-Next into frameworks that mimic OpenAI Codex and Claude Code, but running entirely on your local machine.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For most modern development machines—especially those with 64GB+ of unified memory or dedicated GPUs—this is totally feasible. You can have a production-grade coding assistant running locally.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;What You Can Actually Do With It&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Qwen3-Coder-Next isn&#39;t just for autocompleting code. It&#39;s designed for &lt;strong&gt;agentic workflows&lt;/strong&gt;, meaning it can:&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Understand entire codebases&lt;/strong&gt; (thanks to the 256K context window)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Plan multi-step refactoring tasks&lt;/strong&gt; across multiple files&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Execute code and interpret results&lt;/strong&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Call external tools&lt;/strong&gt; (linters, formatters, test runners, debuggers)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Recover from errors&lt;/strong&gt; by analyzing stack traces and trying different approaches&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Work with different IDE scaffolds&lt;/strong&gt; (Claude Code, Qwen Code, Cline, Kilo, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;It supports tool calling natively, meaning it can interact with your development environment like a junior developer would—running commands, reading outputs, and making decisions based on results.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;One important note: This model does &lt;strong&gt;NOT&lt;/strong&gt; use &quot;thinking&quot; mode (it doesn&#39;t generate &lt;code class=&quot;bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]&quot;&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags). It goes straight to action. This makes it more predictable for agent workflows where you want direct tool calls and responses.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Who Should Care About This?&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;If you&#39;re a developer who:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Works with sensitive codebases that can&#39;t go to cloud APIs&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Wants a powerful coding assistant without monthly subscription fees&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Has a decent development machine (64GB+ RAM or multiple GPUs)&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Prefers open-source tools over proprietary solutions&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Wants to experiment with AI coding agents&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;Then Qwen3-Coder-Next is worth checking out.&lt;/strong&gt;&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;If you&#39;re a company that:&lt;/strong&gt;&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Needs coding assistance for proprietary code&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Wants to avoid data leaving your infrastructure&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Has the compute resources to host models locally or on-premises&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Wants to customize the model for specific frameworks or languages&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;&lt;strong&gt;This is a compelling option&lt;/strong&gt; under the Apache 2.0 license (meaning you can use it commercially without restrictions).&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;The Bigger Picture: Local AI Coding Assistants Are Here&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;What excites me most about Qwen3-Coder-Next isn&#39;t just the model itself—it&#39;s what it represents.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For the past couple of years, the best AI coding tools have been cloud-only. You had to use GitHub Copilot, Claude, or ChatGPT, all of which require sending your code to external servers.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;But in just the past few weeks, we&#39;ve seen an explosion of local coding assistants:&lt;/p&gt;
&lt;ul class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Anthropic&#39;s Claude Code&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;OpenAI&#39;s Codex app&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;Various open-source frameworks like OpenClaw&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;And now Qwen3-Coder-Next&lt;/li&gt;
&lt;/ul&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The tech has matured to the point where you can have GPT-4-class coding assistance running entirely on your own hardware. For privacy-conscious developers and companies working on proprietary systems, this is huge.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;How to Get Started&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;If you want to try Qwen3-Coder-Next:&lt;/p&gt;
&lt;ol class=&quot;[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3&quot;&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Check the model weights&lt;/strong&gt; on Hugging Face: Look for &lt;code class=&quot;bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]&quot;&gt;Qwen/Qwen3-Coder-Next&lt;/code&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Read the technical report&lt;/strong&gt; on GitHub: &lt;a class=&quot;underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current&quot; href=&quot;https://github.com/QwenLM/Qwen3-Coder&quot;&gt;QwenLM/Qwen3-Coder&lt;/a&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Follow the Unsloth guide&lt;/strong&gt; for local deployment: &lt;a class=&quot;underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current&quot; href=&quot;https://unsloth.ai/docs/models/qwen3-coder-next&quot;&gt;unsloth.ai/docs/models/qwen3-coder-next&lt;/a&gt;&lt;/li&gt;
&lt;li class=&quot;whitespace-normal break-words pl-2&quot;&gt;&lt;strong&gt;Try it with your favorite agent framework&lt;/strong&gt;: Claude Code, Cline, or Qwen Code all support it&lt;/li&gt;
&lt;/ol&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;The setup isn&#39;t quite as simple as installing a VSCode extension (yet), but the documentation is solid, and the community is already building tooling around it.&lt;/p&gt;
&lt;h2 class=&quot;text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold&quot;&gt;Final Thoughts&lt;/h2&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;I think we&#39;re at an inflection point with AI coding tools. The cloud-based options are still excellent and will continue to improve, but now we have legitimate alternatives that run locally, respect privacy, and don&#39;t require subscriptions.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;Qwen3-Coder-Next proves that you don&#39;t need to activate hundreds of billions of parameters to get strong coding assistance. With clever architecture (sparse MoE with hybrid attention) and smart training (agentic training on executable tasks), you can build something powerful enough to rival the big proprietary models.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;For me, this opens up possibilities for experimentation, customization, and building coding tools that work the way I want them to—without worrying about API costs or data privacy.&lt;/p&gt;
&lt;p class=&quot;font-claude-response-body break-words whitespace-normal leading-[1.7]&quot;&gt;If you&#39;ve been curious about local AI coding assistants, now&#39;s the time to dive in. The tech is finally here.&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/3880651063871096930/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/3880651063871096930?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3880651063871096930'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3880651063871096930'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2026/02/qwen-just-dropped-coding-ai-that-runs.html' title='Qwen Just Dropped a Coding AI That Runs on Your Laptop — And It&#39;s Competing with Models 20x Larger'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/a/AVvXsEhoRKsMAF08sCn5VITXPpY81k1wj2G-FdNQdrCEER4Do0_w17r99-1BmLOAwDJSQ7tfTmMeQ_pRHaR20EH-_wMKth0b2uXa8b_jGZIJVMVhdTPGsTgfTmnVjbrGIepWbicg5k8n1JbLZMj6Gi5TwfvBH6XuLFAsI4sPA30Xm4dyR7bUenBRbwwajUo5izid=s72-w640-h358-c" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-933945640325056562</id><published>2025-09-12T20:00:00.001+08:00</published><updated>2025-09-12T20:00:00.114+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="Anthropic Claude"/><category scheme="http://www.blogger.com/atom/ns#" term="ChatGPT comparison"/><category scheme="http://www.blogger.com/atom/ns#" term="document editing"/><category scheme="http://www.blogger.com/atom/ns#" term="Enterprise AI"/><category scheme="http://www.blogger.com/atom/ns#" term="Excel formulas"/><category scheme="http://www.blogger.com/atom/ns#" term="file creation"/><category scheme="http://www.blogger.com/atom/ns#" term="Google Gemini"/><category scheme="http://www.blogger.com/atom/ns#" term="PowerPoint slides"/><category scheme="http://www.blogger.com/atom/ns#" term="productivity AI"/><category scheme="http://www.blogger.com/atom/ns#" term="sandboxed compute"/><title type='text'>Claude’s new file creation tools vs. ChatGPT and Gemini: who’s ahead on real productivity</title><content type='html'>&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;h2 data-end=&quot;289&quot; data-start=&quot;264&quot;&gt;What Claude offers now&lt;/h2&gt;
&lt;p data-end=&quot;322&quot; data-start=&quot;291&quot;&gt;From Anthropic’s announcements:&lt;/p&gt;
&lt;ul data-end=&quot;1159&quot; data-start=&quot;324&quot;&gt;
&lt;li data-end=&quot;491&quot; data-start=&quot;324&quot;&gt;
&lt;p data-end=&quot;491&quot; data-start=&quot;326&quot;&gt;Creates and edits real files &lt;strong data-end=&quot;376&quot; data-start=&quot;355&quot;&gt;directly in chats&lt;/strong&gt; or the desktop app: Excel (.xlsx), Word (.docx), PowerPoint (.pptx), PDFs.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;684&quot; data-start=&quot;492&quot;&gt;
&lt;p data-end=&quot;684&quot; data-start=&quot;494&quot;&gt;Users can upload data or supply shared input, then ask Claude to build files from scratch (e.g. spreadsheets with formulas, documents or slide decks).&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;903&quot; data-start=&quot;685&quot;&gt;
&lt;p data-end=&quot;903&quot; data-start=&quot;687&quot;&gt;The outputs are downloadable, usable “ready-to-use” artifacts. Claude can also convert document formats (e.g. PDF→slides) and do statistical/analysis tasks within spreadsheets.&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;998&quot; data-start=&quot;904&quot;&gt;
&lt;p data-end=&quot;998&quot; data-start=&quot;906&quot;&gt;File size limits: up to &lt;strong data-end=&quot;939&quot; data-start=&quot;930&quot;&gt;30 MB&lt;/strong&gt; uploads/downloads.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1159&quot; data-start=&quot;999&quot;&gt;
&lt;p data-end=&quot;1159&quot; data-start=&quot;1001&quot;&gt;The feature is currently a preview for certain paid plans (Max, Team, Enterprise), with Pro plans getting access “soon.”&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;1164&quot; data-start=&quot;1161&quot; /&gt;
&lt;h2 data-end=&quot;1213&quot; data-start=&quot;1166&quot;&gt;What ChatGPT currently supports (vs. Claude)&lt;/h2&gt;
&lt;p data-end=&quot;1236&quot; data-start=&quot;1215&quot;&gt;Based on public info:&lt;/p&gt;
&lt;ul data-end=&quot;2349&quot; data-start=&quot;1238&quot;&gt;
&lt;li data-end=&quot;1443&quot; data-start=&quot;1238&quot;&gt;
&lt;p data-end=&quot;1443&quot; data-start=&quot;1240&quot;&gt;&lt;strong data-end=&quot;1285&quot; data-start=&quot;1240&quot;&gt;File uploads &amp;amp; summarization / extraction&lt;/strong&gt;: ChatGPT can accept PDFs, presentations, plaintext documents, etc., and then respond to queries about their contents.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1684&quot; data-start=&quot;1444&quot;&gt;
&lt;p data-end=&quot;1684&quot; data-start=&quot;1446&quot;&gt;&lt;strong data-end=&quot;1540&quot; data-start=&quot;1446&quot;&gt;Data analysis / code execution environment (“Code Interpreter” / “Advanced Data Analysis”)&lt;/strong&gt;: for spreadsheets or CSVs, you can upload, have it run code, do charts/visualizations, clean data, etc.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1991&quot; data-start=&quot;1685&quot;&gt;
&lt;p data-end=&quot;1991&quot; data-start=&quot;1687&quot;&gt;&lt;strong data-end=&quot;1727&quot; data-start=&quot;1687&quot;&gt;File editing or direct file creation&lt;/strong&gt;: ChatGPT so far does &lt;em data-end=&quot;1754&quot; data-start=&quot;1749&quot;&gt;not&lt;/em&gt; create or modify Excel/Word/PPTX/PDF files as downloadable artifacts via a “create new file + edit” flow in chat (at least very broadly marketed). There are plugins and workflows, but not a core feature announced the way Claude’s was.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2349&quot; data-start=&quot;1993&quot;&gt;
&lt;p data-end=&quot;2349&quot; data-start=&quot;1995&quot;&gt;&lt;strong data-end=&quot;2015&quot; data-start=&quot;1995&quot;&gt;Canvas interface&lt;/strong&gt;: ChatGPT introduced “Canvas,” allowing inline editing of texts or code alongside chat—helpful for refining, rewriting, collaborating. But this is about editing text/code drafts in the interface, not necessarily generating formal document files with formatting and exporting to PPTX, XLSX, etc.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;2354&quot; data-start=&quot;2351&quot; /&gt;
&lt;h2 data-end=&quot;2398&quot; data-start=&quot;2356&quot;&gt;What we know less about Gemini (Google)&lt;/h2&gt;
&lt;ul data-end=&quot;3149&quot; data-start=&quot;2400&quot;&gt;
&lt;li data-end=&quot;2871&quot; data-start=&quot;2400&quot;&gt;
&lt;p data-end=&quot;2871&quot; data-start=&quot;2402&quot;&gt;Public info is less detailed for Gemini’s ability to &lt;em data-end=&quot;2484&quot; data-start=&quot;2455&quot;&gt;generate downloadable files&lt;/em&gt; like PowerPoints, spreadsheets with formulas, etc. Gemini &lt;em data-end=&quot;2548&quot; data-start=&quot;2543&quot;&gt;can&lt;/em&gt; export Deep Research reports as Google Docs, which implies some document generation + formatting functionality. But whether it handles real &lt;code data-end=&quot;2696&quot; data-start=&quot;2689&quot;&gt;.xlsx&lt;/code&gt; spreadsheets or retains formula logic is less clear. (This comes from secondary sources referencing export of reports as Google Docs.)&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3149&quot; data-start=&quot;2873&quot;&gt;
&lt;p data-end=&quot;3149&quot; data-start=&quot;2875&quot;&gt;Gemini does well with research-style reports, text generation, multimedia input/output; but direct file editing workflows (upload file, edit content, download in formatted artifact) are not obviously at parity with Claude’s newly announced capability as of now (publicly).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;3154&quot; data-start=&quot;3151&quot; /&gt;
&lt;h2 data-end=&quot;3188&quot; data-start=&quot;3156&quot;&gt;Side-by-side strengths &amp;amp; gaps&lt;/h2&gt;
&lt;div class=&quot;_tableContainer_1rjym_1&quot;&gt;&lt;div class=&quot;group w-fit _tableWrapper_1rjym_13 flex flex-col-reverse&quot; tabindex=&quot;-1&quot;&gt;&lt;table class=&quot;w-fit min-w-(--thread-content-width)&quot; data-end=&quot;5447&quot; data-start=&quot;3190&quot;&gt;&lt;thead data-end=&quot;3299&quot; data-start=&quot;3190&quot;&gt;&lt;tr data-end=&quot;3299&quot; data-start=&quot;3190&quot;&gt;&lt;th data-col-size=&quot;md&quot; data-end=&quot;3200&quot; data-start=&quot;3190&quot;&gt;Feature&lt;/th&gt;&lt;th data-col-size=&quot;lg&quot; data-end=&quot;3234&quot; data-start=&quot;3200&quot;&gt;Claude’s new file creation/edit&lt;/th&gt;&lt;th data-col-size=&quot;xl&quot; data-end=&quot;3265&quot; data-start=&quot;3234&quot;&gt;ChatGPT’s current capacities&lt;/th&gt;&lt;th data-col-size=&quot;lg&quot; data-end=&quot;3299&quot; data-start=&quot;3265&quot;&gt;Google Gemini (publicly known)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody data-end=&quot;5447&quot; data-start=&quot;3318&quot;&gt;&lt;tr data-end=&quot;3602&quot; data-start=&quot;3318&quot;&gt;&lt;td data-col-size=&quot;md&quot; data-end=&quot;3386&quot; data-start=&quot;3318&quot;&gt;Create + download Word / PPTX / Excel / PDF from scratch via chat&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;3403&quot; data-start=&quot;3386&quot;&gt;✅ Yes (Claude)&lt;/td&gt;&lt;td data-col-size=&quot;xl&quot; data-end=&quot;3519&quot; data-start=&quot;3403&quot;&gt;❓ Mostly no / limited; chat drafts or upload → extract, but not full artifact creation with formatting &amp;amp; formulas&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;3602&quot; data-start=&quot;3519&quot;&gt;✅ Some doc export (e.g. Google Doc), but file formats &amp;amp; formula support unclear&lt;/td&gt;&lt;/tr&gt;&lt;tr data-end=&quot;3894&quot; data-start=&quot;3603&quot;&gt;&lt;td data-col-size=&quot;md&quot; data-end=&quot;3679&quot; data-start=&quot;3603&quot;&gt;Edit existing files (spreadsheets, slide decks, PDFs) by specifying edits&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;3721&quot; data-start=&quot;3679&quot;&gt;✅ Yes (Claude can modify uploaded file)&lt;/td&gt;&lt;td data-col-size=&quot;xl&quot; data-end=&quot;3871&quot; data-start=&quot;3721&quot;&gt;Partial: you can ask ChatGPT to suggest edits, maybe produce updated content; but usually via text, not editing the actual file artifact internally&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;3894&quot; data-start=&quot;3871&quot;&gt;Less clear publicly&lt;/td&gt;&lt;/tr&gt;&lt;tr data-end=&quot;4262&quot; data-start=&quot;3895&quot;&gt;&lt;td data-col-size=&quot;md&quot; data-end=&quot;3961&quot; data-start=&quot;3895&quot;&gt;Formulas / spreadsheet logic, charts, data analysis within file&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;4067&quot; data-start=&quot;3961&quot;&gt;✅ Claude supports formulas, chart generation in Excel sheets etc.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;&lt;td data-col-size=&quot;xl&quot; data-end=&quot;4233&quot; data-start=&quot;4067&quot;&gt;✅ ChatGPT’s Advanced Data Analysis / Code Interpreter can run code, generate charts etc., but output often image or code rather than in Excel with working formulas&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;4262&quot; data-start=&quot;4233&quot;&gt;Unknown detail for Gemini&lt;/td&gt;&lt;/tr&gt;&lt;tr data-end=&quot;4708&quot; data-start=&quot;4263&quot;&gt;&lt;td data-col-size=&quot;md&quot; data-end=&quot;4334&quot; data-start=&quot;4263&quot;&gt;Format preservation / bulk edits (e.g. replace terms, style, layout)&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;4477&quot; data-start=&quot;4334&quot;&gt;Claude claims it preserves formatting and supports direct editing without opening the file manually.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;&lt;td data-col-size=&quot;xl&quot; data-end=&quot;4627&quot; data-start=&quot;4477&quot;&gt;ChatGPT can manipulate content, but not always preserve all formatting when exporting to external files; often conversion‐based or re‐rendered text&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;4708&quot; data-start=&quot;4627&quot;&gt;Gemini likely similar to document export, with less file format variety known&lt;/td&gt;&lt;/tr&gt;&lt;tr data-end=&quot;5050&quot; data-start=&quot;4709&quot;&gt;&lt;td data-col-size=&quot;md&quot; data-end=&quot;4730&quot; data-start=&quot;4709&quot;&gt;File size &amp;amp; limits&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;4810&quot; data-start=&quot;4730&quot;&gt;Claude: upload/download up to ~30 MB.&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;&lt;td data-col-size=&quot;xl&quot; data-end=&quot;4994&quot; data-start=&quot;4810&quot;&gt;ChatGPT file uploads also have size limits (for large files or large images), but the limit &amp;amp; support for editing artifacts with formulas or presentation layouts is more constrained&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;5050&quot; data-start=&quot;4994&quot;&gt;Not fully disclosed / varies across features / tools&lt;/td&gt;&lt;/tr&gt;&lt;tr data-end=&quot;5447&quot; data-start=&quot;5051&quot;&gt;&lt;td data-col-size=&quot;md&quot; data-end=&quot;5086&quot; data-start=&quot;5051&quot;&gt;Availability / plan restrictions&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;5191&quot; data-start=&quot;5086&quot;&gt;Preview for paid tiers in Claude; not yet general free access.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;&lt;td data-col-size=&quot;xl&quot; data-end=&quot;5310&quot; data-start=&quot;5191&quot;&gt;Many advanced features gated to Plus / Pro / Teams; “Canvas” is in beta etc.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;&lt;td data-col-size=&quot;lg&quot; data-end=&quot;5447&quot; data-start=&quot;5310&quot;&gt;Gemini similarly has tiered and regionized feature access; some users may have access, but not universally confirmed for all features&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;/div&gt;
&lt;hr data-end=&quot;5452&quot; data-start=&quot;5449&quot; /&gt;
&lt;h2 data-end=&quot;5487&quot; data-start=&quot;5454&quot;&gt;Implications &amp;amp; what this means&lt;/h2&gt;
&lt;ul data-end=&quot;6335&quot; data-start=&quot;5489&quot;&gt;
&lt;li data-end=&quot;5700&quot; data-start=&quot;5489&quot;&gt;
&lt;p data-end=&quot;5700&quot; data-start=&quot;5491&quot;&gt;Claude’s added file creation/edit increases its utility for &lt;em data-end=&quot;5587&quot; data-start=&quot;5551&quot;&gt;document + presentation work flows&lt;/em&gt;, especially for business / enterprise use, where formatted deliverables (slides, reports, spreadsheets) are key.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5923&quot; data-start=&quot;5702&quot;&gt;
&lt;p data-end=&quot;5923&quot; data-start=&quot;5704&quot;&gt;If ChatGPT (or Gemini) wants to match this, they&#39;d need to support not just text/coding, but full artifact generation + editing with retention of formula logic/formatting + download/export in common office file formats.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;6115&quot; data-start=&quot;5925&quot;&gt;
&lt;p data-end=&quot;6115&quot; data-start=&quot;5927&quot;&gt;Users whose workflows involve formatting, layout, bulk edits, or converting between formats will benefit more from Claude’s new feature—less manual reformatting and fewer copy-paste hacks.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;6335&quot; data-start=&quot;6117&quot;&gt;
&lt;p data-end=&quot;6335&quot; data-start=&quot;6119&quot;&gt;For many use cases, existing tools (ChatGPT + Code Interpreter) suffice, especially when output is data or charts. But for file artifacts that are meant to be “finished” or shared, Claude’s offering tightens the gap.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/933945640325056562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/933945640325056562?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/933945640325056562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/933945640325056562'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/claudes-new-file-creation-tools-vs.html' title='Claude’s new file creation tools vs. ChatGPT and Gemini: who’s ahead on real productivity'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-977760614037978007</id><published>2025-09-12T19:50:00.001+08:00</published><updated>2025-09-12T19:50:00.118+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="AI productivity"/><category scheme="http://www.blogger.com/atom/ns#" term="Anthropic Claude"/><category scheme="http://www.blogger.com/atom/ns#" term="data analysis"/><category scheme="http://www.blogger.com/atom/ns#" term="documents"/><category scheme="http://www.blogger.com/atom/ns#" term="Enterprise AI"/><category scheme="http://www.blogger.com/atom/ns#" term="file creation"/><category scheme="http://www.blogger.com/atom/ns#" term="PDFs"/><category scheme="http://www.blogger.com/atom/ns#" term="PowerPoint"/><category scheme="http://www.blogger.com/atom/ns#" term="sandboxed compute"/><category scheme="http://www.blogger.com/atom/ns#" term="spreadsheets"/><title type='text'>Claude’s Leap: From Chat to File Factory</title><content type='html'>&lt;p&gt;&amp;nbsp;Anthropic just upgraded Claude to be more than a conversational assistant. A fresh feature preview lets users &lt;strong data-end=&quot;567&quot; data-start=&quot;537&quot;&gt;create and edit real files&lt;/strong&gt;—Excel sheets, Word docs, PowerPoint decks, and PDFs—directly through Claude.ai and the desktop app. Rather than simply getting text output, you can describe what you need, upload data, and receive usable files already formatted and ready to share or export.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;868&quot; data-start=&quot;865&quot; /&gt;
&lt;h3 data-end=&quot;884&quot; data-start=&quot;870&quot;&gt;What’s New&lt;/h3&gt;
&lt;ul data-end=&quot;1585&quot; data-start=&quot;886&quot;&gt;
&lt;li data-end=&quot;1049&quot; data-start=&quot;886&quot;&gt;
&lt;p data-end=&quot;1049&quot; data-start=&quot;888&quot;&gt;&lt;strong data-end=&quot;912&quot; data-start=&quot;888&quot;&gt;File types supported&lt;/strong&gt;: &lt;code data-end=&quot;921&quot; data-start=&quot;914&quot;&gt;.xlsx&lt;/code&gt;, &lt;code data-end=&quot;930&quot; data-start=&quot;923&quot;&gt;.docx&lt;/code&gt;, &lt;code data-end=&quot;939&quot; data-start=&quot;932&quot;&gt;.pptx&lt;/code&gt;, &lt;code data-end=&quot;947&quot; data-start=&quot;941&quot;&gt;.pdf&lt;/code&gt;—spreadsheet, word-processed, slide, and presentation formats.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1343&quot; data-start=&quot;1050&quot;&gt;
&lt;p data-end=&quot;1343&quot; data-start=&quot;1052&quot;&gt;&lt;strong data-end=&quot;1081&quot; data-start=&quot;1052&quot;&gt;Complex workflows enabled&lt;/strong&gt;: You can ask Claude to build financial models with formulas and multiple sheets, convert PDFs into slides, clean raw data, run statistical analyses, produce charts, or stitch together reports—all via natural instructions.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1585&quot; data-start=&quot;1344&quot;&gt;
&lt;p data-end=&quot;1585&quot; data-start=&quot;1346&quot;&gt;&lt;strong data-end=&quot;1369&quot; data-start=&quot;1346&quot;&gt;Sandboxed computing&lt;/strong&gt;: Claude now operates in a restricted internal computing environment. It can run code (e.g. Python), load libraries, and generate artifacts without exposing your local machine.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;1590&quot; data-start=&quot;1587&quot; /&gt;
&lt;h3 data-end=&quot;1616&quot; data-start=&quot;1592&quot;&gt;Availability &amp;amp; Plans&lt;/h3&gt;
&lt;ul data-end=&quot;2011&quot; data-start=&quot;1618&quot;&gt;
&lt;li data-end=&quot;1735&quot; data-start=&quot;1618&quot;&gt;
&lt;p data-end=&quot;1735&quot; data-start=&quot;1620&quot;&gt;&lt;strong data-end=&quot;1645&quot; data-start=&quot;1620&quot;&gt;Already available now&lt;/strong&gt; for those on &lt;strong data-end=&quot;1688&quot; data-start=&quot;1659&quot;&gt;Max, Team, and Enterprise&lt;/strong&gt; plans.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1817&quot; data-start=&quot;1736&quot;&gt;
&lt;p data-end=&quot;1817&quot; data-start=&quot;1738&quot;&gt;&lt;strong data-end=&quot;1751&quot; data-start=&quot;1738&quot;&gt;Pro users&lt;/strong&gt; will get access &lt;strong data-end=&quot;1776&quot; data-start=&quot;1768&quot;&gt;soon&lt;/strong&gt;.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2011&quot; data-start=&quot;1818&quot;&gt;
&lt;p data-end=&quot;2011&quot; data-start=&quot;1820&quot;&gt;It’s currently a &lt;strong data-end=&quot;1856&quot; data-start=&quot;1837&quot;&gt;feature preview&lt;/strong&gt;—opt-in required per user via the Claude settings (“Upgraded file creation and analysis”) and may still be tweaked.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;2016&quot; data-start=&quot;2013&quot; /&gt;
&lt;h3 data-end=&quot;2048&quot; data-start=&quot;2018&quot;&gt;Use Cases: What You Can Do&lt;/h3&gt;
&lt;ul data-end=&quot;2461&quot; data-start=&quot;2050&quot;&gt;
&lt;li data-end=&quot;2172&quot; data-start=&quot;2050&quot;&gt;
&lt;p data-end=&quot;2172&quot; data-start=&quot;2052&quot;&gt;Transform raw data into polished reports (CSV → charts → formatted Word or PDF).&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2298&quot; data-start=&quot;2173&quot;&gt;
&lt;p data-end=&quot;2298&quot; data-start=&quot;2175&quot;&gt;Build project trackers, scenario models, dashboards in Excel with working formulas.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2461&quot; data-start=&quot;2299&quot;&gt;
&lt;p data-end=&quot;2461&quot; data-start=&quot;2301&quot;&gt;Convert existing documents from one format to another: e.g., meeting notes → slide decks; PDF reports → editable docs.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;2466&quot; data-start=&quot;2463&quot; /&gt;
&lt;h3 data-end=&quot;2490&quot; data-start=&quot;2468&quot;&gt;Risks &amp;amp; Safeguards&lt;/h3&gt;
&lt;ul data-end=&quot;3003&quot; data-start=&quot;2492&quot;&gt;
&lt;li data-end=&quot;2789&quot; data-start=&quot;2492&quot;&gt;
&lt;p data-end=&quot;2789&quot; data-start=&quot;2494&quot;&gt;&lt;strong data-end=&quot;2506&quot; data-start=&quot;2494&quot;&gt;Security&lt;/strong&gt;: Because Claude gets limited internet access in order to import packages or execute code, there is risk of malicious content or prompt injection. Users are encouraged to monitor outputs and disable the feature if suspicious behavior arises.&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3003&quot; data-start=&quot;2790&quot;&gt;
&lt;p data-end=&quot;3003&quot; data-start=&quot;2792&quot;&gt;&lt;strong data-end=&quot;2813&quot; data-start=&quot;2792&quot;&gt;Sandbox isolation&lt;/strong&gt;: Enterprise settings allow admins to enable or disable file creation organization-wide. Team users must opt in; individuals can toggle the feature.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;3008&quot; data-start=&quot;3005&quot; /&gt;
&lt;h3 data-end=&quot;3028&quot; data-start=&quot;3010&quot;&gt;Why It Matters&lt;/h3&gt;
&lt;p data-end=&quot;3607&quot; data-start=&quot;3030&quot;&gt;This move shifts Claude (and similar models) further into hands-on productivity automation. Rather than merely advising, Claude can now execute parts of what used to require manual effort: formatting, data manipulation, cross-format conversion. That reduces friction for users who want to go from idea → usable artifact in fewer steps. It’s also a more natural way to blend AI into workflows: you stay in chat, give instructions, and get back files—not just text dumps you have to reformat. It’s a signal of what’s next: smarter agents embedded in the tools people use daily.&lt;/p&gt;&lt;p data-end=&quot;3607&quot; data-start=&quot;3030&quot;&gt;&lt;a href=&quot;https://www.anthropic.com/news/create-files&quot;&gt;Blog Link&lt;/a&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/977760614037978007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/977760614037978007?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/977760614037978007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/977760614037978007'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/claudes-leap-from-chat-to-file-factory.html' title='Claude’s Leap: From Chat to File Factory'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-6052383390916263022</id><published>2025-09-12T19:00:00.001+08:00</published><updated>2025-09-12T19:00:00.119+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="DeepSeek-R1"/><category scheme="http://www.blogger.com/atom/ns#" term="large reasoning models"/><category scheme="http://www.blogger.com/atom/ns#" term="LRMs"/><category scheme="http://www.blogger.com/atom/ns#" term="policy optimization"/><category scheme="http://www.blogger.com/atom/ns#" term="Reinforcement Learning"/><category scheme="http://www.blogger.com/atom/ns#" term="reward design"/><category scheme="http://www.blogger.com/atom/ns#" term="RLVR"/><category scheme="http://www.blogger.com/atom/ns#" term="rule-based rewards"/><category scheme="http://www.blogger.com/atom/ns#" term="sampling strategy"/><category scheme="http://www.blogger.com/atom/ns#" term="survey paper"/><title type='text'>A Survey of Reinforcement Learning for Large Reasoning Models: mapping the promise and the gaps</title><content type='html'>&lt;p&gt;&amp;nbsp;Reinforcement learning (RL) isn’t new—but as Large Language Models (LLMs) evolve into &lt;em data-end=&quot;660&quot; data-start=&quot;640&quot;&gt;reasoning machines&lt;/em&gt;, RL is taking a central role not just in alignment, but in &lt;strong data-end=&quot;749&quot; data-start=&quot;720&quot;&gt;building reasoning itself&lt;/strong&gt;. A new survey, &lt;em data-end=&quot;825&quot; data-start=&quot;765&quot;&gt;“Reinforcement Learning for Large Reasoning Models (LRMs)”&lt;/em&gt; by a large group from Tsinghua, Shanghai AI Lab, SJTU, and others, lays out an exhaustive map of the nascent field: what’s working, what’s risky, and what future architects need to solve.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;1058&quot; data-start=&quot;1055&quot; /&gt;
&lt;h2 data-end=&quot;1085&quot; data-start=&quot;1060&quot;&gt;What the survey covers&lt;/h2&gt;
&lt;p data-end=&quot;1548&quot; data-start=&quot;1087&quot;&gt;The paper dives into the core building blocks of using RL in reasoning-centered LLMs (often called LRMs): how to define rewards, what training algorithms are in play, how sampling strategies are evolving, and how infrastructure and task domains factor into the picture. It considers both alignment-adjacent RL (e.g. RLHF, preference learning) &lt;em data-end=&quot;1435&quot; data-start=&quot;1430&quot;&gt;and&lt;/em&gt; RL whose goal is reasoning performance (accuracy, planning, reflection).&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;1553&quot; data-start=&quot;1550&quot; /&gt;
&lt;h2 data-end=&quot;1581&quot; data-start=&quot;1555&quot;&gt;Key themes and insights&lt;/h2&gt;
&lt;ol data-end=&quot;3987&quot; data-start=&quot;1583&quot;&gt;
&lt;li data-end=&quot;2189&quot; data-start=&quot;1583&quot;&gt;
&lt;p data-end=&quot;1658&quot; data-start=&quot;1586&quot;&gt;&lt;strong data-end=&quot;1603&quot; data-start=&quot;1586&quot;&gt;Reward design&lt;/strong&gt;&lt;br data-end=&quot;1606&quot; data-start=&quot;1603&quot; /&gt;
The survey classifies rewards into several types:&lt;/p&gt;
&lt;ul data-end=&quot;2189&quot; data-start=&quot;1662&quot;&gt;
&lt;li data-end=&quot;1754&quot; data-start=&quot;1662&quot;&gt;
&lt;p data-end=&quot;1754&quot; data-start=&quot;1664&quot;&gt;&lt;em data-end=&quot;1684&quot; data-start=&quot;1664&quot;&gt;Verifiable rewards&lt;/em&gt; (e.g. test correctness, unit tests, exact checks) when tasks allow.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1830&quot; data-start=&quot;1758&quot;&gt;
&lt;p data-end=&quot;1830&quot; data-start=&quot;1760&quot;&gt;&lt;em data-end=&quot;1796&quot; data-start=&quot;1760&quot;&gt;Generative / learned reward models&lt;/em&gt; for subjective or open domains.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1935&quot; data-start=&quot;1834&quot;&gt;
&lt;p data-end=&quot;1935&quot; data-start=&quot;1836&quot;&gt;&lt;em data-end=&quot;1851&quot; data-start=&quot;1836&quot;&gt;Dense rewards&lt;/em&gt; vs outcome-only reward schemes—bringing signal into intermediate reasoning steps.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2189&quot; data-start=&quot;1939&quot;&gt;
&lt;p data-end=&quot;2189&quot; data-start=&quot;1941&quot;&gt;Unsupervised or weak rewards when neither full correctness metrics nor human feedback are feasible.&lt;br data-end=&quot;2043&quot; data-start=&quot;2040&quot; /&gt;
The authors emphasize that &lt;strong data-end=&quot;2148&quot; data-start=&quot;2073&quot;&gt;tasks with strong verifiability tend to yield more reliable RL learning&lt;/strong&gt;.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2672&quot; data-start=&quot;2191&quot;&gt;
&lt;p data-end=&quot;2672&quot; data-start=&quot;2194&quot;&gt;&lt;strong data-end=&quot;2239&quot; data-start=&quot;2194&quot;&gt;Policy optimization &amp;amp; sampling strategies&lt;/strong&gt;&lt;br data-end=&quot;2242&quot; data-start=&quot;2239&quot; /&gt;
There’s a broad sweep of algorithms: policy gradients, off-policy methods, regularized RL, hybrid approaches, critic-based vs critic-free methods. Sampling strategies—how you gather candidate outputs or intermediate chains—have big effects both on performance and on compute cost. Dynamic / structured sampling (e.g. adaptively adjusting paths, beam vs sampling) is becoming more common.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3186&quot; data-start=&quot;2674&quot;&gt;
&lt;p data-end=&quot;2744&quot; data-start=&quot;2677&quot;&gt;&lt;strong data-end=&quot;2711&quot; data-start=&quot;2677&quot;&gt;Foundational problems and gaps&lt;/strong&gt;&lt;br data-end=&quot;2714&quot; data-start=&quot;2711&quot; /&gt;
Several of these stand out:&lt;/p&gt;
&lt;ul data-end=&quot;3186&quot; data-start=&quot;2748&quot;&gt;
&lt;li data-end=&quot;2817&quot; data-start=&quot;2748&quot;&gt;
&lt;p data-end=&quot;2817&quot; data-start=&quot;2750&quot;&gt;Distinguishing when RL improves &lt;em data-end=&quot;2793&quot; data-start=&quot;2782&quot;&gt;reasoning&lt;/em&gt; vs just memorization.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2941&quot; data-start=&quot;2821&quot;&gt;
&lt;p data-end=&quot;2941&quot; data-start=&quot;2823&quot;&gt;Balancing weak model priors: does your base LLM already encode reasoning bias, or do you need to train from scratch?&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3008&quot; data-start=&quot;2945&quot;&gt;
&lt;p data-end=&quot;3008&quot; data-start=&quot;2947&quot;&gt;Trap of over-rewarding narrow achievements; reward hacking.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3073&quot; data-start=&quot;3012&quot;&gt;
&lt;p data-end=&quot;3073&quot; data-start=&quot;3014&quot;&gt;Challenges in reward specification in subjective domains.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3186&quot; data-start=&quot;3077&quot;&gt;
&lt;p data-end=&quot;3186&quot; data-start=&quot;3079&quot;&gt;Scaling issues: compute, infrastructure, verifying many candidates.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3590&quot; data-start=&quot;3188&quot;&gt;
&lt;p data-end=&quot;3590&quot; data-start=&quot;3191&quot;&gt;&lt;strong data-end=&quot;3230&quot; data-start=&quot;3191&quot;&gt;Training resources &amp;amp; infrastructure&lt;/strong&gt;&lt;br data-end=&quot;3233&quot; data-start=&quot;3230&quot; /&gt;
The survey catalogues the spectrum of environments and corpora used: from static datasets to dynamic environments (interactive tasks, tool usage), from single-task to multi-agent setups. It also considers RL frameworks and infrastructure tools (e.g. RL pipeline libraries) that enable reproducible LLM+RL research.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3987&quot; data-start=&quot;3592&quot;&gt;
&lt;p data-end=&quot;3646&quot; data-start=&quot;3595&quot;&gt;&lt;strong data-end=&quot;3611&quot; data-start=&quot;3595&quot;&gt;Applications&lt;/strong&gt;&lt;br data-end=&quot;3614&quot; data-start=&quot;3611&quot; /&gt;
RL for LRMs has been used in:&lt;/p&gt;
&lt;ul data-end=&quot;3987&quot; data-start=&quot;3650&quot;&gt;
&lt;li data-end=&quot;3707&quot; data-start=&quot;3650&quot;&gt;
&lt;p data-end=&quot;3707&quot; data-start=&quot;3652&quot;&gt;&lt;strong data-end=&quot;3662&quot; data-start=&quot;3652&quot;&gt;Coding&lt;/strong&gt;: unit tests, code correctness, reflection.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3778&quot; data-start=&quot;3711&quot;&gt;
&lt;p data-end=&quot;3778&quot; data-start=&quot;3713&quot;&gt;&lt;strong data-end=&quot;3730&quot; data-start=&quot;3713&quot;&gt;Agentic tasks&lt;/strong&gt;: agents using tools, web retrieval, planning.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3847&quot; data-start=&quot;3782&quot;&gt;
&lt;p data-end=&quot;3847&quot; data-start=&quot;3784&quot;&gt;&lt;strong data-end=&quot;3808&quot; data-start=&quot;3784&quot;&gt;Multimodal reasoning&lt;/strong&gt;: vision-language tasks, code+images.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3987&quot; data-start=&quot;3851&quot;&gt;
&lt;p data-end=&quot;3987&quot; data-start=&quot;3853&quot;&gt;&lt;strong data-end=&quot;3888&quot; data-start=&quot;3853&quot;&gt;Robotics / medical / scientific&lt;/strong&gt; domains. Each has its own reward/verification constraints.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr data-end=&quot;3992&quot; data-start=&quot;3989&quot; /&gt;
&lt;h2 data-end=&quot;4032&quot; data-start=&quot;3994&quot;&gt;Why it matters &amp;amp; what to watch next&lt;/h2&gt;
&lt;ul data-end=&quot;4872&quot; data-start=&quot;4034&quot;&gt;
&lt;li data-end=&quot;4205&quot; data-start=&quot;4034&quot;&gt;
&lt;p data-end=&quot;4205&quot; data-start=&quot;4036&quot;&gt;&lt;strong data-end=&quot;4072&quot; data-start=&quot;4036&quot;&gt;Reasoning as an explicit target.&lt;/strong&gt; RL is being woven into models &lt;em data-end=&quot;4145&quot; data-start=&quot;4103&quot;&gt;not just to be more “helpful” or “safe,”&lt;/em&gt; but to &lt;em data-end=&quot;4173&quot; data-start=&quot;4153&quot;&gt;reason more deeply&lt;/em&gt;: plan, reflect, self-correct.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4371&quot; data-start=&quot;4207&quot;&gt;
&lt;p data-end=&quot;4371&quot; data-start=&quot;4209&quot;&gt;&lt;strong data-end=&quot;4244&quot; data-start=&quot;4209&quot;&gt;Verifiability is a power lever.&lt;/strong&gt; Where tasks allow for exact or semi-exact verification, RL works well. When reward is fuzzy, progress is slower and riskier.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4659&quot; data-start=&quot;4373&quot;&gt;
&lt;p data-end=&quot;4659&quot; data-start=&quot;4375&quot;&gt;&lt;strong data-end=&quot;4428&quot; data-start=&quot;4375&quot;&gt;Cost and scalability are fundamental constraints.&lt;/strong&gt; As LRMs become larger and used with more test-time compute (more chain-of-thought, more candidate generations), RL training and inference costs balloon; infrastructure and sampling strategy choices can make or break feasibility.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4872&quot; data-start=&quot;4661&quot;&gt;
&lt;p data-end=&quot;4872&quot; data-start=&quot;4663&quot;&gt;&lt;strong data-end=&quot;4716&quot; data-start=&quot;4663&quot;&gt;Hybrid and co-evolving reward models are growing.&lt;/strong&gt; There’s increasing interest in reward models that both learn and evolve alongside the LLM, or in having the model itself critique or verify its own work.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;4877&quot; data-start=&quot;4874&quot; /&gt;
&lt;h2 data-end=&quot;4920&quot; data-start=&quot;4879&quot;&gt;Takeaways for researchers and builders&lt;/h2&gt;
&lt;ul data-end=&quot;5583&quot; data-start=&quot;4922&quot;&gt;
&lt;li data-end=&quot;5069&quot; data-start=&quot;4922&quot;&gt;
&lt;p data-end=&quot;5069&quot; data-start=&quot;4924&quot;&gt;If you’re designing RL for reasoning tasks, aim for &lt;em data-end=&quot;5003&quot; data-start=&quot;4976&quot;&gt;verifiable reward signals&lt;/em&gt; where possible—they give cleaner gradients and fewer surprises.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5217&quot; data-start=&quot;5070&quot;&gt;
&lt;p data-end=&quot;5217&quot; data-start=&quot;5072&quot;&gt;Pay attention to sampling strategy—generating more candidates or reasoning branches helps, but only when combined with selective reinforcement.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5397&quot; data-start=&quot;5218&quot;&gt;
&lt;p data-end=&quot;5397&quot; data-start=&quot;5220&quot;&gt;For subjective or “open” tasks (creative writing, alignment, etc.), you likely need sophisticated reward models, rubric-based or generative rewards, and strong regularization.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5583&quot; data-start=&quot;5399&quot;&gt;
&lt;p data-end=&quot;5583&quot; data-start=&quot;5401&quot;&gt;Infrastructure matters: your ability to scale RL—from having candidate generation, verifiers, tool execution environments, caching, etc.—significantly affects what you can achieve.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;5588&quot; data-start=&quot;5585&quot; /&gt;
&lt;p data-end=&quot;5977&quot; data-start=&quot;5590&quot;&gt;&lt;strong data-end=&quot;5606&quot; data-start=&quot;5590&quot;&gt;Bottom line:&lt;/strong&gt; This survey is a timely, comprehensive lookup table for anyone playing at the intersection of LLMs, RL, and reasoning. It confirms that reward design and verifiability are major levers, that RL is now essential for pushing reasoning as a capability, but also that many technical, infrastructural, and algorithmic challenges remain before “reasoning superintelligence.”&lt;/p&gt;&lt;p data-end=&quot;5977&quot; data-start=&quot;5590&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;6050&quot; data-start=&quot;5992&quot; href=&quot;https://arxiv.org/pdf/2509.08827&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.08827 (PDF)&lt;/a&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/6052383390916263022/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/6052383390916263022?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6052383390916263022'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/6052383390916263022'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/a-survey-of-reinforcement-learning-for.html' title='A Survey of Reinforcement Learning for Large Reasoning Models: mapping the promise and the gaps'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-106974029640690644</id><published>2025-09-12T09:40:00.001+08:00</published><updated>2025-09-12T09:40:04.143+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="agent software"/><category scheme="http://www.blogger.com/atom/ns#" term="Claude Code"/><category scheme="http://www.blogger.com/atom/ns#" term="developer guide"/><category scheme="http://www.blogger.com/atom/ns#" term="LLM Agents"/><category scheme="http://www.blogger.com/atom/ns#" term="MCP"/><category scheme="http://www.blogger.com/atom/ns#" term="namespaces"/><category scheme="http://www.blogger.com/atom/ns#" term="prompt engineering"/><category scheme="http://www.blogger.com/atom/ns#" term="token efficiency"/><category scheme="http://www.blogger.com/atom/ns#" term="tool design"/><category scheme="http://www.blogger.com/atom/ns#" term="tool evaluation"/><title type='text'>How to Build High-Quality Tools for LLM Agents — Lessons from Anthropic</title><content type='html'>&lt;p&gt;&amp;nbsp;As agents become more central to AI workflows, what separates a good agent from a great one often comes down to the &lt;strong data-end=&quot;630&quot; data-start=&quot;621&quot;&gt;tools&lt;/strong&gt; it has—and how well those tools are designed. In &lt;em data-end=&quot;733&quot; data-start=&quot;680&quot;&gt;“Writing effective tools for agents — with agents,”&lt;/em&gt; Anthropic shares a practical roadmap for building better tools powered by tools themselves, using Claude and the Model Context Protocol (MCP) as real-use labs.&lt;/p&gt;
&lt;hr data-end=&quot;900&quot; data-start=&quot;897&quot; /&gt;
&lt;h2 data-end=&quot;945&quot; data-start=&quot;902&quot;&gt;What are “tools” in the agentic context?&lt;/h2&gt;
&lt;p data-end=&quot;1489&quot; data-start=&quot;947&quot;&gt;Unlike conventional software APIs—deterministic functions that always give the same output for the same input—tools for agents must be built to coexist with non-deterministic systems. Agents like Claude must decide when to use tools, how to parse their output, and how to call them responsibly. A tool here is not just an API call; it&#39;s part of an interface contract between predictable software and unpredictable agent behavior. Tools are the mechanisms by which agents expand what they can reliably do.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;1494&quot; data-start=&quot;1491&quot; /&gt;
&lt;h2 data-end=&quot;1552&quot; data-start=&quot;1496&quot;&gt;Key workflows: prototyping, evaluating, and iterating&lt;/h2&gt;
&lt;p data-end=&quot;1597&quot; data-start=&quot;1554&quot;&gt;Anthropic emphasizes an iterative workflow:&lt;/p&gt;
&lt;ol data-end=&quot;2608&quot; data-start=&quot;1599&quot;&gt;
&lt;li data-end=&quot;1889&quot; data-start=&quot;1599&quot;&gt;
&lt;p data-end=&quot;1889&quot; data-start=&quot;1602&quot;&gt;&lt;strong data-end=&quot;1622&quot; data-start=&quot;1602&quot;&gt;Prototype early:&lt;/strong&gt; Build simple versions of your tools. Use MCP servers or desktop extensions to connect your tool to Claude Code, allowing rapid experimentation and detection of rough edges. Include clear documentation that the agent can consume.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2260&quot; data-start=&quot;1891&quot;&gt;
&lt;p data-end=&quot;2260&quot; data-start=&quot;1894&quot;&gt;&lt;strong data-end=&quot;1924&quot; data-start=&quot;1894&quot;&gt;Run realistic evaluations:&lt;/strong&gt; Create evaluation tasks that reflect real-world usage (multiple tool calls, complex chains, integration with other services). Use verifiable outcomes, not just “it seems right.” Capture metrics such as tool calls, token consumption, runtime, errors. Avoid toy tasks that underrepresent complexity.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2608&quot; data-start=&quot;2262&quot;&gt;
&lt;p data-end=&quot;2608&quot; data-start=&quot;2265&quot;&gt;&lt;strong data-end=&quot;2297&quot; data-start=&quot;2265&quot;&gt;Use agents to improve tools:&lt;/strong&gt; Let Claude analyze transcripts and feedback to suggest refinements—maybe better prompt descriptions, more efficient tool outputs, clearer schemas. Anthropic reports improvements even for tools built by internal experts, purely by letting agents inspect tools’ performance.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr data-end=&quot;2613&quot; data-start=&quot;2610&quot; /&gt;
&lt;h2 data-end=&quot;2655&quot; data-start=&quot;2615&quot;&gt;Best practices and guiding principles&lt;/h2&gt;
&lt;p data-end=&quot;2736&quot; data-start=&quot;2657&quot;&gt;Anthropic distills the lessons into a set of design principles. Key among them:&lt;/p&gt;
&lt;ul data-end=&quot;3948&quot; data-start=&quot;2738&quot;&gt;
&lt;li data-end=&quot;2980&quot; data-start=&quot;2738&quot;&gt;
&lt;p data-end=&quot;2980&quot; data-start=&quot;2740&quot;&gt;&lt;strong data-end=&quot;2771&quot; data-start=&quot;2740&quot;&gt;Choosing tools selectively:&lt;/strong&gt; Not every API needs to become a tool. Tools should cover high-impact, repeated workflows—not wrapping every possible existing endpoint. Also, consolidate when possible.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3226&quot; data-start=&quot;2981&quot;&gt;
&lt;p data-end=&quot;3226&quot; data-start=&quot;2983&quot;&gt;&lt;strong data-end=&quot;3017&quot; data-start=&quot;2983&quot;&gt;Namespaces and naming clarity:&lt;/strong&gt; Clear, consistent naming helps agents pick the right tool. Avoid ambiguous names or overlapping functionality. Group related tools under logical prefixes or categories.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3478&quot; data-start=&quot;3227&quot;&gt;
&lt;p data-end=&quot;3478&quot; data-start=&quot;3229&quot;&gt;&lt;strong data-end=&quot;3268&quot; data-start=&quot;3229&quot;&gt;Return meaningful, concise context:&lt;/strong&gt; Tools should return high-signal info. Avoid overwhelming the agent with technical IDs, long metadata unless necessary. Also allow “concise” vs “detailed” response modes.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3719&quot; data-start=&quot;3479&quot;&gt;
&lt;p data-end=&quot;3719&quot; data-start=&quot;3481&quot;&gt;&lt;strong data-end=&quot;3515&quot; data-start=&quot;3481&quot;&gt;Optimize for token efficiency:&lt;/strong&gt; Use truncation, filtering, pagination. Prompt agents to use fewer tool calls or more precise queries. Efficient context limits make downstream tasks more reliable.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3948&quot; data-start=&quot;3720&quot;&gt;
&lt;p data-end=&quot;3948&quot; data-start=&quot;3722&quot;&gt;&lt;strong data-end=&quot;3760&quot; data-start=&quot;3722&quot;&gt;Clear tool specs and descriptions:&lt;/strong&gt; Explicit parameter naming, clear input/output formats, good examples. Prompt engineering of tool descriptions can significantly impact performance.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;3953&quot; data-start=&quot;3950&quot; /&gt;
&lt;h2 data-end=&quot;3974&quot; data-start=&quot;3955&quot;&gt;Why this matters&lt;/h2&gt;
&lt;p data-end=&quot;4339&quot; data-start=&quot;3976&quot;&gt;Tools shape what agents &lt;em data-end=&quot;4005&quot; data-start=&quot;4000&quot;&gt;can&lt;/em&gt; do. When tools are poorly described, overly broad, or return huge dumps of irrelevant context, agents waste resources, produce hallucinations, or fail to successfully orchestrate workflows. On the other hand, well-designed tools reduce ambiguity, reduce token use, reduce error, and let agents scale reliably across real-world tasks.&lt;/p&gt;
&lt;p data-end=&quot;4657&quot; data-start=&quot;4341&quot;&gt;Especially as agents connect to many tools (hundreds via MCP servers), these design principles become the difference between brittle behavior and something that feels reliable and intuitive. Anthropic’s experience shows that many improvements come not from changing the LLM itself but refining the tools around it.&lt;/p&gt;
&lt;hr data-end=&quot;4662&quot; data-start=&quot;4659&quot; /&gt;
&lt;p data-end=&quot;4917&quot; data-start=&quot;4664&quot;&gt;If you’re building agent tools or service/tool APIs for agents, following Anthropic’s workflow—prototype → evaluate → iterate—and using clear naming, context-efficient returns, and good documentation will set you up for tools agents actually use well.&lt;/p&gt;&lt;p data-end=&quot;5073&quot; data-start=&quot;4919&quot;&gt;
Link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;5073&quot; data-start=&quot;5011&quot; href=&quot;https://www.anthropic.com/engineering/writing-tools-for-agents?utm_source=chatgpt.com&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;https://www.anthropic.com/engineering/writing-tools-for-agents&lt;/a&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/106974029640690644/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/106974029640690644?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/106974029640690644'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/106974029640690644'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/how-to-build-high-quality-tools-for-llm.html' title='How to Build High-Quality Tools for LLM Agents — Lessons from Anthropic'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-3309453393596501930</id><published>2025-09-11T17:55:00.001+08:00</published><updated>2025-09-11T17:55:19.373+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="AIME"/><category scheme="http://www.blogger.com/atom/ns#" term="causality vs. structure"/><category scheme="http://www.blogger.com/atom/ns#" term="Curriculum Learning"/><category scheme="http://www.blogger.com/atom/ns#" term="Math Reasoning"/><category scheme="http://www.blogger.com/atom/ns#" term="parallel thinking"/><category scheme="http://www.blogger.com/atom/ns#" term="Parallel-R1"/><category scheme="http://www.blogger.com/atom/ns#" term="path branching"/><category scheme="http://www.blogger.com/atom/ns#" term="Qwen-3-4B"/><category scheme="http://www.blogger.com/atom/ns#" term="reasoning verification"/><category scheme="http://www.blogger.com/atom/ns#" term="Reinforcement Learning"/><title type='text'>Parallel-R1: Teaching LLMs to reason from multiple angles—permanently</title><content type='html'>&lt;p&gt;&amp;nbsp;Modern large language models (LLMs) often reason sequentially—one thought chain at a time. &lt;strong data-end=&quot;634&quot; data-start=&quot;613&quot;&gt;Parallel thinking&lt;/strong&gt;, in contrast, involves spawning multiple reasoning paths (or perspectives), then merging the insights. While prompting tricks can induce this behavior at inference, they carry heavy overhead and brittle generalization. &lt;em data-end=&quot;867&quot; data-start=&quot;854&quot;&gt;Parallel-R1&lt;/em&gt;, a new paper by Tencent AI Lab Seattle with collaborators, pioneers a &lt;strong data-end=&quot;968&quot; data-start=&quot;938&quot;&gt;training-time RL framework&lt;/strong&gt; for instilling parallel thinking as a native reasoning strategy.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;1076&quot; data-start=&quot;1073&quot; /&gt;
&lt;h2 data-end=&quot;1100&quot; data-start=&quot;1078&quot;&gt;What is Parallel-R1&lt;/h2&gt;
&lt;p data-end=&quot;1235&quot; data-start=&quot;1102&quot;&gt;The key idea: don’t just &lt;em data-end=&quot;1135&quot; data-start=&quot;1127&quot;&gt;prompt&lt;/em&gt; models to use parallel paths—&lt;strong data-end=&quot;1174&quot; data-start=&quot;1165&quot;&gt;train&lt;/strong&gt; them to do so. Parallel-R1 has a &lt;strong data-end=&quot;1234&quot; data-start=&quot;1208&quot;&gt;progressive curriculum&lt;/strong&gt;:&lt;/p&gt;
&lt;ol data-end=&quot;1891&quot; data-start=&quot;1237&quot;&gt;
&lt;li data-end=&quot;1519&quot; data-start=&quot;1237&quot;&gt;
&lt;p data-end=&quot;1519&quot; data-start=&quot;1240&quot;&gt;&lt;strong data-end=&quot;1280&quot; data-start=&quot;1240&quot;&gt;Cold start (format learning via SFT)&lt;/strong&gt; — teach the model the syntax/tags of parallel blocks (e.g. &lt;code data-end=&quot;1352&quot; data-start=&quot;1340&quot;&gt;&amp;lt;Parallel&amp;gt;&lt;/code&gt;, &lt;code data-end=&quot;1372&quot; data-start=&quot;1354&quot;&gt;&amp;lt;Path&amp;gt;...&amp;lt;/Path&amp;gt;&lt;/code&gt;, &lt;code data-end=&quot;1385&quot; data-start=&quot;1374&quot;&gt;&amp;lt;Summary&amp;gt;&lt;/code&gt;), using easier math problems (GSM8K) where high-quality parallel traces are easy to generate.&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1717&quot; data-start=&quot;1520&quot;&gt;
&lt;p data-end=&quot;1717&quot; data-start=&quot;1523&quot;&gt;&lt;strong data-end=&quot;1568&quot; data-start=&quot;1523&quot;&gt;Reinforcement learning (RL) on easy tasks&lt;/strong&gt;, to explore usage of parallel thinking, with reward that combines correctness + usage of parallel structure.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1891&quot; data-start=&quot;1718&quot;&gt;
&lt;p data-end=&quot;1891&quot; data-start=&quot;1721&quot;&gt;&lt;strong data-end=&quot;1754&quot; data-start=&quot;1721&quot;&gt;RL on more difficult problems&lt;/strong&gt; (e.g. DAPO, AMC, AIME), so the model generalizes both performance and the parallel thinking style.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-end=&quot;2317&quot; data-start=&quot;1893&quot;&gt;The architecture has two variants: a &lt;strong data-end=&quot;1961&quot; data-start=&quot;1930&quot;&gt;causal (structure-agnostic)&lt;/strong&gt; version and a &lt;strong data-end=&quot;1990&quot; data-start=&quot;1976&quot;&gt;structured&lt;/strong&gt; version. The structured version modifies the attention mechanism (via path-window masking, separate position encodings) so paths are more isolated during reasoning. But structured variants show trade-offs—good for generalization in some settings, but less robust under distribution shift.&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;2322&quot; data-start=&quot;2319&quot; /&gt;
&lt;h2 data-end=&quot;2342&quot; data-start=&quot;2324&quot;&gt;Results &amp;amp; gains&lt;/h2&gt;
&lt;p data-end=&quot;2449&quot; data-start=&quot;2344&quot;&gt;On a battery of math benchmarks (MATH, AMC23, AIME24, AIME25), Parallel-R1 shows consistent improvements:&lt;/p&gt;
&lt;ul data-end=&quot;3114&quot; data-start=&quot;2451&quot;&gt;
&lt;li data-end=&quot;2641&quot; data-start=&quot;2451&quot;&gt;
&lt;p data-end=&quot;2641&quot; data-start=&quot;2453&quot;&gt;The “Seen” variant (causal) achieves &lt;strong data-end=&quot;2508&quot; data-start=&quot;2490&quot;&gt;~48.9% average&lt;/strong&gt; across benchmarks (Mean@16 / Pass@16, etc.), beating baseline GRPO RL on general math tasks.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2823&quot; data-start=&quot;2642&quot;&gt;
&lt;p data-end=&quot;2823&quot; data-start=&quot;2644&quot;&gt;In particular, on &lt;strong data-end=&quot;2673&quot; data-start=&quot;2662&quot;&gt;AIME’25&lt;/strong&gt;, Parallel-R1 raises accuracy by ~8.4% over a purely sequential RL model trained on the harder tasks directly.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3114&quot; data-start=&quot;2824&quot;&gt;
&lt;p data-end=&quot;3114&quot; data-start=&quot;2826&quot;&gt;The structured (Unseen) variant also performs well under certain reward schedules; the “alternating ACC/PAR” reward schedule (switching between rewarding correctness and parallel structure periodically) helps balance parallel usage and performance.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;3524&quot; data-start=&quot;3116&quot;&gt;Beyond numerical gains, the authors observe a &lt;strong data-end=&quot;3182&quot; data-start=&quot;3162&quot;&gt;behavioral shift&lt;/strong&gt;: early in training, the model heavily uses parallel paths as an &lt;em data-end=&quot;3265&quot; data-start=&quot;3247&quot;&gt;exploration tool&lt;/em&gt;, branching in many places; as the model becomes stronger, it shifts to using parallel paths more conservatively, mostly for &lt;em data-end=&quot;3404&quot; data-start=&quot;3390&quot;&gt;verification&lt;/em&gt; near the end of reasoning. This shift correlates with stronger final performance.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;3529&quot; data-start=&quot;3526&quot; /&gt;
&lt;h2 data-end=&quot;3550&quot; data-start=&quot;3531&quot;&gt;Why this matters&lt;/h2&gt;
&lt;ul data-end=&quot;4453&quot; data-start=&quot;3552&quot;&gt;
&lt;li data-end=&quot;3780&quot; data-start=&quot;3552&quot;&gt;
&lt;p data-end=&quot;3780&quot; data-start=&quot;3554&quot;&gt;&lt;strong data-end=&quot;3592&quot; data-start=&quot;3554&quot;&gt;Performance &amp;amp; efficiency trade-off&lt;/strong&gt;: Parallel-R1 shows that training models for parallel thinking can yield higher reasoning ability without ballooning inference cost (since only when needed are parallel paths triggered).&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4072&quot; data-start=&quot;3781&quot;&gt;
&lt;p data-end=&quot;4072&quot; data-start=&quot;3783&quot;&gt;&lt;strong data-end=&quot;3808&quot; data-start=&quot;3783&quot;&gt;Better than imitation&lt;/strong&gt;: Many earlier works used supervised fine-tuning on synthetic parallel reasoning traces under teacher forcing; but those often over-fit to particular patterns. RL in Parallel-R1 helps models learn to &lt;em data-end=&quot;4016&quot; data-start=&quot;4008&quot;&gt;decide&lt;/em&gt; when parallel paths help, not just how to mimic them.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4262&quot; data-start=&quot;4073&quot;&gt;
&lt;p data-end=&quot;4262&quot; data-start=&quot;4075&quot;&gt;&lt;strong data-end=&quot;4102&quot; data-start=&quot;4075&quot;&gt;Scaffolding exploration&lt;/strong&gt;: The cold-start + easy tasks + alternating reward strategy functions as a scaffold, enabling RL to find a stronger policy space than direct RL on hard tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4453&quot; data-start=&quot;4263&quot;&gt;
&lt;p data-end=&quot;4453&quot; data-start=&quot;4265&quot;&gt;&lt;strong data-end=&quot;4296&quot; data-start=&quot;4265&quot;&gt;Architecture designs matter&lt;/strong&gt;: The structured variant shows that attention masking and position encodings can help or hurt depending on how well training data matches deployment tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;4458&quot; data-start=&quot;4455&quot; /&gt;
&lt;h2 data-end=&quot;4494&quot; data-start=&quot;4460&quot;&gt;Limitations &amp;amp; future directions&lt;/h2&gt;
&lt;ul data-end=&quot;5107&quot; data-start=&quot;4496&quot;&gt;
&lt;li data-end=&quot;4603&quot; data-start=&quot;4496&quot;&gt;
&lt;p data-end=&quot;4603&quot; data-start=&quot;4498&quot;&gt;The gains, though significant, still leave room before human-level performance in very hard math tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4741&quot; data-start=&quot;4604&quot;&gt;
&lt;p data-end=&quot;4741&quot; data-start=&quot;4606&quot;&gt;The structured variants can struggle under domain shift; care needed in architectural changes that assume particular path structures.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4900&quot; data-start=&quot;4742&quot;&gt;
&lt;p data-end=&quot;4900&quot; data-start=&quot;4744&quot;&gt;Triggering parallel thinking (using &lt;code data-end=&quot;4792&quot; data-start=&quot;4780&quot;&gt;&amp;lt;Parallel&amp;gt;&lt;/code&gt; blocks) costs some token and compute overhead, though the model learns to use it more sparsely over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5107&quot; data-start=&quot;4901&quot;&gt;
&lt;p data-end=&quot;5107&quot; data-start=&quot;4903&quot;&gt;There’s a balance tension between pushing for parallel structure (which encourages exploration) and maximizing accuracy (which sometimes pushes toward fewer divergences). Reward engineering is delicate.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;5112&quot; data-start=&quot;5109&quot; /&gt;
&lt;p data-end=&quot;5580&quot; data-start=&quot;5114&quot;&gt;&lt;strong data-end=&quot;5130&quot; data-start=&quot;5114&quot;&gt;Bottom line:&lt;/strong&gt; Parallel-R1 is a breakthrough toward training LLMs that &lt;em data-end=&quot;5223&quot; data-start=&quot;5187&quot;&gt;think in parallel, not just deeper&lt;/em&gt;. By combining curriculum learning, structured or causal variants, and reinforcement learning with rewards for both correctness and reasoning style, it unlocks better performance on challenging math tasks. As reasoning benchmarks and applications demand both correctness &lt;em data-end=&quot;5499&quot; data-start=&quot;5494&quot;&gt;and&lt;/em&gt; robustness, methods like this will likely become a standard part of the toolkit.&lt;/p&gt;
&lt;p data-end=&quot;5654&quot; data-start=&quot;5582&quot;&gt;&lt;em data-end=&quot;5654&quot; data-start=&quot;5582&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;5653&quot; data-start=&quot;5595&quot; href=&quot;https://arxiv.org/pdf/2509.07980&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.07980 (PDF)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/3309453393596501930/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/3309453393596501930?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3309453393596501930'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/3309453393596501930'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/parallel-r1-teaching-llms-to-reason.html' title='Parallel-R1: Teaching LLMs to reason from multiple angles—permanently'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-5561906254129119233</id><published>2025-09-11T17:10:00.009+08:00</published><updated>2025-09-11T17:10:56.517+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="majority vote"/><category scheme="http://www.blogger.com/atom/ns#" term="math contests"/><category scheme="http://www.blogger.com/atom/ns#" term="Meta FAIR"/><category scheme="http://www.blogger.com/atom/ns#" term="minority correctness"/><category scheme="http://www.blogger.com/atom/ns#" term="model efficiency"/><category scheme="http://www.blogger.com/atom/ns#" term="Reasoning"/><category scheme="http://www.blogger.com/atom/ns#" term="Reinforcement Learning"/><category scheme="http://www.blogger.com/atom/ns#" term="solution aggregation"/><category scheme="http://www.blogger.com/atom/ns#" term="test-time scaling"/><title type='text'>The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting</title><content type='html'>&lt;p&gt;&amp;nbsp;When logic is tricky, the most common answer isn’t always the correct one. A new Meta/Fair &amp;amp; CMU paper titled &lt;em data-end=&quot;688&quot; data-start=&quot;614&quot;&gt;“The Majority is not always right: RL training for solution aggregation”&lt;/em&gt; challenges the standard practice of combining LLM outputs via voting or reward-scored selection. Their method—&lt;strong data-end=&quot;808&quot; data-start=&quot;799&quot;&gt;AggLM&lt;/strong&gt;—trains a dedicated aggregator model to &lt;em data-end=&quot;866&quot; data-start=&quot;848&quot;&gt;review, correct,&lt;/em&gt; and &lt;em data-end=&quot;883&quot; data-start=&quot;871&quot;&gt;synthesize&lt;/em&gt; among multiple LLM-generated candidate solutions via reinforcement learning from verifiable rewards (RLVR), yielding big gains over majority voting and reward model baselines.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;1104&quot; data-start=&quot;1101&quot; /&gt;
&lt;h2 data-end=&quot;1158&quot; data-start=&quot;1106&quot;&gt;Solving it: learned reconciliation vs. counting&lt;/h2&gt;
&lt;p data-end=&quot;1891&quot; data-start=&quot;1160&quot;&gt;Standard aggregation in LLM reasoning often works like this: sample many candidate solutions, then pick the answer that&#39;s most frequent (&lt;em data-end=&quot;1314&quot; data-start=&quot;1297&quot;&gt;majority voting&lt;/em&gt;) or highest scored by some reward model. While effective in many settings, these methods have a blind spot—when correct answers exist &lt;strong data-end=&quot;1457&quot; data-start=&quot;1449&quot;&gt;only&lt;/strong&gt; among minority solutions. In contrast, AggLM treats aggregation itself as a reasoning task. It takes a set of candidate solutions, &lt;strong data-end=&quot;1606&quot; data-start=&quot;1589&quot;&gt;analyzes them&lt;/strong&gt;, spots mistakes or partial correctness, then &lt;strong data-end=&quot;1664&quot; data-start=&quot;1652&quot;&gt;combines&lt;/strong&gt; ideas or corrects missing steps to produce a final solution. Importantly, it’s trained using &lt;strong data-end=&quot;1780&quot; data-start=&quot;1758&quot;&gt;verifiable rewards&lt;/strong&gt;—i.e. only when the aggregated output matches a known correct solution.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr data-end=&quot;1896&quot; data-start=&quot;1893&quot; /&gt;
&lt;h2 data-end=&quot;1932&quot; data-start=&quot;1898&quot;&gt;Key ingredients &amp;amp; experiments&lt;/h2&gt;
&lt;ul data-end=&quot;2922&quot; data-start=&quot;1934&quot;&gt;
&lt;li data-end=&quot;2329&quot; data-start=&quot;1934&quot;&gt;
&lt;p data-end=&quot;2329&quot; data-start=&quot;1936&quot;&gt;&lt;strong data-end=&quot;1958&quot; data-start=&quot;1936&quot;&gt;Dataset &amp;amp; training&lt;/strong&gt;: Using Qwen3-1.7B as the solution generator, AggLM-1.7B is trained on ~446,000 examples drawn from a mixture of “easy” and “hard” sets. Hard sets are those where the majority answer among candidates is actually incorrect; the mix helps the model learn both to follow the majority and to rescue correctness from minority solutions.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2618&quot; data-start=&quot;2331&quot;&gt;
&lt;p data-end=&quot;2618&quot; data-start=&quot;2333&quot;&gt;&lt;strong data-end=&quot;2357&quot; data-start=&quot;2333&quot;&gt;Aggregation via RLVR&lt;/strong&gt;: The model uses &lt;strong data-end=&quot;2419&quot; data-start=&quot;2374&quot;&gt;Group-Relative Policy Optimization (GRPO)&lt;/strong&gt;, with a binary reward (1 for matching the ground truth, 0 otherwise). The aggregator is initialized from the Qwen3-1.7B model but is tuned via this RL signal.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2922&quot; data-start=&quot;2620&quot;&gt;
&lt;p data-end=&quot;2922&quot; data-start=&quot;2622&quot;&gt;&lt;strong data-end=&quot;2636&quot; data-start=&quot;2622&quot;&gt;Benchmarks&lt;/strong&gt;: Evaluated on four math contest datasets: &lt;strong data-end=&quot;2713&quot; data-start=&quot;2679&quot;&gt;AIME24, AIME25, HMMT24, HMMT25&lt;/strong&gt;. AggLM was tested aggregating candidate solutions from both the same generator model (Qwen3-1.7B) and stronger ones (Qwen3-8B), in both thinking and non-thinking modes.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;2927&quot; data-start=&quot;2924&quot; /&gt;
&lt;h2 data-end=&quot;2960&quot; data-start=&quot;2929&quot;&gt;Results &amp;amp; token-efficiency&lt;/h2&gt;
&lt;ul data-end=&quot;4052&quot; data-start=&quot;2962&quot;&gt;
&lt;li data-end=&quot;3437&quot; data-start=&quot;2962&quot;&gt;
&lt;p data-end=&quot;3437&quot; data-start=&quot;2964&quot;&gt;On solutions from Qwen3-1.7B in thinking mode, AggLM-1.7B lifts accuracy significantly. For example, on &lt;strong data-end=&quot;3078&quot; data-start=&quot;3068&quot;&gt;AIME25&lt;/strong&gt;, majority voting with 8 candidates yields ~67.9%, while AggLM pushes it to &lt;strong data-end=&quot;3163&quot; data-start=&quot;3154&quot;&gt;50.0%&lt;/strong&gt; in a different benchmark context (depending on the exact evaluation variant). More striking, when aggregating from the stronger 8B model, AggLM still outperforms majority voting, weighted voting, and reward-model selection baselines.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3722&quot; data-start=&quot;3439&quot;&gt;
&lt;p data-end=&quot;3722&quot; data-start=&quot;3441&quot;&gt;In &lt;strong data-end=&quot;3460&quot; data-start=&quot;3444&quot;&gt;non-thinking&lt;/strong&gt; modes (i.e. when the candidate-generating model is weaker or does not use chain-of-thought reasoning), AggLM retains its lead—showing that it generalizes beyond just cherry-picking strong or specifically-formatted inputs.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4052&quot; data-start=&quot;3724&quot;&gt;
&lt;p data-end=&quot;4052&quot; data-start=&quot;3726&quot;&gt;Regarding cost, AggLM is more &lt;strong data-end=&quot;3775&quot; data-start=&quot;3756&quot;&gt;token efficient&lt;/strong&gt;: instead of needing large numbers of candidate solutions (i.e. very large &lt;em data-end=&quot;3853&quot; data-start=&quot;3850&quot;&gt;k&lt;/em&gt;) for majority voting to reach high accuracy, AggLM achieves similar or better accuracy with fewer candidate solutions, saving both inference time and compute.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-end=&quot;4057&quot; data-start=&quot;4054&quot; /&gt;
&lt;h2 data-end=&quot;4090&quot; data-start=&quot;4059&quot;&gt;Implications &amp;amp; what’s next&lt;/h2&gt;
&lt;p data-end=&quot;4126&quot; data-start=&quot;4092&quot;&gt;AggLM shifts thinking in two ways:&lt;/p&gt;
&lt;ol data-end=&quot;5141&quot; data-start=&quot;4128&quot;&gt;
&lt;li data-end=&quot;4375&quot; data-start=&quot;4128&quot;&gt;
&lt;p data-end=&quot;4375&quot; data-start=&quot;4131&quot;&gt;&lt;strong data-end=&quot;4160&quot; data-start=&quot;4131&quot;&gt;Aggregation as reasoning.&lt;/strong&gt; Aggregation isn’t just picking among options—it’s an opportunity to correct, synthesize, and integrate partial truths. Models that can do that perform better, especially in instances where majority answers mislead.&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4604&quot; data-start=&quot;4377&quot;&gt;
&lt;p data-end=&quot;4604&quot; data-start=&quot;4380&quot;&gt;&lt;strong data-end=&quot;4410&quot; data-start=&quot;4380&quot;&gt;Balancing examples is key.&lt;/strong&gt; Training on a mix of easy and hard cases was essential. If you train only on “easy” majority-correct groups, or only on “hard” ones, performance suffers.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;4889&quot; data-start=&quot;4606&quot;&gt;
&lt;p data-end=&quot;4889&quot; data-start=&quot;4609&quot;&gt;&lt;strong data-end=&quot;4655&quot; data-start=&quot;4609&quot;&gt;Generalization beyond training generators.&lt;/strong&gt; AggLM works well even when aggregating from stronger models than those used during training—implying aggregation skills are transferable, not just overfitted to particular output distributions.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;5141&quot; data-start=&quot;4891&quot;&gt;
&lt;p data-end=&quot;5141&quot; data-start=&quot;4894&quot;&gt;&lt;strong data-end=&quot;4919&quot; data-start=&quot;4894&quot;&gt;Efficiency trade-off.&lt;/strong&gt; Instead of scaling &lt;em data-end=&quot;4942&quot; data-start=&quot;4939&quot;&gt;k&lt;/em&gt; (number of solutions) to very high values, using a learned aggregator yields larger gains per additional candidate, meaning happier ceilings on tokens/time.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr data-end=&quot;5146&quot; data-start=&quot;5143&quot; /&gt;
&lt;p data-end=&quot;5711&quot; data-start=&quot;5148&quot;&gt;&lt;strong data-end=&quot;5164&quot; data-start=&quot;5148&quot;&gt;Bottom line:&lt;/strong&gt; AggLM demonstrates that “the majority vote” should not be the default in reasoning aggregation. Models that are trained to &lt;em data-end=&quot;5301&quot; data-start=&quot;5288&quot;&gt;look across&lt;/em&gt; candidate solutions—identify hidden truth, correct errors, and combine the best ideas—do better than simple heuristics. Especially in math and logic tasks where minority correct answers exist, learned aggregation via RL with verifiable reward is a strong lever. If you’re designing agents or reasoning pipelines, integrating an aggregator like AggLM can be a powerful performance boost with reasonable cost.&lt;/p&gt;
&lt;p data-end=&quot;5785&quot; data-start=&quot;5713&quot;&gt;&lt;em data-end=&quot;5785&quot; data-start=&quot;5713&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;5784&quot; data-start=&quot;5726&quot; href=&quot;https://arxiv.org/pdf/2509.06870&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.06870 (PDF)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/5561906254129119233/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/5561906254129119233?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/5561906254129119233'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/5561906254129119233'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/the-majority-isnt-always-right-agglm.html' title='The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-2849186173751043698</id><published>2025-09-11T10:57:00.006+08:00</published><updated>2025-09-11T10:57:00.116+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="AIME"/><category scheme="http://www.blogger.com/atom/ns#" term="AMC"/><category scheme="http://www.blogger.com/atom/ns#" term="chain-of-thought"/><category scheme="http://www.blogger.com/atom/ns#" term="control tokens"/><category scheme="http://www.blogger.com/atom/ns#" term="MATH-500"/><category scheme="http://www.blogger.com/atom/ns#" term="parallel thinking"/><category scheme="http://www.blogger.com/atom/ns#" term="ParaThinker"/><category scheme="http://www.blogger.com/atom/ns#" term="positional embeddings"/><category scheme="http://www.blogger.com/atom/ns#" term="test-time compute"/><category scheme="http://www.blogger.com/atom/ns#" term="Tsinghua AIR"/><category scheme="http://www.blogger.com/atom/ns#" term="Tunnel Vision"/><title type='text'>ParaThinker: parallel minds beat longer monologues</title><content type='html'>&lt;p&gt;&amp;nbsp;LLMs have ridden &lt;strong data-end=&quot;572&quot; data-start=&quot;551&quot;&gt;test-time compute&lt;/strong&gt;—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s &lt;strong data-end=&quot;701&quot; data-start=&quot;686&quot;&gt;ParaThinker&lt;/strong&gt; calls this &lt;strong data-end=&quot;730&quot; data-start=&quot;713&quot;&gt;Tunnel Vision&lt;/strong&gt; and proposes &lt;strong data-end=&quot;774&quot; data-start=&quot;744&quot;&gt;native thought parallelism&lt;/strong&gt;: generate several independent reasoning paths &lt;strong data-end=&quot;839&quot; data-start=&quot;821&quot;&gt;simultaneously&lt;/strong&gt;, then fuse them into one answer.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1370&quot; data-start=&quot;912&quot;&gt;Instead of external voting, ParaThinker trains the model itself to branch and merge: &lt;strong data-end=&quot;1027&quot; data-start=&quot;997&quot;&gt;specialized control tokens&lt;/strong&gt; (&lt;code data-end=&quot;1040&quot; data-start=&quot;1029&quot;&gt;&amp;lt;think i&amp;gt;&lt;/code&gt;) trigger distinct trajectories, &lt;strong data-end=&quot;1112&quot; data-start=&quot;1073&quot;&gt;path-specific positional embeddings&lt;/strong&gt; keep streams separate, and a &lt;strong data-end=&quot;1170&quot; data-start=&quot;1142&quot;&gt;two-phase attention mask&lt;/strong&gt; enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1807&quot; data-start=&quot;1372&quot;&gt;On AIME-24/25, AMC-23 and MATH-500, ParaThinker with &lt;strong data-end=&quot;1445&quot; data-start=&quot;1425&quot;&gt;8 parallel paths&lt;/strong&gt; boosts accuracy by &lt;strong data-end=&quot;1485&quot; data-start=&quot;1465&quot;&gt;+12.3 pts (1.5B)&lt;/strong&gt; and &lt;strong data-end=&quot;1507&quot; data-start=&quot;1490&quot;&gt;+7.5 pts (7B)&lt;/strong&gt; over sequential baselines under the same token budget, and still &lt;strong data-end=&quot;1598&quot; data-start=&quot;1573&quot;&gt;beats majority voting&lt;/strong&gt; by &lt;strong data-end=&quot;1619&quot; data-start=&quot;1602&quot;&gt;+4.3/+2.0 pts&lt;/strong&gt;—with only &lt;strong data-end=&quot;1639&quot; data-start=&quot;1630&quot;&gt;~7.1%&lt;/strong&gt; latency overhead. Generating up to &lt;strong data-end=&quot;1681&quot; data-start=&quot;1675&quot;&gt;16&lt;/strong&gt; paths costs &lt;strong data-end=&quot;1701&quot; data-start=&quot;1694&quot;&gt;&amp;lt;2×&lt;/strong&gt; single-path latency, thanks to better arithmetic intensity on GPUs.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;2097&quot; data-start=&quot;1809&quot;&gt;The takeaway: &lt;strong data-end=&quot;1854&quot; data-start=&quot;1823&quot;&gt;scale width, not just depth&lt;/strong&gt;. ParaThinker shows that orchestrating compute across &lt;strong data-end=&quot;1938&quot; data-start=&quot;1908&quot;&gt;diverse, parallel thoughts&lt;/strong&gt; unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;2171&quot; data-start=&quot;2099&quot;&gt;&lt;em data-end=&quot;2171&quot; data-start=&quot;2099&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;2170&quot; data-start=&quot;2112&quot; href=&quot;https://arxiv.org/pdf/2509.04475&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.04475 (PDF)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/2849186173751043698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/2849186173751043698?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/2849186173751043698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/2849186173751043698'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/parathinker-parallel-minds-beat-longer.html' title='ParaThinker: parallel minds beat longer monologues'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-395633679661135047</id><published>2025-09-10T18:00:00.000+08:00</published><updated>2025-09-10T18:00:00.117+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="block diffusion"/><category scheme="http://www.blogger.com/atom/ns#" term="coding"/><category scheme="http://www.blogger.com/atom/ns#" term="diffusion language models"/><category scheme="http://www.blogger.com/atom/ns#" term="full-attention DLMs"/><category scheme="http://www.blogger.com/atom/ns#" term="KV-cache"/><category scheme="http://www.blogger.com/atom/ns#" term="long chain-of-thought"/><category scheme="http://www.blogger.com/atom/ns#" term="Math Reasoning"/><category scheme="http://www.blogger.com/atom/ns#" term="TraceRL"/><category scheme="http://www.blogger.com/atom/ns#" term="TraDo-4B"/><category scheme="http://www.blogger.com/atom/ns#" term="TraDo-8B"/><category scheme="http://www.blogger.com/atom/ns#" term="value model"/><title type='text'>TraceRL puts diffusion LLMs on the reasoning map</title><content type='html'>&lt;p&gt;&amp;nbsp;Autoregressive (AR) giants have dominated reasoning benchmarks, while diffusion language models (DLMs) were seen as “fast samplers” with limited logic chops. A new paper from &lt;strong data-end=&quot;782&quot; data-start=&quot;769&quot;&gt;Princeton&lt;/strong&gt; and &lt;strong data-end=&quot;799&quot; data-start=&quot;787&quot;&gt;UChicago&lt;/strong&gt; argues that’s mostly a training-objective problem—and offers &lt;strong data-end=&quot;872&quot; data-start=&quot;861&quot;&gt;TraceRL&lt;/strong&gt;, a &lt;strong data-end=&quot;896&quot; data-start=&quot;876&quot;&gt;trajectory-aware&lt;/strong&gt; reinforcement learning framework that aligns what a DLM learns with how it actually &lt;strong data-end=&quot;992&quot; data-start=&quot;981&quot;&gt;samples&lt;/strong&gt;. The team also releases code and ready-to-run models under the &lt;strong data-end=&quot;1065&quot; data-start=&quot;1056&quot;&gt;TraDo&lt;/strong&gt; banner.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2 data-end=&quot;1126&quot; data-start=&quot;1113&quot;&gt;What’s new&lt;/h2&gt;
&lt;ul data-end=&quot;1824&quot; data-start=&quot;1127&quot;&gt;
&lt;li data-end=&quot;1553&quot; data-start=&quot;1127&quot;&gt;
&lt;p data-end=&quot;1553&quot; data-start=&quot;1129&quot;&gt;&lt;strong data-end=&quot;1162&quot; data-start=&quot;1129&quot;&gt;Trajectory-aware RL for DLMs.&lt;/strong&gt; Instead of scoring randomly masked sequences, &lt;strong data-end=&quot;1220&quot; data-start=&quot;1209&quot;&gt;TraceRL&lt;/strong&gt; optimizes against the model’s &lt;strong data-end=&quot;1284&quot; data-start=&quot;1251&quot;&gt;intermediate inference traces&lt;/strong&gt;, matching the left-to-right / blockwise behavior used at decode time. A &lt;strong data-end=&quot;1388&quot; data-start=&quot;1357&quot;&gt;diffusion-based value model&lt;/strong&gt; stabilizes training by reducing variance. Crucially, the method works for &lt;strong data-end=&quot;1481&quot; data-start=&quot;1463&quot;&gt;full-attention&lt;/strong&gt; &lt;em data-end=&quot;1487&quot; data-start=&quot;1482&quot;&gt;and&lt;/em&gt; &lt;strong data-end=&quot;1507&quot; data-start=&quot;1488&quot;&gt;block-attention&lt;/strong&gt; DLMs.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;1824&quot; data-start=&quot;1554&quot;&gt;
&lt;p data-end=&quot;1824&quot; data-start=&quot;1556&quot;&gt;&lt;strong data-end=&quot;1571&quot; data-start=&quot;1556&quot;&gt;Open stack.&lt;/strong&gt; The release includes a framework to build/train/deploy DLMs across architectures, with &lt;strong data-end=&quot;1684&quot; data-start=&quot;1659&quot;&gt;KV-cache acceleration&lt;/strong&gt;, inference engines, SFT + RL recipes for &lt;strong data-end=&quot;1743&quot; data-start=&quot;1726&quot;&gt;math and code&lt;/strong&gt;, and links to &lt;strong data-end=&quot;1773&quot; data-start=&quot;1758&quot;&gt;TraDo-4B/8B&lt;/strong&gt; checkpoints.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 data-end=&quot;1841&quot; data-start=&quot;1826&quot;&gt;The receipts&lt;/h2&gt;
&lt;p data-end=&quot;2024&quot; data-start=&quot;1842&quot;&gt;On headline benchmarks (dynamic vs. static sampling shown in the paper), the TraDo models post the strongest DLM numbers to date and &lt;strong data-end=&quot;2013&quot; data-start=&quot;1975&quot;&gt;overtake AR peers at similar scale&lt;/strong&gt; on math:&lt;/p&gt;
&lt;ul data-end=&quot;2631&quot; data-start=&quot;2025&quot;&gt;
&lt;li data-end=&quot;2261&quot; data-start=&quot;2025&quot;&gt;
&lt;p data-end=&quot;2261&quot; data-start=&quot;2027&quot;&gt;&lt;strong data-end=&quot;2049&quot; data-start=&quot;2027&quot;&gt;TraDo-8B-Instruct:&lt;/strong&gt; &lt;strong data-end=&quot;2066&quot; data-start=&quot;2050&quot;&gt;MATH500 78.5&lt;/strong&gt;, &lt;strong data-end=&quot;2084&quot; data-start=&quot;2068&quot;&gt;AIME’24 13.3&lt;/strong&gt;, &lt;strong data-end=&quot;2101&quot; data-start=&quot;2086&quot;&gt;LCB-V2 25.9&lt;/strong&gt;—a &lt;strong data-end=&quot;2113&quot; data-start=&quot;2104&quot;&gt;+6.1%&lt;/strong&gt; relative lift over &lt;strong data-end=&quot;2156&quot; data-start=&quot;2133&quot;&gt;Qwen2.5-7B-Instruct&lt;/strong&gt; and &lt;strong data-end=&quot;2171&quot; data-start=&quot;2161&quot;&gt;+51.3%&lt;/strong&gt; over &lt;strong data-end=&quot;2202&quot; data-start=&quot;2177&quot;&gt;Llama-3.1-8B-Instruct&lt;/strong&gt; on math reasoning.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2424&quot; data-start=&quot;2262&quot;&gt;
&lt;p data-end=&quot;2424&quot; data-start=&quot;2264&quot;&gt;&lt;strong data-end=&quot;2286&quot; data-start=&quot;2264&quot;&gt;TraDo-4B-Instruct:&lt;/strong&gt; &lt;strong data-end=&quot;2303&quot; data-start=&quot;2287&quot;&gt;MATH500 75.6&lt;/strong&gt;, &lt;strong data-end=&quot;2321&quot; data-start=&quot;2305&quot;&gt;AIME’24 10.3&lt;/strong&gt;, &lt;strong data-end=&quot;2338&quot; data-start=&quot;2323&quot;&gt;LCB-V2 18.7&lt;/strong&gt;, consistently edging 7B AR baselines on math.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2631&quot; data-start=&quot;2425&quot;&gt;
&lt;p data-end=&quot;2631&quot; data-start=&quot;2427&quot;&gt;&lt;strong data-end=&quot;2460&quot; data-start=&quot;2427&quot;&gt;TraDo-8B-Thinking (long-CoT):&lt;/strong&gt; first &lt;strong data-end=&quot;2492&quot; data-start=&quot;2467&quot;&gt;long chain-of-thought&lt;/strong&gt; diffusion LLM, hitting &lt;strong data-end=&quot;2532&quot; data-start=&quot;2516&quot;&gt;MATH500 87.4&lt;/strong&gt;, &lt;strong data-end=&quot;2550&quot; data-start=&quot;2534&quot;&gt;AIME’24 35.5&lt;/strong&gt;, &lt;strong data-end=&quot;2567&quot; data-start=&quot;2552&quot;&gt;LCB-V2 34.6&lt;/strong&gt; with very long answers.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;2938&quot; data-start=&quot;2633&quot;&gt;The authors attribute gains to objective/trajectory alignment and show smoother curves with the &lt;strong data-end=&quot;2744&quot; data-start=&quot;2729&quot;&gt;value model&lt;/strong&gt; vs. policy-only RL. They also document a &lt;strong data-end=&quot;2814&quot; data-start=&quot;2786&quot;&gt;speed/accuracy trade-off&lt;/strong&gt;: &lt;strong data-end=&quot;2827&quot; data-start=&quot;2816&quot;&gt;dynamic&lt;/strong&gt; sampling is faster; &lt;strong data-end=&quot;2858&quot; data-start=&quot;2848&quot;&gt;static&lt;/strong&gt; top-1 decoding squeezes out extra points.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2 data-end=&quot;2957&quot; data-start=&quot;2940&quot;&gt;Why it matters&lt;/h2&gt;
&lt;ol data-end=&quot;3457&quot; data-start=&quot;2958&quot;&gt;
&lt;li data-end=&quot;3457&quot; data-start=&quot;2958&quot;&gt;
&lt;p data-end=&quot;3457&quot; data-start=&quot;2961&quot;&gt;&lt;strong data-end=&quot;3005&quot; data-start=&quot;2961&quot;&gt;DLMs aren’t just “fast”—they can reason.&lt;/strong&gt; With the right RL target, parallel generation stacks clear long-form math and coding hurdles previously ceded to AR. 2) &lt;strong data-end=&quot;3146&quot; data-start=&quot;3126&quot;&gt;Unifies the zoo.&lt;/strong&gt; One RL recipe spans &lt;strong data-end=&quot;3185&quot; data-start=&quot;3167&quot;&gt;full-attention&lt;/strong&gt; and &lt;strong data-end=&quot;3209&quot; data-start=&quot;3190&quot;&gt;block-diffusion&lt;/strong&gt;, and even helps &lt;strong data-end=&quot;3248&quot; data-start=&quot;3226&quot;&gt;enlarge block size&lt;/strong&gt; for more flexible sampling. 3) &lt;strong data-end=&quot;3299&quot; data-start=&quot;3280&quot;&gt;Practical path.&lt;/strong&gt; The open framework + KV-cache tricks make DLM post-training and deployment feel product-ready, not just a lab exercise.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 data-end=&quot;3473&quot; data-start=&quot;3459&quot;&gt;Setup notes&lt;/h2&gt;
&lt;p data-end=&quot;3681&quot; data-start=&quot;3474&quot;&gt;Math RL uses &lt;strong data-end=&quot;3493&quot; data-start=&quot;3487&quot;&gt;8k&lt;/strong&gt; hard MATH tasks; coding RL uses &lt;strong data-end=&quot;3532&quot; data-start=&quot;3526&quot;&gt;6k&lt;/strong&gt; verified problems from PrimeIntellect. Long-CoT training mixes &lt;strong data-end=&quot;3607&quot; data-start=&quot;3596&quot;&gt;TraceRL&lt;/strong&gt; with long-form SFT as a curriculum.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;3910&quot; data-start=&quot;3683&quot;&gt;Bottom line: &lt;strong data-end=&quot;3707&quot; data-start=&quot;3696&quot;&gt;TraceRL&lt;/strong&gt; reframes diffusion LLMs as credible &lt;strong data-end=&quot;3757&quot; data-start=&quot;3744&quot;&gt;reasoners&lt;/strong&gt;, not just fast generators—and &lt;strong data-end=&quot;3809&quot; data-start=&quot;3788&quot;&gt;TraDo-8B-Thinking&lt;/strong&gt; plants the first long-CoT flag on the DLM side of the field.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;3984&quot; data-start=&quot;3912&quot;&gt;&lt;em data-end=&quot;3984&quot; data-start=&quot;3912&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;3983&quot; data-start=&quot;3925&quot; href=&quot;https://arxiv.org/pdf/2509.06949&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.06949 (PDF)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/395633679661135047/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/395633679661135047?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/395633679661135047'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/395633679661135047'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/tracerl-puts-diffusion-llms-on.html' title='TraceRL puts diffusion LLMs on the reasoning map'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-686431685480840724</id><published>2025-09-10T14:00:00.000+08:00</published><updated>2025-09-10T14:00:00.119+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="AlpacaEval"/><category scheme="http://www.blogger.com/atom/ns#" term="data-free training"/><category scheme="http://www.blogger.com/atom/ns#" term="GPT-4o judge"/><category scheme="http://www.blogger.com/atom/ns#" term="GRPO"/><category scheme="http://www.blogger.com/atom/ns#" term="Language Self-Play"/><category scheme="http://www.blogger.com/atom/ns#" term="Llama-3.2-3B-Instruct"/><category scheme="http://www.blogger.com/atom/ns#" term="Meta AI"/><category scheme="http://www.blogger.com/atom/ns#" term="self-play RL"/><category scheme="http://www.blogger.com/atom/ns#" term="self-reward"/><category scheme="http://www.blogger.com/atom/ns#" term="Skywork Reward V2"/><title type='text'>Language Self-Play: training an LLM without adding data actually works</title><content type='html'>&lt;p&gt;&amp;nbsp;LLMs keep getting better by eating more data—until the data well runs dry. A new paper from Meta Superintelligence Labs proposes &lt;strong data-end=&quot;648&quot; data-start=&quot;620&quot;&gt;Language Self-Play (LSP)&lt;/strong&gt;: turn training into a game where a single model plays &lt;em data-end=&quot;715&quot; data-start=&quot;703&quot;&gt;both sides&lt;/em&gt;—a &lt;strong data-end=&quot;732&quot; data-start=&quot;718&quot;&gt;Challenger&lt;/strong&gt; that generates tougher prompts and a &lt;strong data-end=&quot;780&quot; data-start=&quot;770&quot;&gt;Solver&lt;/strong&gt; that answers them—so the system improves &lt;strong data-end=&quot;856&quot; data-start=&quot;822&quot;&gt;without ingesting new datasets&lt;/strong&gt;. In tests on AlpacaEval using &lt;strong data-end=&quot;912&quot; data-start=&quot;887&quot;&gt;Llama-3.2-3B-Instruct&lt;/strong&gt;, LSP matches a strong data-driven RL baseline and even &lt;strong data-end=&quot;988&quot; data-start=&quot;968&quot;&gt;pushes beyond it&lt;/strong&gt; when used as a follow-on stage.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2 data-end=&quot;1099&quot; data-start=&quot;1060&quot;&gt;How it works: one model, two roles&lt;/h2&gt;
&lt;p data-end=&quot;1582&quot; data-start=&quot;1100&quot;&gt;LSP frames training as a &lt;strong data-end=&quot;1141&quot; data-start=&quot;1125&quot;&gt;minimax game&lt;/strong&gt;: Challenger tries to &lt;strong data-end=&quot;1175&quot; data-start=&quot;1163&quot;&gt;minimize&lt;/strong&gt; reward by making hard queries; Solver tries to &lt;strong data-end=&quot;1235&quot; data-start=&quot;1223&quot;&gt;maximize&lt;/strong&gt; reward by answering them. Crucially, both roles are instantiated by &lt;strong data-end=&quot;1320&quot; data-start=&quot;1304&quot;&gt;the same LLM&lt;/strong&gt; via a role-selecting prompt (e.g., a special &lt;em data-end=&quot;1378&quot; data-start=&quot;1366&quot;&gt;challenger&lt;/em&gt; prompt), avoiding the instability and memory overhead of training an external adversary. KL regularization keeps the Challenger from devolving into nonsense prompts.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;2088&quot; data-start=&quot;1584&quot;&gt;Under the hood, LSP borrows &lt;strong data-end=&quot;1640&quot; data-start=&quot;1612&quot;&gt;group-relative baselines&lt;/strong&gt; from GRPO: Challenger generates &lt;em data-end=&quot;1676&quot; data-start=&quot;1673&quot;&gt;N&lt;/em&gt; queries, Solver samples &lt;em data-end=&quot;1704&quot; data-start=&quot;1701&quot;&gt;G&lt;/em&gt; answers per query, and the &lt;strong data-end=&quot;1750&quot; data-start=&quot;1732&quot;&gt;average reward&lt;/strong&gt; defines both a per-answer advantage (for Solver) and a “difficulty” signal (for Challenger). A practical variant, &lt;strong data-end=&quot;1877&quot; data-start=&quot;1865&quot;&gt;LSP-Zero&lt;/strong&gt;, runs as a pure zero-sum game; the full &lt;strong data-end=&quot;1925&quot; data-start=&quot;1918&quot;&gt;LSP&lt;/strong&gt; adds a &lt;strong data-end=&quot;1956&quot; data-start=&quot;1933&quot;&gt;quality self-reward&lt;/strong&gt; scored by a reference model to prevent reward-hacking (e.g., answering everything in Python).&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2 data-end=&quot;2148&quot; data-start=&quot;2090&quot;&gt;Results: data-free ≈ data-driven—and sometimes better&lt;/h2&gt;
&lt;p data-end=&quot;2242&quot; data-start=&quot;2149&quot;&gt;Using &lt;strong data-end=&quot;2174&quot; data-start=&quot;2155&quot;&gt;GPT-4o as judge&lt;/strong&gt; on AlpacaEval, the team compares models trained from the same base:&lt;/p&gt;
&lt;ul data-end=&quot;2735&quot; data-start=&quot;2244&quot;&gt;
&lt;li data-end=&quot;2486&quot; data-start=&quot;2244&quot;&gt;
&lt;p data-end=&quot;2486&quot; data-start=&quot;2246&quot;&gt;&lt;strong data-end=&quot;2270&quot; data-start=&quot;2246&quot;&gt;From base (no data):&lt;/strong&gt; Overall win rates vs. the base model—&lt;strong data-end=&quot;2334&quot; data-start=&quot;2308&quot;&gt;GRPO (with data) 40.9%&lt;/strong&gt;, &lt;strong data-end=&quot;2354&quot; data-start=&quot;2336&quot;&gt;LSP-Zero 40.1%&lt;/strong&gt;, &lt;strong data-end=&quot;2369&quot; data-start=&quot;2356&quot;&gt;LSP 40.6%&lt;/strong&gt;. Translation: self-play &lt;strong data-end=&quot;2417&quot; data-start=&quot;2394&quot;&gt;without any RL data&lt;/strong&gt; keeps pace with standard RL.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;2735&quot; data-start=&quot;2487&quot;&gt;
&lt;p data-end=&quot;2735&quot; data-start=&quot;2489&quot;&gt;&lt;strong data-end=&quot;2519&quot; data-start=&quot;2489&quot;&gt;From RL (as a next stage):&lt;/strong&gt; Starting from the GRPO model and continuing with self-play, &lt;strong data-end=&quot;2619&quot; data-start=&quot;2580&quot;&gt;LSP lifts overall win rate to 43.1%&lt;/strong&gt;, with large gains on &lt;strong data-end=&quot;2651&quot; data-start=&quot;2641&quot;&gt;Vicuna&lt;/strong&gt;-style conversational tasks (28.7% → 46.3%).&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;3096&quot; data-start=&quot;2737&quot;&gt;The setup uses &lt;strong data-end=&quot;2786&quot; data-start=&quot;2752&quot;&gt;Skywork-Reward-V2-Llama-3.2-3B&lt;/strong&gt; as the reward model; the authors note that &lt;strong data-end=&quot;2837&quot; data-start=&quot;2830&quot;&gt;LSP&lt;/strong&gt; (with the added quality reward) avoids the degradation seen with &lt;strong data-end=&quot;2915&quot; data-start=&quot;2903&quot;&gt;LSP-Zero&lt;/strong&gt; in some splits, and acknowledge dips on “chatbot-y” &lt;strong data-end=&quot;2977&quot; data-start=&quot;2968&quot;&gt;Koala&lt;/strong&gt; prompts—likely because Challenger skews toward structured, orderly instructions.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2 data-end=&quot;3119&quot; data-start=&quot;3098&quot;&gt;Why this matters&lt;/h2&gt;
&lt;ul data-end=&quot;3665&quot; data-start=&quot;3120&quot;&gt;
&lt;li data-end=&quot;3321&quot; data-start=&quot;3120&quot;&gt;
&lt;p data-end=&quot;3321&quot; data-start=&quot;3122&quot;&gt;&lt;strong data-end=&quot;3149&quot; data-start=&quot;3122&quot;&gt;Data bottleneck relief.&lt;/strong&gt; If you can translate “more practice data” into a &lt;strong data-end=&quot;3228&quot; data-start=&quot;3199&quot;&gt;self-generated curriculum&lt;/strong&gt;, you can keep improving without chasing new corpora.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3506&quot; data-start=&quot;3322&quot;&gt;
&lt;p data-end=&quot;3506&quot; data-start=&quot;3324&quot;&gt;&lt;strong data-end=&quot;3352&quot; data-start=&quot;3324&quot;&gt;A clean follow-on stage.&lt;/strong&gt; Even after data-based RL, &lt;strong data-end=&quot;3406&quot; data-start=&quot;3379&quot;&gt;self-play adds headroom&lt;/strong&gt;—useful when further high-quality preference data is scarce.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li data-end=&quot;3665&quot; data-start=&quot;3507&quot;&gt;
&lt;p data-end=&quot;3665&quot; data-start=&quot;3509&quot;&gt;&lt;strong data-end=&quot;3537&quot; data-start=&quot;3509&quot;&gt;Single-model simplicity.&lt;/strong&gt; One backbone serves both roles, avoiding adversary models and the instability they bring.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 data-end=&quot;3698&quot; data-start=&quot;3667&quot;&gt;Caveats and open questions&lt;/h2&gt;
&lt;p data-end=&quot;4127&quot; data-start=&quot;3699&quot;&gt;Self-play can &lt;strong data-end=&quot;3727&quot; data-start=&quot;3713&quot;&gt;degenerate&lt;/strong&gt; without the quality self-reward; reward choice caps the ceiling (a weak reward model means weak training signal); and Challenger diversity remains an open knob to broaden beyond the structured style seen in examples. Still, the authors argue the method should work &lt;strong data-end=&quot;4008&quot; data-start=&quot;3993&quot;&gt;even better&lt;/strong&gt; on tasks with &lt;strong data-end=&quot;4045&quot; data-start=&quot;4023&quot;&gt;verifiable rewards&lt;/strong&gt; (e.g., code tests), not just preferences.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;4348&quot; data-start=&quot;4129&quot;&gt;If your roadmap hits a data wall, &lt;strong data-end=&quot;4185&quot; data-start=&quot;4163&quot;&gt;Language Self-Play&lt;/strong&gt; is a compelling new leg in the post-training pipeline: spin up a Challenger inside your own model, let it stress-test itself, and learn—no fresh dataset required.&lt;/p&gt;
&lt;p data-end=&quot;4422&quot; data-start=&quot;4350&quot;&gt;&lt;em data-end=&quot;4422&quot; data-start=&quot;4350&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;4421&quot; data-start=&quot;4363&quot; href=&quot;https://arxiv.org/pdf/2509.07414&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.07414 (PDF)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/686431685480840724/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/686431685480840724?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/686431685480840724'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/686431685480840724'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/language-self-play-training-llm-without.html' title='Language Self-Play: training an LLM without adding data actually works'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6408904051479439567.post-187563340130470359</id><published>2025-09-10T11:25:00.005+08:00</published><updated>2025-09-10T11:25:47.738+08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="COVID-19 forecasting"/><category scheme="http://www.blogger.com/atom/ns#" term="empirical software"/><category scheme="http://www.blogger.com/atom/ns#" term="GIFT-Eval"/><category scheme="http://www.blogger.com/atom/ns#" term="Google Research"/><category scheme="http://www.blogger.com/atom/ns#" term="Kaggle benchmark"/><category scheme="http://www.blogger.com/atom/ns#" term="LLM code generation"/><category scheme="http://www.blogger.com/atom/ns#" term="recombination"/><category scheme="http://www.blogger.com/atom/ns#" term="scRNA-seq batch integration"/><category scheme="http://www.blogger.com/atom/ns#" term="tree search"/><category scheme="http://www.blogger.com/atom/ns#" term="ZAPBench"/><title type='text'>An AI that writes expert-level scientific software—and often beats the leaderboard</title><content type='html'>&lt;p&gt;&amp;nbsp;A large Google team is pushing past “chatty copilot” and into &lt;strong data-end=&quot;657&quot; data-start=&quot;614&quot;&gt;AI that authors working scientific code&lt;/strong&gt;. Their system pairs a &lt;strong data-end=&quot;721&quot; data-start=&quot;680&quot;&gt;large language model with tree search&lt;/strong&gt; to iteratively write, run, and score programs for &lt;em data-end=&quot;782&quot; data-start=&quot;772&quot;&gt;scorable&lt;/em&gt; research problems—then learns to &lt;strong data-end=&quot;835&quot; data-start=&quot;816&quot;&gt;recombine ideas&lt;/strong&gt; from papers and prior algorithms. In benchmarks, it discovered &lt;strong data-end=&quot;937&quot; data-start=&quot;899&quot;&gt;40 new single-cell RNA-seq methods&lt;/strong&gt; that outperformed the top human-made entries on OpenProblems, and produced &lt;strong data-end=&quot;1056&quot; data-start=&quot;1013&quot;&gt;14 COVID-19 hospitalization forecasters&lt;/strong&gt; that beat the &lt;strong data-end=&quot;1089&quot; data-start=&quot;1071&quot;&gt;CDC’s ensemble&lt;/strong&gt; and every individual competitor during the study window.&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;1876&quot; data-start=&quot;1186&quot;&gt;&lt;strong data-end=&quot;1203&quot; data-start=&quot;1186&quot;&gt;How it works.&lt;/strong&gt; Researchers frame a scientific task as “maximize a quality metric,” let the LLM generate code variants, and use &lt;strong data-end=&quot;1331&quot; data-start=&quot;1316&quot;&gt;tree search&lt;/strong&gt; to expand promising branches while pruning the rest. The agent can ingest &lt;strong data-end=&quot;1424&quot; data-start=&quot;1406&quot;&gt;research ideas&lt;/strong&gt; from literature (summarized with &lt;strong data-end=&quot;1476&quot; data-start=&quot;1458&quot;&gt;Gemini 2.5 Pro&lt;/strong&gt;) and also tries &lt;strong data-end=&quot;1521&quot; data-start=&quot;1493&quot;&gt;automatic recombinations&lt;/strong&gt; of methods, plus proposals from &lt;strong data-end=&quot;1578&quot; data-start=&quot;1554&quot;&gt;Gemini Deep Research&lt;/strong&gt; and &lt;strong data-end=&quot;1602&quot; data-start=&quot;1583&quot;&gt;AI co-scientist&lt;/strong&gt; tools. In head-to-head tests on nine published algorithms, the system’s implementations &lt;strong data-end=&quot;1723&quot; data-start=&quot;1691&quot;&gt;beat eight of nine baselines&lt;/strong&gt;; its best run—BBKNN(TS)—improved the bioinformatics leaderboard by &lt;strong data-end=&quot;1798&quot; data-start=&quot;1791&quot;&gt;14%&lt;/strong&gt; over the long-standing ComBat approach.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;2332&quot; data-start=&quot;1878&quot;&gt;&lt;strong data-end=&quot;1906&quot; data-start=&quot;1878&quot;&gt;Bioinformatics at scale.&lt;/strong&gt; The team evaluates on &lt;strong data-end=&quot;1952&quot; data-start=&quot;1929&quot;&gt;OpenProblems v2.0.0&lt;/strong&gt;, spanning &lt;strong data-end=&quot;1982&quot; data-start=&quot;1963&quot;&gt;1,747,937 cells&lt;/strong&gt; and 13 metrics across six datasets. Beyond re-implementing published methods, &lt;strong data-end=&quot;2078&quot; data-start=&quot;2061&quot;&gt;recombination&lt;/strong&gt; mattered: among &lt;strong data-end=&quot;2118&quot; data-start=&quot;2095&quot;&gt;55 pairwise hybrids&lt;/strong&gt;, &lt;strong data-end=&quot;2126&quot; data-start=&quot;2120&quot;&gt;24&lt;/strong&gt; outperformed &lt;em data-end=&quot;2146&quot; data-start=&quot;2140&quot;&gt;both&lt;/em&gt; parents and most others beat at least one—evidence that the search can synthesize competitive, &lt;em data-end=&quot;2249&quot; data-start=&quot;2242&quot;&gt;novel&lt;/em&gt; ideas rather than just tune hyperparameters.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;2814&quot; data-start=&quot;2334&quot;&gt;&lt;strong data-end=&quot;2364&quot; data-start=&quot;2334&quot;&gt;Public-health forecasting.&lt;/strong&gt; For U.S. COVID-19 hospitalization forecasting (the CDC’s &lt;strong data-end=&quot;2438&quot; data-start=&quot;2422&quot;&gt;Forecast Hub&lt;/strong&gt;), the system generated models that were &lt;strong data-end=&quot;2520&quot; data-start=&quot;2479&quot;&gt;consistently lower-error (better WIS)&lt;/strong&gt; than the official ensemble in &lt;strong data-end=&quot;2573&quot; data-start=&quot;2551&quot;&gt;most jurisdictions&lt;/strong&gt;; in an aggregate comparison, &lt;strong data-end=&quot;2620&quot; data-start=&quot;2603&quot;&gt;14 strategies&lt;/strong&gt; (10 recombinations, plus two Deep Research, one AI co-scientist, and one replicated baseline) surpassed the ensemble across the three-week hold-out period.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;3152&quot; data-start=&quot;2816&quot;&gt;&lt;strong data-end=&quot;2837&quot; data-start=&quot;2816&quot;&gt;Not just biology.&lt;/strong&gt; The abstract lists additional wins in &lt;strong data-end=&quot;2968&quot; data-start=&quot;2876&quot;&gt;geospatial image segmentation, zebrafish neural activity prediction, general time-series&lt;/strong&gt;, and &lt;strong data-end=&quot;2999&quot; data-start=&quot;2974&quot;&gt;numerical integration&lt;/strong&gt;, arguing the approach generalizes to diverse “empirical software” problems where code can be scored automatically.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;3654&quot; data-start=&quot;3154&quot;&gt;&lt;strong data-end=&quot;3191&quot; data-start=&quot;3154&quot;&gt;Engineering notes—and guardrails.&lt;/strong&gt; To &lt;strong data-end=&quot;3216&quot; data-start=&quot;3195&quot;&gt;avoid overfitting&lt;/strong&gt;, bio experiments hill-climb on a separate CELLxGENE dataset and report on the &lt;strong data-end=&quot;3307&quot; data-start=&quot;3295&quot;&gt;held-out&lt;/strong&gt; OpenProblems benchmark; metrics that fail to compute are clamped to worst-case—making robustness part of the score. The team also ran &lt;strong data-end=&quot;3465&quot; data-start=&quot;3442&quot;&gt;multiple replicates&lt;/strong&gt; to show stability, and reports practical budgets: &lt;strong data-end=&quot;3530&quot; data-start=&quot;3516&quot;&gt;≈500 nodes&lt;/strong&gt; (~&lt;strong data-end=&quot;3544&quot; data-start=&quot;3533&quot;&gt;7 hours&lt;/strong&gt;) per scRNA-seq search and &lt;strong data-end=&quot;3586&quot; data-start=&quot;3571&quot;&gt;≈2000 nodes&lt;/strong&gt; per COVID run on their infra.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;4105&quot; data-start=&quot;3656&quot;&gt;&lt;strong data-end=&quot;3675&quot; data-start=&quot;3656&quot;&gt;Why it matters.&lt;/strong&gt; Rather than waiting for domain-specific code to be hand-crafted over months, this “AI co-scientist” &lt;em data-end=&quot;3786&quot; data-start=&quot;3776&quot;&gt;produces&lt;/em&gt; working software, &lt;strong data-end=&quot;3845&quot; data-start=&quot;3805&quot;&gt;tests it against public leaderboards&lt;/strong&gt;, and &lt;strong data-end=&quot;3863&quot; data-start=&quot;3851&quot;&gt;composes&lt;/strong&gt; new hybrids from the literature. If those patterns hold beyond the reported tasks, the future of scientific computing looks less like prompt engineering—and more like &lt;strong data-end=&quot;4066&quot; data-start=&quot;4031&quot;&gt;searching the space of programs&lt;/strong&gt;.&amp;nbsp;&lt;span class=&quot;&quot; data-state=&quot;closed&quot;&gt;&lt;span class=&quot;ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]&quot; data-testid=&quot;webpage-citation-pill&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;4179&quot; data-start=&quot;4107&quot;&gt;&lt;em data-end=&quot;4179&quot; data-start=&quot;4107&quot;&gt;Paper link: &lt;a class=&quot;decorated-link&quot; data-end=&quot;4178&quot; data-start=&quot;4120&quot; href=&quot;https://arxiv.org/pdf/2509.06503&quot; rel=&quot;noopener&quot; target=&quot;_new&quot;&gt;arXiv 2509.06503 (PDF)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;Read more....&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://huneesoinsane.blogspot.com/feeds/187563340130470359/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment/fullpage/post/6408904051479439567/187563340130470359?isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/187563340130470359'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6408904051479439567/posts/default/187563340130470359'/><link rel='alternate' type='text/html' href='http://huneesoinsane.blogspot.com/2025/09/an-ai-that-writes-expert-level.html' title='An AI that writes expert-level scientific software—and often beats the leaderboard'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>