PyCharm : The only Python IDE you need. | The JetBrains Blog

Improving Accessibility in JetBrains IDEs: What’s New and What’s Next in 2026

Ekaterina Valeeva — Thu, 21 May 2026 06:45:13 +0000

Making software accessible often comes down to removing small but repeated points of friction in everyday workflows. Today, on Global Accessibility Awareness Day, we’re sharing recent improvements in JetBrains IDEs across several areas: compatibility with assistive technologies on various platforms, keyboard navigation, and non-visual feedback. Some of these improvements are already available, and some are coming later this year.

You can use the audio player below to listen to this blog post.

Accessibility Blog Post Audio

Better compatibility with assistive technologies

One of the key areas we’ve been working on is improving how JetBrains IDEs interact with OS-level accessibility tools.

Improved Magnifier support on Windows

Screen magnifiers are among the most commonly used assistive technologies in JetBrains IDEs. Until recently, the built-in Windows Magnifier didn’t reliably follow the text cursor in the editor, making navigation and editing more difficult for low-vision users. We’ve implemented support for cursor tracking so Magnifier follows text as you type, just as it does in other applications.

This builds on earlier work on macOS, where we addressed text cursor tracking with macOS Zoom. Now, the same support is being extended to Windows.

Orca and GNOME Magnifier support on Linux

With version 2026.2, coming this summer, JetBrains IDEs will allow you to use the Orca screen reader and GNOME Magnifier in supported Linux environments.

This is an active area of work, with multiple related tasks already underway. Accessibility shouldn’t depend on your operating system, and we’re continuing to improve support across platforms.

More predictable keyboard navigation

We’ve also been making it easier to move through the IDE without relying on a mouse.

Main menu access with Alt on Windows

In native Windows applications, pressing Alt moves the focus to the main menu, allowing you to navigate it with the keyboard. This behavior was previously missing from JetBrains IDEs, and screen readers, such as NVDA, would sometimes announce the system menu instead.

Now, the main menu behaves in a way that feels familiar and predictable for keyboard-only and screen-reader users, and the bright focus indicator helps low-vision users identify the selected item.

Navigating between major parts of the IDE

Another focus area is the experience of moving between different parts of the IDE interface, such as toolbars, panels, and the editor. We’re working on a more structured model for navigating through the big component groups:

Tab and Shift+Tab move the focus within the current area.
A dedicated shortcut lets you jump between larger sections of the IDE.

This reduces the effort required to reach essential controls and makes the overall layout easier to navigate. For the current iteration, we made it possible to bring the main toolbar and status bar into focus, and we fixed the Project and Git toolbar widgets, which were not selectable by screen readers, even though other elements already were.

As the next step, we’ll polish specific controls and include tool window bars on both sides of the IDE frame in the navigation flow.

Exploring richer non-visual feedback with audio cues

Accessibility is not only about reaching controls, but also about understanding what’s happening while you work. We’re exploring ways to provide richer audio feedback in the IDE. Two directions we’re currently investigating:

Contextual signals when the caret lands on lines with errors, warnings, breakpoints, or version control changes. We want the IDE to provide immediate, non-visual feedback in context.
More general audio notifications for IDE actions and state changes.

The goal is to reduce the need to rely on visual indicators or switch contexts just to understand what changed. Instead, we want the IDE to provide that information more directly.

Accessibility as an ongoing effort

We’re improving accessibility in JetBrains IDEs across multiple areas at once, including by providing compatibility with assistive technologies like screen readers and magnifiers, as well as by offering more consistent keyboard navigation and clearer feedback for events that are otherwise mostly visual.

These improvements build on earlier updates, such as support for VoiceOver and NVDA, a high-contrast UI theme, and color schemes for red-green vision deficiency. There’s still more to do, and we’ll continue working in this direction.

We’d love to hear from you

We’re eager to hear from developers who rely on accessibility features, as well as from anyone interested in improving the experience of using them.

If you have ideas or feedback about accessibility in JetBrains IDEs, you can reach us directly at accessibility@jetbrains.com. You can also report issues through YouTrack or the support request form.

If you’d like to stay informed about accessibility improvements, you can subscribe to updates here.

LLM Evaluation and AI Observability for Agent Monitoring

Evgenia Verbina — Tue, 19 May 2026 09:46:54 +0000

This is a guest post from Naa Ashiorkor, a data scientist and tech community builder.

Artificial intelligence keeps evolving at a rapid pace. The latest major application of AI, specifically of LLMs, is AI agents. These are systems that use their perception of their environment, processes, and input to take action to achieve specific goals, and they are built on LLMs.

Increasingly, complex AI agents are being used in real-world applications. While simpler agentic applications that use only one agent to achieve a goal still exist, organizations are now shifting towards multi-agent systems that use multiple subagents coordinated by a main agent. These are more adaptable and can mimic human teams when it comes to performing specialized tasks such as data analysis, compliance, customer support, and more. The reasoning and autonomy of AI agents have improved; consequently, they can gather data, conduct cross-references, and generate analysis.

As we move towards these complex, real-world applications of agents, an ever-stronger spotlight is being shone both on how we observe AI agents and how we evaluate the LLMs they’re built upon. The complexity, interactions, and autonomous processes under the surface of AI agents make rigorous monitoring and assessment an essential part of building and maintaining these applications. LLM evaluation determines if the AI agent can work, while AI agent observability determines if it is working. LLM evaluation tests an agent’s basic capabilities before and during deployment, while agent observability provides deep, real-time visibility into an agent’s internal reasoning and operational health once it is live. It is pretty obvious that having just one of these is a loss and a formula for failure.

In this blog post, we’ll explore how to evaluate agents using advanced metrics and observability tools. It’s designed as a practical, end-to-end reference for teams that want to move beyond demos and actually run AI agents in live, real-world environments, avoiding the common pitfalls that cause failure in production.

Core LLM evaluation metrics for modern AI systems

As LLMs are now applied to a wide range of use cases, it is important that their evaluation covers both the tasks they may perform and their potential risks. Evaluation metrics give a better understanding of the strengths and weaknesses of LLMs, influence the guidance of human-LLM interactions, and highlight the importance of ensuring LLM safety and reliability. Hence, LLM evaluation metrics for assessing the performance of an LLM are indispensable in modern AI systems. Without well-defined evaluation metrics, assessing model quality becomes subjective.

There are several key evaluation metrics, each with a different purpose, and the table below provides a summary of some of them.

Evaluation Metric	What the metric evaluates
Hallucination rate	Factual accuracy and truthfulness of generated content
Toxicity scores	Harmful, offensive, or inappropriate content
RAGAS (Retrieval Augmented Generation Assessment)	Measures whether the RAG system retrieves the right documents and generates answers that are faithful to those sources
DeepEval	Tests everything from basic accuracy and safety to complex agent behaviors and security vulnerabilities across the entire LLM application

Hallucination rate

Hallucinations in LLMs produce outputs that seem convincing yet are factually unsupported and can be categorized as either intrinsic, where the output contradicts the source content, or extrinsic, where it simply cannot be verified. They can stem from a range of factors across data, training, and inference, from quality issues in the large datasets used for initial training and the data used to fine-tune model behavior to post-training techniques that make models overly eager to provide responses to imperfect decoding strategies at inference. Because hallucination is an unsolved challenge cutting across every stage of model development, measuring and assessing it remains a vital part of LLM evaluation.

There is a wide variety of techniques for detecting hallucinations. These include:

Fact-checking: Extracting independent factual statements from the model’s outputs (fact extraction) and then verifying these against trusted knowledge sources (fact verification).
Uncertainty estimation: Using the certainty provided in the model’s internal state to estimate how likely a piece of factual content is to be a hallucination.
Faithfulness hallucination detection: Ensures the faithfulness of LLMs to provide context or user instructions.

There are several metrics for hallucination detection. Some of the most commonly used metrics include:

Fact-based metrics: Assessing faithfulness by measuring the overlap of facts between the generated content and the source content.
Classifier-based metrics: Utilizing trained classifiers to distinguish between the level of entailment between the generated content and the source content.
QA-based metrics: Using question-answering systems to validate the consistency of information between the source content and the generated content.
Uncertainty-based metrics: Assessing faithfulness by measuring the model’s confidence in its generated outputs.
LLM-based metrics: Using LLMs as evaluators to assess the faithfulness of generated content through specific prompting strategies.

PyCharm’s Hugging Face integration lets you discover evaluation models and datasets without leaving the IDE. Use the Insert HF Model feature to search for hallucination or toxicity classifiers, and hover over any model or dataset name in your code to instantly preview its model card, including training data, intended use, and limitations. This means you can import a dataset, evaluate your LLM, and verify the tools you’re using, all from one place.

Opening the Hugging Face model browser in PyCharm from the Code menu, then selecting Insert HF Model.

Searching for a specific hallucination model and selecting one. Use Model inserts a ready-to-use code snippet into the editor.

A ready-to-use code snippet of the Vectara hallucination evaluation model is inserted into the editor.

Hovering over the Vectara hallucination evaluation model in the code to preview its model card within PyCharm.

Trust is imperative in the acceptance and adoption of technology. Trust in AI is especially important in areas such as healthcare, finance, personal assistance, autonomous vehicles, and others. Hallucinations have a huge impact on users’ trust in LLMs.

In 2023, a story went viral about a Manhattan lawyer who submitted a legal brief largely generated by ChatGPT. The judge quickly noticed how different it was from a human-written submission, revealing clear signs of hallucination. Incidents like this highlight the real-world risks of LLM errors and their impact on user trust. As people encounter more examples of hallucination, skepticism around LLM reliability continues to grow.

Toxicity scores

LLMs that have been pretrained on large datasets from the web have the tendency to generate harmful, offensive, and disrespectful content as well as toxic language, such as hate speech, harassment, threats, and biased language, which have a negative impact on their safe deployment. Toxicity detection is the process of identifying and flagging toxic content by integrating open-source tools or APIs into the LLM workflow to analyze both the user input and the LLM output. Some of the available toxicity tools include the OpenAI Moderation API, which is free, works with any text, and has a quick implementation. Perspective API by Google is also widely used with a transparent methodology, but will no longer be in service after 2026. Detoxify, which is open source, has no API costs, and is Python-friendly, and Azure AI Content Safety by Microsoft, which is customizable and best for enterprise deployments and existing Azure users. Hugging Face Toxicity Models have many model options and easy integration with Transformers.

Toxicity detection has become a guardrail; hence, it is important in public-facing applications. They prevent toxic content from reaching users, which protects both individuals and organizations. In public-facing applications, toxicity detection operates by input filtering, output monitoring, and real-time scoring. This prevents attacks where users intentionally train AI to produce toxic content through coordinated toxic inputs; toxic content will never reach the user, even if produced by the underlying AI, so systems can adjust their behavior dynamically based on conversation content and escalating risks. Unguarded AI can be exploited, which leads to reputational damage.

For toxicity evaluation, PyCharm’s Hugging Face Insert HF Model feature helps you discover classifiers like s-nlp/roberta_toxicity_classifier directly in the IDE. Hovering over the model name reveals its model card, where you can see it was trained on the Jigsaw toxic comment datasets, helping you understand what the model can and can’t detect before you write a single line of evaluation code.

Opening the Hugging Face model browser in PyCharm from the Code menu, then selecting the Insert HF Model.

Searching for a specific toxicity model and selecting one. Use Model inserts a ready-to-use code snippet into the editor.

A ready-to-use code snippet of the roberta_toxicity_classifier is inserted into the editor.

Hovering over the roberta_toxicity_classifier in the code to preview its model card within PyCharm.

Frameworks for LLM evaluation

Frameworks for LLM evaluation have changed the game; teams don’t have to rely on manual reviews, gut instinct, and subjective judgment to assess model quality. These frameworks automate the measurement of model quality using standardized, quantifiable metrics. They assign numerical scores to outputs that measure faithfulness, relevancy, toxicity, and other important dimensions. This automation results in reproducibility, speed, and objectivity.

Consequently, the same input always produces the same score; evaluation runs 10–100 times faster, so in minutes instead of days; and there are no more debates on the quality of the output. Some of these frameworks include DeepEval and Retrieval Augmented Generation Assessment (Ragas). DeepEval is an open-source evaluation framework built with seven principles in mind, such as the ability to easily “unit test” LLM outputs in a similar way to Pytest and plug in and use over 50 LLM-evaluated metrics, most of which are backed by research and all of which are multimodal.

It is extremely easy to build and iterate on LLM applications with two modes of evaluation, namely, end-to-end LLM evals and component-level LLM evals. It is used for comprehensive testing across RAG, agents, and chatbots. Ragas is a framework for reference-free evaluation of RAG pipelines. There are several dimensions to consider, such as the ability of the retrieval system to identify relevant and focused context passages, as well as the capability of the LLM to exploit such passages in a faithful way; hence, it is challenging to evaluate RAG systems. Ragas provides a suite of metrics for evaluating these dimensions without relying on ground-truth human annotations.

The limits of static prompt evaluation

Traditional LLM evaluation methods are useful for single prompt-response pairs, measuring output quality, RAG systems with straightforward retrieval, and static evaluation with fixed inputs. But they are limited for multi-step agents because LLM evaluation focuses on the final output quality, not the decision-making process that produced it. Multi-step agents exhibit a different kind of complexity, as they chain multiple decisions.

Why traditional LLM evaluation isn’t enough for agents

Agents operate independently within complex workflows, and this independence can introduce challenges such as deviation from expected behavior, errors in production, and more failure points than in traditional software applications. Hence, an agent can perform well in testing but fail in production. Traditional LLM evaluations don’t have the capacity to test such use cases. Testing is usually done in a controlled environment with limited scenarios, but production involves real users, edge cases, unpredictable inputs, and scale. This means that agents can make decisions that are not seen in testing, and in production, tasks could be completed, though incorrectly, without generating an error signal. This is where advanced evaluation and monitoring practices come to the rescue! They provide the visibility and systematic measurement needed to deploy agents confidently, rather than relying on trial and error.

The complexity of agent behavior

Traditional LLM evaluation measures single prompt-response pairs: provide an input prompt, receive an output response, and measure quality through metrics such as accuracy, relevance, and faithfulness. Due to the complexity and non-deterministic, multi-step reasoning of AI agents, they cannot be reliably evaluated using traditional evaluation metrics.

Agent behavior is complex, and this complexity introduces challenges. Agents operate in dynamic environments where APIs might be down, databases change between queries, and the “right” answer depends on current conditions. They can use external tools and APIs to complete tasks, and may either use the wrong tool or use the right tool with the wrong parameters or input type. Their internal reasoning traces remain hidden unless they are logged explicitly, so it might be challenging to determine whether an agent was successful through logic or chance. An agent’s output could be perfectly correct despite poor internal decisions, or the entire task could fail despite correct step execution.

This is where observability tooling becomes essential. PyCharm’s AI Agents Debugger breaks open the black box of agentic systems, letting you trace LangGraph workflows and inspect each agent node’s inputs, outputs, and reasoning directly in the IDE, with zero extra code. Just install the plugin, run your agent, and the debugger automatically captures execution traces. Click the Graph button to visualize the full workflow, making it easy to spot where an agent chose the wrong tool, passed bad parameters, or succeeded by luck rather than logic.

To see this in action, I built a simple travel-planning agent using LangGraph in two steps: a research node that suggests summer destinations based on my preferences, and a plan node that picks the best option and builds a three-day itinerary. With the AI Agents Debugger, you can trace exactly what information flowed between these two steps – what the research node suggested and how the planner used those suggestions to build the final itinerary.

The AI Agents Debugger shows how the agent moves from initialization to the research stage, displaying the data passed in and out, and the LLM call used to generate the research results.

The AI Agents Debugger shows how the planning step processes inputs and produces outputs, using an LLM call to construct the final travel itinerary.

The Graph viewprovides a high-level overview of the agent’s workflow, mapping how it progresses from the initial step through research and planning to the final result.

Advanced agent evaluation metrics

The complexity of AI agents demands evaluation that goes beyond considering the final output quality, that is, measuring whether it is accurate, relevant, and grounded. Specialized agent evaluation assesses the complete decision-making process, including the planning logic, tool selection, parameter construction, reasoning coherence, and resource efficiency that led to the final output. Hence, the advanced agent evaluation metrics are designed to make such a process visible and measurable. Some of them are task completion rate, tool usage, reasoning quality, efficiency, and error handling.

Task completion rate

Task completion rate measures the percentage of tasks where an agent successfully achieves the end goal. This is calculated as the number of completed tasks divided by the total number of tasks attempted. The context of “completed” differs by use case. There are real-world use cases for task completion rate. Let’s start with a basic use case. Consider a customer service agent handling a specific food delivery order: “Where is my order #0001? It has not been delivered to me.” Completion rate means successfully looking up the order ID, retrieving the tracking information, and providing an accurate delivery estimate, so all three steps must succeed. If the agent retrieves the wrong order or fails to assess the tracking system, that is a failed task, even if it produces the same output.

Next, let us look at a medium-complexity use case, sequential API calls. Consider an agent tasked with creating a Jira support ticket and notifying the relevant team in Slack. The agent calls the Jira API to create a ticket, parses the response to get the ticket ID, calls the Slack API with the ticket link, and finally verifies the success of both. If the agent successfully creates the Jira ticket, but the Slack notification fails, that is considered a failed task even if the ticket exists in Jira, since the team wasn’t notified.

Finally, let’s examine a high-complexity use case: An agent is given the task of completing an online purchase, which means it must handle everything from checkout to order confirmation. Six steps are involved: Verify the item is still in stock, process the payment with a credit or debit card, reserve or decrement inventory, create an order record, generate an order confirmation number, and send a confirmation email to the customer. If the agent successfully charges the customer’s card but the confirmation email fails to send, that’s a failed task, even if the payment was processed and the order was created. In such a situation, the customer has no proof of purchase, so they will likely contact support or attempt to purchase again.

Tool usage correctness

Tool usage correctness assesses whether an agent correctly identifies and invokes the relevant tools and APIs. It is a deterministic measure that is assessed using techniques such as LLM as a judge, like most LLM evaluation metrics. It has three dimensions:

Did the agent choose the right tool for the task (tool selection)?
Were the parameters constructed correctly (input parameters)?
Did the agent properly use the tool results (output handling)?

Hence, it is important for reliability and functional correctness.

Step-by-step reasoning accuracy

In real-world use cases, an LLM agent’s reasoning is shaped by much more than just the model itself. Modern frameworks such as LangChain expose the agent’s internal “thoughts” through structured logging of intermediate reasoning steps. This is done using the ReAct (Reasoning and Acting) pattern, which involves the agent thinking about what to do, using a tool, observing the tool result, and then repeating until the task is complete. Each “thought” is logged as text, which creates a complete trace of the reasoning process from initial query to final answer. These traces can be extracted programmatically and evaluated to assess whether the agent’s logic is sound even when the final output appears correct. Evaluating planning steps involves assessing aspects such as the overall approach’s logic, the ordering of steps, and whether any steps are unnecessary or redundant. Evaluating execution assesses whether the implementation worked, such as whether tools were called with correct parameters, whether each step was completed successfully, whether errors were handled appropriately, and whether the output was interpreted correctly. This can be done seamlessly in PyCharm using the AI Agents Debugger.

Groundedness (faithfulness)

Groundedness, also known as faithfulness, is the most critical metric for retrieval-augmented generation (RAG), which is a common component of agentic applications. It assesses whether the agent’s response is actually supported by the retrieved source documents or whether, instead, the model hallucinated information. Different evaluation techniques include:

Atomic claim verification: Breaks up the response into atomic claims and checks each claim against the retrieved context. It is slow but best for production RAG and thorough evaluation.
Semantic similarity: Compares the embeddings of the response and source documents. It is fast, so it is best for quick checks and first-pass filtering.
LLM-as-Judge: works by prompting the LLM to score groundedness by extracting factual statements from the response and then checking each statement against the retrieved context. It offers medium speed and is best for flexible, custom criteria.

AI observability and why it matters

AI observability is about visibility into what the agent is doing. This covers recording everything that happens when a task is executed, including the agent’s reasoning at each step, which tools were called with what parameters, what data was retrieved, and how decisions were made from start to finish. With such a transparent system where every decision can be logged and traced, teams are able to understand why an agent fails, behaves unexpectedly, or becomes expensive to run because issues can be debugged and behavior can be audited. Consequently, system design improves, and guesswork is eliminated.

Definition of AI observability

AI observability is the real-time monitoring of agent actions, thoughts, and environmental interactions: what went in, what came out, how the agent thought through the problem, and which tools, APIs, and data were used. AI observability builds on the three pillars of DevOps observability – that is, metrics, logs, and traces – but extends each one for AI’s unique needs. DevOps metrics track CPU and latency, while AI metrics track token usage and cost per interaction. DevOps logs capture system errors, while AI logs capture reasoning traces and decision points. DevOps traces follow requests through services, while AI traces follow reasoning through agent steps, tool calls, and observations.

Benefits for agent monitoring

Agent monitoring has immense benefits – here are some of the most important:

It debugs reasoning errors: When an agent fails or gives an unexpected output, monitoring provides a complete trace of its decision-making process, which shows exactly where the logic broke down. Hence, there is no need to spend hours guessing the causes.
It measures performance and latency over time: Since metrics such as average latency, token usage, cost per interaction, and completion rates across all queries are tracked, degradation patterns can be identified before they affect users. As a result, performance issues can be identified and resolved before users file any complaints.
It identifies regressions after model or prompt updates: Baseline metrics such as completion rate, faithfulness scores, latency, and cost are established and then monitored for deviations after deployments. If a new prompt drops the compilation rate or a model update increases the hallucination rate, automated alerts catch it immediately. Hence, issues are caught before users are affected.

Popular tools for agent monitoring

Several frameworks and platforms have emerged to provide built-in observability for AI agents, with each having different strengths and integration approaches and matching different features and requirements. The choice of the right tool depends on the framework, deployment preferences, and primary needs. The table below shows some popular tools and whether they match different features and requirements.

Tool	Traces agent steps?	Tracks costs?	Detects regressions?	Self-hostable?	Open source?	Easy integration?
Helicone	Yes	Yes	Yes	Yes	Yes	Yes
LangSmith	Yes	Yes	Yes	Limited	No	Yes
LangFuse	Yes	Yes	Yes	Yes	Yes	Moderate
OpenLLMetry	Yes	Limited	Limited	Yes	Yes	Moderate
Phoenix	Yes	Limited	Yes	Yes	Yes	Moderate
TruLens	Yes	Limited	Yes	Yes	Yes	Moderate
DataDog	Limited	Yes	Yes	No	No	Moderate

Best practices for evaluating agents in production

Evaluation does not end after deployment; rather, it is intensified. This continuous evaluation tracks how much the system costs to run, how quickly it responds under various loads, and how it handles errors or unusual inputs. Without such evaluation, problems can only be identified after the users are affected. An agent can pass all the quality checks with excellent faithfulness scores, high completion rates, and strong reasoning but fail in production if costs spiral, latency increases, or edge cases cause instability. Hence, there is a critical need for ongoing evaluation and monitoring, which will lead to systems that are reliable, scalable, and financially sustainable.

Monitor cost and latency

Monitoring cost and latency is critical for production sustainability. Token usage and response time must be tracked continuously because small inefficiencies compound dramatically over time, and the cost per token of the powerful reasoning models used for agents can be high. Production workloads require cost and latency monitoring to identify problems before user experience and budget are impacted. Cost monitoring tracks token usage at different levels, such as per request, per query type, and over time. Without visibility into patterns generated by these, teams end up discovering cost problems through surprise bills. With monitoring, they can proactively cache common queries and optimize prompts to reduce token use. Latency monitoring reveals track response time and component breakdowns to identify bottlenecks.

Cost control in production workloads is important because production costs can spiral quickly, unmonitored systems can exceed budgets, and latency impacts user experience and retention.

Combine offline and online evaluation

Effective agent evaluation requires combining offline and online evaluation, where each addresses gaps the other leaves. Offline evaluation uses fixed test databases for reproducible benchmarking, which enables fast iteration on prompts and models in controlled environments without production risk. Online evaluation monitors real user interactions in production, which reveals edge cases in testing that were never expected, so it is useful for real-time feedback, user data, and observability tools. A combination of both results in an optimal strategy where offline evaluation validates changes before deployment, then online evaluation monitors production reality.

Use human-in-the-loop when necessary

LLM agents are appreciated for how they have played a positive role in the different ecosystems, but not every agent should run autonomously since they can misinterpret prompts, cross boundaries, or make dreadful errors that can’t be caught by automation alone. Hence, the need for human-in-the-loop failsafes. Human-in-the-loop is also essential during initial setup: Unless teams already have domain-specific evaluation datasets for monitoring the agent, these will need to be created manually by assessing the agent’s performance. A hybrid approach is required when critical decisions require human validation, such as approving transactions, modifying sensitive data, or triggering irreversible workflows. In this approach, it is important that decisions are routed through a human checkpoint before proceeding. The intention is not to slow automation but rather to ensure that the right decisions involve the right oversight. A well-designed human-in-the-loop system delivers compound returns over time. Every human correction becomes feedback, which improves the agent’s accuracy and gradually reduces the need for manual review. Human oversight isn’t treated as a failure but rather as a safety net that makes the system better with use.

Final thoughts

Fundamentally, AI agents are different from single-prompt LLMs. They navigate multi-step workflows, make autonomous decisions, and use external tools, which introduces complexities that demand continuous evaluation, not just static testing. Evaluation must evolve from pre-deployment checkpoints to ongoing monitoring. Production-ready agents aren’t just well-tested; they’re continuously observed and improved based on real behavior. LLM evaluation and AI observability enable faster, safer iteration by catching issues early and feeding production insights back into development.

PyCharm streamlines agent development with integrated debugging, profiling, and testing. Step through reasoning with breakpoints, find cost bottlenecks, and iterate on evaluation tests rapidly. These workflows transform hours of debugging into minutes of systematic investigation. Explore PyCharm for AI development to see how integrated tools can help you build, evaluate, and deploy reliable AI agents.

About the author

Naa Ashiorkor

Naa Ashiorkor is a data scientist and tech community builder. She is deeply involved in the Python community and serves as an organizer for various conferences, including EuroPython. She is currently building PyLadies Tampere.

Pyrefly LSP Integration with Type Engine in PyCharm 2026.1.2

Cheuk Ting Ho — Fri, 15 May 2026 15:31:27 +0000

In PyCharm 2026.1.2, you can enable Pyrefly as an external type provider, dramatically increasing the speed of the IDE’s code insight features.

What is the Pyrefly LSP?

“LSP” stands for the Language Server Protocol – a standardized protocol that allows code editors and IDEs to communicate with language servers. The LSP enables language servers to provide code intelligence features, such as:

Code completion
Information on hover (for example, quick documentation)
Go to definition and other actions
Error checking and type-related diagnostics

The key benefit of the LSP is that it allows a single language server to be used across multiple tools. This means that language-specific intelligence does not have to be implemented separately in every editor, IDE, or CI pipeline.

Pyrefly is Meta’s next-generation Python type checker, engineered from the ground up in Rust to replace its predecessor, Pyre (written in OCaml). With the move to Rust, Pyrefly achieves significantly faster performance and improved cross-platform portability. More than just a rewrite, it is designed to be more capable and robust, offering an efficient toolset for maintaining large-scale Python codebases with high precision and minimal overhead.

Pyrefly provides the following benefits:

Higher performance and efficiency – Thanks to its Rust-based architecture, Pyrefly achieves significantly faster speeds and improves cross-platform portability.
Enhanced code intelligence – As an external type provider, Pyrefly powers essential code insight features in the IDE, including type inference, type-related diagnostics, quick documentation, and inlay hints.
Scalability – Pyrefly is designed to handle large-scale Python codebases with high precision and minimal overhead.

Pyrefly is highly beneficial for projects and developers dealing with large, complex Python codebases that prioritize performance and robust typing. Integrating Pyrefly via the LSP is part of our ongoing work to enhance code insight performance in PyCharm.

Using Pyrefly in PyCharm

Once enabled, Pyrefly powers all code insight functionality in PyCharm, including type inference and type-related diagnostics, quick documentation, and inlay hints. Delegating analysis to this faster engine delivers significantly improved performance.

To start using Pyrefly in your PyCharm project, go to the Type widget at the bottom of the window. By default, the IDE uses the built-in type engine. Click on the widget and select the option to use Pyrefly. If you do not have Pyrefly installed yet, PyCharm will install it automatically.

Once you’ve switched to the Pyrefly type engine, you will see a Pyrefly icon at the bottom, which you can hover over to check the version being used.

Please note that the integration currently works for local interpreter configurations. Support for Docker, Docker Compose, WSL, SSH, and multi-module projects is planned for future releases.

Pyrefly vs. the built-in type engine

Now let’s look at how Pyrefly and the built-in type engine behave in a complex Python project. In this FastAPI example, multiple files are typed, but in this file, the variable ref is incorrectly typed, causing four errors. When using the built-in type engine, the IDE identifies that something is wrong, but it suggests running further analysis to fix the problem, which requires an extra step.

Using Pyrefly as the type engine, the IDE reports errors immediately and highlights where they originate. However, it is worth noting that, in our example, there are four errors, but Pyrefly picks up only three of them. It misses the one in self._storage[ref].

Download the latest version of PyCharm and try it out

Ready to experience a dramatic leap in Python development performance? The Pyrefly type engine in PyCharm 2026.1.2 delivers the next generation of type checking. Engineered in Rust for unparalleled speed, it resolves files in as little as 0.5–1 seconds, significantly faster than the built-in engine. If you maintain large, complex Python codebases and prioritize robust typing, this feature is essential, as it allows you to delegate analysis to a faster engine and receive immediate type-related diagnostics. Download the latest version of PyCharm (2026.1.2) to unlock superior efficiency, scalability, and code insight.

Support for uv, Poetry, and Hatch Workspaces (Beta)

Antonina Belianskaya — Wed, 13 May 2026 12:28:25 +0000

Workspaces are increasingly the go-to choice for companies and open-source teams aiming to manage shared code, enforce consistency, and simplify dependency management across multiple services. Working within massive codebases often means juggling many interdependent Python projects simultaneously.

To streamline this experience, PyCharm 2026.1.1 introduced built-in support for uv workspaces, as well as those managed by Poetry and Hatch. This new functionality – currently in Beta – allows the IDE to automatically manage dependencies and environments across your entire workspace.

Intelligent workspace detection

When you open a workspace, PyCharm can now derive its entire structure and all its dependencies directly from your pyproject.toml files. This allows the IDE to understand relationships between projects deeply, significantly reducing the amount of configuration you have to do manually.

Because this is a fundamental change to how PyCharm handles your workspace, we’ve implemented it as an opt-in feature. Here is what you need to know about the transition:

Opt-in dialog: When you open a project, PyCharm may suggest enabling automatic detection for uv workspaces and Poetry/Hatch setups.

Manual configuration: You can toggle workspace detection in Settings | Project Structure.

Configuration note: If you previously manually edited settings in .idea files, those settings may be reset when you agree to the new model.

Managing workspaces and their projects

PyCharm now provides an integrated experience that handles the complexities of multi-package setups in uv workspaces automatically. When you open a uv workspace, the IDE identifies the individual projects and their interdependencies, ensuring the project structure is ready for you to work with.

Visualizing workspace dependencies

Once the workspace is loaded, you can verify how your projects relate to one another. PyCharm presents these dependencies in Settings | Project Dependencies.

These relationships are derived directly from your configuration and are shown as read-only in the UI. To make changes to the dependency graph, you can edit the pyproject.toml file manually – PyCharm will then update its internal model.

Automatic environment configuration

PyCharm prioritizes a zero-config approach to your Python SDK. When you open a .py or pyproject.toml file within a project, the IDE performs an immediate check.

If a compatible environment already exists on your system, PyCharm automatically configures it as the SDK for that project. If no environment is detected, a file-level notification will appear suggesting that you create a new uv environment and install the necessary dependencies for that project.

Maintaining environment consistency

Beyond the initial setup, PyCharm continuously monitors the health of your environment to ensure it stays in sync with your defined requirements.

If a dependency is not defined in your pyproject.toml file but is imported in your code, PyCharm will trigger a warning with a Sync project quick-fix to resolve these discrepancies.

Import management

PyCharm also assists when you are actively writing code by identifying gaps in your project configuration.

If you import a package that isn’t present in the environment and is not yet listed in the project’s pyproject.toml, the IDE will detect the omission. A quick-fix will suggest adding the package to the environment and updating the corresponding .toml file simultaneously.

Transparency via the Python Process Output tool window

While PyCharm automates the backend execution of commands – such as uv sync –all-packages – it still remains fully transparent.

You can track all executed commands and their live output in the Python Process Output tool window. If synchronization fails for an environment, you can analyze the specific error logs to quickly identify the root cause.

Poetry and Hatch workspaces

The logic for Poetry and Hatch workspaces follows this exact same workflow. PyCharm detects projects via their pyproject.toml files and manages the environments with the same automated precision.

The only minor difference is in tool selection – the suggested environment tool is determined by what you have specified in your pyproject.toml. If no tool is specified, PyCharm will prioritize uv (if installed) or a standard virtual environment to get you up and running quickly.

Looking ahead

This Beta version of the functionality is just the beginning of our focus on supporting complex workspace structures. We are already working on expanding the UI to allow creating new projects, linking dependencies, and activating the terminal for specific projects.

As we refine these features, your feedback is our best guide – please share your thoughts or report any issues on our YouTrack issue tracker.

Python Unplugged on PyTV: Key Takeaways From Our Community Conference

Evgenia Verbina — Thu, 07 May 2026 11:27:22 +0000

AI is changing how Python developers learn, build, and contribute to open source. At the same time, long-standing questions around community, sustainability, data workflows, and web development are becoming even more important.

Python Unplugged on PyTV, a free online conference hosted by JetBrains PyCharm, brought these conversations together in over seven hours of talks and discussions with developers, maintainers, educators, and tool builders from across the Python ecosystem.

Don’t have time to watch the full event? This blog post gives you a quick overview of what’s happening in Python today, based on talks from 13 Python experts – from AI-assisted development and open-source sustainability to modern data processing, Django, and community building.

Watch the recap video

Want to see the highlights from Python Unplugged on PyTV? Watch the full recap video below.

JetBrains’ Dr. Jodie Burchell, Data Scientist and Python Advocacy Team Lead; Cheuk Ting Ho, Data Scientist and Developer Advocate; and Will Vincent, Python Developer Advocate, discuss the key talking points from the day.

Need a quick overview? Here are the highlights

If you’d rather get the key takeaways in a written format, we’ve broken down the biggest insights from the day below. From the evolving role of AI to the importance of the Python community, these are the moments that stood out most from Python Unplugged on PyTV.

Highlight 1: Python goes beyond scripts and prototypes

Many developers first come to Python through a specific use case: automating tasks, building prototypes, learning data science, or experimenting with AI and machine learning. That accessibility is one of Python’s strengths, but it’s only the entry point.

In her session, AI Practitioners Are Only Getting Half the Goodness of Python, Deb Nicholson, Executive Director at the PSF, discussed how many AI and ML practitioners use Python mainly as a scripting or prototyping language. But Python is also used to build and maintain real-world software, supported by frameworks, data tools, testing workflows, packaging standards, and an active open-source community.

This broader context matters for learning, too. In his How to Learn Python session, Mark Smith, Head of Python Ecosystem at JetBrains, focused on what comes after the fundamentals: building real projects, reading other people’s code, and developing the habits needed to move past “tutorial hell.”

AI can help, but it shouldn’t replace hands-on practice. As Cheuk noted in the recap video, one useful tip from Mark’s talk was to turn off AI features while learning, so beginners still build the judgment needed to understand and improve the code they work with.

Highlight 2: The continuing role of community in Python

Python’s success has always been rooted in its community, and that remains as true as ever. Georgi Ker, Director and Fellow at the PSF; Una Galyeva, Head of AI at Geobear Global; and Jessica Greene, Senior ML Engineer at Ecosia, showcased this in their How PyLadies Is Shaping the Future of Python discussion.

PyLadies is an international mentorship group focused on helping more women become active participants and leaders in the Python community. The success of initiatives like PyLadies highlights how inclusive spaces can broaden participation and shape the future of the language.

As Will noted in our recap video, “Being part of the community is not just the code. It’s the conferences, it’s the people, it’s the live events – that’s what makes Python special.”

Python depends on a culture of shared responsibility, and contributors play a vital role. As AI brings more people into the ecosystem, preserving these values becomes even more important. Travis Oliphant, creator of NumPy, touched on this in his insightful session, Community is More Than Code: People Are What Make Python Thrive, and Why That Will Continue in an AI-Enabled Era.

There’s also a strong link between community and innovation, as Carol Willing, Core Developer at JupyterLab, explained in her session, Conversation, Computation, and Community: Key Principles for Solving Scientific Problems With Jupyter Notebooks and AI Tools. Tools like Jupyter have thrived in part because they enable conversation, collaboration, and knowledge sharing among people.

Highlight 3: AI poses both a threat and an opportunity for Python open source

AI is fundamentally changing how developers interact with open source.

On the positive side, AI coding tools lower the barrier to entry and allow more people to contribute. However, this increased accessibility comes with trade-offs. Maintainers are now dealing with a higher volume of contributions, many of which require significant review or refinement. Deb Nicholson, Executive Director at the PSF, discussed this trade-off in more detail in her session, AI Practitioners Are Only Getting Half the Goodness of Python.

This shift places additional pressure on those responsible for maintaining open-source projects. While AI can accelerate development, it also risks introducing poorly structured or low-quality code at scale.

Paul Everitt, Developer Advocate at JetBrains; Georgi Ker, Director and Fellow at the PSF; and Carol Willing, Core Developer at JupyterLab, pondered this in their Open Source in the Age of Coding Agents discussion. Ultimately, AI can’t replace the human systems that sustain open source. Trust, collaboration, and shared ownership remain essential, and arguably become even more important as contribution volumes increase. The real challenge lies in ensuring communities remain healthy and resilient as they scale.

Highlight 4: AI has also revolutionized how Python practitioners work

Beyond its impact on open source, AI is transforming day-to-day development workflows.

As Marlene Mhangami, Senior Developer Advocate at Microsoft Agentic, explained in her A Practical Guide to Agentic Coding session, coding is emerging as a new paradigm in which developers delegate tasks to AI systems capable of planning, executing, and refining code. This means the developer’s role is moving toward orchestration and validation, requiring new skills in guiding and evaluating AI outputs.

At the same time, development is becoming more conversational and exploratory. In environments like Jupyter, AI tools help users iterate faster, test ideas more easily, and move more fluidly between thinking and coding.

AI is also having a tangible impact on frameworks like Django, as discussed by Sheena O’Connell, Board Member at the PSF, in her talk, Powering Up Django Development With Claude Code. AI tools can speed up development in Django by handling repetitive tasks such as boilerplate generation and debugging. However, this comes with a caveat – developers must remain critical and treat AI as a collaborator, not a source of truth.

For beginners, AI can be a powerful learning aid, but over-reliance can limit deeper understanding. Building projects, reading code, and actively solving problems remain essential for developing real expertise.

Highlight 5: The importance of open-source AI

The open-source AI ecosystem is expanding rapidly, bringing with it a growing landscape of models, datasets, and tools.

This openness drives collaboration, transparency, and innovation, making it easier for developers to experiment and build on existing work. At the same time, it introduces challenges around fragmentation and long-term sustainability.

As Merve Noyan, ML Engineer at Hugging Face, explained in her Open-Source AI Ecosystem session, platforms like Hugging Face play a key role in organizing this ecosystem and making it more accessible, while Python continues to connect tools, communities, and technologies.

Highlight 6: Context is key for effective AI agents

As AI systems become more advanced, the way they interact with their input data is becoming increasingly important. Tuana Çelik, Developer Relations Engineer at LlamaIndex, covered this in detail in her insightful Orchestrating Document-Centric Agents With LlamaIndex talk.

LlamaIndex enables developers to build document-centric AI agents that retrieve, index, and reason over large collections of information. By structuring how documents are ingested and queried, it provides the LLM with much more context for the text it is processing, helping produce more accurate, context-aware responses.

This is particularly valuable in knowledge bases and enterprise assistants, where understanding relationships between pieces of information is as important as accessing the data itself.

Highlight 7: How Polars is refining high-performance data processing

Polars is pushing Python data processing toward a more scalable, production-ready future, as Polars creator Ritchie Vink explained in his Towards Query Profiling in Polars session.

Its high-performance, lazy execution model allows queries to be optimized automatically behind the scenes. However, this level of abstraction can make it harder for developers to fully understand performance.

To address this, there’s a growing need for better tooling, particularly around query profiling. By exposing execution plans, memory usage, and bottlenecks, developers can make informed decisions and build more efficient data workflows.

With features like streaming execution, Polars is helping bridge the gap between local data processing and large-scale systems.

As Jodie highlighted in the recap discussion, this shift is bringing more advanced data concepts into everyday Python workflows. She commented, “It’s really interesting to see more big data ideas coming to local Python data processing.”

Highlight 8: The power of typing in modern Python

Typing in Python continues to evolve, with a growing focus on flexibility rather than rigid enforcement. Open-source Django projects creator Carlton Gibson shed more light on this during his talk, Static Islands, Dynamic Sea: Some Thoughts on Incremental Typing.

The talk highlighted how developers are increasingly adopting an incremental approach. By creating “static islands” within a dynamic codebase, they can improve reliability, maintainability, and tooling without sacrificing Python’s core strengths.

In our recap video, Will agreed with this sentiment, adding, “It doesn’t have to be all-or-nothing. We don’t have to turn Python into something that it’s not.”

This approach is particularly useful in large frameworks like Django, where typing can help define clearer boundaries while still preserving developer ergonomics.

Highlight 9: The Django renaissance: Debunking aging myths

Django remains a modern, actively developed framework, as Django Fellow Sarah Boyce revealed in her session, Django Has a Marketing Problem: Debunking the Myths That Won’t Die.

Many of the criticisms that it’s outdated or unscalable don’t reflect the current reality. In practice, Django continues to evolve and power a wide range of applications.

The challenge is less about Django’s capabilities and more about perception, as the Django community was called to champion its strengths, ongoing evolution, and real-world impact.

Shifting this narrative will be key to ensuring its continued relevance and adoption in the years ahead.

What’s next for Python Unplugged on PyTV?

Python Unplugged on PyTV was our first step in reimagining what a fully online community conference can look like, and the response was incredible.

Looking at the numbers, more than 5,500 people joined us during the livestream. Since then, we’ve had a further 110,000 watch the event recording, showing just how global and engaged the Python community really is.

We’d love to bring Python Unplugged on PyTV back next year. What would you like to see more of? Who should we invite as speakers? Are there topics we didn’t cover that you’d love to explore?

Drop your suggestions in the comments and help shape the future of Python Unplugged on PyTV.

PyTorch vs. TensorFlow: Choosing the Right Framework in 2026

Evgenia Verbina — Mon, 04 May 2026 10:07:20 +0000

Choosing between PyTorch and TensorFlow isn’t about finding the “better” framework – it’s about finding the right fit for your project. Both power cutting-edge AI systems, but they excel in different domains. PyTorch dominates research and experimentation, while TensorFlow leads in production deployment at scale.

The frameworks have evolved significantly since their early days, each building tools and capabilities to support research and production. Despite these improvements, fundamental differences remain in their philosophies, ecosystems, and ideal use cases, which will naturally influence which framework will best fit your project.

This guide examines where each framework shines, compares them across key dimensions, and helps you choose the right tool for your natural language processing, computer vision, and reinforcement learning projects.

What sets PyTorch and TensorFlow apart?

PyTorch and TensorFlow took different approaches from day one. Google launched TensorFlow in 2015, focusing on production deployment and enterprise scalability. Meta released PyTorch in 2016, prioritizing research flexibility and Pythonic development. These roots still shape each framework today.

The key difference between the two lies in computational graphs. PyTorch uses dynamic graphs that execute operations immediately, making debugging natural – you use standard Python tools and inspect tensors at any point. TensorFlow originally required static graphs defined before execution, though version 2.x now defaults to eager execution while retaining optional graph compilation for performance.

Market data shows TensorFlow holds a 37% market share, while PyTorch commands 25%. But the research tells a different story: PyTorch powers 85% of deep learning papers presented at top AI conferences.

PyTorch: Strengths and weaknesses

PyTorch’s Pythonic API treats models like regular Python code, making development feel intuitive from the start. The framework’s dynamic computational graphs execute operations immediately rather than requiring upfront model definition, fundamentally changing how you approach debugging and experimentation.

This design philosophy has made PyTorch the dominant choice in research, where flexibility matters more than deployment infrastructure. However, this research-first design means production deployment tools remain less mature than TensorFlow’s enterprise infrastructure.

PyTorch strengths

Intuitive, Pythonic API: Models use standard Python syntax with minimal framework-specific concepts, reducing the learning curve dramatically compared to other frameworks.
Dynamic graphs enable natural debugging: Set breakpoints in training loops, inspect tensor values mid-execution, and modify architectures on the fly using tools you already know.
Priority access to the latest techniques: Because of its research dominance, when cutting-edge architectures or methods emerge, they’re implemented in PyTorch before anywhere else.
Strong ecosystem: Libraries like PyTorch Lightning handle training loops and best practices automatically, letting you focus on model architecture.

PyTorch weaknesses

Production deployment tools are less mature: Deployment options lag behind TensorFlow’s battle-tested infrastructure, so you need to do more setup work for production systems.
Mobile and edge deployment is limited: PyTorch Mobile is functional but less polished than TensorFlow Lite for smartphones and IoT devices.
Dynamic nature complicates optimization: The flexibility that aids development can make optimization for production performance harder without additional tools like TorchScript.
Smaller enterprise adoption: Fewer production patterns and case studies compared to TensorFlow’s extensive enterprise documentation.

TensorFlow: Strengths and weaknesses

TensorFlow’s production ecosystem provides you with a comprehensive infrastructure for deploying models at scale. Google built the framework specifically for enterprise environments where reliability, performance, and deployment flexibility matter most.

This production-first approach created mature tooling for serving, mobile optimization, and MLOps that PyTorch is still catching up to. The trade-off comes in development experience – TensorFlow’s API can feel more complex and less intuitive than PyTorch’s streamlined approach.

TensorFlow strengths

Mature production deployment tools: Battle-tested infrastructure with TensorFlow Serving for high-throughput serving, TensorFlow Lite for mobile, and TensorFlow.js for browsers.
Superior mobile and edge optimization: TensorFlow Lite delivers industry-standard performance and comprehensive device support for smartphones and edge devices.
Strong enterprise adoption: Proven production patterns used by thousands of companies, with extensive documentation for scaling systems serving millions of predictions.
Comprehensive MLOps tooling: TensorFlow Extended (TFX) gives you end-to-end pipelines for production ML workflows, from data validation through model monitoring.
TPU support for large-scale training: Access to Google’s specialized Tensor Processing Units for training at massive scale with performance advantages over GPU infrastructure.

TensorFlow weaknesses

Steeper learning curve: More complexity when implementing custom models or debugging issues, even with Keras integration simplifying high-level operations.
More verbose code for custom work: Novel architectures or training procedures require significantly more code compared to PyTorch’s streamlined approach.
Larger, less cohesive API: Broader API surface with multiple ways to accomplish the same task creates confusion and longer learning curves.
Debugging can be challenging: Graph-related issues may require you to understand TensorFlow’s internal execution model despite eager execution improvements.
Slower adoption of research techniques: New methods from research papers typically take longer to appear in TensorFlow compared to PyTorch.

If you’re new to TensorFlow and want a hands-on starting point, check out How to Train Your First TensorFlow Model in PyCharm, where you’ll build and train a simple model step by step using Keras and visualize the results.

PyTorch vs. TensorFlow: Head-to-head comparison

Choosing between PyTorch and TensorFlow isn’t always straightforward, and there are many factors to consider.

The table below provides a high-level head-to-head comparison of PyTorch and TensorFlow so you can quickly assess which framework generally fits your needs. We’ll later consider project-specific scenarios and provide a detailed decision matrix to guide your choice.

Dimension	PyTorch	TensorFlow
Learning curve	Easier: Pythonic and intuitive	Steeper: more complex API despite Keras
Debugging	Excellent: standard Python tools work naturally	Good: improved with eager execution
Production deployment	Improving: TorchServe and TorchScript available	Excellent: mature ecosystem (Serving, Lite, JS)
Research/experimentation	Dominant: 85% of deep‑learning research papers	Present: but trailing PyTorch in adoption
Community ecosystem	Research-focused: Hugging Face, PyTorch Lightning	Enterprise-focused: TFX, strong cloud integration
Performance at scale	Strong: DDP for distributed training	Strong: graph optimization, TPU support
Industry adoption	Growing: used by 15,800+ companies	Established: used by more than 23,000 companies

PyTorch vs. TensorFlow for different use cases and applications

Your framework choice depends heavily on what you’re building. Here’s how PyTorch and TensorFlow stack up for major machine learning domains.

Natural language processing

PyTorch dominates NLP with no signs of slowing. The Hugging Face Transformers library – the de facto standard for working with language models – started as a PyTorch-only framework and later added TensorFlow support as a secondary option. When you’re fine-tuning transformers, implementing custom attention mechanisms, or experimenting with novel architectures, PyTorch’s flexibility accelerates your iteration.

Verdict: PyTorch leads NLP decisively. Choose TensorFlow only if you have specific mobile deployment requirements that override all other considerations.

Computer vision

Computer vision presents a more balanced landscape for your projects. PyTorch benefits from research momentum – when you’re developing novel detection algorithms or experimenting with architectures, you’ll find state-of-the-art implementations appear in PyTorch first. TensorFlow excels for building production CV systems, especially for mobile object detection or on-device image classification, where TensorFlow Lite’s optimization matters most.

For a hands-on example, watch this video on how to build a TensorFlow object detection app to see how to take a pre-trained model and turn it into a real-time object detection app running on a robot in PyCharm:

Verdict: Use case dependent. Choose PyTorch for research and novel architectures, TensorFlow when your deployment priorities favor mobile and edge devices.

Reinforcement learning

PyTorch holds a slight edge in reinforcement learning, driven by the research community’s preference for it. When you’re implementing custom RL algorithms, modifying reward functions dynamically, or debugging agent behavior, PyTorch’s flexibility serves you better. TensorFlow offers solid capabilities through TF-Agents for production RL systems at scale.

Verdict: Choose PyTorch for RL research and experimentation or TensorFlow for building large-scale production-grade RL systems like recommendation engines.

Tooling and developer experience in PyCharm

PyCharm provides comprehensive support for both frameworks, streamlining your development workflow regardless of which you choose.

Debugging: Set breakpoints in training loops, inspect tensor values, and step through model forward passes using the integrated debugger that works naturally with PyTorch’s dynamic graphs and TensorFlow’s eager execution.
Jupyter notebook support: Prototype in notebooks, inspect data transformations visually, then move to scripts for production training with seamless integration.
Package management: Handle complex dependency trees and CUDA requirements using virtual environment management to prevent conflicts between frameworks.
Remote interpreters: Connect to remote GPU servers, develop locally while training remotely, and sync code automatically to take advantage of powerful hardware without leaving your IDE.
TensorBoard integration: Track training metrics, visualize model graphs, and compare experiments within PyCharm using native TensorFlow support or torch.utils.tensorboard for PyTorch.
Code completion: Get framework-specific suggestions for layer definitions, optimizer configurations, and data pipeline operations that reduce errors and accelerate development.

Performance, scalability, and deployment

Training performance barely differs between frameworks for most workloads – both handle GPU training efficiently with comparable speeds. TensorFlow gains an edge when you need TPU support for large-scale training, offering more mature integration with Google’s specialized hardware. For multi-GPU scaling, both deliver strong performance with PyTorch’s DDP and TensorFlow’s MirroredStrategy.

Deployment scenarios differentiate the frameworks more clearly. TensorFlow Serving handles production model serving at scale with built-in versioning and A/B testing that PyTorch’s TorchServe can’t yet match in maturity. When deploying to mobile devices or edge hardware, TensorFlow Lite provides industry-standard optimization through quantization and pruning. For browser deployment, TensorFlow.js offers more integrated, optimized inference compared to serving PyTorch models via ONNX Runtime.

Memory management affects development experience – PyTorch’s caching allocator handles GPU memory efficiently with dynamic batch sizes, causing fewer surprises when experimenting with different model configurations.

Community, ecosystem, and library support

PyTorch’s research dominance created a vibrant, innovation-focused community that accelerates development. The PyTorch Conference 2024 saw triple the registrations versus 2023, and when cutting-edge techniques emerge, they appear in PyTorch first. The Hugging Face ecosystem amplifies this advantage – more than 220,000 PyTorch-compatible models versus around 15,000 for TensorFlow makes a tangible difference in development speed.

TensorFlow’s community skews toward production engineering, providing comprehensive enterprise-grade documentation and proven deployment patterns. Google’s backing ensures strong cloud platform integrations, particularly with Google Cloud, offering managed services that reduce operational complexity. The Model Garden provides production-ready implementations optimized for deployment rather than research experimentation.

Learning resources reflect these different audiences – PyTorch tutorials emphasize research workflows and novel implementations, while TensorFlow documentation prioritizes production deployment patterns and enterprise-scale systems.

Choosing the right framework for your project

Many successful teams use both frameworks strategically – researching and experimenting in PyTorch, then deploying in TensorFlow. The frameworks aren’t mutually exclusive. You can use ONNX to enable model conversion between them when needed.

When making a choice, it helps to prioritize factors most relevant to your project: Mobile deployment requirements may override other considerations, research-heavy work might make PyTorch essential, and enterprise support with MLOps integration could tip the scales toward TensorFlow.

Use the table below to match your project requirements with the framework strengths.

Decision Factor	PyTorch	TensorFlow
By use case
Natural language processing	✅ NLP standard choice	Only if mobile deployment is critical
Computer vision	✅ Research/novel architectures	✅ Production mobile/edge apps
Reinforcement learning	✅ Research and experimentation	✅ Large-scale production RL
By experience level
Beginner	✅ More intuitive API	Keras simplifies learning
Intermediate/Advanced	✅ Research and prototyping	✅ Production systems at scale
By project phase
Research/Experimentation	✅ Dynamic graphs aid iteration	Graph compilation for optimization
Rapid prototyping	✅ Fast experimentation	Keras for simple models
Production deployment	TorchServe improving	✅ Mature deployment tools
By deployment target
Cloud/Server	Strong performance	✅ Strong performance, slight GCP advantage
Mobile/Edge devices	Basic support via PyTorch Mobile	✅ TensorFlow Lite industry standard
Web Applications	Via ONNX Runtime	✅ TensorFlow.js optimized
By team context
Research-focused team	✅ Natural fit for researchers	If already using TensorFlow
Production-focused team	If comfortable with tooling	✅ Proven enterprise patterns

Using Bag-of-Words With PyCharm

Jodie Burchell — Wed, 29 Apr 2026 17:42:41 +0000

Have you ever wondered how machine learning models actually work with text? After all, these models require numerical input, but text is, well, text.

Natural language processing (NLP) offers many ways to bridge this gap, from the large language models (LLMs) that are dominating headlines today all the way back to the foundational techniques of the 1950s. Those early methods fall under what we now call the bag-of-words (BoW) model, and despite their age, they remain remarkably effective for a wide range of language problems.

In this post, we’ll unpack how the bag-of-words model works, explore the techniques it uses to convert text into numerical representations, and look at where it fits relative to more modern NLP approaches. We’ll also build a text classification project using BoW techniques, and see how PyCharm’s specific features make the whole process faster and easier.

What is the bag-of-words model?

The bag-of-words model is a text representation technique that converts unstructured text into numerical vectors by tracking which words appear across a corpus (a collection of texts). Rather than preserving grammar or word order, it simply represents each document as a “bag” of its words, recording how often each one appears. The result is a vector of counts that captures what a text is about, even if it discards how that content is expressed.

This apparent limitation turns out to matter less than you might expect. For many tasks, such as text classification and sentiment analysis, the presence of certain words is often a stronger signal than their arrangement, and BoW captures that signal efficiently.

How does bag-of-words work?

To use the bag-of-words model, we need to convert each text in a corpus into a numerical vector. Let’s walk through how that works, starting with what that vector actually looks like.

Take the following sentence:

When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

A vector representation of this text using the BoW model might look something like this.

…	natural	naturally	nausea	near	neared	nearing	necessary	negative	…
	2	1	0	0	0	0	0	1

If we think of this vector as a table, you may have noticed that each column represents a word in the corpus, and the row contains a number from 0 to 2. This number is a count of how many times the word occurs in the text, as we can see below:

When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

Each column represents a word in the vocabulary; each value records how many times that word appears. Here, “natural” appears twice, while “naturally” and “negative” each appear once.

Tokenization

Before we can build this vector, we need to split our text into tokens. In BoW modeling, this is typically straightforward: We split on whitespace, so “When diving into natural language processing,” becomes seven tokens: ["When", "diving", "into", "natural", "language", "processing", ","]. This is considerably simpler than the tokenization used in LLMs.

Vocabulary creation

Applying tokenization across every text in the corpus produces a long list of words. Deduplicating this list gives us our vocabulary, which we can see in the set of columns in the vector above. This process does introduce some noise: “Natural” and “natural”, for instance, would be treated as two separate tokens. We’ll look at preprocessing steps to address this shortly.

Encoding

With a vocabulary in hand, we create a vector for each text with one element per vocabulary word. Encoding is then the process of filling in those elements by checking each vocabulary word against the text.

The simplest approach is binary vectorization: 0 if a word is absent, 1 if present. More common, however, is count vectorization, which records the actual number of occurrences, as we saw in the example above. Count vectorization carries more information, since it helps distinguish texts that merely mention a topic from those that focus on it heavily.

One practical consequence of this approach is sparsity. If a corpus contains thousands of unique words, each vector will have thousands of elements, but any individual text will only use a small fraction of them, leaving most values at zero. This signal-to-noise issue is something we’ll return to.

Advantages of the bag-of-words model

The bag-of-words model has remained a staple in NLP for good reason. Its greatest strength is its simplicity: Because text is represented as a collection of word counts, the approach is easy to understand and straightforward to implement, making it a natural baseline before reaching for more complex architectures.

Beyond simplicity, BoW is computationally efficient. As you saw above, the underlying math is lightweight, which means it scales well to large text collections without demanding significant computing resources. For tasks where the presence of specific words is sufficient to capture meaning, with sentiment analysis and topic categorization being the clearest examples, it remains a highly effective tool.

Applications of bag-of-words

Like many NLP approaches, the bag-of-words model can be applied to many natural language problems. These potential applications include:

Document classification, where encoded documents are sorted into predefined categories. A classic example of this is automatically sorting incoming news articles into distinct categories such as sports, politics, or technology, as we’ll see in the project we do in this post.
Sentiment analysis, where the presence of certain words strongly indicates the overall tone of a text, allows models to easily determine whether a piece of writing expresses a positive, negative, or neutral sentiment. If you’re interested in learning more about BoW and other approaches to sentiment analysis, you can see a prior blog post I wrote on this topic.
Spam detection, which relies heavily on BoW to identify and filter out unwanted emails or messages by learning to recognize the distinct, high-frequency word patterns characteristic of spam.
Retrieval systems, where it helps to efficiently find the most relevant documents from an immense corpus based on a user’s search query.
Topic modeling, which aims to group similar text vectors in order to discover and extract the hidden, latent topics present within a large collection of documents.

As you can see, the number of potential applications is broad, making bag-of-words modeling a popular first approach to natural language problems.

Why use PyCharm for NLP?

PyCharm is particularly well-suited to bag-of-words modeling because it supports the iterative, detail-oriented workflow that text processing requires. As you’ll soon see, building a reliable BoW pipeline involves multiple steps, such as cleaning text, tokenizing, vectorizing, and validating outputs, and PyCharm’s code intelligence makes each of these smoother. Autocompletion, parameter hints, and quick navigation through specialized NLP libraries reduce friction when experimenting with different vectorizer settings, and help you understand how each component behaves.

Debugging and data inspection are equally important here, since small preprocessing mistakes can have an outsized effect on results. PyCharm lets you step through your code and examine intermediate states of things such as token lists and vocabulary at runtime, making it straightforward to verify that your feature extraction is working as intended. This visibility is especially useful when diagnosing issues like unexpected vocabulary sizes or missing terms.

PyCharm also supports exploratory work through its excellent Jupyter Notebook integration and scientific tooling. BoW modeling often involves trying different preprocessing strategies and observing their effects immediately, so the ability to run code interactively and inspect outputs inline is a genuine advantage. Combined with built-in virtual environment and package management support, this keeps experiments reproducible and well-organized.

As projects grow, PyCharm’s refactoring tools, project navigation, and version control integration help manage the added complexity. BoW models rarely exist in isolation, and they’re often embedded in larger ML pipelines. In such contexts, PyCharm’s features for working with larger applications mean you spend less time managing code and more time improving your models.

Setting up the project

To see these components in action, let’s build an actual bag-of-words project. We’ll use a classic text classification dataset and the AG News dataset, and then use the model to classify news articles into one of four categories: World, Sports, Business, or Science/Technology.

To get started in PyCharm, open the Projects and Files tool window and select New… > New Project…. Since this is a data science project, we can use PyCharm’s built-in Jupyter project type, which sets up a sensible default structure for us.

During project configuration, you’ll be asked to choose a Python interpreter. By default, PyCharm uses uv and lets you select from a range of Python versions, though all major dependency management systems are supported: pip, Anaconda, Pipenv, Poetry, and Hatch. Every project is automatically created with an attached virtual environment, so your setup will be ready to go each time you reopen the project.

With the project configured, we can install our dependencies via the Python Packages tool window. Simply search for a package by name, select it from the list, and install your desired version directly into the virtual environment. You can also see the same information about the package you’d find on PyPI directly within the IDE. For this project, we’ll need pandas and Numpy, along with datasets from Hugging Face, scikit-learn, Pytorch, and spaCy.

Implementing bag-of-words with PyCharm

There are many versions of this dataset online. We’ll be using one of the versions hosted on Hugging Face Hub.

Loading and preparing the data

We’ll use Hugging Face’s datasets package to download this dataset.

from datasets import load_dataset
ag_news_all = load_dataset("sh0416/ag_news")

This gives us a Hugging Face DatasetDict object. If we look at it, we can see it contains a training dataset with 120,000 news articles, and a test dataset containing 7,600 articles.

ag_news_all

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 7600
    })
})

As we’ll be training a model, we also need a validation set. We’ll convert the training and test sets to pandas DataFrames, and use the train_test_split method from scikit-learn to create the validation set from the training data.

import pandas as pd
from sklearn.model_selection import train_test_split

ag_news_train = ag_news_all["train"].to_pandas()
ag_news_test = ag_news_all["test"].to_pandas()

ag_news_train, ag_news_val = train_test_split(
   ag_news_train,
   test_size=0.1,     
   random_state=456,   
   stratify=ag_news_train['label'] 
)

print(f"Training set: {len(ag_news_train)} samples")
print(f"Validation set: {len(ag_news_val)} samples")

We now have a validation set with 12,000 articles, and a training set with 108,000 articles.

Training set: 108000 samples
Validation set: 12000 samples

For those of you new to machine learning, you might be wondering why we need all of these different datasets. The reason for this is to make sure we have a good idea that our model will generalize well and perform as expected on unseen data. The training set is the only data the model ever learns from directly. The validation set is used to monitor how the model is performing on unseen data as we make modeling decisions, such as choosing how many epochs to train for, how large to make the hidden layer, or which preprocessing steps to apply (we’ll see all of this later). This means that we look at validation performance repeatedly while building the model, and this increases the risk that our choices gradually become tuned to the quirks of that particular split. This is why we need a third set (the test set), which we keep completely locked away until we’ve finished all modeling decisions and want a single, unbiased estimate of how well our model will perform on new data. Using the test set for anything other than this final evaluation would give us an overly optimistic picture of our model’s real-world performance.

Let’s now inspect our datasets. PyCharm Pro has a lot of built-in features that make working with DataFrames easier, a few of which we’ll see soon. In this DataFrame, we have three columns: The article title and description, the article text, and the label indicating which of the four news categories the article belongs to. You can open any of the DataFrame cells in the Value Editor to see its full text, or widen the column to prevent truncation, both of which are useful for a quick visual inspection.

At the top of each column, PyCharm displays column statistics, giving you an at-a-glance summary of the data. Switching from Compact to Detailed mode via Show Column Statistics gives you rich summary statistics about each column, and saves you from writing a lot of pandas boilerplate to get it! From these statistics, we can see that our training set is evenly split across the news categories (which is very handy when training a model). We can also see that some headlines and descriptions are not unique, which may introduce noise when classifying these duplicates.

The first step in preparing the data is basic string cleaning, which normalizes the text and reduces meaningless token variation. For instance, without cleaning, “Natural” and “natural” would be treated as two separate vocabulary entries, as we noted earlier.

We’ll apply four cleaning steps: lowercasing, punctuation removal, number removal, and whitespace normalization. There are different string cleaning steps you can apply depending on the language and use case, but for English-language texts, these tend to be very standard. Let’s go ahead and write a function to do this.

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
   patterns_to_remove = [
       r"[^a-zA-Z\s]",
   ]

   cleaned = dataset.str.lower()

   for pattern in patterns_to_remove:
       cleaned = cleaned.str.replace(pattern, " ", regex=True)

   cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()

   return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

This mostly works, but there’s one issue: The regex strips apostrophes entirely, turning contractions like “you’re” into “you re” and possessives like “Canada’s” into “Canada s”. The cleanest fix is a regex that preserves apostrophes in contractions while removing possessive endings, but this is not the most enjoyable thing to write by hand.

This is where PyCharm’s built-in AI Assistant comes in. Open the chat window via the AI Chat icon on the right-hand side of the IDE and enter the following prompt:

Can you please alter the @apply_string_cleaning function so that it retains apostrophes inside words when they’re used for contractions (e.g., “you’re”), but removes them when they’re used for possessives (e.g., “Canada’s” into “Canada”).

The @ notation lets you reference specific files or objects in your IDE without copying and pasting code into the prompt, including Jupyter variables like datasets and functions.

I ran this against Claude Sonnet 4.5, though JetBrains AI supports a wide range of models from OpenAI, Anthropic, Google, and xAI, as well as open models via Ollama, LM Studio, and OpenAI-compatible APIs. Below is the updated function it returned:

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
    cleaned = dataset.str.lower()
    
    # Remove possessive apostrophes (word's -> word)
    # This pattern matches: letter(s) + 's + word boundary
    cleaned = cleaned.str.replace(r"(\w+)'s\b", r"\1", regex=True)
    
    # Remove all non-letter characters except apostrophes within words
    cleaned = cleaned.str.replace(r"[^a-zA-Z'\s]", " ", regex=True)
    
    # Clean up any apostrophes at the start or end of words
    cleaned = cleaned.str.replace(r"\s'|'\s", " ", regex=True)
    
    # Remove multiple spaces and trim
    cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()
    
    return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

We can insert this into our Jupyter notebook directly by clicking on Insert Snippet as Jupyter Cell in the AI chat.

Once we run this updated function on our raw text, we get the correct result:

text	text_clean
Don’t stand for racism – football chief	don’t stand for racism football chief
Canada’s Barrick Gold acquires nine per cent stake in Celtic Resources (Canadian Press)	canada barrick gold acquires nine per cent stake in celtic resources canadian press

We can see the contraction “don’t” is correctly preserved in the first example, but the possessive “Canada’s” has been removed. We apply this to both the training and validation datasets using the same function, so that the cleaning is consistent across both splits:

ag_news_val["title_clean"] = apply_string_cleaning(ag_news_val["title"])
ag_news_val["description_clean"] = apply_string_cleaning(ag_news_val["description"])

Creating the bag-of-words model

Now that we have clean text, we need to build our vocabulary and encode it. We’ll use scikit-learn’s CountVectorizer for this:

from sklearn.feature_extraction.text import CountVectorizer

countVectorizerNews = CountVectorizer()
countVectorizerNews.fit(ag_news_train["text_clean"])
ag_news_train_cv = countVectorizerNews.transform(ag_news_train["text_clean"]).toarray()

The process has two distinct steps. First, .fit() scans the training data and builds a vocabulary by identifying every unique word and assigning it a fixed index position (for example, “government” = column 8,901). The result is a mapping of 59,544 unique words, which you can think of as the column headers for our eventual matrix.

Second, .transform() uses that vocabulary to convert each headline into a numerical vector, counting how many times each vocabulary word appears and placing that count at the corresponding index position.

The reason these are two separate steps is important: When we later process our validation and test data, we’ll call .transform() using the vocabulary learned from the training set. This ensures that all three splits share a consistent feature space. If we re-ran .fit() on the test data, we’d get a different vocabulary, and the model’s predictions would be meaningless.

With the vectorizer fitted and our training data transformed, we can start exploring what we’ve actually built. Let’s first take a look at the vocabulary. CountVectorizer stores it as a dictionary mapping each word to its index position, accessible via vocabulary_:

countVectorizerNews.vocabulary_

{'fed': 18461,
 'up': 55833,
 'with': 58324,
 'pension': 38929,
 'defaults': 13156,
 'citing': 9475,
 'failure': 18077,
 'of': 36704,
 'two': 54804,
 'big': 5269,
 'airlines': 1139,
 'to': 53531,
 'make': 31397,
 'payments': 38686,
 'their': 52947,
...}

len(countVectorizerNews.vocabulary_)

This confirms that our vocabulary contains 59,544 unique words. Browsing through it, you can start to guess what kinds of terms appear frequently in the different types of news. Country names feature heavily in the “world” news category, terms like “football” and “cricket” in the “sports” news category, terms like “profit” and “losses” in the “business” news category, and company names like “Google” and “Microsoft” in the “science/technology” category.

Next, let’s inspect the feature matrix itself. ag_news_train_cv is a NumPy array with one row per headline and one column per vocabulary word, giving us a matrix of shape (108,000 × 59,544). We can wrap it in a DataFrame to make it easier to inspect in PyCharm’s DataFrame viewer:

pd.DataFrame(ag_news_train_cv, columns=countVectorizerNews.get_feature_names_out())

As expected, the matrix is very sparse. Most values are zero, since any individual headline only contains a small fraction of the full vocabulary. In fact, you might have noticed that the number of columns is two-thirds of the number of rows, which is never good for a feature matrix. We’ll explore how to reduce the dimensionality of the feature space in a later section.

Note that we also need to apply this vectorization to the validation dataset before moving on to modeling. Importantly, we are only applying the .transform method to the validation set, as we already trained it on the training dataset.

ag_news_val_cv = countVectorizerNews.transform(ag_news_val["text_clean"]).toarray()

Visualizing the results

Before we move onto reducing down the dimensionality of our feature space, let’s explore the distribution of the words in our corpus. This can help us to understand the most common and rare words, and how we might use this to further process our data to amplify the signal-to-noise ratio.

Word frequency plots

We’ll start by creating a DataFrame that aggregates word counts across all headlines and ranks them by frequency:

import numpy as np

vocab = countVectorizerNews.get_feature_names_out()
counts = np.asarray(ag_news_train_cv.sum(axis=0)).flatten()

pd.DataFrame({
  'vocab': vocab,
  'count': counts,
}).sort_values('count', ascending=False).reset_index(drop=True)

First, we retrieve the vocabulary in index order using get_feature_names_out(), so each word lines up with its corresponding column in the feature matrix. We then sum the matrix column-wise (that is, across all documents) to get the total number of times each word appears in the training set. Finally, we wrap these two arrays into a DataFrame and sort by count, giving us a ranked list of the most frequent terms.

Once this DataFrame is displayed in PyCharm, we can easily turn it into a visualization without writing a single line of code. By clicking on the Chart View button in the top left-hand corner of the DataFrame, we can explore a range of ways of visualizing our data. Go to Show Series Settings in the top right-hand corner, and adjust the parameters to get the count frequencies of the words: we set the X axis value to “vocab” (and change group and sort to none), the Y axis value to “count”, and the chart type to “Bar”.

Hovering over this chart, we can see that it has a very long-tailed distribution, which is very typical of vocabulary frequencies (this is actually so typical that it is described in something called Zipf’s law). This means that the majority of our words very rarely occur in the text, and in fact, if we hover over the right-hand side of the chart, it looks like around a third of our vocabulary terms are only used once!

On the other hand, when we hover over the left-hand side of the chart, we can see that this is dominated by very common words, prepositions, and articles, such as “to”, “in”, “the”, and “you”. These words don’t really carry any meaning and pretty much occur in every text, so they’re unlikely to be useful for our classification task.

Let’s have a look at some things we can do to clean up our feature space and help our semantically meaningful words stand out a bit more.

Advanced bag-of-words techniques

The basic BoW pipeline we’ve built so far is a solid foundation, but there are several techniques that can meaningfully improve its quality. This section walks through the most important ones. We’ll only be using a selection of them in our project, but you can investigate which of these seem appropriate when building your own project.

Stop word removal

Stop words are extremely common words that appear frequently across all kinds of text but carry little meaningful information. This includes words like “the”, “is”, “and”, “of”, as we saw in the histogram in the previous section. They inflate vocabulary size without adding signal, so removing them is one of the most straightforward ways to improve your BoW representation. NLTK provides a built-in stop word list for English and many other languages.

Stemming and lemmatization

Another issue you might have noticed in our vocabulary is that words that are semantically equivalent have different syntactic forms, meaning that while they should be treated as the same token, they occupy additional token slots. We can resolve this through two techniques: stemming and lemmatization. Stemming reduces words to their root form using simple rule-based truncation (e.g. “running” → “run”), while lemmatization takes a linguistic approach, mapping words to their dictionary base form. Lemmatization is slower but generally produces cleaner results, particularly for irregular word forms.

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is an extension of basic count vectorization that weights each word by how informative it actually is. A word that appears frequently in one document but rarely across the corpus receives a high weight; a word that appears everywhere receives a low one. This neatly addresses one of the core weaknesses of raw count vectors: common but uninformative words can dominate the feature space even after stop-word removal.

N-grams

Standard BoW treats each word independently, which means it misses phrases whose meaning depends on word combinations. A classic example of this is “machine learning”, which has a distinct meaning to “machine” + “learning”. N-grams address this by treating sequences of adjacent words as single tokens, so a bigram model would capture “machine learning” as a feature in its own right. The trade-off is a much larger vocabulary, so in practice, bigrams are most commonly used, with trigrams reserved for cases where capturing longer phrases is important.

Handling out-of-vocabulary words

When you apply your fitted vectorizer to new data, any words not present in the training vocabulary are silently ignored by default. For many tasks, this is acceptable, but if your production data is likely to continue introducing new terms that carry meaningful signal, it’s worth considering alternatives. One common approach is to reserve a special token to represent unseen words, which at least preserves the information that something unfamiliar appeared, even if its identity is unknown and multiple (perhaps unrelated) words are collapsed onto the same token.

However, LLMs, with their more flexible approach to tokenization, tend to be a better choice if out-of-vocabulary words will be a major issue for your model once it is in production.

Dimensionality reduction

Even after stop word removal and other cleaning steps, BoW feature matrices are typically very high-dimensional and sparse. Two widely used techniques can help. Reducing to the top-N most frequent terms is the simplest approach, discarding low-frequency words that are unlikely to generalize well. For a more principled reduction, techniques like principal component analysis (PCA) or latent semantic analysis (LSA) project the feature matrix into a lower-dimensional space, compressing the representation while preserving as much of the meaningful variance as possible.

Feature selection techniques

Rather than reducing dimensionality arbitrarily, feature selection methods identify and retain only the features most relevant to your specific task. Chi-squared testing measures the statistical dependence between each term and the target label, making it well-suited to classification tasks. Mutual information takes a similar approach, scoring each feature by how much it reduces uncertainty about the class. Both methods can substantially reduce vocabulary size while preserving model performance.

Applying bag-of-words to a real-world problem

Let’s now continue the example we started earlier. We’re going to take the work we’ve done on our AG News text classification task and take it to its completion by building a model.

A common way to build a model using encoded text is neural networks, where each of the words in the vocabulary is treated as a feature, and the categories we want to predict (in our case, the news category) are the output. We’ll start by building a baseline model that applies only string cleaning and encoding to the text.

I had originally written this model in Keras, as part of a previous BoW project from a couple of years ago. However, that code was now out of date. In order to update it and adapt it to Pytorch, I asked JetBrains AI to do the following:

Please update this neural network from Keras to Pytorch, making improvements to make the code as reusable as possible.

This gave us the following successful port of the code:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class MulticlassClassificationModel(nn.Module):
   def __init__(self, input_size: int, hidden_layer_size: int, num_classes: int = 4):
       super(MulticlassClassificationModel, self).__init__()
       self.fc1 = nn.Linear(input_size, hidden_layer_size)
       self.relu = nn.ReLU()
       self.fc2 = nn.Linear(hidden_layer_size, num_classes)

   def forward(self, x):
       x = self.fc1(x)
       x = self.relu(x)
       x = self.fc2(x)
       return x

def train_text_classification_model(
       train_features: np.ndarray,
       train_labels: np.ndarray,
       validation_features: np.ndarray,
       validation_labels: np.ndarray,
       input_size: int,
       num_epochs: int,
       hidden_layer_size: int,
       num_classes: int = 4,
       batch_size: int = 1920,
       learning_rate: float = 0.001) -> MulticlassClassificationModel:

   # Convert labels to 0-indexed (AG News has labels 1,2,3,4 -> need 0,1,2,3)
   train_labels_indexed = train_labels - 1
   validation_labels_indexed = validation_labels - 1

   # Convert numpy arrays to PyTorch tensors
   X_train = torch.FloatTensor(train_features.copy())
   y_train = torch.LongTensor(train_labels_indexed.copy())
   X_val = torch.FloatTensor(validation_features.copy())
   y_val = torch.LongTensor(validation_labels_indexed.copy())

   # Create datasets and dataloaders
   train_dataset = TensorDataset(X_train, y_train)
   train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

   # Initialize model, loss function, and optimizer
   model = MulticlassClassificationModel(input_size, hidden_layer_size, num_classes)
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

   # Training loop
   for epoch in range(num_epochs):
       model.train()
       train_loss = 0.0
       correct_train = 0
       total_train = 0

       for batch_features, batch_labels in train_loader:
           # Forward pass
           outputs = model(batch_features)
           loss = criterion(outputs, batch_labels)

           # Backward pass and optimization
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()

           # Calculate training metrics
           train_loss += loss.item()
           _, predicted = torch.max(outputs, 1)
           correct_train += (predicted == batch_labels).sum().item()
           total_train += batch_labels.size(0)

       # Validation
       model.eval()
       with torch.no_grad():
           val_outputs = model(X_val)
           val_loss = criterion(val_outputs, y_val)
           _, val_predicted = torch.max(val_outputs, 1)
           correct_val = (val_predicted == y_val).sum().item()
           total_val = y_val.size(0)

       # Print epoch metrics
       train_acc = correct_train / total_train
       val_acc = correct_val / total_val
       print(f'Epoch [{epoch+1}/{num_epochs}], '
             f'Train Loss: {train_loss/len(train_loader):.4f}, '
             f'Train Acc: {train_acc:.4f}, '
             f'Val Loss: {val_loss:.4f}, '
             f'Val Acc: {val_acc:.4f}')

   return model

def generate_predictions(model: MulticlassClassificationModel,
                       validation_features: np.ndarray,
                       validation_labels: np.ndarray) -> list:
   model.eval()

   # Convert to tensors
   X_val = torch.FloatTensor(validation_features.copy())

   with torch.no_grad():
       outputs = model(X_val)
       _, predicted = torch.max(outputs, 1)

   # Convert back to 1-indexed labels to match original dataset
   predicted_labels = (predicted.numpy() + 1)

   print("Confusion Matrix:")
   print(pd.crosstab(validation_labels, predicted_labels,
                     rownames=['Actual'], colnames=['Predicted']))
   return predicted_labels.tolist()

Let’s walk through this code step-by-step to understand how we’re going to train our text classifier.

The model architecture

MulticlassClassificationModel is a simple two-layer feedforward neural network. It takes a BoW vector as input, with each feature being a vocabulary word, and passes it through two sequential transformations to produce a prediction. The first layer (fc1) compresses this high-dimensional input down to a smaller intermediate representation, whose size we control via hidden_layer_size. A ReLU activation is then applied, which introduces a small amount of mathematical complexity that allows the model to learn patterns that a simple weighted sum couldn’t capture. The second layer (fc2) takes this intermediate representation and maps it down to four output values, one per news category, where the category with the highest value becomes the model’s prediction.

Training and validation

train_text_classification_model handles the full training loop. It starts with a small amount of housekeeping: The AG News labels run from 1 to 4, but PyTorch expects 0-indexed classes, so these are shifted down by 1. The features and labels are then converted to PyTorch tensors, and a DataLoader is created to feed the training data to the model in batches.

Each epoch, the model processes the training data batch by batch. For each batch, it runs a forward pass to generate predictions, computes the cross-entropy loss against the true labels, and then runs a backward pass to update the model weights via the RMSprop optimizer. At the end of every epoch, the model switches into evaluation mode and runs inference over the full validation set, printing the training and validation loss and accuracy so we can monitor how training is progressing.

Generating predictions

Once training is complete, generate_predictions runs the trained model on a held-out dataset and returns the predicted class for each article. It also prints a confusion matrix, which gives us a breakdown of which categories the model is getting right and where it’s getting confused, which is a much more informative picture than accuracy alone.

Running the baseline

We can now train the baseline model. We pass in the raw count-vectorized training and validation features, specify an input size equal to the vocabulary size (59,544 columns), train for two epochs, and use a hidden layer of 5,000 nodes.

baseline_model = train_text_classification_model(
    ag_news_train_cv,
    ag_news_train["label"].to_numpy(),
    ag_news_val_cv,
    ag_news_val["label"].to_numpy(),
    ag_news_train_cv.shape[1],
    5,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_cv,
    ag_news_val["label"].to_numpy()
)

Epoch [1/2], Train Loss: 0.3553, Train Acc: 0.8813, Val Loss: 0.2307, Val Acc: 0.9243
Epoch [2/2], Train Loss: 0.1217, Train Acc: 0.9587, Val Loss: 0.2352, Val Acc: 0.9240

Confusion Matrix:
Predicted     1     2     3     4
Actual                           
1          2774    65    89    72
2            37  2944     9    10
3           112    20  2694   174
4            97    20   207  2676

Even with the very basic data preparation we did, we can see we’ve performed very well on this prediction task, with around 92% accuracy. The confusion matrix shows that the model seems to have the easiest time distinguishing between category two (sports) and the other topics, and the hardest time distinguishing between category three (business) and category four (science/technology). This makes sense, as the words used to describe sports are very distinct and unlikely to be used in other contexts (things like football), whereas there is likely to be overlapping vocabulary between business and technology (especially company names).

As we saw above, there is a lot we can do to improve the signal-to-noise ratio in BoW modeling. Let’s apply four commonly used techniques to our data and see whether this improves our predictions: lemmatization, stop word removal, limiting our vocabulary to the top N terms, and TF-IDF weighting. As you’ll see, all of these can be done relatively simply using inbuilt functions in packages such as spaCy and scikit-learn.

Lemmatization

As we discussed earlier, lemmatization collapses inflected word forms into a single vocabulary entry by mapping each word to its dictionary base form, which both shrinks the vocabulary and concentrates the signal for each concept into a single feature. We’ll use spaCy for this, which first requires downloading its small English language model:

!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

Our lemmatise_text function passes each text through spaCy’s NLP pipeline using nlp.pipe(), which processes them in batches of 1,000 for efficiency. For each document, it extracts the .lemma_ attribute of every token and joins them back into a single string. One small detail worth noting: we preserve the original DataFrame index when constructing the output Series, so that rows stay correctly aligned when we assign the results back.

We apply lemmatization before string cleaning, since spaCy needs the original casing and punctuation to correctly identify grammatical structure. For example, “running” and “Running” lemmatize to the same thing, but removing punctuation first can confuse the parser. Once lemmatized, we pass the output through apply_string_cleaning as before:

ag_news_train["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["title"]))
ag_news_train["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["description"]))

ag_news_val["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["title"]))
ag_news_val["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["description"]))

ag_news_train["text_clean"] = ag_news_train["title_clean"] + " " + ag_news_train["description_clean"]

ag_news_val["text_clean"] = ag_news_val["title_clean"] + " " + ag_news_val["description_clean"]

We apply this separately to the title and description columns before concatenating them into a single text_clean field. As you can see, we do this for both the training and validation sets using the same function, so that lemmatization is applied consistently across both splits.

Removing stop words

As with lemmatization, we covered the motivation for stop word removal earlier: Words like “the”, “is”, and “of” appear so frequently across all texts that they add noise rather than signal to our feature matrix. Here we’ll actually apply it to our data.

def remove_stopwords(texts: pd.Series) -> pd.Series:
   texts = texts.fillna("").astype(str)

   filtered_texts = []
   for doc in nlp.pipe(texts, batch_size=1000):
       filtered_texts.append(
           " ".join(token.text for token in doc if not token.is_stop)
       )

   return pd.Series(filtered_texts, index=texts.index)

Our remove_stopwords function again uses nlp.pipe() to process texts in batches. For each document, it filters out any token where spaCy’s is_stop attribute is True, and joins the remaining tokens back into a string. Conveniently, spaCy handles stop word detection using the same pipeline we already loaded for lemmatization, so no additional setup is needed.

We apply this to the already-cleaned and lemmatized text_clean column for both the training and validation sets, so the stop word removal builds directly on our previous preprocessing steps and is applied consistently across both splits.

ag_news_train["text_no_stopwords"] = remove_stopwords(ag_news_train["text_clean"])
ag_news_val["text_no_stopwords"] = remove_stopwords(ag_news_val["text_clean"])

Top N terms and TF-IDF vectorization

The final two improvements we’ll apply are limiting the vocabulary size and switching from raw count vectorization to TF-IDF weighting. Conveniently, scikit-learn’s TfidfVectorizer handles both in a single step.

Recall from earlier that TF-IDF downweights words that appear frequently across many documents while upweighting words that are distinctive to particular documents. This cleans up uninformative words that don’t quite qualify as stopwords, but add little useful information to our dataset. The max_features=20000 argument caps the vocabulary at the 20,000 most frequent terms after TF-IDF scoring, which discards the long tail of rare words that are unlikely to generalize well and brings our feature matrix down to a much more manageable size. (The choice of 20,000 words is arbitrary. We could have easily used a smaller or larger number, depending on our dataset and use case.)

As with CountVectorizer, we fit only on the training data and then use that fixed vocabulary to transform both the training and validation sets:

TfidfVectorizerNews = TfidfVectorizer(max_features=20000)
TfidfVectorizerNews.fit(ag_news_train["text_no_stopwords"])

ag_news_train_tfidf = TfidfVectorizerNews.transform(ag_news_train["text_no_stopwords"]).toarray()
ag_news_val_tfidf = TfidfVectorizerNews.transform(ag_news_val["text_no_stopwords"]).toarray()

We can inspect the resulting vocabulary and feature matrix exactly as we did before:

TfidfVectorizerNews.vocabulary_

{'fed': np.int64(6243),
 'pension': np.int64(13134),
 'default': np.int64(4469),
 'cite': np.int64(3200),
 'failure': np.int64(6109),
 'big': np.int64(1787),
 'airline': np.int64(401),
 'payment': np.int64(13051),
 'plan': np.int64(13424),
 'government': np.int64(7306),
 'official': np.int64(12453),
 'tuesday': np.int64(18437),
 'congress': np.int64(3691),
 'hard': np.int64(7689),
 'corporation': np.int64(3901),
...}

pd.DataFrame(ag_news_train_tfidf, columns=TfidfVectorizerNews.get_feature_names_out())

Compared to our baseline feature matrix of 59,544 columns filled almost entirely with zeros, this is considerably leaner. We now have 20,000 columns of weighted scores that better reflect each word’s actual importance to the document it appears in. It is still relatively sparse, but we can see from both the feature matrix and the vocabulary list that it is much more focused on semantically rich words.

Fitting the revised model

With our improved features in hand, we can now retrain the model. The call is identical to before, except we pass in the TF-IDF feature matrices instead of the raw count vectors, and the input size is now 20,000 rather than 59,544:

baseline_model = train_text_classification_model(
    ag_news_train_tfidf,
    ag_news_train["label"].to_numpy(),
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy(),
    ag_news_train_tfidf.shape[1],
    2,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy()
)

Epoch [1/2], Train Loss: 0.3183, Train Acc: 0.8932, Val Loss: 0.2301, Val Acc: 0.9225
Epoch [2/2], Train Loss: 0.1512, Train Acc: 0.9475, Val Loss: 0.2332, Val Acc: 0.9243
Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          2703    71   121   105
2            20  2955    13    12
3            68    19  2691   222
4            77    17   163  2743

The results are actually very encouraging! Our overall validation accuracy is essentially unchanged at around 92%, but we’ve achieved this with a feature matrix that is less than a third of the size. This suggests that the extra vocabulary in the baseline (including the stop words) was contributing to noise rather than signal. Reducing the size of the feature matrix makes our model more stable, less prone to overfitting, and much more manageable to deploy.

Looking at the confusion matrix, the pattern of errors is similar to before: Sports (category two) is the easiest category to classify, with 98.5% accuracy, while Business (category three) and Science/Technology (category four) remain the hardest to separate, with around 7% of articles in each category being misclassified as the other. This is consistent with what we saw in the baseline, so it seems that the preprocessing improvements have tightened things up at the margins, but the fundamental difficulty of the Business/Technology boundary is a property of the data rather than the feature representation.

Applying our model to the test set

Finally, we need to validate that our model performs as well on the test set as it does on the validation set. Up to this point, we’ve deliberately kept the test set locked away. As mentioned earlier, if we had been making modeling decisions based on test set performance, we’d risk inadvertently overfitting our choices to it, and our final accuracy estimate would be optimistic.

The preprocessing steps must be applied in exactly the same order as for the training and validation data: lemmatization, string cleaning, concatenation of title and description, and stop-word removal. Crucially, we also call .transform() rather than .fit_transform() on the test text, using the vocabulary learned from the training data:

ag_news_test["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["title"]))
ag_news_test["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["description"]))
ag_news_test["text_clean"] = ag_news_test["title_clean"] + " " + ag_news_test["description_clean"]
ag_news_test["text_no_stopwords"] = remove_stopwords(ag_news_test["text_clean"])

ag_news_test_tfidf = TfidfVectorizerNews.transform(ag_news_test["text_no_stopwords"]).toarray()

We can then generate predictions and evaluate accuracy on the test set:

test_predictions = generate_predictions(
    baseline_model,
    ag_news_test_tfidf,
    ag_news_test["label"].to_numpy()
)

test_accuracy = accuracy_score(ag_news_test["label"].to_numpy(), test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")

Test Accuracy: 0.9183

Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          1710    54    78    58
2            13  1870    10     7
3            51    12  1676   161
4            53     9   115  1723

The test accuracy of 91.8% is very close to the 92.4% we saw on the validation set, which is a reassuring sign that our model has generalized well rather than overfitting to the validation data. The confusion matrix tells the same story as before: Sports (category two) remains the easiest category to classify, with only 30 misclassified articles out of 1,900, while the Business/Technology boundary continues to be the main source of errors, with around 8% of articles in each category being misclassified as the other. The consistency between validation and test results gives us confidence that these error patterns reflect genuine properties of the data rather than artifacts of any particular split.

Limitations and alternatives

Loses word order information

The most fundamental limitation of the bag-of-words model is right there in the name: it treats text as an unordered collection of words, discarding all sequence information. This means “the dog bit the man” and “the man bit the dog” produce identical vectors, even though they describe very different events. For many classification tasks, this doesn’t matter much, but for tasks that require understanding the relationship between words, such as question answering or natural language inference, the loss of word order is a serious handicap.

Ignores semantics and context

BoW has no notion of word meaning or context. Each word is simply a column in a matrix, entirely independent of every other word. This creates two related problems. First, synonyms are treated as completely distinct features: “cheap” and “inexpensive” contribute nothing to each other’s signal, even though they mean the same thing. Second, words with multiple meanings are treated as a single feature regardless of context: “bank” means the same thing whether it appears in a sentence about rivers or finance. Both of these issues limit how well BoW representations can capture the actual semantics of a text.

Can result in large, sparse vectors

As we saw in our own example, even a moderately sized corpus of news headlines can produce a vocabulary of nearly 60,000 unique terms. The resulting feature matrix has one column per vocabulary word, but any individual document only uses a tiny fraction of them, leaving the vast majority of values at zero. This sparsity creates two practical problems: The matrices can consume a large amount of memory if stored densely, and the high dimensionality can make it harder for models to find meaningful patterns, a phenomenon sometimes called the curse of dimensionality.

Alternatives

If BoW’s limitations are a bottleneck for your task, there are several well-established alternatives worth considering.

Word embeddings (Word2Vec and GloVe) address the semantics problem by representing each word as a dense vector in a continuous space, where similar words are geometrically close to each other. They capture distributional meaning far more richly than BoW, and are a natural next step when synonym handling or word similarity matters. Doc2Vec extends this idea to produce embeddings for entire documents rather than individual words.
Transformer-based models (BERT and GPT) go further still, generating contextual representations where the same word receives a different vector depending on the surrounding text. This handles polysemy directly and captures complex long-range dependencies between words. The trade-off is substantially higher computational cost and complexity compared to BoW.
Topic models like latent Dirichlet allocation (LDA) take a different angle entirely. Rather than encoding documents for downstream classification, they are generative models that discover latent thematic structure in a corpus. This is useful when your goal is exploration and interpretation rather than prediction.

For tasks where BoW already performs well, as we saw here with AG News, the added complexity of these approaches may not be worth the cost. BoW remains a strong baseline, and it’s always worth establishing how far it can take you before reaching for heavier machinery.

Get started with PyCharm today

In this post, we’ve covered a lot of ground: from the fundamentals of the bag-of-words model and how it converts text into numerical vectors, through to building and iteratively improving a real text classification pipeline on the AG News dataset. Along the way, we’ve seen how preprocessing steps like lemmatization, stop word removal, vocabulary capping, and TF-IDF weighting can meaningfully improve the efficiency of your feature representation, and how PyCharm’s DataFrame viewer, column statistics, chart view, and AI Assistant make each of these steps faster and easier to inspect and debug.

If you’d like to try this yourself, PyCharm Pro comes with a 30-day trial. As we saw in this tutorial, its built-in support for Jupyter notebooks, virtual environments, and scientific libraries means you can go from a blank project to a working NLP pipeline with minimal setup friction, leaving you free to focus on the fun parts.

You can find the full code for this project on GitHub. If you’re interested in exploring more NLP topics, check out our recent blogs here.

PyCharm for Django Fundraiser: Why Django Matters in the AI Era – And Why We’re Supporting It

Valeria Letusheva — Wed, 22 Apr 2026 12:15:34 +0000

Spend a few minutes around developer content, and it’s easy to come away with the impression that web apps now appear to almost write themselves.

Everything that follows – review, verification, refactoring, debugging, and the open-source frameworks that make those apps dependable – gets less attention. AI can speed up code generation, but it does not remove the need for stable foundations. A lot of AI-generated code works because it’s built on top of mature open-source frameworks, libraries, and documentation.

AI can scaffold a web app in thirty seconds. Django is what keeps it running for ten years. That gap is only getting more valuable.
Will Vincent, former Django Board Member, co-host of the Django Chat podcast and co-writer of the weekly Django News newsletter

As AI makes OSS easier to consume, it can also make the work behind it easier to overlook. But OSS still needs support – perhaps more than ever.

PyCharm for Django Fundraiser

PyCharm has supported Django through fundraising campaigns and ongoing collaboration with the Django Software Foundation (DSF). This year, we’re doing it again.

Together with the Django community, this campaign raised $350,000 for Django from 2016 to 2025. That support helps keep Django secure, stable, relevant, and sustainable, while also supporting community programs such as Django Girls and official events. Previous PyCharm fundraisers accounted for approximately 25% of the DSF budget, according to Django’s official blog.

Django is the rare framework that rewards you the longer you use it: mature, dependable, and still innovating. Best-in-class software, matched by one of the most welcoming communities in open source.
Will Vincent, former Django Board Member, co-host of the Django Chat podcast and co-writer of the weekly Django News newsletter

If Django has helped you learn, ship, or maintain real web products, this is a direct way to give back.

You can donate to the Django Software Foundation directly, or you can support Django through this fundraiser and get a tool you’ll rely on every day.

Django’s ‘batteries included’ philosophy was built for humans who wanted to ship fast. Turns out it’s perfect for AI agents too — fewer decisions, fewer dependencies, and fewer ways to go wrong.
Will Vincent, former Django Board Member, co-host of the Django Chat podcast and co-writer of the weekly Django News newsletter

The offer

During this campaign, get 30% off PyCharm Pro, with 100% of the proceeds going to the DSF. Or you can bundle PyCharm Pro with the JetBrains AI Pro plan and get 40% off PyCharm Pro.

This campaign ends in less than two weeks, so act now!

Get the offer

Why PyCharm Pro

Perfect for your workflow

The hard part of modern development is often not writing code from scratch – it’s understanding the whole project well enough to change it safely.

That’s where PyCharm Pro proves its value:

Navigate and refactor across your entire Django project, from templates to databases.
Work with databases without leaving the IDE.
Build and debug Django templates with full awareness of your context.
Develop frontend code with built-in support for JavaScript, TypeScript, and major frameworks.
Run and debug remote and Docker-based environments with ease

No editor understands Django like PyCharm does — from template tags to ORM queries to migrations, it sees the whole stack the way you do.
Will Vincent, former Django Board Member, co-host of the Django Chat podcast and co-writer of the weekly Django News newsletter

For Django work, I think PyCharm is one of the best tools available. I use it every day. If you haven’t given it a try, this campaign is a great opportunity – AND it supports the Django Software Foundation!
Sarah Boyce, Django Fellow and Djangonaut Space co-organizer

AI on your terms

If you want AI in PyCharm, you can start with JetBrains AI directly in the IDE. You can also shape it to fit your workflow. Bring your own key, sign in with a supported provider, use third-party or local models, or connect compatible agents such as Claude Code and Codex via ACP.

That gives you more control over how you work with AI, instead of locking you into a single workflow, model, or provider. And if AI isn’t what you need, you can simply turn it off.

Support the framework you use every day

If Django is part of how you build, this purchase can improve your workflow while also investing in the framework behind it.

Get the offer

Happy coding!

How (Not) to Learn Python

Cheuk Ting Ho — Fri, 10 Apr 2026 14:21:21 +0000

While listening to Mark Smith’s inspirational talk for Python Unplugged on PyTV about How to Learn Python, what caught my attention was that Mark suggested turning off some of PyCharm’s AI features to help you learn Python more effectively.

As a PyCharm user myself, I’ve found the AI-powered features beneficial in my day-to-day work; however, I never considered that I could turn certain features on or off to customize my experience. This can be done from the settings menu under Editor | General | Code Completion | Inline.

While we are at it, let’s have a look at these features and investigate in more detail why they are great for professional developers but may not be ideal for learners.

Local full line code completion suggestions

JetBrains AI credits are not consumed when you use local line completion. The completion prediction is performed using a built-in local deep learning model. To use this feature, make sure the box for Enable inline completion using language models is checked, and choose either Local or Cloud and local in the options. To show the complete results using the local model alone, we will look at the predictions when only Local is selected.

When it’s selected, you see that the only code completion available out of the box in PyCharm is for Python. To make suggestions available for CSS or HTML, you need to download additional models.

When you are writing code, you will see suggestions pop up in grey with a hint for you to use Tab to complete the line.

After completing that line, you can press Enter to go to the next one, where there may be a new suggestion that you can again use Tab to complete. As you see, this can be very convenient for developers in their daily coding, as it saves time that would otherwise be spent typing obvious lines of code that follow the flow naturally.

However, for beginners, mindlessly hitting Tab and letting the model complete lines may discourage them from learning how to use the functions correctly. An alternative is to use the hint provided by PyCharm to help you choose an appropriate method from the available list, determine which parameters are needed, check the documentation if necessary, and write the code yourself. Here is what the hint looks like when code completion is turned off:

Cloud-based completion suggestions

Let’s have a look at cloud-based completion in contrast to local completion. When using cloud-based completion, next-edit suggestions are also available (which we will look at in more detail in the next section).

Cloud-based completion comes with support for multiple languages by default, and you can switch it on or off for each language individually.

Cloud-based completion provides more functionality than local model completion, but you need a JetBrains AI subscription to use it.

You may also connect to a third-party AI provider for your cloud-based completion. Since this support is still in Beta in PyCharm 2026.1, it is highly recommended to keep your JetBrains AI subscription active as a backup to ensure all features are available.

After switching to cloud-based completion, one of the differences I noticed was that it is better at multiple-line completion, which can be more convenient. However, I have also encountered situations where the completion provided too much for me, and I had to jump in to make my own modifications after accepting the suggestions.

For learners of Python, again, you may want to disable this functionality or have to audit all the suggestions in detail yourself. In addition to the danger of relying too heavily on code completion, which removes opportunities to learn, cloud code completion poses another risk for learners. Because larger suggestions require active review from the developer, learners may not be equipped to fully audit the wholesale suggestions they are accepting. Disabling this feature for learners not only encourages learning, but it can also help prevent mistakes.

Next edit suggestions

In addition to cloud-based completion, JetBrains AI Pro, Ultimate, and Enterprise users are able to take advantage of next edit suggestions.

When they are enabled, every time you make changes to your code, for example, renaming a variable, you will be given suggestions about other places that need to be changed.

And when you press Tab, the changes will be made automatically. You can also customize this behavior so you can see previews of the changes and jump continuously to the next edit until no more are suggested.

This is, no doubt, a very handy feature. It can help you avoid some careless mistakes, like forgetting to refactor your code when you make changes. However, for learners, thinking about what needs to be done is a valuable thought exercise, and using this feature can deprive them of some good learning opportunities.

Conclusion

PyCharm offers a lot of useful features to smooth out your day-to-day development workflow. However, these features may be too powerful, and even too convenient, for those who have just started working with Python and need to learn by making mistakes. It is good to use AI features to improve our work, but we also need to double-check the results and make sure that we want what the AI suggests.

To learn more about how to level up your Python skills, I highly recommend watching Mark’s talk on PyTV and checking out all the AI features that JetBrains AI has to offer. I hope you will find the perfect way to integrate them into your work while remaining ready to turn them off when you plan to learn something new.

How to Train Your First TensorFlow Model in PyCharm

Evgenia Verbina — Tue, 07 Apr 2026 10:36:35 +0000

This is a guest post from Iulia Feroli, founder of the Back To Engineering community on YouTube.

TensorFlow is a powerful open-source framework for building machine learning and deep learning systems. At its core, it works with tensors (a.k.a multi‑dimensional arrays) and provides high‑level libraries (like Keras) that make it easy to transform raw data into models you can train, evaluate, and deploy.

TensorFlow helps you handle the full pipeline: loading and preprocessing data, assembling models from layers and activations, training with optimizers and loss functions, and exporting for serving or even running on edge devices (including lightweight TensorFlow Lite models on Raspberry Pi and other microcontrollers).

If you want to make data-driven applications, prototyping neural networks, or ship models to production or devices, learning TensorFlow gives you a consistent, well-supported toolkit to go from idea to deployment.

If you’re brand new to TensorFlow, start by watching the short overview video where I explain tensors, neural networks, layers, and why TensorFlow is great for taking data → model → deployment, and how all of this can be explained with a LEGO-style pieces sorting example.

In this blog post, I’ll walk you through a first, stripped-down TensorFlow implementation notebook so we can get started with some practical experience. You can also watch the walkthrough video to follow along.

We’ll be exploring a very simple use case today: load the Fashion MNIST dataset, build two very simple Keras models, train and compare them, then dig into visualizations (predictions, confidence bars, confusion matrix). I kept the code minimal and readable so you can focus on the ideas – and you’ll see how PyCharm helps along the way.

Training TensorFlow models step by step

Getting started in PyCharm

We’ll be leveraging PyCharm’s native Notebook integration to build out our project. This way, we can inspect each step of the pipeline and use some supporting visualization along the way. We’ll create a new project and generate a virtual environment to manage our dependencies.

If you’re running the code from the attached repo, you can install directly from the requirements file. If you wish to expand this example with additional visualizations for further models, you can easily add more packages to your requirements as you go by using the PyCharm package manager helpers for installing and upgrading.

Load `Fashion MNIST` and inspect the data

Fashion MNIST is a great starter because the images are small (28×28 pixels), visually meaningful, and easy to interpret. They represent various garment types as pixelated black-and-white images, and provide the relevant labels for a well-contained classification task. We can first take a look at our data sample by printing some of these images with various matplotlib functions:

```
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
    ax.imshow(x_train[i], cmap='gray')
    ax.set_title(class_names[y_train[i]])
    ax.axis('off')
plt.show()
```
# Two simple models (a quick experiment)
```
model1 = models.Sequential([
    layers.Flatten(input_shape=(28, 28)),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])
model2 = models.Sequential([
    layers.Flatten(input_shape=(28, 28)),
    layers.Dense(128, activation='relu'),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])
```

Compile and train your first model

From here, we can compile and train our first TensorFlow model(s). With PyCharm’s code completion features and documentation access, you can get instant suggestions for building out these simple code blocks.

For a first try at TensorFlow, this allows us to spin up a working model with just a few presses of Tab in our IDE. We’re using the recommended standard optimizer and loss function, and we’re tracking for accuracy. We can choose to build multiple models by playing around with the number or type of layers, along with the other parameters.

```
model1.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model1.fit(x_train, y_train, epochs=10)
model2.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model2.fit(x_train, y_train, epochs=15)
```

Evaluate and compare your TensorFlow model performance

```
loss1, accuracy1 = model1.evaluate(x_test, y_test)
print(f'Accuracy of model1: {accuracy1:.2f}')
loss2, accuracy2 = model2.evaluate(x_test, y_test)
print(f'Accuracy of model2: {accuracy2:.2f}')
```

Once the models are trained (and you can see the epochs progressing visually as each cell is run), we can immediately evaluate the performance of the models.

In my experiment, model1 sits around ~0.88 accuracy, and while model2 is a little higher than that, it took 50% longer to train. That’s the kind of trade‑off you should be thinking about: Is a tiny accuracy gain worth the additional compute and complexity?

We can dive further into the results of the model run by generating a DataFrame instance of our new prediction dataset. Here we can also leverage built-in functions like `describe` to quickly get some initial statistical impressions:

```
predictions = model1.predict(x_test)
import pandas as pd
df_pred = pd.DataFrame(predictions, columns=class_names)
df_pred.describe()
```

However, the most useful statistics will compare our model’s prediction with the ground truth “real” labels of our dataset. We can also break this down by item category:

```
y_pred = model1.predict(x_test).argmax(axis=1)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
print('Classification report:')
print(classification_report(y_test, y_pred, target_names=class_names))
```

From here, we can notice that the accuracy differs quite a bit by type of garment. A possible interpretation of this is that trousers are quite a distinct type of clothing from, say, t-shirts and shirts, which can be more commonly confused.

This is, of course, the type of nuance that, as humans, we can pick up by looking at the images, but the model only has access to a matrix of pixel values. The data does seem, however, to confirm our intuition. We can further build a more comprehensive visualization to test this hypothesis.

```
import numpy as np
import matplotlib.pyplot as plt
# pick 8 wrong examples
y_pred = predictions.argmax(axis=1)
wrong_idx = np.where(y_pred != y_test)[0][:8]  # first 8 mistakes
n = len(wrong_idx)
fig, axes = plt.subplots(n, 2, figsize=(10, 2.2 * n), constrained_layout=True)
for row, idx in enumerate(wrong_idx):
    p = predictions[idx]
    pred = int(np.argmax(p))
    true = int(y_test[idx])
    axes[row, 0].imshow(x_test[idx], cmap="gray")
    axes[row, 0].axis("off")
    axes[row, 0].set_title(
        f"WRONG  P:{class_names[pred]} ({p[pred]:.2f})  T:{class_names[true]}",
        color="red",
        fontsize=10
    )
    bars = axes[row, 1].bar(range(len(class_names)), p, color="lightgray")
    bars[pred].set_color("red")
    axes[row, 1].set_ylim(0, 1)
    axes[row, 1].set_xticks(range(len(class_names)))
    axes[row, 1].set_xticklabels(class_names, rotation=90, fontsize=8)
    axes[row, 1].set_ylabel("conf", fontsize=9)
plt.show()
```

This table generates a view where we can explore the confidence our model had in a prediction: By exploring which weight each class was given, we can see where there was doubt (i.e. multiple classes with a higher weight) versus when the model was certain (only one guess). These examples further confirm our intuition: top-types appear to be more commonly confused by the model.

Conclusion

And there we have it! We were able to set up and train our first model and already drive some data science insights from our data and model results. Using some of the PyCharm functionalities at this point can speed up the experimentation process by providing access to our documentation and applying code completion directly in the cells. We can even use AI Assistant to help generate some of the graphs we’ll need to further evaluate the TensorFlow model performance and investigate our results.

You can try out this notebook yourself, or better yet, try to generate it with these same tools for a more hands-on learning experience.

Where to go next

This notebook is a minimal, teachable starting point. Here are some practical next steps to try afterwards:

Replace the dense baseline with a small CNN (Conv2D → MaxPooling → Dense).
Add dropout or batch normalization to reduce overfitting.
Apply data augmentation (random shifts/rotations) to improve generalization.
Use callbacks like EarlyStopping and ModelCheckpoint so training is efficient, and you keep the best weights.
Export a SavedModel for server use or convert to TensorFlow Lite for edge devices (Raspberry Pi, microcontrollers).

Frequently asked questions

When should I use TensorFlow?

TensorFlow is best used when building machine learning or deep learning models that need to scale, go into production, or run across different environments (cloud, mobile, edge devices).

TensorFlow is particularly well-suited for large-scale models and neural networks, including scenarios where you need strong deployment support (TensorFlow Serving, TensorFlow Lite). For research prototypes, TensorFlow is viable, but it’s more commonplace to use lightweight frameworks for easier experimentation.

Can TensorFlow run on a GPU?

Yes, TensorFlow can run GPUs and TPUs. Additionally, using a GPU can significantly speed up training, especially for deep learning models with large datasets. The best part is, TensorFlow will automatically use an available GPU if it’s properly configured.

What is loss in TensorFlow?

Loss (otherwise known as loss function) measures how far a model’s predictions are from the actual target values. Loss in TensorFlow is a numerical value representing the distance between predictions and actual target values. A few examples include:

MSE (mean squared error), used in regression tasks.
Cross-entropy loss, often used in classification tasks.

How many epochs should I use?

There’s no set number of epochs to use, as it depends on your dataset and model. Typical approaches cover:

Starting with a conservative number (10–50 epochs).
Monitoring validation loss/accuracy and adjusting based on the results you see.
Using early stopping to halt training when improvements decrease.

An epoch is one full pass through your training data. Not enough passes through leads to underfitting, and too many can cause overfitting. The sweet spot is where your model generalizes best to unseen data.

About the author

Iulia Feroli

Iulia’s mission is to make tech exciting, understandable, and accessible to the new generation.

With a background spanning data science, AI, cloud architecture, and open source, she brings a unique perspective on bridging technical depth with approachability.

She’s building her own brand, Back To Engineering, through which she creates a community for tech enthusiasts, engineers, and makers. From YouTube videos on building robots from scratch, to conference talks or keynotes about real, grounded AI, and technical blogs and tutorials – Iulia shares her message worldwide on how to turn complex concepts into tools developers can use every day.

What’s New in PyCharm 2026.1

Ilia Afanasiev — Mon, 30 Mar 2026 15:31:08 +0000

Welcome to PyCharm 2026.1. This release doesn’t just add features – it rethinks how you build, debug, and scale Python projects. From a brand-new debugging engine powered by debugpy to first-class uv support on remote targets and expanded JavaScript support in the free tier, this version is all about removing friction and letting you focus on your code. Whether you’re working locally, over SSH, or inside Docker, PyCharm now adapts to your setup instead of the other way around.

In this post, we’ll explore the highlights of this update and show you how these improvements can streamline your daily workflow.

Standardizing the future of debugging with debugpy

PyCharm now offers the option to use debugpy as the default debugger backend, providing the industry-standard Debug Adapter Protocol (DAP) that aligns the IDE with the broader Python ecosystem. By replacing complex, legacy socket-waiting logic with a more stable connection model, race conditions and timing edge cases will no longer interfere with your debugging experience.

A modern foundation for Python development

The new engine provides full native support for PEP 669, utilizing Python 3.12’s low-impact monitoring API to significantly reduce debugger overhead compared to the legacy sys.settrace() approach. This ensures that your debugging sessions are faster and less intrusive. Furthermore, the migration introduces comprehensive asyncio support. You can now use the full suite of debugger tools, such as the debug console and expression evaluation, directly within async contexts for modern frameworks like FastAPI and aiohttp.

Reliability across environments

Beyond performance improvements, debugpy simplifies the Attach to Process experience by providing a standardized approach for Docker containers, remote servers on AWS, Azure, or GCP, and local running processes. For specialized workflows, we have introduced a new Attach to DAP run configuration. This allows you to connect to targets using the debugpy.listen() command, eliminating the friction of manual connection management and allowing you to focus on your code instead of debugging infrastructure.

Support for uv as a remote interpreter

Many developers work on projects where the code and dependencies live on a remote server – whether via SSH, in WSL, or inside Docker. By connecting PyCharm to a remote machine and using uv as the interpreter, you can keep the environment fully synchronized, ensure package management works as expected, and run projects smoothly – just as if everything were local.

Free professional web development for everyone

With PyCharm 2026.1, the core IDE experience continues to evolve as we bring a broader set of professional-grade web tools to all users for free. Everyone, from beginners to backend-first developers, now has access to a substantial set of JavaScript, TypeScript, and CSS features, as well as advanced navigation and code intelligence previously available only with a Pro subscription.

For a complete breakdown of all new features, check out this blog post.

Advancements in AI integration

PyCharm is evolving into an open platform that gives you the freedom to bring the AI tools of your choice directly into your professional development workflow. This release focuses on providing a flexible ecosystem where you can orchestrate the best models and agents available today.

The ACP Registry: Your gateway to new agents

Keeping up with the rapid pace of AI development can be a challenge, with new coding agents appearing almost daily. To help you navigate this dynamic landscape, we’ve launched the ACP Registry – a built-in directory of AI coding agents integrated directly into your IDE via the Agent Client Protocol.

Whether you want to experiment with open-source agents like OpenCode or specialized tools like Gemini CLI, you can now discover and install them in just a few clicks. If you have a custom setup or an agent that isn’t listed yet, you can easily add it via the acp.json configuration, giving you the flexibility to use your favorite tools, with no strings attached.

Native OpenAI Codex integration and BYOK

OpenAI Codex is now natively integrated into the JetBrains AI chat. This means you can tackle complex development tasks without switching to a browser or copy-pasting code between windows.

We’ve also introduced Bring Your Own Key (BYOK) support. You can now connect your own API keys from OpenAI, Anthropic, or other compatible providers – including local models – directly in the IDE settings. This allows you to choose the setup that fits your workflow and budget best, while keeping all your AI-powered development inside PyCharm.

Stay in the flow with next edit suggestions

Small changes in your code often trigger a cascade of mechanical follow-up edits. Adding a parameter to a function or renaming a symbol can lead to errors popping up across your entire file.

Next edit suggestions (NES) offer a smarter, lightweight alternative to asking an AI agent for a full rewrite. As you modify your code, PyCharm proactively predicts the most likely next changes and suggests them inline.

Effortless consistency: Update all call sites across a file with a simple Tab Tab experience.
Stay in control: Move step by step through changes rather than reviewing large, automated diffs.
No quota required: Use NES without consuming AI credits – available without consuming the AI quota of your JetBrains AI Pro subscription.

This natural evolution of code completion keeps you in the flow, making those small cascading fixes feel almost effortless.

All of the updates mentioned above are just a glimpse of what’s new in PyCharm 2026.1.

There is even more under the hood, including performance improvements, stability upgrades, and thoughtful refinements across the IDE that make everyday development smoother and faster.

To explore the full list of updates, check out our What’s New page.

As always, we would love to hear your feedback. Your insights help us shape the future of PyCharm – and we cannot wait to see what you build next.

Expanding Our Core Web Development Support in PyCharm 2026.1

Ilia Afanasiev — Wed, 25 Mar 2026 15:01:54 +0000

With PyCharm 2026.1, our core IDE experience continues to evolve as we’re bringing a broader set of professional-grade web tools to all users for free. Everyone, from beginners to backend-first developers, is getting access to a substantial set of JavaScript, TypeScript, and CSS features that were previously only available with a Pro subscription.

React, JavaScript, TypeScript, and CSS support

Leverage a comprehensive set of editing and formatting tools for modern web languages within PyCharm, including:

Basic React support with code completion, component and attribute navigation, and React component and prop rename refactorings.
Advanced import management:
- Enjoy automatic JavaScript and TypeScript imports as you work.
- Merge or remove unnecessary references via the Optimize imports feature.
- Get required imports automatically when you paste code into the editor.

Enhanced styling: Access CSS-tailored code completion, inspections, and quick-fixes, and view any changes in real time via the built-in web preview.
Smart editor behavior: Utilize smart keys, code vision inlay hints, and postfix code completions designed for web development.

Navigation and code intelligence

Finding your way around web projects is now even more efficient with tools that allow for:

Pro-grade navigation: Use dedicated gutter icons for Jump to… actions, recursive calls, and TypeScript source mapping.
Core web refactorings: Perform essential code changes with reliable Rename refactorings and actions (Introduce variable, Change signature, Move members, and more).
Quality control: Maintain high code standards with professional-grade inspections, intentions, and quick-fixes.
Code cleanup: Identify redundant code blocks through JavaScript and TypeScript duplicate detection.

Frameworks and integrated tools

With the added essential support for some of the most popular frontend frameworks and tools, you will have access to:

Project initialization: Create new web projects quickly using the built-in Vite generator.
Standard tooling: Standardize code quality with integrated support for Prettier, ESLint, TSLint, and StyleLint.
Script management: Discover and execute NPM scripts directly from your package.json.
Security: Check project dependencies for security vulnerabilities.

We’re excited to bring these tried and true features to the core PyCharm experience for free! We’re certain these tools will help beginners, students, and hobbyists tackle real-world tasks within a single, powerful IDE. Best of all, core PyCharm can be used for both commercial and non-commercial projects, so it will grow with you as you move from learning to professional development.