DEV Community: PyCharm

PyCharm, the Only Python IDE You Need

Evgenia Verbina — Wed, 16 Apr 2025 12:11:26 +0000

Estimated reading time: 3 minutes

PyCharm is now one powerful, unified product! Its core functionality, including Jupyter Notebook support, will be free, and a Pro subscription will be available with additional features. Starting with the 2025.1 release, every user will get instant access to a free one-month Pro trial, so you’ll be able to access all of PyCharm’s advanced features right away. After the trial, you can choose whether to continue with a Pro subscription or keep using the core features for free.

Previously, PyCharm was offered as two separate products: the free Community Edition and the Professional Edition with extended capabilities. Now, with a single streamlined product, you no longer need to choose. Everything is in one place, and you can seamlessly switch between core and advanced features within the same installation whenever you need to.

💡 What’s new?

✅ One product for all developers

You no longer need to worry about additional downloads or switching between editions. PyCharm is now a single product. Start with a month of full Pro access for free, and then keep using the core features at no cost. Upgrade to Pro anytime within the same installation.

🎓 Free Jupyter Notebook support

PyCharm now offers free Jupyter support, including running, debugging, output rendering, and intelligent code assistance in notebooks. It’s perfect for data workflows, no Pro subscription required. However, a Pro subscription does offer more advanced capabilities, including remote notebooks, dynamic tables, SQL cells, and others.

🚀 Seamless access to Pro

With every new major PyCharm release (currently three times a year), you will get instant access to a free one-month Pro trial. Once it ends, you can continue using the core features for free.

🛠️ One product, better quality

Focusing on a single PyCharm product will help us improve overall quality, streamline updates, and deliver new features faster.

What does it mean for me?

🐍 I’m a PyCharm Community Edition user

First of all, thank you for being part of our amazing community! Your feedback, passion, and contributions have helped shape PyCharm into the tool it is today.

Nothing is going to change for you right away – you can upgrade to PyCharm Community 2025.1 as usual. Alternatively, you may choose to manually switch to the new PyCharm immediately and keep using everything you have now for free, plus the support for Jupyter notebooks.

Starting with PyCharm 2025.2, we’ll offer a smooth migration path that preserves your current setup and preferences. PyCharm Community 2025.2 will be the final standalone version, and, from 2025.3 onward, all Community Edition users will transition to the unified PyCharm experience.

Rest assured – our commitment to open-source development remains as strong as ever. The Community Edition codebase will stay public on GitHub, and we’ll continue to maintain and update it. We’ll also provide an easy way to build PyCharm from source via GitHub Actions.

Have more questions about what’s next? Read our extended FAQ for more details.

👥 I’m a PyCharm Professional Edition user

Nothing changes! Your license will automatically work with the new single PyCharm product. Simply upgrade to PyCharm 2025.1 and continue enjoying everything Pro has to offer.

🆕 I’m new to PyCharm

You can start right away with the new single PyCharm product. You’ll get a free one-month Pro trial with full functionality. After that, you can purchase a Pro subscription and keep using PyCharm with its full capabilities, or you can continue using just the core features – including Jupyter Notebook support – for free. Download PyCharm now.

Which Is the Best Python Web Framework: Django, Flask, or FastAPI?

Evgenia Verbina — Tue, 18 Feb 2025 10:00:17 +0000

Search for Python web frameworks, and three names will consistently come up: Django, Flask, and FastAPI. Our latest Python Developer Survey Results confirm that these three frameworks remain developers’ top choices for backend web development with Python.

All three frameworks are open-source and compatible with the latest versions of Python.

But how do you determine which web framework is best for your project? Here, we’ll look at the pros and cons of each and compare how they stack up against one another.

Django

Django is a “batteries included”, full-stack web framework used by the likes of Instagram, Spotify, and Dropbox, to name but a few. Pitched as “the web framework for perfectionists with deadlines”, the Django framework was designed to make it easier and quicker to build robust web apps.

First made available as an open-source project in 2005, Django is a mature project that remains in active development 20 years later. It’s suitable for many web applications, including social media, e-commerce, news, and entertainment sites.

Django follows a model-view-template (MVT) architecture, where each component has a specific role. Models are responsible for handling the data and defining its structure. The views manage the business logic, processing requests and fetching the necessary data from the models. Finally, templates present this data to the end user – similar to views in a model-view-controller (MVC) architecture.

As a full-stack web framework, Django can be used to build an entire web app (from database to HTML and JavaScript frontend).

Alternatively, you can use the Django REST Framework to combine Django with a frontend framework (such as React) to build both mobile and browser-based apps.

Explore our comprehensive Django guide, featuring an overview of prerequisite knowledge, a structured learning path, and additional resources to help you master the framework.

Django advantages

There are plenty of reasons why Django remains one of the most widely used Python web frameworks, including:

Extensive functionality : With a “batteries included” approach, Django offers built-in features like authentication, caching, data validation, and session management. Its don’t repeat yourself (DRY) principle speeds up development and reduces bugs.
Ease of setup : Django simplifies dependency management with its built-in features, reducing the need for external packages. This helps streamline the initial setup and minimizes compatibility issues, so you can get up and running sooner.
Database support : Django’s ORM (object-relational mapping) makes data handling more straightforward, enabling you to work with databases like SQLite, MySQL, and PostgreSQL without needing SQL knowledge. However, it’s less suitable for non-relational databases like MongoDB.
Security : Built-in defenses against common vulnerabilities such as cross-site scripting (XSS), SQL injection, and clickjacking help quickly secure your app from the start.
Scalability : Despite being monolithic, Django allows for horizontal scaling of the application’s architecture (business logic and templates), caching to ease database load, and asynchronous processing to improve efficiency.
Community and documentation : Django has a vast, active community and detailed documentation, with tutorials and support readily available.

Django disadvantages

Despite its many advantages, there are a few reasons you might want to look at options other than Django when developing your next web app.

Heavyweight : Its “batteries included” design can be too much for smaller apps, where a lightweight framework like Flask may be more appropriate.
Learning curve : Django’s extensive features naturally come with a steeper learning curve, though there are plenty of resources available to help new developers.
Performance : Django is generally slower compared to other frameworks like Flask and FastAPI, but built-in caching and asynchronous processing can help improve the response times.

Flask

Flask is a Python-based micro-framework for backend web development. However, don’t let the term “micro” deceive you. As we’ll see, Flask isn’t only limited to smaller web apps.

Instead, Flask is designed with a simple core based on Werkzeug WSGI (Web Server Gateway Interface) and Jinja2 templates. Well-known users of Flask include Netflix, Airbnb, and Reddit.

Flask was initially created as an April Fools’ Day joke and released as an open-source project in 2010, a few years after Django. The micro-framework’s approach is fundamentally different from Django’s. While Django takes a “batteries included” style and comes with a lot of the functionality you may need for building web apps, Flask is much leaner.

The philosophy behind the micro-framework is that everyone has their preferences, so developers should be free to choose their own components. For this reason, Flask doesn’t include a database, ORM (object-relational mapper), or ODM (object-document mapper).

When you build a web app with Flask, very little is decided for you upfront. This can have significant benefits, as we’ll discuss below.

Flask advantages

We’ve seen usage of Flask grow steadily over the last five years through our State of the Developer Ecosystem survey – it overtook Django for the first time in 2021.

Some reasons for choosing Flask as a backend web framework include:

Lightweight design : Flask’s minimalist approach offers a flexible alternative to Django, making it ideal for smaller applications or projects where Django’s extensive features may feel excessive. However, Flask isn’t limited to small projects – you can extend it as needed.
Flexibility : Flask allows you to choose the libraries and frameworks for core functionality, such as data handling and user authentication. This enables you to select the best tools for your project and extend it in unforeseen ways.
Scalability : Flask’s modular design makes it easy to scale horizontally. Using a NoSQL database layer can further enhance scalability.
Shallow learning curve : Its simple design makes Flask easy to learn, though you may need to explore extensions for more complex apps.
Community and documentation : Flask has extensive (if somewhat technical) documentation and a clear codebase. While its community is smaller than Django’s, Flask remains active and is growing steadily.

Flask disadvantages

While Flask has a lot to offer, there are a few things to consider before you decide to use it in your next web development project.

Bring your own everything: Flask’s micro-framework design and flexibility require you to handle much of that core functionality, including data validation, session management, and caching. While this flexibility can be beneficial, it can also slow the development process, as you’ll need to find existing libraries or build features from scratch. Additionally, dependencies must be managed over time to ensure they remain compatible with Flask.
Security: Flask has minimal built-in security. Beyond securing client-side cookies, you must implement web security best practices and ensure the security of the dependencies you include, applying updates as needed.
Performance : While Flask performs slightly better than Django, it lags behind FastAPI. Flask offers some ASGI support (the standard used by FastAPI), but it is more tightly tied to WSGI.

FastAPI

As the name suggests, FastAPI is a micro-framework for building high-performance web APIs with Python. Despite being relatively new – it was first released as an open-source project in 2018 – FastAPI has quickly become popular among developers, ranking third in our list of the most popular Python web frameworks since 2021.

FastAPI is based on Uvicorn, an ASGI (Asynchronous Server Gateway Interface) server, and Starlette, a web micro-framework. FastAPI adds data validation, serialization, and documentation to streamline building web APIs.

When developing FastAPI, the micro-framework’s creator drew on the experiences of working with many different frameworks and tools. Whereas Django was developed before frontend JavaScript web frameworks (such as React or Vue.js) were prominent, FastAPI was designed with this context in mind.

The emergence of OpenAPI (formerly Swagger) as a format for structuring and documenting APIs in the preceding years provided an industry standard that FastAPI could leverage.

Beyond the implicit use case of creating RESTful APIs, FastAPI is ideal for applications that require real-time responses, such as messaging platforms and dashboards. Its high performance and asynchronous capabilities make it a good fit for data-intensive apps, including machine learning models, data processing, and analytics.

FastAPI advantages

FastAPI first received its own category in our State of the Developer Ecosystem survey in 2021, with 14% of respondents using the micro-framework.

Since then, usage has increased to 20%, alongside a slight dip in the use of Flask and Django.

These are some of the reasons why developers are choosing FastAPI:

Performance : Designed for speed, FastAPI supports asynchronous processing and bi-directional web sockets (courtesy of Starlette). It outperformed both Django and Flask in benchmark tests, making it ideal for high-traffic applications.
Scalability : Like Flask, FastAPI is highly modular, making it easy to scale and ideal for containerized deployments.
Adherence to industry standards : FastAPI is fully compatible with OAuth 2.0, OpenAPI (formerly Swagger), and JSON Schema. As a result, you can implement secure authentication and generate your API documentation with minimal effort.
Ease of use : FastAPI use of Pydantic for type hints and validation speeds up development by providing type checks, auto-completion, and request validation.
Documentation : FastAPI comes with a sizable body of documentation and growing third-party resources, making it accessible for developers at all levels.

FastAPI disadvantages

Before deciding that FastAPI is the best framework for your next project, bear in mind the following:

Maturity : FastAPI, being newer, lacks the maturity of Django or Flask. Its community is smaller, and the user experience may be less streamlined due to less extensive use.
Compatibility : As a micro-framework, FastAPI requires additional functionality for fully featured apps. There are fewer compatible libraries compared to Django or Flask, which may require you to develop your own extensions.

Choosing between Flask, Django, and FastAPI

So, which is the best Python web framework? As with many programming things, the answer is “it depends”.

The right choice hinges on answering a few questions: What kind of app are you building? What are your priorities? How do you expect your project to grow in the future?

All three popular Python web frameworks come with unique strengths, so assessing them in the context of your application will help you make the best decision.

Django is a great option if you need standard web app functionality out of the box, making it suitable for projects that require a more robust structure. It’s particularly advantageous if you’re using a relational database, as its ORM simplifies data management and provides built-in security features. However, the extensive functionality may feel overwhelming for smaller projects or simple applications.

Flask , on the other hand, offers greater flexibility. Its minimalist design enables developers to pick and choose the extensions and libraries they want, making it suitable for projects where you need to customize features. This approach works well for startups or MVPs, where your requirements might change and evolve rapidly. While Flask is easy to get started with, keep in mind that building more intricate applications will mean exploring various extensions.

FastAPI is a strong contender when speed is of the essence, especially for API-first or machine learning projects. It uses modern Python features like type hints to provide automatic data validation and documentation. FastAPI is an excellent choice for applications that need high performance, like microservices or data-driven APIs. Despite this, it may not be as feature-rich as Django or Flask in terms of built-in functionality, which means you might need to implement additional features manually.

For a deeper comparison between Django and the different web frameworks, check out our other guides, including:

Start your web development project with PyCharm

Regardless of your primary framework, you can access all the essential web development tools in a single IDE. PyCharm provides built-in support for Django, FastAPI, and Flask, while also offering top-notch integration with frontend frameworks like React, Angular, and Vue.js.

Start with PyCharm for free

The Ultimate Guide to Django Templates

Evgenia Verbina — Wed, 05 Feb 2025 10:37:47 +0000

Django templates are a crucial part of the framework. Understanding what they are and why they’re useful can help you build seamless, adaptable, and functional templates for your Django sites and apps.

If you’re new to the framework and looking to set up your first Django project, grasping templates is vital. In this guide, you’ll find everything you need to know about Django templates, including the different types and how to use them.

What are Django templates?

Django templates are a fundamental part of the Django framework. They allow you to separate the visual presentation of your site from the underlying code. A template contains the static parts of the desired HTML output and special syntax describing how dynamic content will be inserted.

Ultimately, templates can generate complete web pages, while database queries and other data processing tasks are handled by views and models. This separation ensures clean, maintainable code by keeping HTML business logic separate from the Python code in the rest of your Django project. Without templates, you’d need to embed HTML directly into your Python code, making it hard to read and debug.

Here is an example of a Django template containing some HTML, a variable name, and basic if/else logic:

<h1>Welcome!</h1>

{% if name %}
  <h1>Hello, {{ name }}!</h1>
{% else %}
  <h1>Hello, Guest!</h1>
{% endif %}
<h1>{{ heading }}</h1>

Benefits of using templates

Developers use Django templates to help them build reliable apps quickly and efficiently. Other key benefits of templates include:

Code reusability : Reusable components and layouts can be created for consistency across pages and apps.
Easier maintenance : The appearance of web pages may be modified without altering the underlying logic.
Improved readability: HTML code can be kept clean and understandable without the need for complex logic.
Template inheritance : Common structures and layouts may be defined to reduce duplication and promote consistency.
Dynamic content : It’s possible to build personalized web pages that adapt to user inputs and data variations.
Performance optimization : Templates can be cached to improve app or website performance.

Challenges and limitations

While templates are essential for rendering web pages in Django, they should be used thoughtfully, especially in complex projects with bigger datasets. Despite the relative simplicity of Django’s template language, overly complex templates with numerous nested tags, filters, and inheritance can become difficult to manage and maintain. Instead of embedding too much logic into your templates, aim to keep them focused on presentation. Customization options are also limited unless you create your own custom tags or fillers.

Migrating to a different template engine can be challenging, as Django’s default engine is closely tied to the framework. However, switching to an alternative like Jinja is relatively straightforward, and we will discuss this possibility later in this guide.

Debugging Django templates

In some situations (such as when issues arise), it can be useful to see how your template works. For this, you can use template debugging.

Template debugging focuses on identifying errors in how your HTML and dynamic data interact. Common problems include missing variables, incorrect template tags, and logic errors.

Luckily, Django provides helpful tools like {{ debug }} for inspecting your templates and detailed error pages that highlight where the problem lies. This makes it easier to pinpoint and resolve issues, ensuring your templates render as expected.

Understanding the Django Template Language (DTL)

The Django Template Language (DTL) is Django’s built-in templating engine, designed to simplify the creation of dynamic web pages. It seamlessly blends HTML with Django-specific tags and filters, allowing you to generate rich, data-driven content directly from your Django app. Let’s explore some of the key features that make DTL a powerful tool for building templates.

DTL basic syntax and structure

Django templates are written with a combination of HTML and DTL syntax. The basic structure of a Django template consists of HTML markup with embedded Django tags and variables.

Here’s an example:

<!DOCTYPE html>
<html>
  <head>
    <title>{{ page_title }}</title>
  </head>
  <body>
    <h1>{{ heading }}</h1>
    <ul>
      {% for item in item_list %}
        <li>{{ item.name }}</li>
      {% endfor %}
    </ul>
  </body>
</html>

Variables, filters, and tags

The DTL has several features for working with variables, filters, and tags:

Variables : Variables display dynamic data in your templates. They are enclosed in double curly brackets, e.g. {{ variable_name }}.
Filters : Filters modify or format the value of a variable before rendering it. They are applied using a pipe character ( | ), e.g. {{ variable_name|upper }}.
Tags : Tags control the logic and flow of your templates. They are enclosed in {% %} blocks and can perform various operations like loops, conditionals, and template inclusions.

PyCharm, a professional IDE for Django development, simplifies working with Django templates by providing syntax highlighting, which color-codes tags, variables, and HTML for better readability. It also offers real-time error detection, ensuring you don’t miss closing tags or misplace syntax. With auto-completion for variables and tags, you can code faster and with fewer mistakes.

Start with PyCharm Pro for free

Template inheritance and extending base templates

The framework’s template inheritance system enables you to create a base template that contains the standard structure and the layout for your website or app.

You can then create child templates that inherit from the base template and override specific blocks of sections as needed. This encourages code reuse and consistency across your different templates.

To create a base template, you define blocks using the {% block %} tag:

<!-- base.html -->
<!DOCTYPE html>
<html>
  <head>
    <title>{% block title %}Default Title{% endblock %}</title>
  </head>
  <body>
    {% block content %}
    {% endblock %}
  </body>
</html>

Child templates then extend the base templates and override certain blocks:

<!-- child_template.html -->
{% extends 'base.html' %}

{% block title %}My Page Title{% endblock %}

{% block content %}
  <h1>My Page Content</h1>
  <p>This is the content of my page.</p>
{% endblock %}

Django template tags

Tags are an essential element of Django templates. They provide various functionalities, from conditional rendering and looping to template inheritance and inclusion.

Let’s explore them in more detail.

Common Django template tags

There are several template tags in Django, but these are the ones you’ll probably use most frequently:

{% if %}: This tag allows you to conditionally render content based on a specific condition. It’s often used with the {% else %} and {% elif %} tags.
{% for %}: The {% for %} tag is used to iterate over a sequence, such as a list or query set, and render content for each item in the sequence.
{% include %}: This tag enables you to include the contents of another template file within the current template. It facilitates the reuse of common template snippets across multiple templates.
{% block %}: The {% block %} tag is used in conjunction with template inheritance. It defines a block of content that can be overridden by child templates when extending a base template.
{% extends %}: This tag specifies the base template of the current template from which it should inherit.
{% url %}: This tag is used to generate a URL for a named URL pattern in your Django project. It helps keep your templates decoupled from the actual URL paths.
{% load %}: The {% load %} tag is used to load custom template tags and filters from a Python module or library, enabling you to extend the functionality of the Django template system.

These are just some examples of the many template tags available in Django. Tags like {% with %}, {% cycle %}, {% comment %}, and others can provide more functionality for advanced projects, helping you build customized and efficient apps.

Using template tags

Here’s a detailed example of how you might use tags in a Django template:

{% extends 'base.html' %}
{% load custom_filters %}

{% block content %}
  <h1>{{ page_title }}</h1>
  {% if object_list %}
    <ul>
      {% for obj in object_list %}
<!-- We truncate the object name to 25 characters. -->
        <li>{{ obj.name|truncate:25 }}</li>
      {% endfor %}
    </ul>
  {% else %}
    <p>No objects found.</p>
  {% endif %}

  {% include 'partials/pagination.html' %}
{% endblock %}

In this example, we extend a base template, load custom filters, and then define a block for the main content.

Inside the block, we check whether an object_list exists, and if so, we loop through it and display the truncated names of each object. We show a “No objects found” message if the list is empty.

Finally, we include a partial template for pagination. This template is a reusable snippet of HTML that can be included in other templates, enabling you to manage and update common elements (like pagination) more efficiently.

Django admin templates

Django’s built-in admin interface gives you a user-friendly and intuitive way to manage your application data. It’s powered by a set of templates defining its structure, layout, and appearance.

Functionality

The Django admin templates handle various tasks:

Authentication : Controls user authentication, login, and logout.
Model management : Displays lists of model instances and creates, edits, and deletes instances as needed.
Form rendering : Renders forms for creating and editing model instances.
Navigation : Renders the navigation structure of the admin interface, including the main menu and app-specific sub-menus.
Pagination : Renders pagination controls when displaying lists of model instances.
History tracking : Displays and manages the change history of model instances.

Django’s built-in admin templates provide a solid foundation for managing your application’s data.

Customizing admin templates

Although Django’s admin templates offer a good, functional interface out of the box, you may want to customize their appearance or behavior to suit your individual project’s needs.

You can change things to match your project’s branding, improve the user experience, or add custom functionality unique to your app.

There are several ways to do this:

Override templates : You can override default admin templates by creating templates with the same file structure and naming convention in your project’s templates directory. Django will then automatically use your custom templates instead of the built-in ones.
Extend base templates : Many of Django’s admin templates are built using template inheritance. You can create templates that extend the base admin templates and override specific blocks or sections.
Template options : Django has various template options that enable you to customize the admin interface’s behavior. This includes displaying certain fields, specifying which ones should be editable, and defining customer templates for specific model fields.
Admin site customization : You can customize the admin site’s appearance and behavior by subclassing the AdminSite class and registering your custom admin site with Django.

URL templating in Django

URL templates in Django offer a flexible way to define and generate URLs for web applications.

Understanding URL templates

In Django, you define URL patterns in the project’s urls.py file using the path function from the django.urls module.

These URL patterns map certain URL patterns to Python functions (views) that handle the corresponding HTTP requests.

Here’s an example of a basic URL pattern in Django:

# urls.py
from django.urls import path
from . import views

urlpatterns = [
    path('', views.home, name='home'),
    path('about/', views.about, name='about'),
]

In this example, the URL pattern ‘ ‘ maps to the views.home view function, and the URL pattern ‘about/’ maps to the views.about view function.

Dynamic URL generation with URL templates

URL templates in Django allow you to include variables or parameters in your URL patterns. This means you can create dynamic URLs that represent different instances of the same resource or include more data.

If your urls.py file includes other URL files using include(), PyCharm automatically gathers and recognizes all nested routes, ensuring that URL name suggestions remain accurate. You can also navigate to URL definitions by _Ctrl+Click-_ing on a URL name to jump directly to its source, even if the URL is defined in a child file.

Let’s look at an example of a URL template with a variable:

# urls.py
urlpatterns = [
    path('blog/<int:year>/', views.year_archive, name='blog_year_archive'),
]

In this case, the URL ‘blog/<int:year>/’ includes a variable year of type int. When a request matches this pattern, Django will pass the value of the year as an argument to the views.year_archive view function.

Using Django URLs

Django URLs are the foundation of any application and work by linking user requests to the appropriate views. By defining URL patterns that match specific views, Django ensures your site remains organized and scalable.

Using URL templates with Django’s `reverse` function

Django’s reverse function lets you generate URLs based on their named URL patterns. It takes the name of the URL pattern as its first argument before any further required arguments and returns the corresponding URL.

Here’s an example of it in action:

# views.py
from django.shortcuts import render
from django.urls import reverse

def blog_post_detail(request, year, month, slug):
    # ...
    url = reverse('blog_post_detail', args=[year, month, slug])
    return render(request, 'blog/post_detail.html', {'url': url})

The reverse function is used to generate the URL for the ‘blog_post_detail’ URL pattern, passing the year, month, and slug values as arguments.

You can then use the returned URL in templates or other application parts.

Using URL tags in Django templates

Django’s {% url %} template tag provides an elegant way to generate URLs directly within your template. Instead of hardcoding URLs, you can refer to named URL patterns, which makes your templates more flexible and easier to manage.

Here’s an example:

<a href="{% url 'blog_post_detail' year=2024 month=10 slug=post.slug %}"> 
Read More 
</a>

In this case, the {% url %} tag creates a URL for the blog_post_detail view, passing in the year, month, and slug parameters. It’s important to make sure these arguments match the URL pattern defined in your urls.py file, which should look like this:

path('blog/<int:year>/<int:month>/<slug:slug>/', views.blog_post_detail, name='blog_post_detail'),

This approach helps keep your templates clean and adaptable, particularly as your project evolves.

Jinja vs. Django templates

Although Django comes with a built-in template engine (DTL), developers also have the option to use alternatives like Jinja.

Jinja is a popular, modern, and feature-rich template engine for Python. Initially developed for the Flask web framework, it’s also compatible with Django.

The engine was designed to be fast, secure, and highly extensible. Its broad feature set and capabilities make it versatile for rendering dynamic content.

Some of Jinja’s key features and advantages over Django’s DTL include:

A more concise and intuitive syntax.
Sandboxed execution for increased security against code injection attacks.
A more flexible and powerful inheritance system.
Better debugging tools and reporting mechanisms.
Faster performance when working with complex templates or large datasets.
Enhanced functionality with built-in filters and macros, enabling more complex rendering logic without cluttering the template.

PyCharm can automatically detect the file type *.jinja and provides syntax highlighting, code completion, and error detection along with support for custom filters and extensions, ensuring a smooth development experience.

Despite these benefits, it’s also important to remember that integrating Jinja into a Django project requires a more complex setup and further configuration.

Some developers might also prefer to stick with Django’s built-in template engine in order to keep everything within the Django ecosystem.

Code faster with Django live templates

With PyCharm’s live template feature, you can quickly insert commonly used code snippets with a simple keyword shortcut.

All you have to do is invoke live templates by pressing ⌘J, typing ListView, and hitting the Tab key.

This reduces boilerplate coding, speeds up development, and ensures consistent syntax. You can even customize or create your own templates to fit specific project needs. This feature is particularly useful for DTL syntax, where loops, conditionals, and block structures are frequently repeated.

Using Django templates: best practices and tips

Working with Django templates is a great way to manage the presentation layer of your web apps.

However, following guidance and carrying out performance optimizations is essential to ensure your templates are well-maintained, secure, and systematic.

Here are some best practices and tips to remember when using Django templates:

Separate presentation and business logic. Keep templates focused on rendering data and handle complex processing in views or models.
Organize your templates logically. Follow Django’s file structure by separating templates by app and functionality, using subdirectories as needed.
Use Django’s naming conventions. Django follows a ‘convention over configuration’ principle, letting you name your templates in a specific way so that you don’t need to provide your template name explicitly. For instance, when using class-based views like ListView, Django automatically looks for a template named /_list.html , thus simplifying your code.
Break down elaborate tasks into reusable components. Promote code reuse and improve maintainability by using template tags, filters, and includes.
Follow consistent naming conventions. Use clear and descriptive names for your templates, tags, and filters. This makes it easier for other developers to read your code.
Use Django’s safe rendering filters. Always escape user-provided data before rendering to prevent XSS vulnerabilities.
Document complex template logic. Use clear comments to explain intricate parts of your templates. This will help others (and your future self) understand your code.
Profile your templates. Use Django’s profiling tools to find and optimize performance bottlenecks like inefficient loops and convoluted logic.

Watch this video to explore Django tips and PyCharm features in more detail.

Conclusion

Whether you’re building a simple website or a more complicated app, you should now know how to create Django templates that enhance user experience and streamline your development process.

But templates are just one component of the Django framework. Explore our other Django blogs and resources that can help you learn Django, discover Django’s newest features, and more. You may also want to familiarize yourself with Django’s official documentation.

Reliable Django support in PyCharm

From complete beginners to experienced developers, PyCharm Professional is on hand to help streamline your Django development workflow.

The Django IDE provides Django-specific code assistance, debugging, live previews, project-wide navigation, and refactoring capabilities. PyCharm includes full support for Django templates, allowing you to manage and edit them with ease. You can also connect to your database with a single click and work seamlessly with TypeScript, JavaScript, and other frontend frameworks.

For full details of how to work with Django templates in PyCharm, see our documentation. Those who are relatively new to the Django framework may benefit from first reading our comprehensive tutorial, which covers all the steps in the process of creating a new Django app in PyCharm.

Ready to get started? Download PyCharm now and enjoy a more productive development process.

Start with PyCharm Pro for free

An Introduction to Django Views

Evgenia Verbina — Wed, 29 Jan 2025 10:51:08 +0000

Views are central to Django’s architecture pattern, and having a solid grasp of how to work with them is essential for any developer working with the framework. If you’re new to developing web apps with Django or just need a refresher on views, we’ve got you covered.

Gaining a better understanding of views will help you make faster progress in your Django project. Whether you’re working on an API backend or web UI flows, knowing how to use views is crucial.

Read on to discover what Django views are, their different types, best practices for working with them, and examples of use cases.

What are Django views?

Views are a core component of Django’s MTV (model-template-view) architecture pattern. They essentially act as middlemen between models and templates, processing user requests and returning responses.

You may have come across views in the MVC (model-view-controller) pattern. However, these are slightly different from views in Django and don’t translate exactly. Django views are essentially controllers in MVC, while Django templates roughly align with views in MVC. This makes understanding the nuances of Django views vital, even if you’re familiar with views in an MVC context.

Views are part of the user interface in Django, and they handle the logic and data processing for web requests made to your Django-powered apps and sites. They render your templates into what the user sees when they view your webpage. Each function-based or class-based view takes a user’s request, fetches the data from its models, applies business logic or data processing, and then prepares and returns an HTTP response to a template.

This response can be anything a web browser can display and is typically an HTML webpage. However, Django views can also return images, XML documents, redirects, error pages, and more.

Rendering and passing data to templates

Django provides the render() shortcut to make template rendering simple from within views. Using render() helps avoid the boilerplate of loading the template and creating the response manually.

PyCharm offers smart code completion that automatically suggests the render() function from django.shortcuts when you start typing it in your views. It also recognizes template names and provides autocompletion for template paths, helping you avoid typos and errors.

The user provides the request, the template name, and a context dictionary, which gives data for the template. Once the necessary data is obtained, the view passes it to the template, where it can be rendered and presented to the user.

from django.shortcuts import render

def my_view(request):
    # Some business logic to obtain data
    data_to_pass = {'variable1': 'value1', 'variable2': 'value2'}

    # Pass the data to the template
    return render(request, 'my_template.html', context=data_to_pass)

In this example, data_to_pass is a dictionary containing the data you want to send to the template. The render function is then used to render the template (my_template.html) with the provided context data.

Now, in your template (my_template.html), you can access and display the data.

<!DOCTYPE html>
<html>
<head>
    <title>My Template</title>
</head>
<body>
    <h1>{{ variable1 }}</h1>
    <p>{{ variable2 }}</p>
</body>
</html>

In the template, you use double curly braces ({{ }}) to indicate template variables. These will be replaced with the values from the context data passed by the view.

PyCharm offers completion and syntax highlighting for Django template tags, variables, and loops. It also provides in-editor linting for common mistakes. This allows you to focus on building views and handling logic, rather than spending time manually filling in template elements or debugging common errors.

Start with PyCharm Pro for free

Function-based views

Django has two types of views: function-based views and class-based views.

Function-based views are built using simple Python functions and are generally divided into four basic categories: create, read, update, and delete (CRUD). This is the foundation of any framework in development. They take in an HTTP request and return an HTTP response.

from django.http import HttpResponse

def my_view(request):

    # View logic goes here
    context = {"message": "Hello world"}

    return HttpResponse(render(request, "mytemplate.html", context))

This snippet handles the logic of the view, prepares a context dictionary for passing data to a template that is rendered, and returns the final template HTML in a response object.

Function-based views are simple and straightforward. The logic is contained in a single Python function instead of spread across methods in a class, making them most suited to use cases with minimal processing.

PyCharm allows you to automatically generate the def my_view(request) structure using live templates. Live templates are pre-defined code snippets that can be expanded into boilerplate code. This feature saves you time and ensures a consistent structure for your view definitions.

You can invoke live templates simply by pressing ⌘J, typing Listview, and pressing the tab key.

Moreover, PyCharm includes a Django Structure tool window, where you can see a list of all the views in your Django project, organized by app. This allows you to quickly locate views, navigate between them, and identify which file each view belongs to.

Class-based views

Django introduced class-based views so users wouldn’t need to write the same code repeatedly. They don’t replace function-based views but instead have certain applications and advantages, especially in cases where complex logic is required.

Class-based views in Django provide reusable parent classes that implement various patterns and functionality typically needed by web application views. You can take your views from these parent classes to reduce boilerplate code.

Class-based views offer generic parent classes like:

ListView
DetailView
CreateView
And many more.

Below are two similar code snippets demonstrating a simple BookListView. The first shows a basic implementation using the default class-based conventions, while the second illustrates how you can customize the view by specifying additional parameters.

Basic implementation :

from django.views.generic import ListView
from .models import Book 

class BookListView(ListView):
    model = Book
    # The template_name is omitted because Django defaults to 'book_list.html' 
    # based on the convention of <model_name>_list.html for ListView.

When BookListView gets rendered, it automatically queries the Book records and passes them under the variable books when rendering book_list.html. This means you can create a view to list objects quickly without needing to rewrite the underlying logic.

Customized implementation :

from django.views.generic import ListView
from .models import Book 

class BookListView(ListView):

    model = Book
    # You can customize the view further by adding additional attributes or methods 
    def get_queryset(self):
    # Example of customizing the queryset to filter books
    return Book.objects.filter(is_available=True)

In the second snippet, we’ve introduced a custom get_queryset() method, allowing us to filter the records displayed in the view more precisely. This shows how class-based views can be extended beyond their default functionality to meet the needs of your application.

Class-based views also define methods that tie into key parts of the request and response lifecycle, such as:

get() – logic for GET requests.
post() – logic for POSTrequests.
dispatch() – determines which method to call get() or post().

These types of views provide structure while offering customization where needed, making them well-suited to elaborate use cases.

PyCharm offers live templates for class-based views like ListView, DetailView, and TemplateView, allowing you to generate entire view classes in seconds, complete with boilerplate methods and docstrings.

Creating custom class-based views

You can also create your own view classes by subclassing Django’s generic ones and customizing them for your needs.

Some use cases where you might want to make your own classes include:

Adding business logic, such as complicated calculations.
Mixing multiple generic parents to blend functionality.
Managing sessions or state across multiple requests.
Optimizing database access with custom queries.
Reusing common rendering logic across different areas.

A custom class-based view could look like this:

from django.views.generic import View
from django.shortcuts import render
from . import models

class ProductSalesView(View):

    def get(self, request):

        # Custom data processing 
        sales = get_sales_data()

        return render(request, "sales.html", {"sales": sales})

    def post(self, request):

        # Custom form handling
        form = SalesSearchForm(request.POST)  
        if form.is_valid():
            results = models.Sale.objects.filter(date__gte=form.cleaned_data['start_date'])
            context = {"results": results}
            return render(request, "search_results.html", context)

        # Invalid form handling
        errors = form.errors
        return render(request, "sales.html", {"errors": errors})

Here, custom get and post handlers enable you to extend the existing ones between requests.

When to use each view type

Function-based and class-based views can both be useful depending on the complexity and needs of the view logic.

The main differences are that class-based views:

Promote reuse via subclassing and parents inheriting behavior.
Are ideal for state management between requests.
Provide more structure and enforced discipline.

You might use them working with:

Dashboard pages with complex rendering logic.
Public-facing pages that display dynamic data.
Admin portals for content management.
List or detail pages involving database models.

On the other hand, function-based views:

Are simpler and take less code to create.
Can be easier for Python developers to grasp.
Are highly flexible and have fewer constraints.

Their use cases include:

Prototyping ideas.
Simple CRUD or database views.
Landing or marketing pages.
API endpoints for serving web requests.

In short, function-based views are flexible, straightforward, and are easier to reason about. However, for more complex cases, you’ll need to create more code that you can’t reuse.

Class-based views in Django enforce structure and are reusable, but they can be more challenging to understand and implement, as well as harder to debug.

Views and URLs

As we’ve established, in Django, views are the functions or classes that determine how a template is rendered. Each view links to a specific URL pattern, guiding incoming requests to the right place.

Understanding the relationship between views and URLs is important for managing your application’s flow effectively.

Every view corresponds with a URL pattern defined in your Django app’s urls.py file. This URL mapping ensures that when a user navigates to a specific address in your application, Django knows exactly which view to invoke.

Let’s take a look at a simple URL configuration:

from django.urls import path
from .views import BookListView

urlpatterns = [
    path('books/', BookListView.as_view(), name='book-list'),
]

In this setup, when a user visits /books/, the BookListView kicks in to render the list of books. By clearly mapping URLs to views, you make your codebase easier to read and more organized.

Simplify URL management with PyCharm

Managing and visualizing endpoints in Django can become challenging as your application grows. PyCharm addresses this with its Endpoints tool window, which provides a centralized view of all your app’s URL patterns, linked views, and HTTP methods. This feature allows you to see a list of every endpoint in your project, making it easier to track which views are tied to specific URLs.

Instead of searching through multiple urls.py files, you can instantly locate and navigate to the corresponding views with just a click. This is especially useful for larger Django projects where URL configurations span multiple files or when working in teams where establishing context quickly is crucial.

Furthermore, the Endpoints tool window lets you visualize all endpoints in a table-like interface. Each row displays the URL path, the HTTP method (GET, POST, etc.), and the associated view function or class of a given endpoint.

This feature not only boosts productivity but also improves code navigation, allowing you to spot missing or duplicated URL patterns with ease. This level of visibility is invaluable for debugging routing issues or onboarding new developers to a project.

Check out this video for more information on the Endpoints tool window and how you can benefit from it.

Best practices for using Django views

Here are some guidelines that can help you create well-structured and maintainable views.

Keep views focused

Views should concentrate on handling requests, fetching data, passing data to templates, and controlling flow and redirects. Complicated business logic and complex processing should happen elsewhere, such as in model methods or dedicated service classes.

However, you should be mindful not to overload your models with too much logic, as this can lead to the “fat model” anti-pattern in Django. Django’s documentation on views provides more insights about structuring them properly.

Keep views and templates thin

It’s best to keep both views and templates slim. Views should handle request processing and data retrieval, while templates should focus on presentation with minimal logic.

Complex processing should be done in Python outside the templates to improve maintainability and testing. For more on this, check out the Django templates documentation.

Decouple database queries

Extracting database queries into separate model managers or repositories instead of placing them directly in views can help reduce duplication. Refer to the Django models documentation for guidance on managing database interactions effectively.

Use generic class-based views when possible

Django’s generic class-based views, like DetailView and ListView, provide reusability without requiring you to write much code. Opt for using them over reinventing the wheel to make better use of your time. The generic views documentation is an excellent resource for understanding these features.

Function-based views are OK for simple cases

For basic views like serving APIs, a function can be more effective than a class. Reserve complex class-based views for intricate UI flows. The writing views documentation page offers helpful examples.

Structure routes and URLs cleanly

Organize routes and view handlers by grouping them into apps by functionality. This makes it easier to find and navigate the source. Check out the Django URL dispatcher documentation for best practices in structuring your URL configurations.

Next steps

Now that you have a basic understanding of views in Django, you’ll want to dig deeper into the framework and other next steps.

Brush up on your Django knowledge with our How to Learn Django blog post, which is ideal for beginners or those looking to refresh their expertise.
Discover how to create and run your first Django project in PyCharm, with our tutorial on crafting a basic to-do application, or explore our complete list of Django project ideas for further inspiration.
Explore the state of Django to see the latest trends in Django development for further inspiration.
If you’re still deciding which Python framework to use, our Django vs. Flask and Django vs. FastAPI comparison guides can help.

Django support in PyCharm

PyCharm Professional is the best-in-class IDE for Django development. It allows you to code faster with Django-specific code assistance, project-wide navigation and refactoring, and full support for Django templates. You can connect to your database in a single click and work on TypeScript, JavaScript, and frontend frameworks. PyCharm also supports Flask and FastAPI out of the box.

Create better applications and streamline your code. Get started with PyCharm now for an effortless Django development experience.

Start with PyCharm Pro for free

Anomaly Detection in Time Series

Evgenia Verbina — Wed, 22 Jan 2025 12:14:32 +0000

How do you identify unusual patterns in data that might reveal critical issues or hidden opportunities? Anomaly detection helps identify data that deviates significantly from the norm. Time series data, which consists of data collected over time, often includes trends and seasonal patterns. Anomalies in time series data occur when these patterns are disrupted, making anomaly detection a valuable tool in industries like sales, finance, manufacturing, and healthcare.

As time series data has unique characteristics like seasonality and trends, specialized methods are required to detect anomalies effectively. In this blog post, we’ll explore some popular methods for anomaly detection in time series, including STL decomposition and LSTM prediction, with detailed code examples to help you get started.

Time series anomaly detection in businesses

Time series data is essential to many businesses and services. Many businesses record data over time with timestamps, allowing changes to be analyzed and data to be compared over time. Time series are useful when comparing a certain quantity over a certain period, as, for example, in a year-over-year comparison where the data exhibits characteristics of seasonalities.

Sales monitoring

One of the most common examples of time series data with seasonalities is sales data. As a lot of sales are affected by annual holidays and the time of the year, it is hard to draw conclusions about sales data without considering the seasonalities. Because of that, a common method for analyzing and finding anomalies in sales data is STL decomposition, which we will cover in detail later in this blog post.

Finance

Financial data, such as transactions and stock prices, are typical examples of time series data. In the finance industry, analyzing and detecting anomalies in this data is a common practice. For example, time series prediction models can be used in automatic trading. We’ll use a time series prediction to identify anomalies in stock data later in this blog post.

Manufacturing

Another use case of time series anomaly detection is monitoring defects in production lines. Machines are often monitors, making time series data available. Being able to notify management of potential failures is essential, and anomaly detection plays a key role.

Medicine and healthcare

In medicine and healthcare, human vitals are monitored and anomalies can be detected. This is important enough in medical research, but it’s critical in diagnostics. If a patient at a hospital has anomalies in their vitals and is not treated immediately, the results can be fatal.

Why is it important to use special methods for time series anomaly detection?

Time series data is special in the sense that it sometimes cannot be treated like other types of data. For example, when we apply a train test split to time series data, the sequentially related nature of the data means we cannot shuffle it. This is also true when applying time series data to a deep learning model. A recurrent neural network (RNN) is commonly used to take the sequential relationship into account, and training data is input as time windows, which preserve the sequence of events within.

Time series data is also special because it often has seasonality and trends that we cannot ignore. This seasonality can manifest in a 24-hour cycle, a 7-day cycle, or a 12-month cycle, just to name a few common possibilities. Anomalies can only be determined after the seasonality and trends have been considered, as you will see in our example below.

Methods used for anomaly detection in time series

Because time series data is special, there are specific methods for detecting anomalies in it. Depending on the type of data, some of the methods and algorithms we mentioned in the previous blog post about anomaly detection can be used on time series data. However, with those methods, the anomaly detection may not be as robust as using ones specifically designed for time series data. In some cases, a combination of detection methods can be used to reconfirm the detection result and avoid false positives or negatives.

STL decomposition

One of the most popular ways to use time series data that has seasonality is STL decomposition – seasonal trend decomposition using LOESS (locally estimated scatterplot smoothing). In this method, a time series is decomposed using an estimate of seasonality (with the period provided or determined using an algorithm), a trend (estimated), and the residual (the noise in the data). A Python library that provides STL decomposition tools is the statsmodels library.

An anomaly is detected when the residual is beyond a certain threshold.

Using STL decomposition on beehive data

In an earlier blog post, we explored anomaly detection in beehives using the OneClassSVM and IsolationForest methods.

In this tutorial, we’ll analyze beehive data as a time series using the STL class provided by the statsmodels library. To get started, set up your environment using this file: requirements.txt.

1. Install the library

Since we have only been using the model provided by Scikit-learn, we will need to install statsmodels from PyPI. This is easy to do in PyCharm.

Start with PyCharm Pro for free

Go to the Python Packagewindow (choose the icon at the bottom of the left-hand side of the IDE) and type in statsmodels in the search box.

You can see all of the information about the package on the right-hand side. To install it, simply click Install package.

2. Create a Jupyter notebook

To investigate the dataset further, let’s create a Jupyter notebook to take advantage of the tools that PyCharm’s Jupyter notebook environment provides.

We will import pandas and load the .csv file.

import pandas as pd

df = pd.read_csv('../data/Hive17.csv', sep=";")
df = df.dropna()
df

3. Inspect the data as graphs

Now, we can inspect the data as graphs. Here, we would like to see the temperature of hive 17 over time. Click on Chart view in the dataframe inspector and then choose T17 as the y-axis in the series settings.

When expressed as a time series, the temperature has a lot of ups and downs. This indicates periodic behavior, likely due to the day-night cycle, so it is safe to assume there is a 24-hour period for the temperature.

Next, there is a trend of temperature dropping over time. If you inspect the DateTime column, you can see that the dates range from August to November. Since the Kaggle page of the dataset indicates that the data was collected in Turkey, the transition from summer to fall explains our observation that the temperature is dropping over time.

4. Time series decomposition

To understand the time series and detect anomalies, we will perform STL decomposition, importing the STL class from statsmodels and fitting it with our temperature data.

from statsmodels.tsa.seasonal import STL

stl = STL(df["T17"], period=24, robust=True) 
result = stl.fit()

We will have to provide a period for the decomposition to work. As we mentioned before, it is safe to assume a 24-hour cycle.

According to the documentation, STL decomposes a time series into three components: trend, seasonal, and residual. To get a clearer look at the decomposed result, we can use the built-in plot method:

result.plot()

You can see the Trend and Season plots seem to align with our assumptions above. However, we are interested in the residual plot at the bottom, which is the original series without the trend and seasonal changes. Any extremely high or low value in the residual indicates an anomaly.

5. Anomaly threshold

Next, we would like to determine what values of the residual we’ll consider abnormal. To do that, we can look at the residual’s histogram.

result.resid.plot.hist()

This can be considered a normal distribution around 0, with a long tail above 5 and below -5, so we’ll set the threshold to 5.

To show the anomalies on the original time series, we can color all of them red in the graph like this:

import matplotlib.pyplot as plt

threshold = 5
anomalies_filter = result.resid.apply(lambda x: True if abs(x) > threshold else False)
anomalies = df["T17"][anomalies_filter]

plt.figure(figsize=(14, 8))
plt.scatter(x=anomalies.index, y=anomalies, color="red", label="anomalies")
plt.plot(df.index, df['T17'], color='blue')
plt.title('Temperatures in Hive 17')
plt.xlabel('Hours')
plt.ylabel('Temperature')
plt.legend()
plt.show()

Without STL decomposition, it is very hard to identify these anomalies in a time series consisting of periods and trends.

LSTM prediction

Another way to detect anomalies in time series data is to do a time series prediction on the series using deep learning methods to estimate the outcome of data points. If an estimate is very different from the actual data point, then it could be a sign of anomalous data.

One of the popular deep learning algorithms to perform the prediction of sequential data is the Long short-term memory (LSTM) model, which is a type of recurrent neural network (RNN). The LSTM model has input, forget, and output gates, which are number matrices. This ensures important information is passed on in the next iteration of the data.

Since time series data is sequential data, meaning the order of data points is in sequential order and should not be shuffled, the LSTM model is an effective deep learning model to predict the outcome at a certain time. This prediction can be compared to the actual data and a threshold can be set to determine if the actual data is an anomaly.

Using LSTM prediction on stock prices

Now let’s start a new Jupyter project to detect any anomalies in Apple’s stock price over the past 5 years. The stock price dataset shows the most up-to-date data. If you want to follow along with the blog post, you can download the dataset we are using.

1. Start a Jupyter project

When starting a new project, you can choose to create a Jupyter one, which is optimized for data science. In the New Project window, you can create a Git repository and determine which conda installation to use for managing your environment.

After starting the project, you will see an example notebook. Go ahead and start a new Jupyter notebook for this exercise.

After that, let’s set up requirements.txt. We will need pandas, matplotlib, and PyTorch, which is named torch on PyPI. Since PyTorch is not included in the conda environment, PyCharm will tell us that we are missing the package. To install the package, click on the lightbulb and select Install all missing packages.

2. Loading and inspecting the data

Next, let’s put our dataset apple_stock_5y.csv in the data folder and load it as a pandas DataFrame to inspect it.

import pandas as pd

df = pd.read_csv('data/apple_stock_5y.csv')
df

With the interactive table, we can easily see if any data is missing.

There is no missing data, but we have one issue – we would like to use the Close/Last price but it is not a numeric data type. Let’s do a conversion and inspect our data again:

df["Close/Last"] = df["Close/Last"].apply(lambda x: float(x[1:]))
df

Now, we can inspect the price with the interactive table. Click on the plot icon on the left and a plot will be created. By default, it uses Date as the x-axis and Volume as the y-axis. Since we would like to inspect the Close/Last price, go to the settings by clicking the gear icon on the right and choose Close/Last as the y-axis.

3. Preparing the training data for LSTM

Next, we have to prepare the training data to be used in the LSTM model. We need to prepare a sequence of vectors (feature X), each representing a time window, to predict the next price. The next price will form another sequence (target y). Here we can choose how big this time window is with the lookback variable. The following code creates sequences X and y which will then be converted to PyTorch tensors:

import torch

lookback = 5
timeseries = df[["Close/Last"]].values.astype('float32')

X, y = [], []
for i in range(len(timeseries)-lookback):
    feature = timeseries[i:i+lookback]
    target = timeseries[i+1:i+lookback+1]
    X.append(feature)
    y.append(target)

X = torch.tensor(X)
y = torch.tensor(y)

print(X.shape, y.shape)

Generally speaking, the bigger the window, the bigger our model will be, since the input vector is bigger. However, with a bigger window, the sequence of inputs will be shorter, so determining this lookback window is a balancing act. We will start with 5, but feel free to try different values to see the differences.

4. Build and train the model

We can build the model by creating a class using the nn module in PyTorch before we train it. The nn module provides building blocks, such as different neural network layers. In this exercise, we will build a simple LSTM layer followed by a linear layer:

import torch.nn as nn

class StockModel(nn.Module):
    def __init__ (self):
        super(). __init__ ()
        self.lstm = nn.LSTM(input_size=1, hidden_size=50, num_layers=1, batch_first=True)
        self.linear = nn.Linear(50, 1)
    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.linear(x)
        return x

Next, we will train our model. Before training it, we will need to create an optimizer, a loss function used to calculate the loss between the predicted and actual y values, and a data loader to feed in our training data:

import numpy as np
import torch.optim as optim
import torch.utils.data as data

model = StockModel()
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=8)

The data loader can shuffle the input, as we have already created the time windows. This preserves the sequential relationship in each window.

Training is done using a for loop which loops over each epoch. For every 100 epochs, we will print out the loss and observe while the model converges:

n_epochs = 1000
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if epoch % 100 != 0:
        continue
    model.eval()
    with torch.no_grad():
        y_pred = model(X)
        rmse = np.sqrt(loss_fn(y_pred, y))
    print(f"Epoch {epoch}: RMSE {rmse:.4f}")

We start at 1000 epochs, but the model converges quite quickly. Feel free to try other numbers of epochs for training to achieve the best result.

In PyCharm, a cell that requires some time to execute will provide a notification about how much time remains and a shortcut to the cell. This is very handy when training machine learning models, especially deep learning models, in Jupyter notebooks.

5. Plot the prediction and find the errors

Next, we will create the prediction and plot it together with the actual time series. Note that we will have to create a 2D np series to match with the actual time series. The actual time series will be in blue while the predicted time series will be in red.

import matplotlib.pyplot as plt

with torch.no_grad():
    pred_series = np.ones_like(timeseries) * np.nan
    pred_series[lookback:] = model(X)[:, -1, :]

plt.plot(timeseries, c='b')
plt.plot(pred_series, c='r')
plt.show()

If you observe carefully, you will see that the prediction and the actual values do not align perfectly. However, most of the predictions do a good job.

To inspect the errors closely, we can create an error series and use the interactive table to observe them. We are using the absolute error this time.

error = abs(timeseries-pred_series)
error

Use the settings to create a histogram with the value of the absolute error as the x-axis and the count of the value as the y-axis.

6. Decide on the anomaly threshold and visualize

Most of the points will have an absolute error of less than 6, so we can set that as the anomaly threshold. Similar to what we did for the beehive anomalies, we can plot the anomalous data points in the graph.

threshold = 6
error_series = pd.Series(error.flatten())
price_series = pd.Series(timeseries.flatten())

anomalies_filter = error_series.apply(lambda x: True if x > threshold else False)
anomalies = price_series[anomalies_filter]

plt.figure(figsize=(14, 8))
plt.scatter(x=anomalies.index, y=anomalies, color="red", label="anomalies")
plt.plot(df.index, timeseries, color='blue')
plt.title('Closing price')
plt.xlabel('Days')
plt.ylabel('Price')
plt.legend()
plt.show()

Summary

Time series data is a common form of data used in many applications including business and scientific research. Due to the sequential nature of time series data, special methods and algorithms are used to help determine anomalies in it. In this blog post, we demonstrated how to identify anomalies using STL decomposition to eliminate seasonalities and trends. We have also demonstrated how to use deep learning and the LSTM model to compare the predicted estimate and the actual data in order to determine anomalies.

Detect anomalies using PyCharm

With the Jupyter project in PyCharm Professional, you can organize your anomaly detection project with a lot of data files and notebooks easily. Graphs output can be generated to inspect anomalies and plots are very accessible in PyCharm. Other features, such as auto-complete suggestions, make navigating all the Scikit-learn models and Matplotlib plot settings a blast.

Power up your data science projects by using PyCharm, and check out the data science features offered to streamline your data science workflow.

Start with PyCharm Pro for free

Anomaly Detection in Machine Learning Using Python

Evgenia Verbina — Thu, 16 Jan 2025 10:08:19 +0000

In recent years, many of our applications have been driven by the high volume of data that we are able to collect and process. Some may refer to us being in the age of data. One of the essential aspects of handling such a large amount of data is anomaly detection – processes that enable us to identify outliers, data that is outside the bounds of expectation and demonstrate behavior that is out of the norm. In scientific research, anomaly data points could be a cause of technical issues and may need to be discarded when drawing conclusions, or it could lead to new discoveries.

In this blog post, we’ll see why using machine learning for anomaly detection is helpful and explore key techniques for detecting anomalies using Python. You’ll learn how to implement popular methods like OneClassSVM and Isolation Forest, see examples of how to visualize these results and understand how to apply them to real-world problems.

Where is anomaly detection used?

Anomaly detection has also been a crucial part of modern-day business intelligence, as it provides insights into what could possibly go wrong and may also identify potential problems. Here are some examples of using anomaly detection in modern-day business.

Security alerts

There are some cyber security attacks that can be detected via anomaly detection; for example, a spike in request volume may indicate a DDoS attack, while suspicious login behavior, like multiple failing attempts, may indicate unauthorized access. Detecting suspicious user behavior may indicate potential cyber security threats, and companies can act on them accordingly to prevent or minimize the damage.

Fraud detection

In financial organizations, for example, banks can use anomaly detection to highlight suspicious account activities, which may be an indication of illegal activities like money laundering or identity theft. Suspicious transactions can also be a sign of credit card fraud.

Observability

One of the common practices for web services is to collect metrics of the real-time performance of the service if there is abnormal behavior in the system. For example, a spike in memory usage may show that something in the system isn’t functioning properly, and engineers may need to address it immediately to avoid a break in service.

Why use machine learning for anomaly detection?

Although traditional statistical methods can also help find outliers, the use of machine learning for anomaly detection has been a game changer. With machine learning algorithms, more complex data (e.g. with multiple parameters) can be analyzed all at once. Machine learning techniques also provide a means to analyze categorical data that isn’t easy to analyze using traditional statistical methods, which are more suited to numerical data.

A lot of time, these anomaly detection algorithms are programmed and can be deployed as an application (see our FastAPI for Machine Learning tutorial) and run as requested or at scheduled intervals to detect any anomalies. This means that they can prompt immediate actions within the company and can also be used as reporting tools for business intelligence teams to review and adjust strategies.

Types of anomaly detection techniques and algorithms

There are generally two main types of anomaly detection: outlier detection and novelty detection.

Outlier detection

Outlier detection is sometimes referred to as unsupervised anomaly detection, as it is assumed that in the training data, there are some undetected anomalies (thus unlabeled), and the approach is to use unsupervised machine learning algorithms to pick them out. Some of these algorithms include one-class support vector machines (SVMs), Isolation Forest, Local Outlier Factor, and Elliptic Envelope.

Novelty detection

On the other hand, novelty detection is sometimes referred to as semi-supervised anomaly detection. Since we assume that all training data doesn’t solely consist of anomalies, they’re all labeled as normal. The goal is to detect whether or not new data is an anomaly, which is sometimes referred to as a novelty. The algorithms used in outlier detection can also be used for novelty detection, provided that there are no anomalies in the training data.

Other than the outlier detection and novelty detection mentioned, it is also very common to require anomaly detection in time series data. However, since the approach and technique used for time series data are often different from the algorithms mentioned above, we’ll discuss these in detail at a later date.

Code example: finding anomalies in the Beehives dataset

In this blog post, we’ll be using this Beehives dataset as an example to detect any anomalies in the hives. This data set provides various measurements of the hive (including the temperature and relative humidity of the hive) at various times.

Here, we’ll be showing two very different methods for discovering anomalies. They are OneClassSVM, which is based on support vector machine technology, which we’ll use for drawing decision boundaries, and Isolation Forest, which is an ensemble method similar to Random Forest.

Example: OneClassSVM

In this first example, we’ll be using the data of hive 17, assuming bees will keep their hive in a constant pleasant environment for the colony; we can look at whether this is true and if there are times that the hive experiences anomaly temperature and relative humidity levels. We’ll use OneClassSVM to fit our data and look at the decision-making boundaries on a scatter plot.

The SVM in OneClassSVM stands for support vector machine, which is a popular machine learning algorithm for classification and regressions. While support vector machines can be used to classify data points in high dimensions, by choosing a kernel and a scalar parameter to define a frontier, we can create a decision boundary that includes most of the data points (normal data), while retaining a small number of anomalies outside of the boundaries to represent the probability (nu) of finding a new anomaly. The method of using support vector machines for anomaly detection is covered in a paper by Scholkopf et al. entitled Estimating the Support of a High-Dimensional Distribution.

1. Start a Jupyter project

When starting a new project in PyCharm (Professional 2024.2.2), select Jupyter under Python.

Start with PyCharm Pro for free

The benefit of using a Jupyter project (previously also known as a Scientific project) in PyCharm is that a file structure is generated for you, including a folder for storing your data and a folder to store all the Jupyter notebooks so you can keep all your experiments in one place.

Another huge benefit is that we can render graphs very easily with Matplotlib. You will see that in the steps below.

2. Install dependencies

Download this requirements.txt from the relevant GitHub repo. Once you place it in the project directory and open it in PyCharm, you will see a prompt asking you to install the missing libraries.

Click on Install requirements, and all of the requirements will be installed for you. In this project, we’re using Python 3.11.1.

3. Import and inspect the data

You can either download the “Beehives” dataset from Kaggle or from this GitHub repo. Put all three CSVs in the Data folder. Then, in main.py, enter the following code:

import pandas as pd

df = pd.read_csv('data/Hive17.csv', sep=";")
df = df.dropna()
print(df.head())

Finally, press the Run button in the top right-hand corner of the screen, and our code will be run in the Python console, giving us an idea of what our data looks like.

4. Fit the data points and inspect them in a graph

Since we’ll be using the OneClassSVM from scikit-learn, we’ll import it together with DecisionBoundaryDisplay and Matplotlib using the code below:

from sklearn.svm import OneClassSVM
from sklearn.inspection import DecisionBoundaryDisplay

import matplotlib.pyplot as plt

According to the data’s description, we know that column T17 represents the temperature of the hive, and RH17 represents the relative humidity of the hive. We’ll extract the value of these two columns as our input:

X = df[["T17", "RH17"]].values

Then, we’ll create and fit the model. Note that we’ll try the default setting first:

estimator = OneClassSVM().fit(X)

Next, we’ll show the decision boundary together with the data points:

disp = DecisionBoundaryDisplay.from_estimator(
    estimator,
    X,
    response_method="decision_function",
    plot_method="contour",
    xlabel="Temperature", ylabel="Humidity",
    levels=[0],
)
disp.ax_.scatter(X[:, 0], X[:, 1])
plt.show()

Now, save and press Run again, and you’ll see that the plot is shown in a separate window for inspection.

5. Fine-tune hyperparameters

As the plot above shows, the decision boundary does not fit very well with the data points. The data points consist of a couple of irregular shapes instead of an oval. To fine-tune our model, we have to provide a specific value of “mu” and “gamma” to the OneClassSVM model. You can try it out yourself, but after a couple of tests, it seems “nu=0.1, gamma=0.05” gives the best result.

Example: Isolation Forest

Isolation Forest is an ensemble-based method, similar to the more popularRandom Forestclassification method. By randomly selecting parting features and values, it will create many decision trees, and the path length from the root of the tree to the node making that decision will then be averaged over all the trees (hence “forest”). A short average path length indicates anomalies.

A short decision path usually indicates data that is very different from the others.

Now, let’s compare the result of OneClassSVM with IsolationForest. To do that, we’ll make two plots of the decision boundaries made by the two algorithms. In the following steps, we’ll build on the script above using the same hive 17 data.

1. Import IsolationForest

IsolationForest can be imported from the ensemble categories in Scikit-learn:

from sklearn.ensemble import IsolationForest

2. Refactor and add a new estimator

Since now we’ll have two different estimators, let’s put them in a list:

estimators = [
    OneClassSVM(nu=0.1, gamma=0.05).fit(X),
    IsolationForest(n_estimators=100).fit(X)
]

After that, we’ll use a for loop to loop through all the estimators.

for estimator in estimators:
    disp = DecisionBoundaryDisplay.from_estimator(
        estimator,
        X,
        response_method="decision_function",
        plot_method="contour",
        xlabel="Temperature", ylabel="Humidity",
        levels=[0],
    )
    disp.ax_.scatter(X[:, 0], X[:, 1])
    plt.show()

As a final touch, we’ll also add a title to each of the graphs for easier inspection. To do that, we’ll add the following after disp.ax_.scatter:

disp.ax_.set_title(
        f"Decision boundary using {estimator. __class__. __name__ }"
    )

You may find that refactoring using PyCharm is very easy with the auto-complete suggestions it provides.

3. Run the code

Like before, running the code is as easy as pressing the Run button in the top-right corner. After running the code this time, we should get two graphs.

You can easily flip through the two graphs with the preview on the right. As you can see, the decision boundary is quite different while using different algorithms. When doing anomaly detection, it’s worth experimenting with various algorithms and parameters to find the one that suits the use case the most.

Next step: anomaly detection in time series data

If the data is like our bee hive data, which is a time series, then there are additional methods to single out anomalies. As time series have trends and periods, anything out of this pattern of trends and periods can be considered as anomalies. Popular methods to detect anomalies in time series include STL decomposition and LSTM prediction.

Learn how to use these methods to detect anomalies in time series in this blog post.

Summary

Anomaly Detection has proven to be an important aspect of business intelligence, and being able to identify anomalies and prompt immediate actions to be taken is essential in some sectors of business. Using the proper machine learning model to automatically detect anomalies can help analyze complicated and high volumes of data in a short period of time. In this blog post, we have demonstrated how to identify anomalies using statistical models like OneClassSVM.

To learn more about using PyCharm for machine learning, please check out “Start Studying Machine Learning With PyCharm” and “How to Use Jupyter Notebooks in PyCharm”.

Detect anomalies using PyCharm

Power up your data science project by using PyCharm; check out the data science features offered to streamline your data science workflow.

Start with PyCharm Pro for free

Data Cleaning in Data Science

Evgenia Verbina — Wed, 08 Jan 2025 15:02:13 +0000

In this Data Science blog post series, we’ve talked about where to get data from and how to explore that data using pandas, but whilst that data is excellent for learning, it’s not similar to what we will term real-world data. Data for learning has often already been cleaned and curated to allow you to learn quickly without needing to venture into the world of data cleaning, but real-world data has problems and is messy. Real-world data needs cleaning before it can give us useful insights, and that’s the subject of this blog post.

Data problems can come from the behaviour of the data itself, the way the data was gathered, or even the way the data was input. Mistakes and oversights can happen at every stage of the journey.

We are specifically talking about data cleaning here rather than data transformation. Data cleaning ensures that conclusions you make from your data can be generalised to the population you define. In contrast, data transformation involves tasks such as converting data formats, normalising data and aggregating data.

Why Is Data Cleaning Important?

The first thing we need to understand about datasets is what they represent. Most datasets are a sample representing a wider population, and in working with this sample, you will be able to extrapolate (or generalise) your findings to this population. For example, we used a dataset in the previous two blog posts. This dataset is broadly about house sales, but it only covers a small geographical area, a small period of time and potentially not all houses in that area and period; it is a sample of a larger population.

Your data needs to be a representative sample of the wider population, for example, all house sales in that area over a defined period. To ensure that our data is a representative sample of the wider population, we must first define our population’s boundaries.

As you might imagine, it’s often impractical to work with the entire population, except perhaps census data, so you need to decide where your boundaries are. These boundaries might be geographical, demographical, time-based, action-based (such as transactional) or industry-specific. There are numerous ways to define your population, but to generalise your data reliably, this is something you must do before you clean your data.

In summary, if you’re planning to use your data for any kind of analysis or machine learning, you need to spend time cleaning the data to ensure that you can rely on your insights and generalise them to the real world. Cleaning your data results in more accurate analysis and, when it comes to machine learning, performance improvements, too.

Without cleaning your data, you risk issues such as not being able to generalise your learnings to the wider population reliably, inaccurate summary statistics and incorrect visualisations. If you are using your data to train machine learning models, this can also lead to errors and inaccurate predictions.

Try PyCharm Professional for free

Examples of Data Cleaning

We’re going to take a look at five tasks you can use to clean your data. This is not an exhaustive list, but it’s a good place to start when you get your hands on some real-world data.

Deduplicating data

Duplicates are a problem because they can distort your data. Imagine you are plotting a histogram where you’re using the frequency of sale prices. If you have duplicates of the same value, you will end up with a histogram that has an inaccurate pattern based on the prices that are duplicated.

As a side note, when we talk about duplication being a problem in datasets, we are talking about duplication of whole rows, each of which is a single observation. There will be duplicate values in the columns, and we expect this. We’re just talking about duplicate observations.

Fortunately for us, there is a pandas method we can use to help us detect if there are any duplicates in our data. We can use JetBrains AI chat if we need a reminder with a prompt such as:

C_ode to identify duplicate rows_

Here’s the resulting code:

duplicate_rows = df[df.duplicated()]
duplicate_rows

This code assumes that your DataFrame is called df_,_ so make sure to change it to the name of your DataFrame if it is not.

There isn’t any duplicated data in the Ames Housing dataset that we’ve been using, but if you’re keen to try it out, take a look at the CITES Wildlife Trade Database dataset and see if you can find the duplicates using the pandas method above.

Once you have identified duplicates in your dataset, you must remove them to avoid distorting your results. You can get the code for this with JetBrains AI again with a prompt such as:

_Code to drop duplicates from my dataframe _

The resulting code drops the duplicates, resets the index of your DataFrame and then displays it as a new DataFrame called df_cleaned:

df_cleaned = df.drop_duplicates()
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned

There are other pandas functions that you can use for more advanced duplicate management but this is enough to get you started with deduplicating your dataset.

Dealing with implausible values

Implausible values can occur when data is entered incorrectly or something has gone wrong in the data-gathering process. For our Ames Housing dataset, an implausible value might be a negative SalePrice, or a numerical value for Roof Style.

Spotting implausible values in your dataset relies on a broad approach that includes looking at your summary statistics, checking the data validation rules that were defined by the collector for each column and noting any data points that fall outside of this validation as well as using visualisations to spot patterns and anything that looks like it might be an anomaly.

You will want to deal with implausible values as they can add noise and cause problems with your analysis. However, how you deal with them is somewhat open to interpretation. If you don’t have many implausible values relative to the size of your dataset, you may want to remove the records containing them. For example, if you’ve identified an implausible value in row 214 of your dataset, you can use the pandas drop function to remove that row from your dataset.

Once again, we can get JetBrains AI to generate the code we need with a prompt like:

Code that drops index 214 from #df_cleaned

Note that in PyCharm’s Jupyter notebooks I can prefix words with the # sign to indicate to JetBrains AI Assistant that I am providing additional context and in this case that my DataFrame is called df_cleaned.

The resulting code will remove that observation from your DataFrame, reset the index and display it:

df_cleaned = df_cleaned.drop(index=214)
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned

Another popular strategy for dealing with implausible values is to impute them, meaning you replace the value with a different, plausible value based on a defined strategy. One of the most common strategies is to use the median value instead of the implausible value. Since the median is not affected by outliers, it is often chosen by data scientists for this purpose, but equally, the mean or the mode value of your data might be more appropriate in some situations.

Alternatively, if you have domain knowledge about the dataset and how the data was gathered, you can replace the implausible value with one that is more meaningful. If you’re involved in or know of the data-gathering process, this option might be for you.

How you choose to handle implausible values depends on their prevalence in your dataset, how the data was gathered and how you intend to define your population as well as other factors such as your domain knowledge.

Formatting data

You can often spot formatting problems with your summary statistics or early visualisations you perform to get an idea of the shape of your data. Some examples of inconsistent formatting are numerical values not all being defined to the same decimal place or variations in terms of spelling, such as “first” and “1st”. Incorrect data formatting can also have implications for the memory footprint of your data.

Once you spot formatting issues in your dataset, you need to standardise the values. Depending on the issue you are facing, this normally involves defining your own standard and applying the change. Again, the pandas library has some useful functions here such as round. If you wanted to round the SalePrice column to 2 decimal places, we could ask JetBrains AI for the code:

Code to round #SalePrice _to two decimal places _

The resulting code will perform the rounding and then print out the first 10 rows so you can check it:

df_cleaned['SalePrice'] = df_cleaned['SalePrice].round(2)
df_cleaned.head()

As another example, you might have inconsistent spelling – for example, a HouseStyle column that has both “1Story” and “OneStory”, and you’re confident that they mean the same thing. You can use the following prompt to get code for that:

Code to change all instances of #OneStory to #1Story in #HouseStyle_ _

The resulting code does exactly that and replaces all instances of OneStory to 1Story:

df_cleaned[HouseStyle'] = df_cleaned['HouseStyle'].replace('OneStory', '1Story')

Addressing outliers

Outliers are very common in datasets, but how you address them, if at all, is very context-dependent. One of the easiest ways to spot outliers is with a box plot, which uses the seaborn and matplotlib libraries. I discussed box plots in my previous blog post on exploring data with pandas if you need a quick refresher.

We’ll look at SalePrice in our Ames housing dataset for this box plot. Again, I’ll use JetBrains AI to generate the code for me with a prompt such as:

Code to create a box plot of #SalePrice_ _

Here’s the resulting code that we need to run:

import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot for SalePrice
plt.figure(figsize=(10, 6))
sns.boxplot(x=df_cleaned['SalePrice'])
plt.title('Box Plot of SalePrice')
plt.xlabel('SalePrice')
plt.show()

The box plot tells us that we have a positive skew because the vertical median line inside the blue box is to the left of the centre. A positive skew tells us that we have more house prices at the cheaper end of the scale, which is not surprising. The box plot also tells us visually that we have lots of outliers on the right-hand side. That is a small number of houses that are much more expensive than the median price.

You might accept these outliers as it’s fairly typical to expect a small number of houses with a larger price point than the majority. However, this is all dependent on the population you want to be able to generalise to and the conclusions you want to draw from your data. Putting clear boundaries around what is and what is not part of your population will allow you to make an informed decision about whether outliers in your data are going to be a problem.

For example, if your population consists of people who will not be buying expensive mansions, then perhaps you can delete these outliers. If, on the other hand, your population demographics include those who might reasonably be expected to buy these expensive houses, you might want to keep them as they’re relevant to your population.

I’ve talked about box plots here as ways to spot outliers, but other options such as scatter plots and histograms can quickly show you if you have outliers in your data, so you can make an informed decision on if you need to do anything about them.

Addressing outliers usually falls into two categories – deleting them or using summary statistics less prone to outliers. In the first instance, we need to know exactly which rows they are.

Until now we’ve just been discussing how to identify them visually. There are different ways to determine which observations are and aren’t outliers. One common approach is using a method called the modified Z score. Before we look at how and why it’s modified, the Z-score is defined as:

Z-score = (data point value – mean) / standard deviation

The reason we then modify the Z-score for detecting outlets is that both the mean and the standard deviation are prone to outlier influence by virtue of how they are calculated. The modified Z-score is defined as:

Modified Z-score = (data point value – median) / median absolute deviation

As we learned when we talked about summary statistics, the median is not affected by outliers. The median absolute deviation is the median value of the dataset’s absolute deviations from the median. For example, if your data set contains these values:

1, 2, 2, 2, 3 , 3, 3, 5,9

Then your median is 3 as the value in the middle of the dataset. The deviations from your median value of 3 are as follows in the order presented above:

2, 1, 1, 1, 0, 0, 0, 2, 5

When we sort these into ascending order we get:

0, 0, 0, 1, 1 , 1, 2, 2, 5

Meaning the median absolute deviation in this example would be 1.

How do we calculate the Modified Z-score for our data set for SalePrice? Let’s give JetBrains AI this prompt:

Given:

Modified Z-score = (data point value – median) / median absolute deviation

Calculate the Modified Z-score for #SalePrice_ _

That results in this code:

import pandas as pd
from scipy.stats import median_abs_deviation

# Assuming 'data' is your DataFrame and it contains the 'SalePrice' column

# Calculate the median for the SalePrice column
median_sale_price = data['SalePrice'].median()

# Calculate the Median Absolute Deviation (MAD) for the SalePrice column
mad_sale_price = median_abs_deviation(data['SalePrice'], scale='normal')

# Calculate the Modified Z-scores for the SalePrice column
data['Modified_Z_Score'] = (data['SalePrice'] - median_sale_price) / mad_sale_price

# Display the first few rows with the Modified Z-scores
print(data[['SalePrice', 'Modified_Z_Score']].head())

When we install the required packages and run this code, we can see we’re partway through, but now we need to decide what an outlier for our SalePrice is based on the modified Z-score. The general wisdom here is that an outlier is anything >=3 or <=-3 however, it’s worth noting that like most statistical decisions, it can and should be tailored to your dataset.

Let’s pass another prompt over to JetBrains AI so we can further tailor our output:

Just list those that have a #Modified_Z_Score _of 3 or above or -3 or below _

I’m going to take this snippet of code and replace it with the relevant rows above:

# Filter the rows where the Modified Z-score is 3 or above, or -3 or below
outliers = data[(data['Modified_Z_Score'] >= 3) | (data['Modified_Z_Score'] <= -3)]

# Print all the filtered rows, showing their index and SalePrice
outliers = (outliers[['SalePrice', 'Modified_Z_Score']])
outliers

I have modified this code to save the outliers in a new DataFrame called outliers and print them out so I can view them.

Our next step would be to remove these outliers from our DataFrame. Again we can use JetBrains AI to generate the code with a prompt like:

_Create a new dataframe without the outliers _

data_without_outliers = data.drop(index=outliers.index)

# Display the new DataFrame without outliers
print(data_without_outliers)

Our new DataFrame, data_without_outliers, excludes those values where the SalePrice variable is considered an outlier.

We can update our box plot code to look at the new DataFrame. It still shows our positive skew as we’d expect, but the values considered as outliers have been removed:

import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot for SalePrice
plt.figure(figsize=(10, 6))
sns.boxplot(x=data_without_outliers['SalePrice'])
plt.title('Box Plot of SalePrice')
plt.xlabel('SalePrice')
plt.show()

Before we finish up here, let’s take a look at how many observations were removed from our DataFrame as a percentage because we considered them to be outliers based on SalePrice.

We can use a prompt such as:

Calculate the percentage of observations removed between #data and #data_without_outliers

# Calculate the number of observations in the original and filtered DataFrames
original_count = len(data)
filtered_count = len(data_without_outliers)

# Calculate the number of removed observations
removed_count = original_count - filtered_count

# Calculate the percentage of observations removed
percentage_removed = (removed_count / original_count) * 100

# Display the percentage
print(f"Percentage of observations removed: {percentage_removed:.2f}%")

PyCharm tells us that 5.67% of observations have been removed.

As I mentioned earlier, if you are keeping your outliers, consider using summary values less prone to being affected by outliers such as the median and interquartile range. You might consider using these measurements to form your conclusions when you’re working with datasets that you know contain outliers that you’ve not removed because they are relevant to the population you’ve defined and the conclusions you want to draw.

Missing values

The fastest way to spot missing values in your dataset is with your summary statistics. As a reminder, in your DataFrame, click Show Column Statistics on the right-hand side and then select Compact. Missing values in the columns are shown in red, as you can see here for Lot Frontage in our Ames housing dataset:

There are three kinds of missingness that we have to consider for our data:

Missing completely at random
Missing at random
Missing not at random

Missing completely at random

Missing completely at random means the data has gone missing entirely by chance and the fact that it is missing has no relationship to other variables in the dataset. This can happen when someone forgets to answer a survey question, for example.

Data that is missing completely at random is rare, but it’s also among the easiest to deal with. If you have a relatively small number of observations missing completely at random, the most common approach is to delete those observations because doing so shouldn’t affect the integrity of your dataset and, thus, the conclusions you hope to draw.

Missing at random

Missing at random has a pattern to it, but we’re able to explain that pattern through other variables we’ve measured. For example, someone didn’t answer a survey question because of how the data was collected.

Consider in our Ames housing dataset again, perhaps the Lot Frontage variable is missing more frequently for houses that are sold by certain real estate agencies. In that case, this missingness could be due to inconsistent data entry practices by some agencies. If true, the fact that the Lot Frontage data is missing is related to how the agency that sold the property gathered the data, which is an observed characteristic, not the Lot Frontage itself.

When you have data missing at random, you will want to understand why that data is missing, which often involves digging into how the data was gathered. Once you understand why the data is missing, you can choose what to do. One of the more common approaches to deal with missing at random is to impute the values. We’ve already touched on this for implausible values, but it’s a valid strategy for missingness too. There are various options you could choose from based on your defined population and the conclusions you want to draw, including using correlated variables such as house size, year built, and sale price in this example. If you understand the pattern behind the missing data, you can often use contextual information to impute the values, which ensures that relationships between data in your dataset are preserved.

Missing not at random

Finally, missing not at random is when the likelihood of missing data is related to unobserved data. That means that the missingness is dependent on the unobserved data.

One last time, let’s return to our Ames housing dataset and the fact that we have missing data in Lot Frontage. One scenario for data missing not at random is when sellers deliberately choose not to report Lot Frontage if they consider it small and thus reporting it might reduce the sale price of their house. If the likelihood of Lot Frontage data being missing depends on the size of the frontage itself (which is unobserved), smaller lot frontages are less likely to be reported, meaning the missingness is directly related to the missing value.

Visualising missingness

Whenever data is missing, you need to establish whether there’s a pattern. If you have a pattern, then you have a problem that you’ll likely have to address before you can generalize your data.

One of the easiest ways to look for patterns is with heat map visualisations. Before we get into the code, let’s exclude variables with no missingness. We can prompt JetBrains AI for this code:

_Code to create a new dataframe that contains only columns with missingness _

Here’s our code:

# Identify columns with any missing values
columns_with_missing = data.columns[data.isnull().any()]

# Create a new DataFrame with only columns that have missing values
data_with_missingness = data[columns_with_missing]

# Display the new DataFrame
print(data_with_missingness)

Before you run this code, change the final line so we can benefit from PyCharm’s nice DataFrame layout:

data_with_missingness

Now it’s time to create a heatmap; again we will prompt JetBrains AI with code such as:

Create a heatmap of #data_with_missingness that is transposed

Here’s the resulting code:

import seaborn as sns
import matplotlib.pyplot as plt

# Transpose the data_with_missingness DataFrame
transposed_data = data_with_missingness.T

# Create a heatmap to visualize missingness
plt.figure(figsize=(12, 8))
sns.heatmap(transposed_data.isnull(), cbar=False, yticklabels=True)
plt.title('Missing Data Heatmap (Transposed)')
plt.xlabel('Instances')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

Note that I removed cmap=’viridis’ from the heatmap arguments as I find it hard to view.

This heatmap suggests that there might be a pattern of missingness because the same variables are missing across multiple rows. In one group, we can see that Bsmt Qual, Bsmt Cond, Bsmt Exposure, BsmtFin Type 1 and Bsmt Fin Type 2 are all missing from the same observations. In another group, we can see that Garage Type, Garage Yr Bit, Garage Finish, Garage Qual and Garage Cond are all missing from the same observations.

These variables all relate to basements and garages, but there are other variables related to garages or basements that are not missing. One possible explanation is that different questions were asked about garages and basements in different real estate agencies when the data was gathered, and not all of them recorded as much detail as is in the dataset. Such scenarios are common with data you don’t collect yourself, and you can explore how the data was collected if you need to learn more about missingness in your dataset.

Best Practices for Data Cleaning

As I’ve mentioned, defining your population is high on the list of best practices for data cleaning. Know what you want to achieve and how you want to generalise your data before you start cleaning it.

You need to ensure that all your methods are reproducible because reproducibility also speaks to clean data. Situations that are not reproducible can have a significant impact further down the line. For this reason, I recommend keeping your Jupyter notebooks tidy and sequential while taking advantage of the Markdown features to document your decision-making at every step, especially with cleaning.

When cleaning data, you should work incrementally, modifying the DataFrame rather than the original CSV file or database, and ensuring you do it all in reproducible, well-documented code.

Summary

Data cleaning is a big topic, and it can have many challenges. The larger the dataset is, the more challenging the cleaning process is. You will need to keep your population in mind to generalise your conclusions more widely while balancing tradeoffs between removing and imputing missing values and understanding why that data is missing in the first place.

You can think of yourself as the voice of the data. You know the journey that the data has been on and how you have maintained data integrity at all stages. You are the best person to document that journey and share it with others.

Try PyCharm Professional for free

7 Reasons You Should Use dbt Core in PyCharm

Evgenia Verbina — Mon, 16 Dec 2024 12:58:55 +0000

dbt Core is a modern data transformation framework. It doesn’t extract or load data and is only responsible for the T in the ELT (extract-load-transform) process. dbt connects to your data warehouse and helps you prepare your data so it can later be used to answer business questions.

In this blog post, we’ll talk about the top benefits of dbt and the advantages of using it in PyCharm Professional. To make the most of these features, you should be familiar with the framework. If you know SQL well, you’ll likely find it easy to use, and if you are a total novice in the field, you can use the dbt portal to get acquainted with it.

Why you should use dbt

Modularity and code reusability – Transformations can be saved into modular, reusable models. For instance, in this example the model int_count_customer.sql has a reference to stg_day_customer.sql and reuses its code.

Versioning – dbt projects can be stored in version control systems like Git or GitHub. This allows you to track changes, collaborate with other team members, and maintain a record of all transformations.

Testing – dbt allows you to write tests for your data models easily and check whether the data has any duplicates or null values. Additionally, you can even create specific rules to test against, and you can perform tests on both the model and the project levels.

Documentation – dbt auto-generates documentation for data models, ensuring that team members and stakeholders all understand the data lineage and model definitions in the same way.

To summarize, dbt brings best practices in engineering to the field of data analysis, allowing you to produce higher-quality results while providing you with a straightforward and intuitive workflow.

These benefits are just the tip of the iceberg when it comes to what the tool can do.

How PyCharm streamlines your dbt workflow

Having established the benefits of dbt, we can now turn to the 7 key reasons to use it in PyCharm:

User-friendly onboarding – PyCharm streamlines the initial setup. As demonstrated in this video, setting up a project and configuring the necessary settings is straightforward.

Unified workspace for databases and dbt – PyCharm’s integrated database plugin powered by JetBrains DataGrip makes handling SQL databases significantly easier. Since it’s compatible with all databases that dbt works with, you don’t have to worry about juggling multiple tools. You can focus on data modeling and instantly view outcomes all in one place. To cover even a small number of the plugin’s features would take hours, but luckily we have a nice set of webinars dedicated to PyCharm’s functionality for databases: Visual SQL Development with PyCharm.

Git and dbt integration – In one interface, you can easily clone the repo, track any changes, manage branches, resolve conflicts, and collaborate with teammates.

Autocompletion for your .yml and jinja-template SQL files – People love using PyCharm because of its smart autocompletion, which it, of course, offers for dbt as well.

Local history –This feature lets you undo recent changes if they cause problems. You can also compare different versions to see what was changed and check whether updates were made correctly.

AI Assistant – AI Assistant is really helpful, especially if you’re just starting with dbt Core. It is context-aware, and in addition to having it answer your questions in the AI chat, you can have it generate code and fix problems for you, streamlining your work with data models. It also saves you from worrying about what to write in commit messages by composing them for you.

Project navigation – PyCharm excels in project navigation, offering features like fast search functionality and the Go to Declaration feature, both of which allow you to navigate through your dbt models effortlessly.

That’s just a glimpse of the benefits PyCharm already offers for dbt, and our support is still in its early stages. We invite you to test it out and share your insights. Whether you have suggestions for features or want to let us know about areas for improvement, we’re eager to hear from you.

Get started with PyCharm by using the promo code dbt-PyCharm to get a 3-month free trial.

Redeem your code

Want to learn how to use dbt in PyCharm? Head to the documentation page to learn more about the IDE’s dbt support.

Eager to learn more about dbt in general? Take a look at this post on the experience of using dbtand this analysis of deeper dbt concepts by Pavel Finkelshteyn.

Introduction to Sentiment Analysis in Python

Evgenia Verbina — Thu, 12 Dec 2024 10:01:40 +0000

Sentiment analysis is one of the most popular ways to analyze text. It allows us to see at a glance how people are feeling across a wide range of areas and has useful applications in fields like customer service, market and product research, and competitive analysis.

Like any area of natural language processing (NLP), sentiment analysis can get complex. Luckily, Python has excellent packages and tools that make this branch of NLP much more approachable.

In this blog post, we’ll explore some of the most popular packages for analyzing sentiment in Python, how they work, and how you can train your own sentiment analysis model using state-of-the-art techniques. We’ll also look at some PyCharm features that make working with these packages easier and faster.

What is sentiment analysis?

Sentiment analysis is the process of analyzing a piece of text to determine its emotional tone. As you can probably see from this definition, sentiment analysis is a very broad field that incorporates a wide variety of methods within the field of natural language processing.

There are many ways to define “emotional tone”. The most commonly used methods determine the valence or polarity of a piece of text – that is, how positive or negative the sentiment expressed in a text is. Emotional tone is also usually treated as a text classification problem, where text is categorized as either positive or negative.

Take the following Amazon product review:

This is obviously not a happy customer, and sentiment analysis techniques would classify this review as negative.

Contrast this with a much more satisfied buyer:

This time, sentiment analysis techniques would classify this as positive.

Different types of sentiment analysis

There are multiple ways of extracting emotional information from text. Let’s review a few of the most important ones.

Ways of defining sentiment

First, sentiment analysis approaches have several different ways of defining sentiment or emotion.

Binary : This is where the valence of a document is divided into two categories, either positive or negative, as with the SST-2 dataset. Related to this are classifications of valence that add a neutral class (where a text expresses no sentiment about a topic) or even a conflict class (where a text expresses both positive and negative sentiment about a topic).

Some sentiment analyzers use a related measure to classify texts into subjective or objective.

Fine-grained : This term describes several different ways of approaching sentiment analysis, but here it refers to breaking down positive and negative valence into a Likert scale. A well-known example of this is the SST-5 dataset, which uses a five-point Likert scale with the classes very positive, positive, neutral, negative, and very negative.

Continuous : The valence of a piece of text can also be measured continuously, with scores indicating how positive or negative the sentiment of the writer was. For example, the VADER sentiment analyzer gives a piece of text a score between –1 (strongly negative) and 1 (strongly positive), with scores close to 0 indicating a neutral sentiment.

Emotion-based : Also known as emotion detection or emotion identification, this approach attempts to detect the specific emotion being expressed in a piece of text. You can approach this in two ways. Categorical emotion detection tries to classify the sentiment expressed by a text into one of a handful of discrete emotions, usually based on the Ekman model, which includes anger, disgust, fear, joy, sadness, and surprise. A number of datasets exist for this type of emotion detection. Dimensional emotional detection is less commonly used in sentiment analysis and instead tries to measure three emotional aspects of a piece of text: polarity, arousal (how exciting a feeling is), and dominance (how restricted the emotional expression is).

Levels of analysis

We can also consider different levels at which we can analyze a piece of text. To understand this better, let’s consider another review of the coffee maker:

Document-level : This is the most basic level of analysis, where one sentiment for an entire piece of text will be returned. Document-level analysis might be fine for very short pieces of text, such as Tweets, but can give misleading answers if there is any mixed sentiment. For example, if we based the sentiment analysis for this review on the whole document, it would likely be classified as neutral or conflict, as we have two opposing sentiments about the same coffee machine.

Sentence-level : This is where the sentiment for each sentence is predicted separately. For the coffee machine review, sentence-level analysis would tell us that the reviewer felt positively about some parts of the product but negatively about others. However, this analysis doesn’t tell us what things the reviewer liked and disliked about the coffee machine.

Aspect-based : This type of sentiment analysis dives deeper into a piece of text and tries to understand the sentiment of users about specific aspects. For our review of the coffee maker, the reviewer mentioned two aspects: appearance and noise. By extracting these aspects, we have more information about what the user specifically did and did not like. They had a positive sentiment about the machine’s appearance but a negative sentiment about the noise it made.

Coupling sentiment analysis with other NLP techniques

Intent-based : In this final type of sentiment analysis, the text is classified in two ways: in terms of the sentiment being expressed, and the topic of the text. For example, if a telecommunication company receives a ticket complaining about how often their service goes down, they could classify the text intent or topic as service reliability and the sentiment as negative. As with aspect-based sentiment analysis, this analysis gives the company much more information than knowing whether their customers are generally happy or unhappy.

Applications of sentiment analysis

By now, you can probably already think of some potential use cases for sentiment analysis. Basically, it can be used anywhere that you could get text feedback or opinions about a topic. Organizations or individuals can use sentiment analysis to do social media monitoring and see how people feel about a brand, government entity, or topic.

Customer feedback analysis can be used to find out the sentiments expressed in feedback or tickets. Product reviews can be analyzed to see how satisfied or dissatisfied people are with a company’s products. Finally, sentiment analysis can be a key component in market research and competitive analysis, where how people feel about emerging trends, features, and competitors can help guide a company’s strategies.

How does sentiment analysis work?

At a general level, sentiment analysis operates by linking words (or, in more sophisticated models, the overall tone of a text) to an emotion. The most common approaches to sentiment analysis fall into one of the three methods below.

Lexicon-based approaches

These methods rely on a lexicon that includes sentiment scores for a range of words. They combine these scores using a set of rules to get the overall sentiment for a piece of text. These methods tend to be very fast and also have the advantage of yielding more fine-grained continuous sentiment scores. However, as the lexicons need to be handcrafted, they can be time-consuming and expensive to produce.

Machine learning models

These methods train a machine learning model, most commonly a Naive Bayes classifier, on a dataset that contains text and their sentiment labels, such as movie reviews. In this model, texts are generally classified as positive, negative, and sometimes neutral. These models also tend to be very fast, but as they usually don’t take into account the relationship between words in the input, they may struggle with more complex texts that involve qualifiers and negations.

Large language models

These methods rely on fine-tuning a pre-trained transformer-based large language model on the same datasets used to train the machine learning classifiers mentioned earlier. These sophisticated models are capable of modeling complex relationships between words in a piece of text but tend to be slower than the other two methods.

Sentiment analysis in Python

Python has a rich ecosystem of packages for NLP, meaning you are spoiled for choice when doing sentiment analysis in this language.

Let’s review some of the most popular Python packages for sentiment analysis.

The best Python libraries for sentiment analysis

VADER

VADER (Valence Aware Dictionary and Sentiment Reasoner) is a popular lexicon-based sentiment analyzer. Built into the powerful NLTK package, this analyzer returns four sentiment scores: the degree to which the text was positive, neutral, or negative, as well as a compound sentiment score. The positive, neutral, and negative scores range from 0 to 1 and indicate the proportion of the text that was positive, neutral, or negative. The compound score ranges from –1 (extremely negative) to 1 (extremely positive) and indicates the overall sentiment valence of the text.

Let’s look at a basic example of how it works:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

We first need to download the VADER lexicon.

nltk.download('vader_lexicon')

We can then instantiate the VADER SentimentIntensityAnalyzer() and extract the sentiment scores using the polarity_scores() method.

analyzer = SentimentIntensityAnalyzer()

sentence = "I love PyCharm! It's my favorite Python IDE."
sentiment_scores = analyzer.polarity_scores(sentence)
print(sentiment_scores)

{'neg': 0.0, 'neu': 0.572, 'pos': 0.428, 'compound': 0.6696}

We can see that VADER has given this piece of text an overall sentiment score of 0.67 and classified its contents as 43% positive, 57% neutral, and 0% negative.

VADER works by looking up the sentiment scores for each word in its lexicon and combining them using a nuanced set of rules. For example, qualifiers can increase or decrease the intensity of a word’s sentiment, so a qualifier such as “a bit” before a word would decrease the sentiment intensity, but “extremely” would amplify it.

VADER’s lexicon includes abbreviations such as “smh” (shaking my head) and emojis, making it particularly suitable for social media text. VADER’s main limitation is that it doesn’t work for languages other than English, but you can use projects such as vader-multi as an alternative. I wrote about how VADER works if you’re interested in taking a deeper dive into this package.

NLTK

Additionally, you can use NLTK to train your own machine learning-based sentiment classifier, using classifiers from scikit-learn.

There are many ways of processing the text to feed into these models, but the simplest way is doing it based on the words that are present in the text, a type of text modeling called the bag-of-words approach. The most straightforward type of bag-of-words modeling is binary vectorisation, where each word is treated as a feature, with the value of that feature being either 0 or 1 (whether the word is absent or present in the text, respectively).

If you’re new to working with text data and NLP, and you’d like more information about how text can be converted into inputs for machine learning models, I gave a talk on this topic that provides a gentle introduction.

You can see an example in the NLTK documentation, where a Naive Bayes classifier is trained to predict whether a piece of text is subjective or objective. In this example, they add an additional negation qualifier to some of the terms based on rules which indicate whether that word or character is likely involved in negating a sentiment expressed elsewhere in the text. Real Python also has a sentiment analysis tutorial on training your own classifiers using NLTK, if you want to learn more about this topic.

Pattern and TextBlob

The Pattern package provides another lexicon-based approach to analyzing sentiment. It uses the SentiWordNet lexicon, where each synonym group (synset) from WordNet is assigned a score for positivity, negativity, and objectivity. The positive and negative scores for each word are combined using a series of rules to give a final polarity score. Similarly, the objectivity score for each word is combined to give a final subjectivity score.

As WordNet contains part-of-speech information, the rules can take into account whether adjectives or adverbs preceding a word modify its sentiment. The ruleset also considers negations, exclamation marks, and emojis, and even includes some rules to handle idioms and sarcasm.

However, Pattern as a standalone library is only compatible with Python 3.6. As such, the most common way to use Pattern is through TextBlob. By default, the TextBlob sentiment analyzer uses its own implementation of the Pattern library to generate sentiment scores.

Let’s have a look at this in action:

from textblob import TextBlob

You can see that we run the TextBlob method over our text, and then extract the sentiment using the sentiment attribute.

pattern_blob = TextBlob("I love PyCharm! It's my favorite Python IDE.")
sentiment = pattern_blob.sentiment

print(f"Polarity: {sentiment.polarity}")
print(f"Subjectivity: {sentiment.subjectivity}")

Polarity: 0.625
Subjectivity: 0.6

For our example sentence, Pattern in TextBlob gives us a polarity score of 0.625 (relatively close to the score given by VADER), and a subjectivity score of 0.6.

But there’s also a second way of getting sentiment scores in TextBlob. This package also includes a pre-trained Naive Bayes classifier, which will label a piece of text as either positive or negative, and give you the probability of the text being either positive or negative.

To use this method, we first need to download both the punkt module and the movie-reviews dataset from NLTK, which is used to train this model.

import nltk
nltk.download('movie_reviews')
nltk.download('punkt')

from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

Once again, we need to run TextBlob over our text, but this time we add the argument analyzer=NaiveBayesAnalyzer(). Then, as before, we use the sentiment attribute to extract the sentiment scores.

nb_blob = TextBlob("I love PyCharm! It's my favorite Python IDE.", analyzer=NaiveBayesAnalyzer())
sentiment = nb_blob.sentiment
print(sentiment)

Sentiment(classification='pos', p_pos=0.5851800554016624, p_neg=0.4148199445983381)

This time we end up with a label of pos (positive), with the model predicting that the text has a 59% probability of being positive and a 41% probability of being negative.

spaCy

Another option is to use spaCy for sentiment analysis. spaCy is another popular package for NLP in Python, and has a wide range of options for processing text.

The first method is by using the spacytextblob plugin to use the TextBlob sentiment analyzer as part of your spaCy pipeline. Before you can do this, you’ll first need to install both spacy and spacytextblob and download the appropriate language model.

import spacy
import spacy.cli
from spacytextblob.spacytextblob import SpacyTextBlob

spacy.cli.download("en_core_web_sm")

We then load in this language model and add spacytextblob to our text processing pipeline. TextBlob can be used through spaCy’s pipe method, which means we can include it as part of a more complex text processing pipeline, including preprocessing steps such as part-of-speech tagging, lemmatization, and named-entity recognition. Preprocessing can normalize and enrich text, helping downstream models to get the most information out of the text inputs.

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

For now, we’ll just analyze our sample sentence without preprocessing:

doc = nlp("I love PyCharm! It's my favorite Python IDE.")

print('Polarity: ', doc._.polarity)
print('Subjectivity: ', doc._.subjectivity)

Polarity: 0.625
Subjectivity: 0.6

We get the same results as when using TextBlob above.

A second way we can do sentiment analysis in spaCy is by training our own model using the TextCategorizer class. This allows you to train a range of spaCY created models using a sentiment analysis training set. Again, as this can be used as part of the spaCy pipeline, you have many options for pre-processing your text before training your model.

Finally, you can use large language models to do sentiment analysis through spacy-llm. This allows you to prompt a variety of proprietary large language models (LLMs) from OpenAI, Anthropic, Cohere, and Google to perform sentiment analysis over your texts.

This approach works slightly differently from the other methods we’ve discussed. Instead of training the model, we can use generalist models like GPT-4 to predict the sentiment of a text. You can do this either through zero-shot learning (where a prompt but no examples are passed to the model) or few-shot learning (where a prompt and a number of examples are passed to the model).

Transformers

The final Python package for sentiment analysis we’ll discuss is Transformers from Hugging Face.

Hugging Face hosts all major open-source LLMs for free use (among other models, including computer vision and audio models), and provides a platform for training, deploying, and sharing these models. Its Transformers package offers a wide range of functionality (including sentiment analysis) for working with the LLMs hosted by Hugging Face.

Understanding the results of sentiment analyzers

Now that we’ve covered all of the ways you can do sentiment analysis in Python, you might be wondering, “How can I apply this to my own data?”

To understand this, let’s use PyCharm to compare two packages, VADER and TextBlob. Their multiple sentiment scores offer us a few different perspectives on our data. We’ll use these packages to analyze the Amazon reviews dataset.

PyCharm Professional is a powerful Python IDE for data science that supports advanced Python code completion, inspections and debugging, rich databases, Jupyter, Git, Conda, and more – all out of the box. In addition to these, you’ll also get incredibly useful features like our DataFrame Column Statistics and Chart View, as well as Hugging Face integrationsthat make working with LLMs much quicker and easier. In this blog post, we’ll explore PyCharm’s advanced features for working with dataframes, which will allow us to get a quick overview of how our sentiment scores are distributed between the two packages.

If you’re now ready to get started on your own sentiment analysis project, you can activate your free three-month subscription to PyCharm. Click on the link below, and enter this promo code: PCSA24. You’ll then receive an activation code via email.

Activate your 3-month subscription

The first thing we need to do is load in the data. We can use the load_dataset() method from the Datasets package to download this data from the Hugging Face Hub.

from datasets import load_dataset
amazon = load_dataset("fancyzhx/amazon_polarity")

You can hover over the name of the dataset to see the Hugging Face dataset card right inside PyCharm, providing you with a convenient way to get information about Hugging Face assets without leaving the IDE.

We can see the contents of this dataset here:

amazon

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })
})

The training dataset has 3.6 million observations, and the test dataset contains 400,000. We’ll be working with the training dataset in this tutorial.

We’ll now load in the VADER SentimentIntensityAnalyzer and the TextBlob method.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

nltk.download("vader_lexicon")

analyzer = SentimentIntensityAnalyzer()

from textblob import TextBlob

The training dataset has too many observations to comfortably visualize, so we’ll take a random sample of 1,000 reviews to represent the general sentiment of all the reviewers.

from random import sample
sample_reviews = sample(amazon["train"]["content"], 1000)

Let’s now get the VADER and TextBlob scores for each of these reviews. We’ll loop over each review text, run them through the sentiment analyzers, and then attach the scores to a dedicated list.

vader_neg = []
vader_neu = []
vader_pos = []
vader_compound = []
textblob_polarity = []
textblob_subjectivity = []

for review in sample_reviews:
   vader_sent = analyzer.polarity_scores(review)
   vader_neg += [vader_sent["neg"]]
   vader_neu += [vader_sent["neu"]]
   vader_pos += [vader_sent["pos"]]
   vader_compound += [vader_sent["compound"]]

   textblob_sent = TextBlob(review).sentiment
   textblob_polarity += [textblob_sent.polarity]
   textblob_subjectivity += [textblob_sent.subjectivity]

We’ll then pop each of these lists into a pandas DataFrame as a separate column:

import pandas as pd

sent_scores = pd.DataFrame({
   "vader_neg": vader_neg,
   "vader_neu": vader_neu,
   "vader_pos": vader_pos,
   "vader_compound": vader_compound,
   "textblob_polarity": textblob_polarity,
   "textblob_subjectivity": textblob_subjectivity
})

Now, we’re ready to start exploring our results.

Typically, this would be the point where we’d start creating a bunch of code for exploratory data analysis. This might be done using pandas’ describe method to get summary statistics over our columns, and writing Matplotlib or seaborn code to visualize our results. However, PyCharm has some features to speed this whole thing up.

Let’s go ahead and print our DataFrame.

sent_scores

We can see a button in the top right-hand corner, called Show Column Statistics. Clicking this gives us two different options: Compact and Detailed. Let’s select Detailed.

Now we have summary statistics provided as part of our column headers! Looking at these, we can see the VADER compound score has a mean of 0.4 (median = 0.6), while the TextBlob polarity score provides a mean of 0.2 (median = 0.2).

This result indicates that, on average, VADER tends to estimate the same set of reviews more positively than TextBlob does. It also shows that for both sentiment analyzers, we likely have more positive reviews than negative ones – we can dive into this in more detail by checking some visualizations.

Another PyCharm feature we can use is the DataFrame Chart View. The button for this function is in the top left-hand corner.

When we click on the button, we switch over to the chart editor. From here, we can create no-code visualizations straight from our DataFrame.

Let’s start with VADER’s compound score. To start creating this chart, go to Show Series Settings in the top right-hand corner.

Remove the default values for X Axis and Y Axis. Replace the X Axis value with vader_compound, and the Y Axis value with vader_compound. Click on the arrow next to the variable name in the Y Axis field, and select count.

Finally, select Histogram from the chart icons, just under Series Settings. We likely have a bimodal distribution for the VADER compound score, with a slight peak around –0.8 and a much larger one around 0.9. This peak likely represents the split of negative and positive reviews. There are also far more positive reviews than negative.

Let’s repeat the same exercise and create a histogram to see the distribution of the TextBlob polarity scores.

In contrast, TextBlob tends to rate most reviews as neutral, with very few reviews being strongly positive or negative. To understand why we have a discrepancy in the scores these two sentiment analyzers provide, let’s look at a review VADER rated as strongly positive and another that VADER rated strongly negative but that TextBlob rated as neutral.

We’ll get the index of the first review where VADER rated them as positive but TextBlob rated them as neutral:

sent_scores[(sent_scores["vader_compound"] >= 0.8) & (sent_scores["textblob_polarity"].between(-0.1, 0.1))].index[0]

42

Next, we get the index of the first review where VADER rated them as negative but TextBlob as neutral:

sent_scores[(sent_scores["vader_compound"] <= -0.8) & (sent_scores["textblob_polarity"].between(-0.1, 0.1))].index[0]

0

Let’s first retrieve the positive review:

sample_reviews[42]

"I love carpet sweepers for a fast clean up and a way to conserve energy. The Ewbank Multi-Sweep is a solid, well built appliance. However, if you have pets, you will find that it takes more time cleaning the sweeper than it does to actually sweep the room. The Ewbank does pick up pet hair most effectively but emptying it is a bit awkward. You need to take a rag to clean out both dirt trays and then you need a small tooth comb to pull the hair out of the brushes and the wheels. To do a proper cleaning takes quite a bit of time. My old Bissell is easier to clean when it comes to pet hair and it does a great job. If you do not have pets, I would recommend this product because it is definitely well made and for small cleanups, it would suffice. For those who complain about appliances being made of plastic, unfortunately, these days, that's the norm. It's not great and plastic definitely does not hold up but, sadly, product quality is no longer a priority in business."

This review seems mixed, but is overall somewhat positive.

Now, let’s look at the negative review:

sample_reviews[0]

'The only redeeming feature of this Cuisinart 4-cup coffee maker is the sleek black and silver design. After that, it rapidly goes downhill. It is frustratingly difficult to pour water from the carafe into the chamber unless it\'s done extremely slow and with accurate positioning. Even then, water still tends to dribble out and create a mess. The lid, itself, is VERY poorly designed with it\'s molded, round "grip" to supposedly remove the lid from the carafe. The only way I can remove it is to insert a sharp pointed object into one of the front pouring holes and pry it off! I\'ve also occasionally had a problem with the water not filtering down through the grounds, creating a coffee ground lake in the upper chamber and a mess below. I think the designer should go back to the drawing-board for this one.'

This review is unambiguously negative. From comparing the two, VADER appears more accurate, but it does tend to overly prioritize positive terms in a piece of text.

The final thing we can consider is how subjective versus objective each review is. We’ll do this by creating a histogram of TextBlob’s subjectivity score.

Interestingly, there is a good distribution of subjectivity in the reviews, with most reviews being a mixture of subjective and objective writing. A small number of reviews are also very subjective (close to 1) or very objective (close to 0).

These scores between them give us a nice way of cutting up the data. If you need to know the objective things that people did and did not like about the products, you could look at the reviews with a low subjectivity score and VADER compound scores close to 1 and –1, respectively.

In contrast, if you want to know what people’s emotional reaction to the products are, you could take those with a high subjectivity score and high and low VADER compound scores.

Things to consider

As with any problem in natural language processing, there are a number of things to watch out for when doing sentiment analysis.

One of the biggest considerations is the language of the texts you’re trying to analyze. Many of the lexicon-based methods only work for a limited number of languages, so if you’re working with languages not supported by these lexicons, you may need to take another approach, such as using a fine-tuned LLM or training your own model(s).

As texts increase in complexity, it can also be difficult for lexicon-based analyzers and bag-of-words-based models to correctly detect sentiment. Sarcasm or more subtle context indicators can be hard for simpler models to detect, and these models may not be able to accurately classify the sentiment of such texts. LLMs may be able to handle more complex texts, but you would need to experiment with different models.

Finally, when doing sentiment analysis, the same issues also come up as when dealing with any machine learning problem. Your models will only be as good as the training data you use. If you cannot get high-quality training and testing datasets suitable to your problem domain, you will not be able to correctly predict the sentiment of your target audience.

You should also make sure that your targets are appropriate for your business problem. It might seem attractive to build a model to know whether your products make your customers “sad”, “angry”, or “disgusted”, but if this doesn’t help you make a decision about how to improve your products, then it isn’t solving your problem.

Wrapping up

In this blog post, we dove deeply into the fascinating area of Python sentiment analysis and showed how this complex field is made more approachable by a range of powerful packages.

We covered the potential applications of sentiment analysis, different ways of assessing sentiment, and the main methods of extracting sentiment from a piece of text. We also saw some helpful features in PyCharm that make working with models and interpreting their results simpler and faster.

While the field of natural language processing is currently focused intently on large language models, the older techniques of using lexicon-based analyzers or traditional machine learning models, like Naive Bayes classifiers, still have their place in sentiment analysis. These techniques shine when analyzing simpler texts, or when speed, predictions, or ease of deployment are priorities. LLMs are best suited to more complex or nuanced texts.

Now that you’ve grasped the basics, you can learn how to do sentiment analysis with LLMs in our tutorial. The step-by-step guide helps you discover how to select the right model for your task, use it for sentiment analysis, and even fine-tune it yourself.

If you’d like to continue learning about natural language processing or machine learning more broadly after finishing this blog post, here are some resources:

Get started with sentiment analysis in PyCharm today

Activate your 3-month subscription

How to Do Sentiment Analysis With Large Language Models

Evgenia Verbina — Thu, 05 Dec 2024 10:49:14 +0000

Sentiment analysis is a powerful tool for understanding emotions in text. While there are many ways to approach sentiment analysis, including more traditional lexicon-based and machine learning approaches, today we’ll be focusing on one of the most cutting-edge ways of working with text – large language models (LLMs). We’ll explain how you can use these powerful models to predict the sentiment expressed in a text.

As a practical tutorial, this post will introduce you to the types of LLMs most suited for sentiment analysis tasks and then show you how to choose the right model for your specific task.

We’ll cover using models that other people have fine-tuned for sentiment analysis and how to fine-tune one yourself. We’ll also look at some of the powerful tools and resources available that can help you work with these models easily, while demystifying what can feel like an overly complex and overwhelming topic.

To get the most out of this blog post, we’d recommend you have some experience training machine learning or deep learning models and be confident using Python. Our introductory blog post on sentiment analysis with Python is a great place to begin. That said, you don’t necessarily need to have a background in large language models to enjoy it.

Let’s get started!

What are large language models?

Large language models are some of the latest and most powerful tools for solving natural language problems. In brief, they are generalist language models that can complete a range of natural language tasks, from named entity recognition to question answering. LLMs are based on the transformer architecture, a type of neural network that uses a mechanism called attention to represent complex and nuanced relationships between words in a piece of text. This design allows LLMs to accurately represent the information being conveyed in a piece of text.

The full transformer model architecture consists of two blocks. Encoder blocks are designed to receive text inputs and build a representation of them, creating a feature set based on the text corpus over which the model is trained. Decoder blocks take the features generated by the encoder and other inputs and attempt to generate a sequence based on these.

Transformer models can be divided up based on whether they contain encoder blocks, decoder blocks, or both.

Encoder-only models tend to be good at tasks requiring a detailed understanding of the input to do downstream tasks, like text classification and named entity recognition.
Decoder-only models are best for tasks such as text generation.
Encoder-decoder, or sequence-to-sequence models are mainly used for tasks that require the model to evaluate an input and generate a different output, such as translation. In fact, translation was the original task that transformer models were designed for!

This Hugging Face table (also featured below), which I took from their course on natural language processing, gives an overview of what each model tends to be strongest at.

After finishing this blog post and discovering what other natural language tasks you can perform with the Transformers library, I recommend the course if you’d like to learn more about LLMs. It strikes an excellent balance between accessibility and technical depth.

Sentiment analysis is usually treated as a text or sentence classification problem with LLMs, meaning that encoder-only models such as RoBERTa, BERT, and ELECTRA are most often used for this task. However, there are some exceptions. For example, the top scoring model for aspect-based sentiment analysis, InstructABSA, is based on a fine-tuned version of T5, an encoder-decoder model.

Using large language models for sentiment analysis

With all of the background out of the way, we can now get started with using LLMs to do sentiment analysis.

Install PyCharm to get started with sentiment analysis

We’ll use PyCharm Professional for this demo, but you can follow along with any other IDE that supports Python development.

PyCharm Professional is a powerful Python IDE for data science. It supports advanced Python code completion, inspections and debugging, rich databases, Jupyter, Git, Conda, and more right out of the box. You can try out great features such as our DataFrame Column Statistics and Chart View, as well as Hugging Faceintegrations, which make working with LLMs much simpler and faster.

If you’d like to follow along with this tutorial, you can activate your free three-month subscription to PyCharm using this special promo code: PCSA24. Click on the link below, and enter the code. You’ll then receive an activation code through your email.

Activate your free three-month subscription

Import the required libraries

There are two parts to this tutorial: using an LLM that someone else has fine-tuned for sentiment analysis, and fine-tuning a model ourselves.

In order to run both parts of this tutorial, we need to import the following packages:

Transformers: As described, this will allow us to use fine-tuned LLMs for sentiment analysis and fine-tune our own models.
PyTorch, Tensorflow, or Flax: Transformers acts as a high-level interface for deep learning frameworks, reusing their functionality for building, training, and running neural networks. In order to actually work with LLMs using the Transformers package, you will need to install your choice of PyTorch, Tensorflow, or Flax. PyTorch supports the largest number of models of the three frameworks, so that’s the one we’ll use in this tutorial.
Datasets: This is another package from Hugging Face that allows you to easily work with the datasets hosted on Hugging Face Hub. We’ll need this package to get a dataset to fine-tune an LLM for sentiment analysis.

In order to fine-tune our own model, we also need to import these additional packages:

NumPy: NumPy allows us to work with arrays. We’ll need this to do some post-processing on the predictions generated by our LLM.
scikit-learn: This package contains a huge range of functionality for machine learning. We’ll use it to evaluate the performance of our model.
Evaluate: This is another package from Hugging Face. Evaluate adds a convenient interface for measuring the performance of models. It will give us an alternative way of measuring our model’s performance.
Accelerate: This final package from Hugging Face, Accelerate, takes care of distributed model training.

We can easily find and install these in PyCharm. Make sure you’re using a Python 3.7 or higher interpreter. For this demo, we’ll be using Python 3.11.7.

Pick the right model

The next step is picking the right model. Before we get into that, we need to cover some terminology.

LLMs are made up of two components: an architecture and a checkpoint. The architecture is like the blueprint of the model, and describes what will be contained in each layer and each operation that takes place within the model.

The checkpoint refers to the weights that will be used within each layer. Each of the pretrained models will use an architecture like T5 or GPT, and obtain the specific weights (the model checkpoint) by training the model over a huge corpus of text data.

Fine-tuning will adjust the weights in the checkpoint by retraining the last layer(s) on a dataset specialized in a certain task or domain. To make predictions (called inference), an architecture will load in the checkpoint and use this to process text inputs, and together this is called a model.

If you’ve ever looked at the models available on Hugging Face, you might have been overwhelmed by the sheer number of them (even when we narrow them down to encoder-only models).

So, how do you know which one to use for sentiment analysis?

One useful place to start is the sentiment analysis page on Papers With Code. This page includes a very helpful overview of this task and a Benchmarks table that includes the top-performing models for each sentiment analysis benchmarking dataset. From this page, we can see that some of the commonly appearing models are those based on BERT and RoBERTa architectures.

While we may not be able to access these exact model checkpoints on Hugging Face (as not all of them will be uploaded there), it can give us a guide for what sorts of models might perform well at this task. Papers With Code also has similar pages for a range of other natural language tasks: If you search for the task in the upper left-hand corner of the site, you can navigate to these.

Now that we know what kinds of architectures are likely to do well for this problem, we can start searching for a specific model.

PyCharm has an built-in integration with Hugging Face that allows us to search for models directly. Simply right-click anywhere in your Jupyter notebook or Python script, and select Insert HF model. You’ll be presented with the following window:

You can see that we can find Hugging Face models either by the task type (which we can select from the menu on the left-hand side), by keyword search in the search box at the top of the window, or by a combination of both. Models are ranked by the number of likes by default, but we can also select models based on downloads or when the model was created or last modified.

When you use a model for a task, the checkpoint is downloaded and cached, making it faster the next time you need to use that model. You can see all of the models you’ve downloaded in the Hugging Face tool window.

Once we’ve downloaded the model, we can also look at its model card again by hovering over the model name in our Jupyter notebook or Python script. We can do the same thing with dataset cards.

Use a fine-tuned LLM for sentiment analysis

Let’s move on to how we can use a model that someone else has already fine-tuned for sentiment analysis.

As mentioned, sentiment analysis is usually treated as a text classification problem for LLMs. This means that in our Hugging Face model selection window, we’ll select Text Classification, which can be found under Natural Language Processing on the left-hand side. To narrow the results down to sentiment analysis models, we’ll type “sentiment” in the search box in the upper left-hand corner.

We can see various fine-tuned models, and as expected from what we saw on the Papers With Code Benchmarks table, most of them use RoBERTa or BERT architectures. Let’s try out the top ranked model, Twitter-roBERTa-base for Sentiment Analysis.

You can see that after we select Use Model in the Hugging Face model selection window, code is automatically generated at the caret in our Jupyter notebook or Python script to allow us to start working with this model.

from transformers import pipeline
pipe = pipeline("text-classification",
 model="cardiffnlp/twitter-roberta-base-sentiment-latest")

Before we can do inference with this model, we’ll need to modify this code.

The first thing we can check is whether we have a GPU available, which will make the model run faster. We’ll check for two types: NVIDIA GPUs, which support CUDA, and Apple GPUs, which support MPS.

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

My computer supports MPS, so we can add a device argument to the pipeline and add "mps". If your computer supports CUDA, you can instead add the argument device=0.

from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest",
                device="mps")

Finally, we can get the fine-tuned LLM to run inference over our example text.

result = pipe("I love PyCharm! It's my favorite Python IDE.")
result

[{'label': 'positive', 'score': 0.9914802312850952}]

You can see that this model predicts that the text will be positive, with 99% probability.

Fine-tune your own LLM for sentiment analysis

The other way we can use LLMs for sentiment analysis is to fine-tune our own model.

You might wonder why you’d bother doing this, given the huge number of fine-tuned models that already exist on Hugging Face Hub. The main reason you might want to fine-tune a model is so that you can tailor it to your specific use case.

Most models are fine-tuned on public datasets, especially social media posts and movie reviews, and you might need your model to be more sensitive to your specific domain or use case.

Model fine-tuning can be quite a complex topic, so in this demonstration, I’ll explain how to do it at a more general level. However, if you want to understand this in more detail, you can read more about it in Hugging Face’s excellent NLP course, which I recommended earlier. In their tutorial, they explain in detail how to process data for fine-tuning models and two different approaches to fine-tuning: with the trainer API and without it.

To demonstrate how to fine-tune a model, we’ll use the SST-2 dataset, which is composed of single lines pulled from movie reviews that have been annotated as either negative or positive.

As mentioned earlier, BERT models consistently show up as top performers on the Papers With Code benchmarks, so we’ll fine-tune a BERT checkpoint.

We can again search for these models in PyCharm’s Hugging Face model selection window.

We can see that the most popular BERT model is bert-base-uncased. This is perfect for our use case, as this was also trained on lowercase text, so it will match the casing of our dataset.

We could have used the popular bert-large-uncased, but the base model has only 110 million parameters compared to BERT large, which has 340 million, so the base model is a bit friendlier for fine-tuning on a local machine.

If you still want to use a smaller model, you could also try this with a DistilBERT model, which has far fewer parameters but still preserves most of the performance of the original BERT models.

Let’s start by reading in our dataset. We can do so using the load_dataset() function from the Datasets package. SST-2 is part of the GLUE dataset, which is designed to see how well a model can complete a range of natural language tasks.

from datasets import load_dataset

sst_2_raw = load_dataset("glue", "sst2")
sst_2_raw

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

This dataset has already been split into the train, validation, and test sets. We have around 67,349 training examples – quite a modest number for fine-tuning such a large model.

Here’s an example from this dataset.

sst_2_raw["train"][1]

{'sentence': 'contains no wit , only labored gags ', 'label': 0, 'idx': 1}

We can see what the labels mean by calling the features attribute on the training set.

sst_2_raw["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

0 indicates a negative sentiment, and 1 indicates a positive one.

Let’s look at the number in each class:

print(f'Number of negative examples: {sst_2_raw["train"]["label"].count(0)}')
print(f'Number of positive examples: {sst_2_raw["train"]["label"].count(1)}')

Number of negative examples: 29780
Number of positive examples: 37569

The classes in our training data are a tad unbalanced, but they aren’t excessively skewed.

We now need to tokenize our data, transforming the raw text into a form that our model can use. To do this, we need to use the same tokenizer that was used to train the bert-large-uncased model in the first place. The AutoTokenizer class will take care of all of the under-the-hood details for us.

from transformers import AutoTokenizer

checkpoint = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we’ve loaded in the correct tokenizer, we can apply this to the training data.

tokenised_sentences = tokenizer(sst_2_raw["train"]["sentence"])

Finally, we need to add a function to pad our tokenized sentences. This will make sure all of the inputs in a training batch are the same length – text inputs are rarely the same length and models require a consistent number of features for each input.

from transformers import DataCollatorWithPadding

def tokenize_function(example):
    return tokenizer(example["sentence"])

tokenized_datasets = sst_2_raw.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Now that we’ve prepared our dataset, we need to determine how well the model is fitting to the data as it trains. To do this, we need to decide which metrics to use to evaluate the model’s prediction performance.

As we’re dealing with a binary classification problem, we have a few choices of metrics, the most popular of which are accuracy, precision, recall, and the F1 score. In the “Evaluate the model” section, we’ll discuss the pros and cons of using each of these measures.

We have two ways of creating an evaluation function for our model. The first is using the Evaluate package. This package allows us to use the specific evaluator for the SST-2 dataset, meaning we’ll evaluate the model fine-tuning using the specific metrics for this task. In the case of SST-2, the metric used is accuracy.

import evaluate
import numpy as np

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "sst2")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

However, if we want to customize the metrics used, we can also create our own evaluation function.

In this case, I’ve imported the accuracy, precision, recall, and F1 score metrics from scikit-learn. I’ve then created a function which takes in the predicted labels versus actual labels for each sentence and calculates the four required metrics. We’ll use this function, as it gives us a wider variety of metrics we can check our model performance against.

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='macro'),
        'precision': precision_score(labels, predictions, average='macro'),
        'recall': recall_score(labels, predictions, average='macro')
    }

Now that we’ve done all of the setup, we’re ready to train the model. The first thing we need to do is define some parameters that will control the training process using the TrainingArguments class. We’ve only specified a few parameters here, but this class has an enormous number of possible arguments allowing you to calibrate your model training to a high degree of specificity.

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="sst2-bert-fine-tuning",
                                  eval_strategy="epoch",
                                  num_train_epochs=3)

In our case, we’ve used the following arguments:

output_dir: The output directory where we want our model predictions and checkpoints saved.
eval_strategy="epoch": This ensures that the evaluation is performed at the end of each training epoch. Other possible values are “steps” (meaning that evaluation is done at regular step intervals) and “no” (meaning that evaluation is not done during training).
num_train_epochs=3: This sets the number of training epochs (or the number of times the training loop will repeat over all of the data). In this case, it’s set to train on the data three times.

The next step is to load in our pre-trained BERT model.

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Let’s break this down step-by-step:

The AutoModelForSequenceClassification class does two things. First, it automatically identifies the appropriate model architecture from the Hugging Face model hub given the provided checkpoint string. In our case, this would be the BERT architecture. Second, it converts this model into one we can use for classification. It does this by discarding the weights in the model’s final layer(s) so that we can retrain these using our sentiment analysis dataset.
The from_pretrained() method loads in our selected checkpoint, which in this case is bert-base-uncased.
The argument num_labels=2 indicates that we have two classes to predict in our model: positive and negative.

We get a message telling us that some model weights were not initialized when we ran this code. This message is exactly the one we want – it tells us that the AutoModelForSequenceClassification class reset the final model weights in preparation for our fine-tuning.

The last step is to set up our Trainer object. This stage takes in the model, the training arguments, the train and validation datasets, our tokenizer and padding function, and our evaluation function. It uses all of these to train the weights for the head (or final layers) of the BERT model, evaluating the performance of the model after each epoch on the validation set.

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

We can now kick off the training. The Trainer class gives us a nice timer that tells us both the elapsed time and how much longer the training is estimated to take. We can also see the metrics after each epoch, as we requested when creating the TrainingArguments.

trainer.train()

Evaluate the model

Classification metrics

Before we have a look at how our model performed, let’s first discuss the evaluation metrics we used in more detail:

Accuracy : As mentioned, this is the default evaluation metric for the SST-2 dataset. Accuracy is the simplest metric for evaluating classification models, being the ratio of correct predictions to all predictions. Accuracy is a good choice when the target classes are well balanced, meaning each class has an approximately equal number of instances.
Precision : Precision calculates the percentage of the correctly predicted positive observations to the total predicted positives. It is important when the cost of a false positive is high. For example, in spam detection, you would rather miss a spam email (false negative) than have non-spam emails land in your spam folder (false positive).
Recall (also known as sensitivity): Recall calculates the percentage of the correctly predicted positive observations to all observations in the actual class. It is of interest when the cost of false negatives is high, meaning classifying a positive class incorrectly as negative. For example, in disease diagnosis, you would rather have false alarms (false positives) than miss someone who is actually ill (false negatives).
F1-score : The F1-score is the harmonic mean of precision and recall. It tries to find the balance between both measures. It is a more reliable metric than accuracy when dealing with imbalanced classes.

In our case, we had slightly imbalanced classes, so it’s a good idea to check both accuracy and the F1 score. If they differ, the F1 score is likely to be more trustworthy. However, if they are roughly the same, it is nice to be able to use accuracy, as it is easily interpretable.

Knowing whether your model is better at predicting one class versus the other is also useful. Depending on your application, capturing all customers who are unhappy with your service may be more important, even if you sometimes get false negatives. In this case, a model with high recall would be a priority over high precision.

Model predictions

Now that we’ve trained our model, we need to evaluate it. Normally, we would use the test set to get a final, unbiased evaluation, but the SST-2 test set does not have labels, so we cannot use it for evaluation. In this case, we’ll use the validation set accuracy scores for our final evaluation. We can do this using the following code:

trainer.evaluate(eval_dataset=tokenized_datasets["validation"])

{'eval_loss': 0.4223457872867584,
 'eval_accuracy': 0.9071100917431193,
 'eval_f1': 0.9070209502998072,
 'eval_precision': 0.9074841225920363,
 'eval_recall': 0.9068472678285763,
 'eval_runtime': 3.9341,
 'eval_samples_per_second': 221.649,
 'eval_steps_per_second': 27.706,
 'epoch': 3.0}

We see that the model has a 90% accuracy on the test set, comparable to other BERT models trained on SST-2. If we wanted to improve our model performance, we could investigate a few things:

Check whether the model is overfitting : While small by LLM standards, the BERT model we used for fine-tuning is still very large, and our training set was quite modest. In such cases, overfitting is quite common. To check this, we should compare our validation set metrics with our training set metrics. If the training set metrics are much higher than the validation set metrics, then we have overfit the model. You can adjust a range of parameters during model training to help mitigate this.
Train on more epochs : In this example, we only trained the model for three epochs. If the model is not overfitting, continuing to train it for longer may improve its performance.
Check where the model has misclassified : We could dig into where the model is classifying correctly and incorrectly to see if we could spot a pattern. This may allow us to spot any issues with ambiguous cases or mislabelled data. Perhaps the fact this is a binary classification problem with no label for “neutral” sentiment means there is a subset of sentences that the model cannot properly classify.

To finish our section on evaluating this model, let’s see how it goes with our test sentence. We’ll pass our fine-tuned model and tokenizer to a TextClassificationPipeline, then pass our sentence to this pipeline:

from transformers import TextClassificationPipeline

pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)

predictions = pipeline("I love PyCharm! It's my favourite Python IDE.")

print(predictions)

[[{'label': 'LABEL_0', 'score': 0.0006891043740324676}, {'label': 'LABEL_1', 'score': 0.9993108510971069}]]

Our model assigns LABEL_0 (negative) a probability of 0.0007 and LABEL_1 (positive) a probability of 0.999, indicating it predicts that the sentence has a positive sentiment with 99% certainty. This result is similar to the one we got from the fine-tuned RoBERTa model we used earlier in the post.

Sentiment analysis benchmarks

Instead of evaluating the model on only the dataset it was trained on, we could also assess it on other datasets.

As you can see from the Papers With Code benchmarking table, you can use a wide variety of labeled datasets to assess the performance of your sentiment classifiers. These datasets include the SST-5 fine-grained classification, IMDB dataset, Yelp binary and fine-grained classification, Amazon review polarity, TweetEval, and the SemEval Aspect-based sentiment analysis dataset.

When evaluating your model, the main thing is to ensure that the datasets represent your problem domain.

Most of the benchmarking datasets contain either reviews or social media texts, so if your problem is in either of these domains, you may find an existing benchmark that mirrors your business domain closely enough. However, suppose you are applying sentiment analysis to a more specialized problem. In that case, it may be necessary to create your own benchmarks to ensure your model can generalize to your problem domain properly.

Since there are multiple ways of measuring sentiment, it’s also necessary to make sure that any benchmarks you use to assess your model have the same target as the dataset you trained your model on.

For example, it wouldn’t be a fair measure of a model’s performance to fine-tune it on the SST-2 with a binary target, and then test it on the SST-5. As the model has never seen the very positive, very negative, and neutral categories, it will not be able to accurately predict texts with these labels and hence will perform poorly.

Wrapping up

In this blog post, we saw how LLMs can be a powerful way of classifying the sentiment expressed in a piece of text and took a hands-on approach to fine-tuning an LLM for this purpose.

We saw how understanding which types of models are most suited for sentiment analysis, as well as how being able to see the top performing models on different benchmarks with resources like Papers With Code can help you narrow down your options for which models to use.

We also learned how Hugging Face’s powerful tooling for using these models and their integration into PyCharm makes using LLMs for sentiment analysis approachable for anyone with a background in machine learning.

If you’d like to continue learning about large language models, check out our guest blog post by Dido Grigorov, who explains how to build a chatbot using the LangChain package.

Get started with sentiment analysis with PyCharm today

If you’re ready to get started on your own sentiment analysis project, you can activate your free three-month subscription of PyCharm. Click on the link below, and enter this promo code: PCSA24. You’ll then receive an activation code through your email.

Activate your free three-month subscription

DEV Community: PyCharm

PyCharm, the Only Python IDE You Need

💡 What’s new?

✅ One product for all developers

🎓 Free Jupyter Notebook support

🚀 Seamless access to Pro

🛠️ One product, better quality

What does it mean for me?

🐍 I’m a PyCharm Community Edition user

👥 I’m a PyCharm Professional Edition user

*🆕 I’m new to PyCharm *

Which Is the Best Python Web Framework: Django, Flask, or FastAPI?

Django

Django advantages

Django disadvantages

Flask

Flask advantages

Flask disadvantages

FastAPI

FastAPI advantages

FastAPI disadvantages

Choosing between Flask, Django, and FastAPI

Further reading

Start your web development project with PyCharm

The Ultimate Guide to Django Templates

What are Django templates?

Benefits of using templates

Challenges and limitations

Debugging Django templates

Understanding the Django Template Language (DTL)

DTL basic syntax and structure

Variables, filters, and tags

Template inheritance and extending base templates

Django template tags

Common Django template tags

Using template tags

Django admin templates

Functionality

Customizing admin templates

URL templating in Django

Understanding URL templates

Dynamic URL generation with URL templates

Using Django URLs

Using URL templates with Django’s reverse function

Using URL tags in Django templates

Jinja vs. Django templates

Code faster with Django live templates

Using Django templates: best practices and tips

Conclusion

Reliable Django support in PyCharm

An Introduction to Django Views

What are Django views?

Rendering and passing data to templates

Function-based views

Class-based views

Creating custom class-based views

When to use each view type

Views and URLs

Simplify URL management with PyCharm

Best practices for using Django views

Keep views focused

Keep views and templates thin

Decouple database queries

Use generic class-based views when possible

Function-based views are OK for simple cases

Structure routes and URLs cleanly

Next steps

Django support in PyCharm

Anomaly Detection in Time Series

Time series anomaly detection in businesses

Why is it important to use special methods for time series anomaly detection?

Methods used for anomaly detection in time series

STL decomposition

Using STL decomposition on beehive data

1. Install the library

2. Create a Jupyter notebook

3. Inspect the data as graphs

4. Time series decomposition

5. Anomaly threshold

LSTM prediction

🆕 I’m new to PyCharm

Using URL templates with Django’s `reverse` function