<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: PyCharm</title>
    <description>The latest articles on DEV Community by PyCharm (@pycharm).</description>
    <link>https://dev.to/pycharm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F10300%2F3ec3046c-353a-4634-8d18-8637962a97df.png</url>
      <title>DEV Community: PyCharm</title>
      <link>https://dev.to/pycharm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pycharm"/>
    <language>en</language>
    <item>
      <title>PyCharm, the Only Python IDE You Need</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Wed, 16 Apr 2025 12:11:26 +0000</pubDate>
      <link>https://dev.to/pycharm/pycharm-the-only-python-ide-you-need-45gj</link>
      <guid>https://dev.to/pycharm/pycharm-the-only-python-ide-you-need-45gj</guid>
      <description>&lt;p&gt;&lt;em&gt;Estimated reading time: 3 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tay6grff42s1mrtrkci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tay6grff42s1mrtrkci.png" alt="One PyCharm for Everyone" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/" rel="noopener noreferrer"&gt;PyCharm&lt;/a&gt; is now one powerful, unified product! Its core functionality, including Jupyter Notebook support, will be free, and a Pro subscription will be available with additional features. Starting with the 2025.1 release, every user will get instant access to a free one-month Pro trial, so you’ll be able to access all of PyCharm’s advanced features right away. After the trial, you can choose whether to continue with a Pro subscription or keep using the core features for free.&lt;/p&gt;

&lt;p&gt;Previously, PyCharm was offered as two separate products: the free Community Edition and the Professional Edition with extended capabilities. Now, with a single streamlined product, you no longer need to choose. Everything is in one place, and you can seamlessly switch between core and advanced features within the same installation whenever you need to.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💡 What’s new?&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;✅ One product for all developers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You no longer need to worry about additional downloads or switching between editions. PyCharm is now a single product. Start with a month of full Pro access for free, and then keep using the core features at no cost. Upgrade to Pro anytime within the same installation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🎓 Free Jupyter Notebook support&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;PyCharm now offers free Jupyter support, including running, debugging, output rendering, and intelligent code assistance in notebooks. It’s perfect for data workflows, no Pro subscription required. However, a Pro subscription does offer more advanced capabilities, including remote notebooks, dynamic tables, SQL cells, and others.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🚀 Seamless access to Pro&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;With every new major PyCharm release (currently three times a year), you will get instant access to a free one-month Pro trial. Once it ends, you can continue using the core features for free.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🛠️ One product, better quality&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Focusing on a single PyCharm product will help us improve overall quality, streamline updates, and deliver new features faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does it mean for me?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🐍 I’m a PyCharm Community Edition user
&lt;/h3&gt;

&lt;p&gt;First of all, &lt;strong&gt;thank you&lt;/strong&gt; for being part of our amazing community! Your feedback, passion, and contributions have helped shape PyCharm into the tool it is today.&lt;/p&gt;

&lt;p&gt;Nothing is going to change for you right away – you can upgrade to PyCharm Community 2025.1 as usual. Alternatively, you may choose to manually switch to the new PyCharm immediately and keep using everything you have now for free, plus the support for Jupyter notebooks.&lt;/p&gt;

&lt;p&gt;Starting with PyCharm 2025.2, we’ll offer a smooth migration path that preserves your current setup and preferences. PyCharm Community 2025.2 will be the final standalone version, and, from 2025.3 onward, all Community Edition users will transition to the unified PyCharm experience.&lt;/p&gt;

&lt;p&gt;Rest assured – our commitment to open-source development remains as strong as ever. The Community Edition codebase will stay public on GitHub, and we’ll continue to maintain and update it. We’ll also provide an easy way to build PyCharm from source via GitHub Actions.&lt;/p&gt;

&lt;p&gt;Have more questions about what’s next? Read &lt;a href="https://www.jetbrains.com/pycharm/download#faq" rel="noopener noreferrer"&gt;our extended FAQ&lt;/a&gt; for more details.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;👥 I’m a PyCharm Professional Edition user&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Nothing changes! Your license will automatically work with the new single PyCharm product. Simply &lt;a href="https://www.jetbrains.com/pycharm/download/" rel="noopener noreferrer"&gt;upgrade to PyCharm 2025.1&lt;/a&gt; and continue enjoying everything Pro has to offer.&lt;/p&gt;

&lt;h3&gt;
  
  
  *&lt;em&gt;🆕 I’m new to PyCharm *&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;You can start right away with the new single PyCharm product. You’ll get a free one-month Pro trial with full functionality. After that, you can purchase a Pro subscription and keep using PyCharm with its full capabilities, or you can continue using just the core features – including Jupyter Notebook support – for free. &lt;a href="https://www.jetbrains.com/pycharm/download/" rel="noopener noreferrer"&gt;Download PyCharm&lt;/a&gt; now.&lt;/p&gt;

</description>
      <category>news</category>
      <category>releases</category>
    </item>
    <item>
      <title>Which Is the Best Python Web Framework: Django, Flask, or FastAPI?</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Tue, 18 Feb 2025 10:00:17 +0000</pubDate>
      <link>https://dev.to/pycharm/which-is-the-best-python-web-framework-django-flask-or-fastapi-5el4</link>
      <guid>https://dev.to/pycharm/which-is-the-best-python-web-framework-django-flask-or-fastapi-5el4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllwzo0jjw3uica9qikgy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllwzo0jjw3uica9qikgy.png" alt="Which Is the best Python web framework: Django, Flask, or FastAPI?" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Search for Python web frameworks, and three names will consistently come up: &lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;Django&lt;/a&gt;, Flask, and FastAPI. Our latest &lt;a href="https://lp.jetbrains.com/python-developers-survey-2023/" rel="noopener noreferrer"&gt;Python Developer Survey Results&lt;/a&gt; confirm that these three frameworks remain developers’ top choices for backend web development with Python.&lt;/p&gt;

&lt;p&gt;All three frameworks are open-source and compatible with the latest versions of Python. &lt;/p&gt;

&lt;p&gt;But how do you determine which web framework is best for your project? Here, we’ll look at the pros and cons of each and compare how they stack up against one another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Django
&lt;/h2&gt;

&lt;p&gt;Django is a “batteries included”, full-stack web framework used by the likes of Instagram, Spotify, and Dropbox, to name but a few. Pitched as “the web framework for perfectionists with deadlines”, the &lt;a href="https://blog.jetbrains.com/pycharm/2023/11/what-is-the-django-web-framework/" rel="noopener noreferrer"&gt;Django framework&lt;/a&gt; was designed to make it easier and quicker to build robust web apps.&lt;/p&gt;

&lt;p&gt;First made available as an open-source project in 2005, Django is a mature project that remains in active development 20 years later. It’s suitable for many web applications, including social media, e-commerce, news, and entertainment sites.&lt;/p&gt;

&lt;p&gt;Django follows a model-view-template (MVT) architecture, where each component has a specific role. Models are responsible for handling the data and defining its structure. The views manage the business logic, processing requests and fetching the necessary data from the models. Finally, templates present this data to the end user – similar to views in a model-view-controller (MVC) architecture. &lt;/p&gt;

&lt;p&gt;As a full-stack web framework, Django can be used to build an entire web app (from database to HTML and JavaScript frontend).&lt;/p&gt;

&lt;p&gt;Alternatively, you can use the &lt;a href="https://blog.jetbrains.com/pycharm/2023/11/what-is-the-django-web-framework/" rel="noopener noreferrer"&gt;Django REST Framework&lt;/a&gt; to combine Django with a frontend framework (such as React) to build both mobile and browser-based apps.&lt;/p&gt;

&lt;p&gt;Explore our comprehensive &lt;a href="https://blog.jetbrains.com/pycharm/2024/01/how-to-learn-django/" rel="noopener noreferrer"&gt;Django guide&lt;/a&gt;, featuring an overview of prerequisite knowledge, a structured learning path, and additional resources to help you master the framework. &lt;/p&gt;

&lt;h3&gt;
  
  
  Django advantages
&lt;/h3&gt;

&lt;p&gt;There are plenty of reasons why Django remains one of the most widely used Python web frameworks, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extensive functionality&lt;/strong&gt; : With a “batteries included” approach, Django offers built-in features like authentication, caching, data validation, and session management. Its &lt;a href="https://docs.djangoproject.com/en/dev/misc/design-philosophies/#don-t-repeat-yourself-dry" rel="noopener noreferrer"&gt;don’t repeat yourself (DRY)&lt;/a&gt; principle speeds up development and reduces bugs. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ease of setup&lt;/strong&gt; : Django simplifies dependency management with its built-in features, reducing the need for external packages. This helps streamline the initial setup and minimizes compatibility issues, so you can get up and running sooner. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database support&lt;/strong&gt; : Django’s ORM (object-relational mapping) makes data handling more straightforward, enabling you to work with databases like SQLite, MySQL, and PostgreSQL without needing SQL knowledge. However, it’s less suitable for non-relational databases like MongoDB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; : Built-in defenses against common vulnerabilities such as cross-site scripting (XSS), SQL injection, and clickjacking help quickly secure your app from the start. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt; : Despite being monolithic, Django allows for horizontal scaling of the application’s architecture (business logic and templates), caching to ease database load, and asynchronous processing to improve efficiency. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community and documentation&lt;/strong&gt; : Django has a vast, active community and detailed &lt;a href="https://docs.djangoproject.com/en/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;, with tutorials and support readily available. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Django disadvantages
&lt;/h3&gt;

&lt;p&gt;Despite its many advantages, there are a few reasons you might want to look at options other than Django when developing your next web app.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavyweight&lt;/strong&gt; : Its “batteries included” design can be too much for smaller apps, where a lightweight framework like Flask may be more appropriate. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning curve&lt;/strong&gt; : Django’s extensive features naturally come with a steeper learning curve, though there are plenty of resources available to help new developers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; : Django is generally slower compared to other frameworks like Flask and FastAPI, but built-in caching and &lt;a href="https://www.youtube.com/watch?v=lkkxTceQft8" rel="noopener noreferrer"&gt;asynchronous processing&lt;/a&gt; can help improve the response times.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Flask
&lt;/h2&gt;

&lt;p&gt;Flask is a Python-based micro-framework for backend web development. However, don’t let the term “micro” deceive you. As we’ll see, Flask isn’t only limited to smaller web apps. &lt;/p&gt;

&lt;p&gt;Instead, Flask is designed with a simple core based on &lt;a href="https://palletsprojects.com/p/werkzeug/" rel="noopener noreferrer"&gt;Werkzeug WSGI&lt;/a&gt; (Web Server Gateway Interface) and &lt;a href="https://palletsprojects.com/p/jinja/" rel="noopener noreferrer"&gt;Jinja2 templates&lt;/a&gt;. Well-known users of Flask include Netflix, Airbnb, and Reddit.&lt;/p&gt;

&lt;p&gt;Flask was initially created as an April Fools’ Day joke and released as an open-source project in 2010, a few years after Django. The micro-framework’s approach is fundamentally different from Django’s. While Django takes a “batteries included” style and comes with a lot of the functionality you may need for building web apps, Flask is much leaner.&lt;/p&gt;

&lt;p&gt;The philosophy behind the micro-framework is that everyone has their preferences, so developers should be free to choose their own components. For this reason, Flask doesn’t include a database, ORM (object-relational mapper), or ODM (object-document mapper). &lt;/p&gt;

&lt;p&gt;When you build a web app with Flask, very little is decided for you upfront. This can have significant benefits, as we’ll discuss below.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flask advantages
&lt;/h3&gt;

&lt;p&gt;We’ve seen usage of Flask grow steadily over the last five years through &lt;a href="https://www.jetbrains.com/lp/devecosystem-2023/" rel="noopener noreferrer"&gt;our State of the Developer Ecosystem survey&lt;/a&gt; – it overtook Django for the first time in 2021. &lt;/p&gt;

&lt;p&gt;Some reasons for choosing Flask as a backend web framework include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight design&lt;/strong&gt; : Flask’s minimalist approach offers a flexible alternative to Django, making it ideal for smaller applications or projects where Django’s extensive features may feel excessive. However, Flask isn’t limited to small projects – you can extend it as needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility&lt;/strong&gt; : Flask allows you to choose the libraries and frameworks for core functionality, such as data handling and user authentication. This enables you to select the best tools for your project and extend it in unforeseen ways. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt; : Flask’s modular design makes it easy to scale horizontally. Using a NoSQL database layer can further enhance scalability. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shallow learning curve&lt;/strong&gt; : Its simple design makes Flask easy to learn, though you may need to explore extensions for more complex apps. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community and documentation&lt;/strong&gt; : Flask has extensive (if somewhat technical) &lt;a href="https://flask.palletsprojects.com/en/3.0.x/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; and a clear codebase. While its community is smaller than Django’s, Flask remains active and is growing steadily. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flask disadvantages
&lt;/h3&gt;

&lt;p&gt;While Flask has a lot to offer, there are a few things to consider before you decide to use it in your next web development project.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bring your own everything:&lt;/strong&gt; Flask’s micro-framework design and flexibility require you to handle much of that core functionality, including data validation, session management, and caching. While this flexibility can be beneficial, it can also slow the development process, as you’ll need to find existing libraries or build features from scratch. Additionally, dependencies must be managed over time to ensure they remain compatible with Flask. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Flask has minimal built-in security. Beyond securing client-side cookies, you must implement web security best practices and ensure the security of the dependencies you include, applying updates as needed. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; : While Flask performs slightly better than Django, it lags behind FastAPI. Flask offers some &lt;a href="https://flask.palletsprojects.com/en/stable/deploying/asgi/" rel="noopener noreferrer"&gt;ASGI support&lt;/a&gt; (the standard used by FastAPI), but it is more tightly tied to WSGI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FastAPI
&lt;/h2&gt;

&lt;p&gt;As the name suggests, FastAPI is a micro-framework for building high-performance web APIs with Python. Despite being relatively new – it was first released as an open-source project in 2018 ­– FastAPI has quickly become popular among developers, ranking third in our list of the most popular Python web frameworks since 2021.&lt;/p&gt;

&lt;p&gt;FastAPI is based on &lt;a href="https://www.uvicorn.org/" rel="noopener noreferrer"&gt;Uvicorn&lt;/a&gt;, an ASGI (Asynchronous Server Gateway Interface) server, and &lt;a href="https://www.starlette.io/" rel="noopener noreferrer"&gt;Starlette&lt;/a&gt;, a web micro-framework. FastAPI adds data validation, serialization, and documentation to streamline building web APIs.&lt;/p&gt;

&lt;p&gt;When developing FastAPI, the micro-framework’s creator drew on the experiences of working with many different frameworks and tools. Whereas Django was developed before frontend JavaScript web frameworks (such as React or Vue.js) were prominent, FastAPI was designed with this context in mind. &lt;/p&gt;

&lt;p&gt;The emergence of &lt;a href="https://www.openapis.org/" rel="noopener noreferrer"&gt;OpenAPI&lt;/a&gt; (formerly Swagger) as a format for structuring and documenting APIs in the preceding years provided an industry standard that FastAPI could leverage.&lt;/p&gt;

&lt;p&gt;Beyond the implicit use case of creating RESTful APIs, FastAPI is ideal for applications that require real-time responses, such as messaging platforms and dashboards. Its high performance and asynchronous capabilities make it a good fit for data-intensive apps, including machine learning models, data processing, and analytics.&lt;/p&gt;

&lt;h3&gt;
  
  
  FastAPI advantages
&lt;/h3&gt;

&lt;p&gt;FastAPI first received its own category in &lt;a href="https://www.jetbrains.com/lp/devecosystem-2021/" rel="noopener noreferrer"&gt;our State of the Developer Ecosystem survey&lt;/a&gt; in 2021, with 14% of respondents using the micro-framework. &lt;/p&gt;

&lt;p&gt;Since then, usage has increased to 20%, alongside a slight dip in the use of Flask and Django. &lt;/p&gt;

&lt;p&gt;These are some of the reasons why developers are choosing FastAPI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt; : Designed for speed, FastAPI supports asynchronous processing and bi-directional web sockets (courtesy of Starlette). It outperformed both Django and Flask in benchmark tests, making it ideal for high-traffic applications. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt; : Like Flask, FastAPI is highly modular, making it easy to scale and ideal for containerized deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adherence to industry standards&lt;/strong&gt; : FastAPI is fully compatible with &lt;a href="https://oauth.net/2/" rel="noopener noreferrer"&gt;OAuth 2.0&lt;/a&gt;, OpenAPI (formerly Swagger), and JSON Schema. As a result, you can implement secure authentication and generate your API documentation with minimal effort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ease of use&lt;/strong&gt; : FastAPI use of &lt;a href="https://pydantic.dev/" rel="noopener noreferrer"&gt;Pydantic&lt;/a&gt; for type hints and validation speeds up development by providing type checks, auto-completion, and request validation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; : FastAPI comes with a sizable body of documentation and growing third-party resources, making it accessible for developers at all levels.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  FastAPI disadvantages
&lt;/h3&gt;

&lt;p&gt;Before deciding that FastAPI is the best framework for your next project, bear in mind the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Maturity&lt;/strong&gt; : FastAPI, being newer, lacks the maturity of Django or Flask. Its community is smaller, and the user experience may be less streamlined due to less extensive use. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatibility&lt;/strong&gt; : As a micro-framework, FastAPI requires additional functionality for fully featured apps. There are fewer compatible libraries compared to Django or Flask, which may require you to develop your own extensions. &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Choosing between Flask, Django, and FastAPI
&lt;/h2&gt;

&lt;p&gt;So, which is the best Python web framework? As with many programming things, the answer is “it depends”.&lt;/p&gt;

&lt;p&gt;The right choice hinges on answering a few questions: What kind of app are you building? What are your priorities? How do you expect your project to grow in the future?&lt;/p&gt;

&lt;p&gt;All three popular Python web frameworks come with unique strengths, so assessing them in the context of your application will help you make the best decision. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Django&lt;/strong&gt; is a great option if you need standard web app functionality out of the box, making it suitable for projects that require a more robust structure. It’s particularly advantageous if you’re using a relational database, as its ORM simplifies data management and provides built-in security features. However, the extensive functionality may feel overwhelming for smaller projects or simple applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flask&lt;/strong&gt; , on the other hand, offers greater flexibility. Its minimalist design enables developers to pick and choose the extensions and libraries they want, making it suitable for projects where you need to customize features. This approach works well for startups or MVPs, where your requirements might change and evolve rapidly. While Flask is easy to get started with, keep in mind that building more intricate applications will mean exploring various extensions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FastAPI&lt;/strong&gt; is a strong contender when speed is of the essence, especially for API-first or &lt;a href="https://blog.jetbrains.com/pycharm/2024/09/how-to-use-fastapi-for-machine-learning/" rel="noopener noreferrer"&gt;machine learning&lt;/a&gt; projects. It uses modern Python features like type hints to provide automatic data validation and documentation. FastAPI is an excellent choice for applications that need high performance, like microservices or data-driven APIs. Despite this, it may not be as feature-rich as Django or Flask in terms of built-in functionality, which means you might need to implement additional features manually. &lt;/p&gt;

&lt;p&gt;For a deeper comparison between Django and the different web frameworks, check out our other guides, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2023/11/django-vs-flask-which-is-the-best-python-web-framework/" rel="noopener noreferrer"&gt;Django vs. Flask&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2023/12/django-vs-fastapi-which-is-the-best-python-web-framework/" rel="noopener noreferrer"&gt;Django vs. FastAPI &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2024/06/the-state-of-django/" rel="noopener noreferrer"&gt;The State of Django 2024&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2023/11/what-is-the-django-web-framework/" rel="noopener noreferrer"&gt;What is the Django Web Framework?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2024/01/how-to-learn-django/" rel="noopener noreferrer"&gt;How to Learn Django&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2025/01/django-views/" rel="noopener noreferrer"&gt;An Introduction to Django Views&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2025/02/the-ultimate-guide-to-django-templates/" rel="noopener noreferrer"&gt;The Ultimate Guide to Django Templates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.jetbrains.com/pycharm/2024/09/django-project-ideas/" rel="noopener noreferrer"&gt;Django Project Ideas&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Start your web development project with PyCharm
&lt;/h2&gt;

&lt;p&gt;Regardless of your primary framework, you can access all the essential web development tools in a single IDE. &lt;a href="https://www.jetbrains.com/pycharm/web-development/" rel="noopener noreferrer"&gt;PyCharm&lt;/a&gt; provides built-in support for Django, FastAPI, and Flask, while also offering top-notch integration with frontend frameworks like React, Angular, and Vue.js.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/" rel="noopener noreferrer"&gt;Start with PyCharm for free&lt;/a&gt;&lt;/p&gt;

</description>
      <category>django</category>
      <category>fastapi</category>
      <category>flask</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The Ultimate Guide to Django Templates</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Wed, 05 Feb 2025 10:37:47 +0000</pubDate>
      <link>https://dev.to/pycharm/the-ultimate-guide-to-django-templates-21cf</link>
      <guid>https://dev.to/pycharm/the-ultimate-guide-to-django-templates-21cf</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetz4bwaymq74bnb6uzip.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetz4bwaymq74bnb6uzip.png" alt="The Ultimate Guide to Django Templates" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Django templates are a crucial part of the framework. Understanding what they are and why they’re useful can help you build seamless, adaptable, and functional templates for your &lt;a href="https://blog.jetbrains.com/pycharm/2023/11/what-is-the-django-web-framework/" rel="noopener noreferrer"&gt;Django&lt;/a&gt; sites and apps.&lt;/p&gt;

&lt;p&gt;If you’re new to the framework and looking to set up your first &lt;a href="https://www.jetbrains.com/help/pycharm/creating-and-running-your-first-django-project.html" rel="noopener noreferrer"&gt;Django project&lt;/a&gt;, grasping templates is vital. In this guide, you’ll find everything you need to know about Django templates, including the different types and how to use them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Django templates?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/help/pycharm/templates.html" rel="noopener noreferrer"&gt;Django templates&lt;/a&gt; are a fundamental part of the Django framework. They allow you to separate the visual presentation of your site from the underlying code. A template contains the static parts of the desired HTML output and special syntax describing how dynamic content will be inserted. &lt;/p&gt;

&lt;p&gt;Ultimately, templates can generate complete web pages, while database queries and other data processing tasks are handled by &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/django-views/" rel="noopener noreferrer"&gt;views&lt;/a&gt; and &lt;a href="https://docs.djangoproject.com/en/5.1/topics/db/models/" rel="noopener noreferrer"&gt;models&lt;/a&gt;. This separation ensures clean, maintainable code by keeping HTML business logic separate from the Python code in the rest of your Django project. Without templates, you’d need to embed HTML directly into your Python code, making it hard to read and debug.&lt;/p&gt;

&lt;p&gt;Here is an example of a Django template containing some HTML, a variable &lt;code&gt;name&lt;/code&gt;, and basic &lt;code&gt;if/else&lt;/code&gt; logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;h1&amp;gt;Welcome!&amp;lt;/h1&amp;gt;

{% if name %}
  &amp;lt;h1&amp;gt;Hello, {{ name }}!&amp;lt;/h1&amp;gt;
{% else %}
  &amp;lt;h1&amp;gt;Hello, Guest!&amp;lt;/h1&amp;gt;
{% endif %}
&amp;lt;h1&amp;gt;{{ heading }}&amp;lt;/h1&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benefits of using templates
&lt;/h3&gt;

&lt;p&gt;Developers use Django templates to help them build reliable apps quickly and efficiently. Other key benefits of templates include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code reusability&lt;/strong&gt; : Reusable components and layouts can be created for consistency across pages and apps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easier maintenance&lt;/strong&gt; : The appearance of web pages may be modified without altering the underlying logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved readability:&lt;/strong&gt; HTML code can be kept clean and understandable without the need for complex logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Template inheritance&lt;/strong&gt; : Common structures and layouts may be defined to reduce duplication and promote consistency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic content&lt;/strong&gt; : It’s possible to build personalized web pages that adapt to user inputs and data variations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimization&lt;/strong&gt; : Templates can be cached to improve app or website performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Challenges and limitations
&lt;/h3&gt;

&lt;p&gt;While templates are essential for rendering web pages in Django, they should be used thoughtfully, especially in complex projects with bigger datasets. Despite the relative simplicity of Django’s template language, overly complex templates with numerous nested tags, filters, and inheritance can become difficult to manage and maintain. Instead of embedding too much logic into your templates, aim to keep them focused on presentation. Customization options are also limited unless you create your own custom tags or fillers.&lt;/p&gt;

&lt;p&gt;Migrating to a different template engine can be challenging, as Django’s default engine is closely tied to the &lt;a href="https://blog.jetbrains.com/pycharm/2023/11/what-is-the-django-web-framework/" rel="noopener noreferrer"&gt;framework&lt;/a&gt;. However, switching to an alternative like Jinja is relatively straightforward, and we will discuss this possibility later in this guide.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging Django templates
&lt;/h3&gt;

&lt;p&gt;In some situations (such as when issues arise), it can be useful to see how your template works. For this, you can use template debugging.&lt;/p&gt;

&lt;p&gt;Template debugging focuses on identifying errors in how your HTML and dynamic data interact. Common problems include missing variables, incorrect template tags, and logic errors.&lt;/p&gt;

&lt;p&gt;Luckily, Django provides helpful tools like &lt;code&gt;{{ debug }}&lt;/code&gt; for inspecting your templates and detailed error pages that highlight where the problem lies. This makes it easier to pinpoint and resolve issues, ensuring your templates render as expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Django Template Language (DTL)
&lt;/h2&gt;

&lt;p&gt;The Django Template Language (DTL) is Django’s built-in templating engine, designed to simplify the creation of dynamic web pages. It seamlessly blends HTML with Django-specific tags and filters, allowing you to generate rich, data-driven content directly from your &lt;a href="https://blog.jetbrains.com/pycharm/2023/04/create-a-django-app-in-pycharm/" rel="noopener noreferrer"&gt;Django app&lt;/a&gt;. Let’s explore some of the key features that make DTL a powerful tool for building templates.&lt;/p&gt;

&lt;h3&gt;
  
  
  DTL basic syntax and structure
&lt;/h3&gt;

&lt;p&gt;Django templates are written with a combination of HTML and DTL syntax. The basic structure of a Django template consists of HTML markup with embedded Django tags and variables.&lt;/p&gt;

&lt;p&gt;Here’s an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!DOCTYPE html&amp;gt;
&amp;lt;html&amp;gt;
  &amp;lt;head&amp;gt;
    &amp;lt;title&amp;gt;{{ page_title }}&amp;lt;/title&amp;gt;
  &amp;lt;/head&amp;gt;
  &amp;lt;body&amp;gt;
    &amp;lt;h1&amp;gt;{{ heading }}&amp;lt;/h1&amp;gt;
    &amp;lt;ul&amp;gt;
      {% for item in item_list %}
        &amp;lt;li&amp;gt;{{ item.name }}&amp;lt;/li&amp;gt;
      {% endfor %}
    &amp;lt;/ul&amp;gt;
  &amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Variables, filters, and tags
&lt;/h3&gt;

&lt;p&gt;The DTL has several features for working with variables, filters, and tags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variables&lt;/strong&gt; : Variables display dynamic data in your templates. They are enclosed in double curly brackets, e.g. &lt;code&gt;{{ variable_name }}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filters&lt;/strong&gt; : Filters modify or format the value of a variable before rendering it. They are applied using a pipe character ( | ), e.g. &lt;code&gt;{{ variable_name|upper }}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tags&lt;/strong&gt; : Tags control the logic and flow of your templates. They are enclosed in &lt;code&gt;{% %}&lt;/code&gt; blocks and can perform various operations like loops, conditionals, and template inclusions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;PyCharm&lt;/a&gt;, a professional IDE for Django development, simplifies working with Django templates by providing syntax highlighting, which color-codes tags, variables, and HTML for better readability. It also offers real-time error detection, ensuring you don’t miss closing tags or misplace syntax. With auto-completion for variables and tags, you can code faster and with fewer mistakes.&lt;/p&gt;



&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Template inheritance and extending base templates
&lt;/h3&gt;

&lt;p&gt;The framework’s template inheritance system enables you to create a base template that contains the standard structure and the layout for your website or app.&lt;/p&gt;

&lt;p&gt;You can then create child templates that inherit from the base template and override specific blocks of sections as needed. This encourages code reuse and consistency across your different templates.&lt;/p&gt;

&lt;p&gt;To create a base template, you define blocks using the &lt;code&gt;{% block %}&lt;/code&gt; tag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!-- base.html --&amp;gt;
&amp;lt;!DOCTYPE html&amp;gt;
&amp;lt;html&amp;gt;
  &amp;lt;head&amp;gt;
    &amp;lt;title&amp;gt;{% block title %}Default Title{% endblock %}&amp;lt;/title&amp;gt;
  &amp;lt;/head&amp;gt;
  &amp;lt;body&amp;gt;
    {% block content %}
    {% endblock %}
  &amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Child templates then extend the base templates and override certain blocks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!-- child_template.html --&amp;gt;
{% extends 'base.html' %}

{% block title %}My Page Title{% endblock %}

{% block content %}
  &amp;lt;h1&amp;gt;My Page Content&amp;lt;/h1&amp;gt;
  &amp;lt;p&amp;gt;This is the content of my page.&amp;lt;/p&amp;gt;
{% endblock %}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Django template tags
&lt;/h2&gt;

&lt;p&gt;Tags are an essential element of Django templates. They provide various functionalities, from conditional rendering and looping to template inheritance and inclusion.&lt;/p&gt;

&lt;p&gt;Let’s explore them in more detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Django template tags
&lt;/h3&gt;

&lt;p&gt;There are several template tags in Django, but these are the ones you’ll probably use most frequently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;{% if %}&lt;/code&gt;: This tag allows you to conditionally render content based on a specific condition. It’s often used with the &lt;code&gt;{% else %}&lt;/code&gt; and &lt;code&gt;{% elif %}&lt;/code&gt; tags.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;{% for %}&lt;/code&gt;: The &lt;code&gt;{% for %}&lt;/code&gt; tag is used to iterate over a sequence, such as a list or query set, and render content for each item in the sequence.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;{% include %}&lt;/code&gt;: This tag enables you to include the contents of another template file within the current template. It facilitates the reuse of common template snippets across multiple templates.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;{% block %}&lt;/code&gt;: The &lt;code&gt;{% block %}&lt;/code&gt; tag is used in conjunction with template inheritance. It defines a block of content that can be overridden by child templates when extending a base template.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;{% extends %}&lt;/code&gt;: This tag specifies the base template of the current template from which it should inherit.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;{% url %}&lt;/code&gt;: This tag is used to generate a URL for a named URL pattern in your Django project. It helps keep your templates decoupled from the actual URL paths.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;{% load %}&lt;/code&gt;: The &lt;code&gt;{% load %}&lt;/code&gt; tag is used to load custom template tags and filters from a Python module or library, enabling you to extend the functionality of the Django template system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are just some examples of the many template tags available in Django. Tags like &lt;code&gt;{% with %}&lt;/code&gt;, &lt;code&gt;{% cycle %}&lt;/code&gt;, &lt;code&gt;{% comment %}&lt;/code&gt;, and others can provide more functionality for advanced projects, helping you build customized and efficient apps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using template tags
&lt;/h3&gt;

&lt;p&gt;Here’s a detailed example of how you might use tags in a Django template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{% extends 'base.html' %}
{% load custom_filters %}

{% block content %}
  &amp;lt;h1&amp;gt;{{ page_title }}&amp;lt;/h1&amp;gt;
  {% if object_list %}
    &amp;lt;ul&amp;gt;
      {% for obj in object_list %}
&amp;lt;!-- We truncate the object name to 25 characters. --&amp;gt;
        &amp;lt;li&amp;gt;{{ obj.name|truncate:25 }}&amp;lt;/li&amp;gt;
      {% endfor %}
    &amp;lt;/ul&amp;gt;
  {% else %}
    &amp;lt;p&amp;gt;No objects found.&amp;lt;/p&amp;gt;
  {% endif %}

  {% include 'partials/pagination.html' %}
{% endblock %}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, we extend a base template, load custom filters, and then define a block for the main content.&lt;/p&gt;

&lt;p&gt;Inside the block, we check whether an &lt;code&gt;object_list&lt;/code&gt; exists, and if so, we loop through it and display the truncated names of each object. We show a “No objects found” message if the list is empty.&lt;/p&gt;

&lt;p&gt;Finally, we include a partial template for pagination. This template is a reusable snippet of HTML that can be included in other templates, enabling you to manage and update common elements (like pagination) more efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Django admin templates
&lt;/h2&gt;

&lt;p&gt;Django’s built-in admin interface gives you a user-friendly and intuitive way to manage your application data. It’s powered by a set of templates defining its structure, layout, and appearance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Functionality
&lt;/h3&gt;

&lt;p&gt;The Django admin templates handle various tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt; : Controls user authentication, login, and logout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model management&lt;/strong&gt; : Displays lists of model instances and creates, edits, and deletes instances as needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Form rendering&lt;/strong&gt; : Renders forms for creating and editing model instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Navigation&lt;/strong&gt; : Renders the navigation structure of the admin interface, including the main menu and app-specific sub-menus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pagination&lt;/strong&gt; : Renders pagination controls when displaying lists of model instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;History tracking&lt;/strong&gt; : Displays and manages the change history of model instances.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Django’s built-in admin templates provide a solid foundation for managing your application’s data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customizing admin templates
&lt;/h3&gt;

&lt;p&gt;Although Django’s admin templates offer a good, functional interface out of the box, you may want to customize their appearance or behavior to suit your individual project’s needs.&lt;/p&gt;

&lt;p&gt;You can change things to match your project’s branding, improve the user experience, or add custom functionality unique to your app.&lt;/p&gt;

&lt;p&gt;There are several ways to do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Override templates&lt;/strong&gt; : You can override default admin templates by creating templates with the same file structure and naming convention in your project’s templates directory. Django will then automatically use your custom templates instead of the built-in ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extend base templates&lt;/strong&gt; : Many of Django’s admin templates are built using template inheritance. You can create templates that extend the base admin templates and override specific blocks or sections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Template options&lt;/strong&gt; : Django has various template options that enable you to customize the admin interface’s behavior. This includes displaying certain fields, specifying which ones should be editable, and defining customer templates for specific model fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admin site customization&lt;/strong&gt; : You can customize the admin site’s appearance and behavior by subclassing the &lt;code&gt;AdminSite&lt;/code&gt; class and registering your custom admin site with Django.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  URL templating in Django
&lt;/h2&gt;

&lt;p&gt;URL templates in Django offer a flexible way to define and generate URLs for web applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding URL templates
&lt;/h3&gt;

&lt;p&gt;In Django, you define URL patterns in the project’s urls.py file using the path function from the django.urls module.&lt;/p&gt;

&lt;p&gt;These URL patterns map certain URL patterns to Python functions (views) that handle the corresponding HTTP requests.&lt;/p&gt;

&lt;p&gt;Here’s an example of a basic URL pattern in Django:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# urls.py
from django.urls import path
from . import views

urlpatterns = [
    path('', views.home, name='home'),
    path('about/', views.about, name='about'),
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, the URL pattern &lt;code&gt;‘ ‘&lt;/code&gt; maps to the &lt;code&gt;views.home&lt;/code&gt; view function, and the URL pattern &lt;code&gt;‘about/’&lt;/code&gt; maps to the &lt;code&gt;views.about&lt;/code&gt; view function.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dynamic URL generation with URL templates
&lt;/h3&gt;

&lt;p&gt;URL templates in Django allow you to include variables or parameters in your URL patterns. This means you can create dynamic URLs that represent different instances of the same resource or include more data.&lt;/p&gt;

&lt;p&gt;If your urls.py file includes other URL files using &lt;code&gt;include()&lt;/code&gt;, PyCharm automatically gathers and recognizes all nested routes, ensuring that URL name suggestions remain accurate. You can also navigate to URL definitions by _Ctrl+Click-_ing on a URL name to jump directly to its source, even if the URL is defined in a child file.&lt;/p&gt;

&lt;p&gt;Let’s look at an example of a URL template with a variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# urls.py
urlpatterns = [
    path('blog/&amp;lt;int:year&amp;gt;/', views.year_archive, name='blog_year_archive'),
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, the URL &lt;code&gt;‘blog/&amp;lt;int:year&amp;gt;/’&lt;/code&gt; includes a variable year of type &lt;code&gt;int&lt;/code&gt;. When a request matches this pattern, Django will pass the value of the year as an argument to the &lt;code&gt;views.year_archive&lt;/code&gt; view function.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Django URLs
&lt;/h3&gt;

&lt;p&gt;Django URLs are the foundation of any application and work by linking user requests to the appropriate views. By defining URL patterns that match specific views, Django ensures your site remains organized and scalable. &lt;/p&gt;

&lt;h4&gt;
  
  
  Using URL templates with Django’s &lt;code&gt;reverse&lt;/code&gt; function
&lt;/h4&gt;

&lt;p&gt;Django’s &lt;code&gt;reverse&lt;/code&gt; function lets you generate URLs based on their named URL patterns. It takes the name of the URL pattern as its first argument before any further required arguments and returns the corresponding URL.&lt;/p&gt;

&lt;p&gt;Here’s an example of it in action:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# views.py
from django.shortcuts import render
from django.urls import reverse

def blog_post_detail(request, year, month, slug):
    # ...
    url = reverse('blog_post_detail', args=[year, month, slug])
    return render(request, 'blog/post_detail.html', {'url': url})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reverse function is used to generate the URL for the &lt;code&gt;‘blog_post_detail’&lt;/code&gt; URL pattern, passing the year, month, and &lt;a href="https://docs.djangoproject.com/en/dev/glossary/#term-slug" rel="noopener noreferrer"&gt;slug&lt;/a&gt; values as arguments.&lt;/p&gt;

&lt;p&gt;You can then use the returned URL in templates or other application parts.&lt;/p&gt;

&lt;h4&gt;
  
  
  Using URL tags in Django templates
&lt;/h4&gt;

&lt;p&gt;Django’s &lt;code&gt;{% url %}&lt;/code&gt; template tag provides an elegant way to generate URLs directly within your template. Instead of hardcoding URLs, you can refer to named URL patterns, which makes your templates more flexible and easier to manage.&lt;/p&gt;

&lt;p&gt;Here’s an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;a href="{% url 'blog_post_detail' year=2024 month=10 slug=post.slug %}"&amp;gt; 
Read More 
&amp;lt;/a&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, the &lt;code&gt;{% url %}&lt;/code&gt; tag creates a URL for the &lt;code&gt;blog_post_detail&lt;/code&gt; view, passing in the &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, and &lt;code&gt;slug&lt;/code&gt; parameters. It’s important to make sure these arguments match the URL pattern defined in your urls.py file, which should look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;path('blog/&amp;lt;int:year&amp;gt;/&amp;lt;int:month&amp;gt;/&amp;lt;slug:slug&amp;gt;/', views.blog_post_detail, name='blog_post_detail'),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach helps keep your templates clean and adaptable, particularly as your project evolves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Jinja vs. Django templates
&lt;/h2&gt;

&lt;p&gt;Although Django comes with a built-in template engine (DTL), developers also have the option to use alternatives like Jinja.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://jinja.palletsprojects.com/en/stable/" rel="noopener noreferrer"&gt;Jinja&lt;/a&gt; is a popular, modern, and feature-rich template engine for Python. Initially developed for the &lt;a href="https://blog.jetbrains.com/pycharm/2023/11/django-vs-flask-which-is-the-best-python-web-framework/" rel="noopener noreferrer"&gt;Flask&lt;/a&gt; web framework, it’s also compatible with Django.&lt;/p&gt;

&lt;p&gt;The engine was designed to be fast, secure, and highly extensible. Its broad feature set and capabilities make it versatile for rendering dynamic content.&lt;/p&gt;

&lt;p&gt;Some of Jinja’s key features and advantages over Django’s DTL include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A more concise and intuitive syntax.&lt;/li&gt;
&lt;li&gt;Sandboxed execution for increased security against code injection attacks.&lt;/li&gt;
&lt;li&gt;A more flexible and powerful inheritance system.&lt;/li&gt;
&lt;li&gt;Better debugging tools and reporting mechanisms.&lt;/li&gt;
&lt;li&gt;Faster performance when working with complex templates or large datasets.&lt;/li&gt;
&lt;li&gt;Enhanced functionality with built-in filters and macros, enabling more complex rendering logic without cluttering the template.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PyCharm can automatically detect the file type *.jinja and provides syntax highlighting, code completion, and error detection along with support for custom filters and extensions, ensuring a smooth development experience.&lt;/p&gt;

&lt;p&gt;Despite these benefits, it’s also important to remember that integrating Jinja into a Django project requires a more complex setup and further configuration.&lt;/p&gt;

&lt;p&gt;Some developers might also prefer to stick with Django’s built-in template engine in order to keep everything within the Django ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code faster with Django live templates
&lt;/h3&gt;

&lt;p&gt;With PyCharm’s live template feature, you can quickly insert commonly used code snippets with a simple keyword shortcut.&lt;/p&gt;

&lt;p&gt;All you have to do is invoke live templates by pressing &lt;em&gt;⌘J&lt;/em&gt;, typing &lt;code&gt;ListView&lt;/code&gt;, and hitting the Tab key.&lt;/p&gt;



&lt;p&gt;This reduces boilerplate coding, speeds up development, and ensures consistent syntax. You can even &lt;strong&gt;customize or create your own templates&lt;/strong&gt; to fit specific project needs. This feature is particularly useful for DTL syntax, where loops, conditionals, and block structures are frequently repeated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using Django templates: best practices and tips
&lt;/h2&gt;

&lt;p&gt;Working with Django templates is a great way to manage the presentation layer of your web apps.&lt;/p&gt;

&lt;p&gt;However, following guidance and carrying out performance optimizations is essential to ensure your templates are well-maintained, secure, and systematic.&lt;/p&gt;

&lt;p&gt;Here are some best practices and tips to remember when using Django templates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Separate presentation and business logic&lt;/strong&gt;. Keep templates focused on rendering data and handle complex processing in views or models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organize your templates logically.&lt;/strong&gt; Follow Django’s file structure by separating templates by app and functionality, using subdirectories as needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Django’s naming conventions&lt;/strong&gt;. Django follows a ‘convention over configuration’ principle, letting you name your templates in a specific way so that you don’t need to provide your template name explicitly. For instance, when using class-based views like &lt;code&gt;ListView&lt;/code&gt;, Django automatically looks for a template named &lt;strong&gt;/_list.html&lt;/strong&gt; , thus simplifying your code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Break down elaborate tasks into reusable components.&lt;/strong&gt; Promote code reuse and improve maintainability by using template tags, filters, and includes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Follow consistent naming conventions.&lt;/strong&gt; Use clear and descriptive names for your templates, tags, and filters. This makes it easier for other developers to read your code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Django’s safe rendering filters.&lt;/strong&gt; Always escape user-provided data before rendering to prevent XSS vulnerabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document complex template logic.&lt;/strong&gt; Use clear comments to explain intricate parts of your templates. This will help others (and your future self) understand your code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profile your templates&lt;/strong&gt;. Use Django’s profiling tools to find and optimize performance bottlenecks like inefficient loops and convoluted logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/guide/django/links/django-in-pycharm-tips-reloaded/" rel="noopener noreferrer"&gt;Watch this video&lt;/a&gt; to explore Django tips and PyCharm features in more detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Whether you’re building a simple website or a more complicated app, you should now know how to create Django templates that enhance user experience and streamline your development process.&lt;/p&gt;

&lt;p&gt;But templates are just one component of the Django framework. Explore our other &lt;a href="https://blog.jetbrains.com/pycharm/tag/django/" rel="noopener noreferrer"&gt;Django blogs&lt;/a&gt; and resources that can help you &lt;a href="https://blog.jetbrains.com/pycharm/2024/01/how-to-learn-django/" rel="noopener noreferrer"&gt;learn Django&lt;/a&gt;, discover &lt;a href="https://blog.jetbrains.com/pycharm/2023/12/django-5-0-delight-unraveling-the-newest-features/" rel="noopener noreferrer"&gt;Django’s newest features&lt;/a&gt;, and more. You may also want to familiarize yourself with &lt;a href="https://docs.djangoproject.com/en/5.0/" rel="noopener noreferrer"&gt;Django’s official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reliable Django support in PyCharm
&lt;/h2&gt;

&lt;p&gt;From complete beginners to experienced developers, &lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;PyCharm Professional&lt;/a&gt; is on hand to help streamline your Django development workflow.&lt;/p&gt;

&lt;p&gt;The Django IDE provides Django-specific code assistance, debugging, live previews, project-wide navigation, and refactoring capabilities. PyCharm includes full support for Django templates, allowing you to manage and edit them with ease. You can also connect to your database with a single click and work seamlessly with TypeScript, JavaScript, and other frontend frameworks.&lt;/p&gt;

&lt;p&gt;For full details of how to work with Django templates in PyCharm, see our &lt;a href="https://www.jetbrains.com/help/pycharm/templates.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;. Those who are relatively new to the Django framework may benefit from first reading our comprehensive tutorial, which covers all the steps in the process of &lt;a href="https://www.jetbrains.com/guide/django/tutorials/django-aws/setup-django/" rel="noopener noreferrer"&gt;creating a new Django app in PyCharm&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Ready to get started? Download PyCharm now and enjoy a more productive development process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

</description>
      <category>django</category>
      <category>python</category>
    </item>
    <item>
      <title>An Introduction to Django Views</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Wed, 29 Jan 2025 10:51:08 +0000</pubDate>
      <link>https://dev.to/pycharm/an-introduction-to-django-views-4cb9</link>
      <guid>https://dev.to/pycharm/an-introduction-to-django-views-4cb9</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F219u9i5x3vt1xxo0rmxn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F219u9i5x3vt1xxo0rmxn.png" alt="An Introduction to Django Views" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Views are central to Django’s architecture pattern, and having a solid grasp of how to work with them is essential for any developer working with the framework. If you’re new to developing web apps with Django or just need a refresher on views, we’ve got you covered. &lt;/p&gt;

&lt;p&gt;Gaining a better understanding of views will help you make faster progress in your Django project. Whether you’re working on an API backend or web UI flows, knowing how to use views is crucial.&lt;/p&gt;

&lt;p&gt;Read on to discover what Django views are, their different types, best practices for working with them, and examples of use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Django views?
&lt;/h2&gt;

&lt;p&gt;Views are a core component of Django’s MTV (model-template-view) architecture pattern. They essentially act as middlemen between &lt;a href="https://docs.djangoproject.com/en/5.1/topics/db/models/" rel="noopener noreferrer"&gt;models&lt;/a&gt; and &lt;a href="https://blog.jetbrains.com/pycharm/2025/02/the-ultimate-guide-to-django-templates/" rel="noopener noreferrer"&gt;templates&lt;/a&gt;, processing user requests and returning responses.&lt;/p&gt;

&lt;p&gt;You may have come across views in the MVC (model-view-controller) pattern. However, these are slightly &lt;a href="https://docs.djangoproject.com/en/5.1/faq/general/#faq-mtv" rel="noopener noreferrer"&gt;different from views in Django&lt;/a&gt; and don’t translate exactly. Django views are essentially controllers in MVC, while Django templates roughly align with views in MVC. This makes understanding the nuances of Django views vital, even if you’re familiar with views in an MVC context.&lt;/p&gt;

&lt;p&gt;Views are part of the user interface in Django, and they handle the logic and data processing for web requests made to your Django-powered apps and sites. They render your templates into what the user sees when they view your webpage. Each function-based or class-based view takes a user’s request, fetches the data from its models, applies business logic or data processing, and then prepares and returns an HTTP response to a template.&lt;/p&gt;

&lt;p&gt;This response can be anything a web browser can display and is typically an HTML webpage. However, Django views can also return images, XML documents, redirects, error pages, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rendering and passing data to templates
&lt;/h2&gt;

&lt;p&gt;Django provides the &lt;code&gt;render()&lt;/code&gt; shortcut to make template rendering simple from within views. Using &lt;code&gt;render()&lt;/code&gt; helps avoid the boilerplate of loading the template and creating the response manually.&lt;/p&gt;

&lt;p&gt;PyCharm offers smart code completion that automatically suggests the &lt;code&gt;render()&lt;/code&gt; function from &lt;code&gt;django.shortcuts&lt;/code&gt; when you start typing it in your views. It also recognizes template names and provides autocompletion for template paths, helping you avoid typos and errors.&lt;/p&gt;

&lt;p&gt;The user provides the request, the template name, and a context dictionary, which gives data for the template. Once the necessary data is obtained, the view passes it to the template, where it can be rendered and presented to the user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from django.shortcuts import render

def my_view(request):
    # Some business logic to obtain data
    data_to_pass = {'variable1': 'value1', 'variable2': 'value2'}

    # Pass the data to the template
    return render(request, 'my_template.html', context=data_to_pass)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, &lt;code&gt;data_to_pass&lt;/code&gt; is a dictionary containing the data you want to send to the template. The &lt;code&gt;render&lt;/code&gt; function is then used to render the template (&lt;code&gt;my_template.html&lt;/code&gt;) with the provided context data.&lt;/p&gt;

&lt;p&gt;Now, in your template (&lt;code&gt;my_template.html&lt;/code&gt;), you can access and display the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!DOCTYPE html&amp;gt;
&amp;lt;html&amp;gt;
&amp;lt;head&amp;gt;
    &amp;lt;title&amp;gt;My Template&amp;lt;/title&amp;gt;
&amp;lt;/head&amp;gt;
&amp;lt;body&amp;gt;
    &amp;lt;h1&amp;gt;{{ variable1 }}&amp;lt;/h1&amp;gt;
    &amp;lt;p&amp;gt;{{ variable2 }}&amp;lt;/p&amp;gt;
&amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the template, you use double curly braces (&lt;code&gt;{{ }}&lt;/code&gt;) to indicate template variables. These will be replaced with the values from the context data passed by the view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;PyCharm&lt;/a&gt; offers completion and syntax highlighting for Django template tags, variables, and loops. It also provides in-editor linting for common mistakes. This allows you to focus on building views and handling logic, rather than spending time manually filling in template elements or debugging common errors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmnchxbvzvnellyh2yjgt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmnchxbvzvnellyh2yjgt.png" alt="PyCharm Django completion" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Function-based views
&lt;/h2&gt;

&lt;p&gt;Django has two types of views: function-based views and class-based views.&lt;/p&gt;

&lt;p&gt;Function-based views are built using simple Python functions and are generally divided into four basic categories: create, read, update, and delete (CRUD). This is the foundation of any framework in development. They take in an HTTP request and return an HTTP response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from django.http import HttpResponse

def my_view(request):

    # View logic goes here
    context = {"message": "Hello world"}

    return HttpResponse(render(request, "mytemplate.html", context))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This snippet handles the logic of the view, prepares a context dictionary for passing data to a template that is rendered, and returns the final template HTML in a response object.&lt;/p&gt;

&lt;p&gt;Function-based views are simple and straightforward. The logic is contained in a single Python function instead of spread across methods in a class, making them most suited to use cases with minimal processing.&lt;/p&gt;

&lt;p&gt;PyCharm allows you to automatically generate the &lt;code&gt;def my_view(request)&lt;/code&gt; structure using &lt;a href="https://www.jetbrains.com/help/pycharm/using-live-templates.html" rel="noopener noreferrer"&gt;live templates&lt;/a&gt;. Live templates are pre-defined code snippets that can be expanded into boilerplate code. This feature saves you time and ensures a consistent structure for your view definitions.&lt;/p&gt;

&lt;p&gt;You can invoke live templates simply by pressing &lt;em&gt;⌘J&lt;/em&gt;, typing &lt;code&gt;Listview&lt;/code&gt;, and pressing the tab key. &lt;/p&gt;



&lt;p&gt;Moreover, PyCharm includes a &lt;em&gt;Django Structure&lt;/em&gt; tool window, where you can see a list of all the views in your Django project, organized by app. This allows you to quickly locate views, navigate between them, and identify which file each view belongs to.&lt;/p&gt;



&lt;h2&gt;
  
  
  Class-based views
&lt;/h2&gt;

&lt;p&gt;Django introduced class-based views so users wouldn’t need to write the same code repeatedly. They don’t replace function-based views but instead have certain applications and advantages, especially in cases where complex logic is required.&lt;/p&gt;

&lt;p&gt;Class-based views in Django provide reusable parent classes that implement various patterns and functionality typically needed by web application views. You can take your views from these parent classes to reduce boilerplate code. &lt;/p&gt;

&lt;p&gt;Class-based views offer generic parent classes like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ListView&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DetailView&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CreateView&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;And many more.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below are two similar code snippets demonstrating a simple &lt;code&gt;BookListView.&lt;/code&gt; The first shows a basic implementation using the default class-based conventions, while the second illustrates how you can customize the view by specifying additional parameters. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic implementation&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from django.views.generic import ListView
from .models import Book 

class BookListView(ListView):
    model = Book
    # The template_name is omitted because Django defaults to 'book_list.html' 
    # based on the convention of &amp;lt;model_name&amp;gt;_list.html for ListView.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;BookListView&lt;/code&gt; gets rendered, it automatically queries the Book records and passes them under the variable books when rendering &lt;code&gt;book_list.html&lt;/code&gt;. This means you can create a view to list objects quickly without needing to rewrite the underlying logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customized implementation&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from django.views.generic import ListView
from .models import Book 

class BookListView(ListView):

    model = Book
    # You can customize the view further by adding additional attributes or methods 
    def get_queryset(self):
    # Example of customizing the queryset to filter books
    return Book.objects.filter(is_available=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the second snippet, we’ve introduced a custom &lt;code&gt;get_queryset()&lt;/code&gt; method, allowing us to filter the records displayed in the view more precisely. This shows how class-based views can be extended beyond their default functionality to meet the needs of your application. &lt;/p&gt;

&lt;p&gt;Class-based views also define methods that tie into key parts of the request and response lifecycle, such as: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;get()&lt;/code&gt; – logic for &lt;code&gt;GET&lt;/code&gt; requests.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;post()&lt;/code&gt; – logic for &lt;code&gt;POST&lt;/code&gt;requests.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dispatch()&lt;/code&gt; – determines which method to call &lt;code&gt;get()&lt;/code&gt; or &lt;code&gt;post()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These types of views provide structure while offering customization where needed, making them well-suited to elaborate use cases.&lt;/p&gt;

&lt;p&gt;PyCharm offers live templates for class-based views like &lt;code&gt;ListView&lt;/code&gt;, &lt;code&gt;DetailView&lt;/code&gt;, and &lt;code&gt;TemplateView&lt;/code&gt;, allowing you to generate entire view classes in seconds, complete with boilerplate methods and docstrings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz89m8d6okn9mum7ey23t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz89m8d6okn9mum7ey23t.png" alt="Django live templates in PyCharm" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating custom class-based views
&lt;/h3&gt;

&lt;p&gt;You can also create your own view classes by subclassing Django’s generic ones and customizing them for your needs. &lt;/p&gt;

&lt;p&gt;Some use cases where you might want to make your own classes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding business logic, such as complicated calculations.&lt;/li&gt;
&lt;li&gt;Mixing multiple generic parents to blend functionality.&lt;/li&gt;
&lt;li&gt;Managing sessions or state across multiple requests.&lt;/li&gt;
&lt;li&gt;Optimizing database access with custom queries. &lt;/li&gt;
&lt;li&gt;Reusing common rendering logic across different areas. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A custom class-based view could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from django.views.generic import View
from django.shortcuts import render
from . import models

class ProductSalesView(View):

    def get(self, request):

        # Custom data processing 
        sales = get_sales_data()

        return render(request, "sales.html", {"sales": sales})

    def post(self, request):

        # Custom form handling
        form = SalesSearchForm(request.POST)  
        if form.is_valid():
            results = models.Sale.objects.filter(date__gte=form.cleaned_data['start_date'])
            context = {"results": results}
            return render(request, "search_results.html", context)

        # Invalid form handling
        errors = form.errors
        return render(request, "sales.html", {"errors": errors})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, custom &lt;code&gt;get&lt;/code&gt; and &lt;code&gt;post&lt;/code&gt; handlers enable you to extend the existing ones between requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use each view type
&lt;/h2&gt;

&lt;p&gt;Function-based and class-based views can both be useful depending on the complexity and needs of the view logic. &lt;/p&gt;

&lt;p&gt;The main differences are that class-based views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Promote reuse via subclassing and parents inheriting behavior.&lt;/li&gt;
&lt;li&gt;Are ideal for state management between requests.&lt;/li&gt;
&lt;li&gt;Provide more structure and enforced discipline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might use them working with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboard pages with complex rendering logic. &lt;/li&gt;
&lt;li&gt;Public-facing pages that display dynamic data.&lt;/li&gt;
&lt;li&gt;Admin portals for content management.&lt;/li&gt;
&lt;li&gt;List or detail pages involving database models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the other hand, function-based views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are simpler and take less code to create.&lt;/li&gt;
&lt;li&gt;Can be easier for Python developers to grasp.&lt;/li&gt;
&lt;li&gt;Are highly flexible and have fewer constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Their use cases include: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prototyping ideas.&lt;/li&gt;
&lt;li&gt;Simple CRUD or database views.&lt;/li&gt;
&lt;li&gt;Landing or marketing pages. &lt;/li&gt;
&lt;li&gt;API endpoints for serving web requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, function-based views are flexible, straightforward, and are easier to reason about. However, for more complex cases, you’ll need to create more code that you can’t reuse.&lt;/p&gt;

&lt;p&gt;Class-based views in Django enforce structure and are reusable, but they can be more challenging to understand and implement, as well as harder to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Views and URLs
&lt;/h2&gt;

&lt;p&gt;As we’ve established, in Django, views are the functions or classes that determine how a template is rendered. Each view links to a specific URL pattern, guiding incoming requests to the right place.&lt;/p&gt;

&lt;p&gt;Understanding the relationship between views and URLs is important for managing your application’s flow effectively. &lt;/p&gt;

&lt;p&gt;Every view corresponds with a URL pattern defined in your Django app’s &lt;code&gt;urls.py&lt;/code&gt; file. This URL mapping ensures that when a user navigates to a specific address in your application, Django knows exactly which view to invoke. &lt;/p&gt;

&lt;p&gt;Let’s take a look at a simple URL configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from django.urls import path
from .views import BookListView

urlpatterns = [
    path('books/', BookListView.as_view(), name='book-list'),
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this setup, when a user visits &lt;code&gt;/books/&lt;/code&gt;, the &lt;code&gt;BookListView&lt;/code&gt; kicks in to render the list of books. By clearly mapping URLs to views, you make your codebase easier to read and more organized.&lt;/p&gt;

&lt;h3&gt;
  
  
  Simplify URL management with PyCharm
&lt;/h3&gt;

&lt;p&gt;Managing and visualizing endpoints in Django can become challenging as your application grows. PyCharm addresses this with its &lt;em&gt;Endpoints&lt;/em&gt; tool window, which provides a centralized view of all your app’s URL patterns, linked views, and HTTP methods. This feature allows you to see a list of every endpoint in your project, making it easier to track which views are tied to specific URLs. &lt;/p&gt;

&lt;p&gt;Instead of searching through multiple &lt;code&gt;urls.py&lt;/code&gt; files, you can instantly locate and navigate to the corresponding views with just a click. This is especially useful for larger Django projects where URL configurations span multiple files or when working in teams where establishing context quickly is crucial.&lt;/p&gt;

&lt;p&gt;Furthermore, the &lt;em&gt;Endpoints&lt;/em&gt; tool window lets you visualize all endpoints in a table-like interface. Each row displays the URL path, the HTTP method (&lt;code&gt;GET&lt;/code&gt;, &lt;code&gt;POST&lt;/code&gt;, etc.), and the associated view function or class of a given endpoint. &lt;/p&gt;

&lt;p&gt;This feature not only boosts productivity but also improves code navigation, allowing you to spot missing or duplicated URL patterns with ease. This level of visibility is invaluable for debugging routing issues or onboarding new developers to a project.&lt;/p&gt;

&lt;p&gt;Check out this video for more information on the &lt;em&gt;Endpoints&lt;/em&gt; tool window and how you can benefit from it. &lt;/p&gt;

&lt;h2&gt;
  
  
  Best practices for using Django views
&lt;/h2&gt;

&lt;p&gt;Here are some guidelines that can help you create well-structured and maintainable views.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keep views focused
&lt;/h3&gt;

&lt;p&gt;Views should concentrate on handling requests, fetching data, passing data to templates, and controlling flow and redirects. Complicated &lt;a href="https://forum.djangoproject.com/t/where-to-put-business-logic-in-django/282" rel="noopener noreferrer"&gt;business logic&lt;/a&gt; and complex processing should happen elsewhere, such as in model methods or dedicated service classes. &lt;/p&gt;

&lt;p&gt;However, you should be mindful not to overload your models with too much logic, as this can lead to the “fat model” anti-pattern in Django. &lt;a href="https://docs.djangoproject.com/en/5.1/topics/class-based-views/" rel="noopener noreferrer"&gt;Django’s documentation on views&lt;/a&gt; provides more insights about structuring them properly. &lt;/p&gt;

&lt;h3&gt;
  
  
  Keep views and templates thin
&lt;/h3&gt;

&lt;p&gt;It’s best to keep both views and templates slim. Views should handle request processing and data retrieval, while templates should focus on presentation with minimal logic.&lt;/p&gt;

&lt;p&gt;Complex processing should be done in Python outside the templates to improve maintainability and testing. For more on this, check out the &lt;a href="https://docs.djangoproject.com/en/stable/topics/templates/" rel="noopener noreferrer"&gt;Django templates documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decouple database queries
&lt;/h3&gt;

&lt;p&gt;Extracting database queries into separate model managers or repositories instead of placing them directly in views can help reduce duplication. Refer to the &lt;a href="https://docs.djangoproject.com/en/stable/topics/db/models/" rel="noopener noreferrer"&gt;Django models documentation&lt;/a&gt; for guidance on managing database interactions effectively. &lt;/p&gt;

&lt;h3&gt;
  
  
  Use generic class-based views when possible
&lt;/h3&gt;

&lt;p&gt;Django’s generic class-based views, like &lt;code&gt;DetailView&lt;/code&gt; and &lt;code&gt;ListView&lt;/code&gt;, provide reusability without requiring you to write much code. Opt for using them over reinventing the wheel to make better use of your time. The &lt;a href="https://docs.djangoproject.com/en/stable/topics/class-based-views/generic-display/" rel="noopener noreferrer"&gt;generic views documentation&lt;/a&gt; is an excellent resource for understanding these features. &lt;/p&gt;

&lt;h3&gt;
  
  
  Function-based views are OK for simple cases
&lt;/h3&gt;

&lt;p&gt;For basic views like serving APIs, a function can be more effective than a class. Reserve complex class-based views for intricate UI flows. The &lt;a href="https://docs.djangoproject.com/en/stable/topics/http/views/" rel="noopener noreferrer"&gt;writing views documentation&lt;/a&gt; page offers helpful examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structure routes and URLs cleanly
&lt;/h3&gt;

&lt;p&gt;Organize routes and view handlers by grouping them into apps by functionality. This makes it easier to find and navigate the source. Check out the &lt;a href="https://docs.djangoproject.com/en/stable/topics/http/urls/" rel="noopener noreferrer"&gt;Django URL dispatcher documentation&lt;/a&gt; for best practices in structuring your URL configurations. &lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;Now that you have a basic understanding of views in Django, you’ll want to dig deeper into the framework and other next steps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Brush up on your Django knowledge with our &lt;a href="https://blog.jetbrains.com/pycharm/2024/01/how-to-learn-django/" rel="noopener noreferrer"&gt;&lt;em&gt;How to Learn Django&lt;/em&gt;&lt;/a&gt; blog post, which is ideal for beginners or those looking to refresh their expertise.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Discover how to &lt;a href="https://www.jetbrains.com/help/pycharm/creating-and-running-your-first-django-project.html" rel="noopener noreferrer"&gt;create and run your first Django project&lt;/a&gt; in PyCharm, with our tutorial on crafting a basic to-do application, or explore our complete list of &lt;a href="https://blog.jetbrains.com/pycharm/2024/09/django-project-ideas/" rel="noopener noreferrer"&gt;Django project ideas&lt;/a&gt; for further inspiration. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explore the &lt;a href="https://blog.jetbrains.com/pycharm/2024/06/the-state-of-django/" rel="noopener noreferrer"&gt;state of Django&lt;/a&gt; to see the latest trends in Django development for further inspiration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you’re still deciding which Python framework to use, our &lt;a href="https://blog.jetbrains.com/pycharm/2023/11/django-vs-flask-which-is-the-best-python-web-framework/" rel="noopener noreferrer"&gt;&lt;em&gt;Django vs. Flask&lt;/em&gt;&lt;/a&gt; and &lt;a href="https://blog.jetbrains.com/pycharm/2023/12/django-vs-fastapi-which-is-the-best-python-web-framework/" rel="noopener noreferrer"&gt;&lt;em&gt;Django vs. FastAPI&lt;/em&gt;&lt;/a&gt; comparison guides can help.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Django support in PyCharm
&lt;/h3&gt;

&lt;p&gt;PyCharm Professional is the best-in-class IDE for &lt;a href="https://www.jetbrains.com/pycharm/web-development/django" rel="noopener noreferrer"&gt;Django development&lt;/a&gt;. It allows you to code faster with Django-specific code assistance, project-wide navigation and refactoring, and full support for Django templates. You can connect to your database in a single click and work on TypeScript, JavaScript, and frontend frameworks. PyCharm also supports Flask and FastAPI out of the box. &lt;/p&gt;

&lt;p&gt;Create better applications and streamline your code. Get started with PyCharm now for an effortless Django development experience.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/web-development/django/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

</description>
      <category>django</category>
      <category>python</category>
    </item>
    <item>
      <title>Anomaly Detection in Time Series</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Wed, 22 Jan 2025 12:14:32 +0000</pubDate>
      <link>https://dev.to/pycharm/anomaly-detection-in-time-series-3pa3</link>
      <guid>https://dev.to/pycharm/anomaly-detection-in-time-series-3pa3</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhe1wcqsvgz96dzfq80q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhe1wcqsvgz96dzfq80q.png" alt="Anomaly Detection in Time Series" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How do you identify unusual patterns in data that might reveal critical issues or hidden opportunities? &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-machine-learning/" rel="noopener noreferrer"&gt;Anomaly detection&lt;/a&gt; helps identify data that deviates significantly from the norm. Time series data, which consists of data collected over time, often includes trends and seasonal patterns. Anomalies in time series data occur when these patterns are disrupted, making anomaly detection a valuable tool in industries like sales, finance, manufacturing, and healthcare.&lt;/p&gt;

&lt;p&gt;As time series data has unique characteristics like seasonality and trends, specialized methods are required to detect anomalies effectively. In this blog post, we’ll explore some popular methods for anomaly detection in time series, including STL decomposition and LSTM prediction, with detailed code examples to help you get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time series anomaly detection in businesses
&lt;/h2&gt;

&lt;p&gt;Time series data is essential to many businesses and services. Many businesses record data over time with timestamps, allowing changes to be analyzed and data to be compared over time. Time series are useful when comparing a certain quantity over a certain period, as, for example, in a year-over-year comparison where the data exhibits characteristics of seasonalities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sales monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most common examples of time series data with seasonalities is sales data. As a lot of sales are affected by annual holidays and the time of the year, it is hard to draw conclusions about sales data without considering the seasonalities. Because of that, a common method for analyzing and finding anomalies in sales data is STL decomposition, which we will cover in detail &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-time-series/#stl-beehive" rel="noopener noreferrer"&gt;later in this blog post.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Financial data, such as transactions and stock prices, are typical examples of time series data. In the finance industry, analyzing and detecting anomalies in this data is a common practice. For example, time series prediction models can be used in automatic trading. We’ll use a time series prediction to identify anomalies in stock data &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-time-series/#lstm-stock" rel="noopener noreferrer"&gt;later in this blog post.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manufacturing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another use case of time series anomaly detection is monitoring defects in production lines. Machines are often monitors, making time series data available. Being able to notify management of potential failures is essential, and anomaly detection plays a key role.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Medicine and healthcare&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In medicine and healthcare, human vitals are monitored and anomalies can be detected. This is important enough in medical research, but it’s critical in diagnostics. If a patient at a hospital has anomalies in their vitals and is not treated immediately, the results can be fatal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is it important to use special methods for time series anomaly detection?
&lt;/h2&gt;

&lt;p&gt;Time series data is special in the sense that it sometimes cannot be treated like other types of data. For example, when we apply a train test split to time series data, the sequentially related nature of the data means we cannot shuffle it. This is also true when applying time series data to a deep learning model. A recurrent neural network (RNN) is commonly used to take the sequential relationship into account, and training data is input as time windows, which preserve the sequence of events within.&lt;/p&gt;

&lt;p&gt;Time series data is also special because it often has seasonality and trends that we cannot ignore. This seasonality can manifest in a 24-hour cycle, a 7-day cycle, or a 12-month cycle, just to name a few common possibilities. Anomalies can only be determined after the seasonality and trends have been considered, as you will see in &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-time-series/#stl-beehive" rel="noopener noreferrer"&gt;our example below&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Methods used for anomaly detection in time series
&lt;/h2&gt;

&lt;p&gt;Because time series data is special, there are specific methods for detecting anomalies in it. Depending on the type of data, some of the methods and algorithms we mentioned in the previous &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-machine-learning/" rel="noopener noreferrer"&gt;blog post about anomaly detection&lt;/a&gt; can be used on time series data. However, with those methods, the anomaly detection may not be as robust as using ones specifically designed for time series data. In some cases, a combination of detection methods can be used to reconfirm the detection result and avoid false positives or negatives.&lt;/p&gt;

&lt;h3&gt;
  
  
  STL decomposition
&lt;/h3&gt;

&lt;p&gt;One of the most popular ways to use time series data that has seasonality is STL decomposition – seasonal trend decomposition using LOESS (locally estimated scatterplot smoothing). In this method, a time series is decomposed using an estimate of seasonality (with the period provided or determined using an algorithm), a trend (estimated), and the residual (the noise in the data). A &lt;a href="https://www.jetbrains.com/help/pycharm/python.html" rel="noopener noreferrer"&gt;Python&lt;/a&gt; library that provides &lt;a href="https://www.statsmodels.org/stable/examples/notebooks/generated/stl_decomposition.html" rel="noopener noreferrer"&gt;STL decomposition tools&lt;/a&gt; is the &lt;a href="https://www.statsmodels.org/stable/index.html" rel="noopener noreferrer"&gt;statsmodels&lt;/a&gt; library.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmszn3kpoto7kz6587ec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmszn3kpoto7kz6587ec.png" alt="STL decomposition" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An anomaly is detected when the residual is beyond a certain threshold. &lt;/p&gt;

&lt;h3&gt;
  
  
  Using STL decomposition on beehive data
&lt;/h3&gt;

&lt;p&gt;In an earlier &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-machine-learning/" rel="noopener noreferrer"&gt;blog post&lt;/a&gt;, we explored anomaly detection in beehives using the &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html" rel="noopener noreferrer"&gt;OneClassSVM&lt;/a&gt; and &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html" rel="noopener noreferrer"&gt;IsolationForest&lt;/a&gt; methods. &lt;/p&gt;

&lt;p&gt;In this tutorial, we’ll analyze &lt;a href="https://www.kaggle.com/datasets/vivovinco/beehives" rel="noopener noreferrer"&gt;beehive data&lt;/a&gt; as a time series using the &lt;code&gt;STL&lt;/code&gt; class provided by the statsmodels library. To get started, set up your environment using this file: &lt;a href="https://github.com/Cheukting/anomaly-detection/blob/main/requirements.txt" rel="noopener noreferrer"&gt;requirements.txt&lt;/a&gt;. &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. Install the library&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Since we have only been using the model provided by Scikit-learn, we will need to install statsmodels from PyPI. This is easy to do in &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;PyCharm&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Go to the &lt;em&gt;Python&lt;/em&gt; &lt;a href="https://www.jetbrains.com/help/pycharm/installing-uninstalling-and-upgrading-packages.html" rel="noopener noreferrer"&gt;&lt;em&gt;Package&lt;/em&gt;&lt;/a&gt;window (choose the icon at the bottom of the left-hand side of the IDE) and type in statsmodels in the search box.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3d8fql3simgbbbskn3y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3d8fql3simgbbbskn3y.png" alt="Statsmodels in PyCharm" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see all of the information about the package on the right-hand side. To install it, simply click &lt;em&gt;Install package&lt;/em&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Create a Jupyter notebook&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;To investigate the dataset further, let’s create a &lt;a href="https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html" rel="noopener noreferrer"&gt;Jupyter notebook&lt;/a&gt; to take advantage of the tools that PyCharm’s Jupyter notebook environment provides.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fna10s8s26lmpgzs10olq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fna10s8s26lmpgzs10olq.png" alt="Create a Jupyter notebook in PyCharm" width="800" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will import &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/" rel="noopener noreferrer"&gt;pandas&lt;/a&gt; and load the &lt;code&gt;.csv&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

df = pd.read_csv('../data/Hive17.csv', sep=";")
df = df.dropna()
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ixa39m0s5tp4ik0ol6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ixa39m0s5tp4ik0ol6e.png" alt="Import pandas in PyCharm" width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Inspect the data as graphs&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Now, we can inspect the data as graphs. Here, we would like to see the temperature of hive 17 over time. Click on &lt;em&gt;Chart view&lt;/em&gt; in the dataframe inspector and then choose &lt;em&gt;T17&lt;/em&gt; as the y-axis in the series settings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2rvlo4ydpk796mhp9i1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2rvlo4ydpk796mhp9i1.gif" alt="Inspect the data as graphs in PyCharm" width="720" height="290"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When expressed as a time series, the temperature has a lot of ups and downs. This indicates periodic behavior, likely due to the day-night cycle, so it is safe to assume there is a 24-hour period for the temperature. &lt;/p&gt;

&lt;p&gt;Next, there is a trend of temperature dropping over time. If you inspect the &lt;em&gt;DateTime&lt;/em&gt; column, you can see that the dates range from August to November. Since the &lt;a href="https://www.kaggle.com/datasets/vivovinco/beehives/data" rel="noopener noreferrer"&gt;Kaggle page of the dataset&lt;/a&gt; indicates that the data was collected in Turkey, the transition from summer to fall explains our observation that the temperature is dropping over time.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Time series decomposition&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;To understand the time series and detect anomalies, we will perform STL decomposition, importing the &lt;code&gt;STL&lt;/code&gt; class from statsmodels and fitting it with our temperature data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from statsmodels.tsa.seasonal import STL

stl = STL(df["T17"], period=24, robust=True) 
result = stl.fit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will have to provide a period for the decomposition to work. As we mentioned before, it is safe to assume a 24-hour cycle.&lt;/p&gt;

&lt;p&gt;According to the documentation, &lt;code&gt;STL&lt;/code&gt; decomposes a time series into three components: trend, seasonal, and residual. To get a clearer look at the decomposed result, we can use the built-in &lt;code&gt;plot&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result.plot()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjf8q9miqxvyeijal4d1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjf8q9miqxvyeijal4d1.png" alt="Time series decomposition" width="800" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see the &lt;em&gt;Trend&lt;/em&gt; and &lt;em&gt;Season&lt;/em&gt; plots seem to align with our assumptions above. However, we are interested in the residual plot at the bottom, which is the original series without the trend and seasonal changes. Any extremely high or low value in the residual indicates an anomaly.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5. Anomaly threshold&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Next, we would like to determine what values of the residual we’ll consider abnormal. To do that, we can look at the residual’s histogram.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result.resid.plot.hist()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqd6ih8hl14kkx9znky19.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqd6ih8hl14kkx9znky19.png" alt="Anomaly threshold in PyCharm" width="800" height="631"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This can be considered a normal distribution around 0, with a long tail above 5 and below -5, so we’ll set the threshold to 5.&lt;/p&gt;

&lt;p&gt;To show the anomalies on the original time series, we can color all of them red in the graph like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import matplotlib.pyplot as plt

threshold = 5
anomalies_filter = result.resid.apply(lambda x: True if abs(x) &amp;gt; threshold else False)
anomalies = df["T17"][anomalies_filter]

plt.figure(figsize=(14, 8))
plt.scatter(x=anomalies.index, y=anomalies, color="red", label="anomalies")
plt.plot(df.index, df['T17'], color='blue')
plt.title('Temperatures in Hive 17')
plt.xlabel('Hours')
plt.ylabel('Temperature')
plt.legend()
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zt9fvo93prtvvbdrl78.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zt9fvo93prtvvbdrl78.png" alt="Anomalies on the original time series in PyCharm" width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without STL decomposition, it is very hard to identify these anomalies in a time series consisting of periods and trends.&lt;/p&gt;

&lt;h3&gt;
  
  
  LSTM prediction
&lt;/h3&gt;

&lt;p&gt;Another way to detect anomalies in time series data is to do a time series prediction on the series using deep learning methods to estimate the outcome of data points. If an estimate is very different from the actual data point, then it could be a sign of anomalous data.&lt;/p&gt;

&lt;p&gt;One of the popular deep learning algorithms to perform the prediction of sequential data is the Long short-term memory (LSTM) model, which is a type of recurrent neural network (RNN). The LSTM model has input, forget, and output gates, which are number matrices. This ensures important information is passed on in the next iteration of the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmmu9gp013o5iumy8nhcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmmu9gp013o5iumy8nhcg.png" alt="LSTM memory cell" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since time series data is sequential data, meaning the order of data points is in sequential order and should not be shuffled, the LSTM model is an effective deep learning model to predict the outcome at a certain time. This prediction can be compared to the actual data and a threshold can be set to determine if the actual data is an anomaly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using LSTM prediction on stock prices
&lt;/h3&gt;

&lt;p&gt;Now let’s start a new Jupyter project to detect any anomalies in Apple’s stock price over the past 5 years. The &lt;a href="https://www.nasdaq.com/market-activity/stocks/aapl/historical?page=1&amp;amp;rows_per_page=25&amp;amp;timeline=y5" rel="noopener noreferrer"&gt;stock price dataset&lt;/a&gt; shows the most up-to-date data. If you want to follow along with the blog post, you can &lt;a href="https://github.com/Cheukting/lstm_anomaly_detection/tree/main/data" rel="noopener noreferrer"&gt;download the dataset&lt;/a&gt; we are using.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. Start a Jupyter project&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;When starting a new project, you can choose to create a Jupyter one, which is optimized for data science. In the &lt;em&gt;New Project&lt;/em&gt; window, you can create a Git repository and determine which conda installation to use for managing your environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftq7dtpgudj5l6vmebjgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftq7dtpgudj5l6vmebjgq.png" alt="Start a Jupyter project in PyCharm" width="800" height="644"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After starting the project, you will see an example notebook. Go ahead and start a new Jupyter notebook for this exercise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj32smg65usrwuhxwiut2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj32smg65usrwuhxwiut2.gif" alt="An example notebook in PyCharm" width="716" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After that, let’s set up &lt;code&gt;requirements.txt&lt;/code&gt;. We will need pandas, matplotlib, and PyTorch, which is named torch on PyPI. Since PyTorch is not included in the conda environment, PyCharm will tell us that we are missing the package. To install the package, click on the lightbulb and select &lt;em&gt;Install all missing packages&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frg6toit833gw1j6z1kua.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frg6toit833gw1j6z1kua.gif" alt="Install all missing packages in PyCharm" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Loading and inspecting the data&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Next, let’s put our dataset &lt;a href="https://github.com/Cheukting/lstm_anomaly_detection/tree/main/data" rel="noopener noreferrer"&gt;apple_stock_5y.csv&lt;/a&gt; in the data folder and load it as a pandas DataFrame to inspect it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

df = pd.read_csv('data/apple_stock_5y.csv')
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the interactive table, we can easily see if any data is missing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4onlzqurdlzgg2qvt4h.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4onlzqurdlzgg2qvt4h.gif" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is no missing data, but we have one issue – we would like to use the &lt;em&gt;Close/Last&lt;/em&gt; price but it is not a numeric data type. Let’s do a conversion and inspect our data again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["Close/Last"] = df["Close/Last"].apply(lambda x: float(x[1:]))
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we can inspect the price with the interactive table. Click on the plot icon on the left and a plot will be created. By default, it uses &lt;em&gt;Date&lt;/em&gt; as the x-axis and &lt;em&gt;Volume&lt;/em&gt; as the y-axis. Since we would like to inspect the &lt;em&gt;Close/Last&lt;/em&gt; price, go to the settings by clicking the gear icon on the right and choose &lt;em&gt;Close/Last&lt;/em&gt; as the y-axis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67nltgl7pq8renr9p1x8.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67nltgl7pq8renr9p1x8.gif" width="642" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Preparing the training data for LSTM&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Next, we have to prepare the training data to be used in the LSTM model. We need to prepare a sequence of vectors (feature X), each representing a time window, to predict the next price. The next price will form another sequence (target y). Here we can choose how big this time window is with the &lt;code&gt;lookback&lt;/code&gt; variable. The following code creates sequences X and y which will then be converted to PyTorch tensors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch

lookback = 5
timeseries = df[["Close/Last"]].values.astype('float32')

X, y = [], []
for i in range(len(timeseries)-lookback):
    feature = timeseries[i:i+lookback]
    target = timeseries[i+1:i+lookback+1]
    X.append(feature)
    y.append(target)

X = torch.tensor(X)
y = torch.tensor(y)

print(X.shape, y.shape)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generally speaking, the bigger the window, the bigger our model will be, since the input vector is bigger. However, with a bigger window, the sequence of inputs will be shorter, so determining this lookback window is a balancing act. We will start with 5, but feel free to try different values to see the differences.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Build and train the model&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;We can build the model by creating a class using the &lt;a href="https://pytorch.org/docs/stable/nn.html" rel="noopener noreferrer"&gt;nn module&lt;/a&gt; in PyTorch before we train it. The nn module provides building blocks, such as different neural network layers. In this exercise, we will build a simple &lt;a href="https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html" rel="noopener noreferrer"&gt;LSTM layer&lt;/a&gt; followed by a &lt;a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html" rel="noopener noreferrer"&gt;linear layer&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch.nn as nn

class StockModel(nn.Module):
    def __init__ (self):
        super(). __init__ ()
        self.lstm = nn.LSTM(input_size=1, hidden_size=50, num_layers=1, batch_first=True)
        self.linear = nn.Linear(50, 1)
    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.linear(x)
        return x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we will train our model. Before training it, we will need to create an optimizer, a &lt;a href="https://pytorch.org/docs/stable/nn.html#loss-functions" rel="noopener noreferrer"&gt;loss function&lt;/a&gt; used to calculate the loss between the predicted and actual y values, and a &lt;a href="https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler" rel="noopener noreferrer"&gt;data loader&lt;/a&gt; to feed in our training data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import torch.optim as optim
import torch.utils.data as data

model = StockModel()
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=8)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data loader can shuffle the input, as we have already created the time windows. This preserves the sequential relationship in each window.&lt;/p&gt;

&lt;p&gt;Training is done using a &lt;code&gt;for&lt;/code&gt; loop which loops over each epoch. For every 100 epochs, we will print out the loss and observe while the model converges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;n_epochs = 1000
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if epoch % 100 != 0:
        continue
    model.eval()
    with torch.no_grad():
        y_pred = model(X)
        rmse = np.sqrt(loss_fn(y_pred, y))
    print(f"Epoch {epoch}: RMSE {rmse:.4f}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We start at 1000 epochs, but the model converges quite quickly. Feel free to try other numbers of epochs for training to achieve the best result.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6g6vjuqwrmvfabacn6a9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6g6vjuqwrmvfabacn6a9.png" alt="Epochs for training" width="800" height="621"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In PyCharm, a cell that requires some time to execute will provide a notification about how much time remains and a shortcut to the cell. This is very handy when training machine learning models, especially deep learning models, in Jupyter notebooks.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5. Plot the prediction and find the errors&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Next, we will create the prediction and plot it together with the actual time series. Note that we will have to create a 2D np series to match with the actual time series. The actual time series will be in blue while the predicted time series will be in red.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import matplotlib.pyplot as plt

with torch.no_grad():
    pred_series = np.ones_like(timeseries) * np.nan
    pred_series[lookback:] = model(X)[:, -1, :]

plt.plot(timeseries, c='b')
plt.plot(pred_series, c='r')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffb6r1bchxij4iu6ljfpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffb6r1bchxij4iu6ljfpl.png" alt="Plot the prediction and find the errors" width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you observe carefully, you will see that the prediction and the actual values do not align perfectly. However, most of the predictions do a good job.&lt;/p&gt;

&lt;p&gt;To inspect the errors closely, we can create an error series and use the interactive table to observe them. We are using the absolute error this time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error = abs(timeseries-pred_series)
error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the settings to create a histogram with the value of the absolute error as the x-axis and the count of the value as the y-axis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr6k9ppsjg6w4f1l28hsd.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr6k9ppsjg6w4f1l28hsd.gif" width="452" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;6. Decide on the anomaly threshold and visualize&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Most of the points will have an absolute error of less than 6, so we can set that as the anomaly threshold. Similar to &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-time-series/#anomaly-threshold" rel="noopener noreferrer"&gt;what we did for the beehive anomalies&lt;/a&gt;, we can plot the anomalous data points in the graph.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;threshold = 6
error_series = pd.Series(error.flatten())
price_series = pd.Series(timeseries.flatten())

anomalies_filter = error_series.apply(lambda x: True if x &amp;gt; threshold else False)
anomalies = price_series[anomalies_filter]

plt.figure(figsize=(14, 8))
plt.scatter(x=anomalies.index, y=anomalies, color="red", label="anomalies")
plt.plot(df.index, timeseries, color='blue')
plt.title('Closing price')
plt.xlabel('Days')
plt.ylabel('Price')
plt.legend()
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu9qjqbsn3szpoeeftv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu9qjqbsn3szpoeeftv3.png" alt="Plot the anomalous data points in the graph" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Time series data is a common form of data used in many applications including business and scientific research. Due to the sequential nature of time series data, special methods and algorithms are used to help determine anomalies in it. In this blog post, we demonstrated how to identify anomalies using STL decomposition to eliminate seasonalities and trends. We have also demonstrated how to use deep learning and the LSTM model to compare the predicted estimate and the actual data in order to determine anomalies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detect anomalies using PyCharm
&lt;/h2&gt;

&lt;p&gt;With the Jupyter project in PyCharm Professional, you can organize your anomaly detection project with a lot of data files and notebooks easily. Graphs output can be generated to inspect anomalies and plots are very accessible in PyCharm. Other features, such as auto-complete suggestions, make navigating all the Scikit-learn models and Matplotlib plot settings a blast.&lt;/p&gt;

&lt;p&gt;Power up your data science projects by using PyCharm, and &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;check out the data science features offered&lt;/a&gt; to streamline your data science workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>anomalydetection</category>
    </item>
    <item>
      <title>Anomaly Detection in Machine Learning Using Python</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Thu, 16 Jan 2025 10:08:19 +0000</pubDate>
      <link>https://dev.to/pycharm/anomaly-detection-in-machine-learning-using-python-3fbb</link>
      <guid>https://dev.to/pycharm/anomaly-detection-in-machine-learning-using-python-3fbb</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36ijtvagdu6ynghp0o31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36ijtvagdu6ynghp0o31.png" alt="Anomaly Detection in Machine Learning Using Python" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In recent years, many of our applications have been driven by the high volume of data that we are able to collect and process. Some may refer to us being in the age of data. One of the essential aspects of handling such a large amount of data is &lt;strong&gt;anomaly detection&lt;/strong&gt; – processes that enable us to identify outliers, data that is outside the bounds of expectation and demonstrate behavior that is out of the norm. In scientific research, anomaly data points could be a cause of technical issues and may need to be discarded when drawing conclusions, or it could lead to new discoveries.&lt;/p&gt;

&lt;p&gt;In this blog post, we’ll see why using machine learning for anomaly detection is helpful and explore key techniques for detecting anomalies using Python. You’ll learn how to implement popular methods like OneClassSVM and Isolation Forest, see examples of how to visualize these results and understand how to apply them to real-world problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where is anomaly detection used?
&lt;/h2&gt;

&lt;p&gt;Anomaly detection has also been a crucial part of modern-day business intelligence, as it provides insights into what could possibly go wrong and may also identify potential problems. Here are some examples of using anomaly detection in modern-day business.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security alerts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are some cyber security attacks that can be detected via anomaly detection; for example, a spike in request volume may indicate a &lt;a href="https://en.wikipedia.org/wiki/Denial-of-service_attack" rel="noopener noreferrer"&gt;DDoS attack&lt;/a&gt;, while suspicious login behavior, like multiple failing attempts, may indicate unauthorized access. Detecting suspicious user behavior may indicate potential cyber security threats, and companies can act on them accordingly to prevent or minimize the damage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fraud detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In financial organizations, for example, banks can use anomaly detection to highlight suspicious account activities, which may be an indication of illegal activities like money laundering or identity theft. Suspicious transactions can also be a sign of ​​credit card fraud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the common practices for web services is to collect metrics of the real-time performance of the service if there is abnormal behavior in the system. For example, a spike in memory usage may show that something in the system isn’t functioning properly, and engineers may need to address it immediately to avoid a break in service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use machine learning for anomaly detection?
&lt;/h2&gt;

&lt;p&gt;Although traditional statistical methods can also help find outliers, the use of machine learning for anomaly detection has been a game changer. With machine learning algorithms, more complex data (e.g. with multiple parameters) can be analyzed all at once. Machine learning techniques also provide a means to analyze categorical data that isn’t easy to analyze using traditional statistical methods, which are more suited to numerical data.  &lt;/p&gt;

&lt;p&gt;A lot of time, these anomaly detection algorithms are programmed and can be deployed as an application (see our &lt;a href="https://blog.jetbrains.com/pycharm/2024/09/how-to-use-fastapi-for-machine-learning/" rel="noopener noreferrer"&gt;FastAPI for Machine Learning&lt;/a&gt; tutorial) and run as requested or at scheduled intervals to detect any anomalies. This means that they can prompt immediate actions within the company and can also be used as reporting tools for business intelligence teams to review and adjust strategies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Types of anomaly detection techniques and algorithms
&lt;/h2&gt;

&lt;p&gt;There are generally two main types of anomaly detection: outlier detection and novelty detection.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outlier detection&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;Outlier detection is sometimes referred to as &lt;strong&gt;unsupervised&lt;/strong&gt; anomaly detection, as it is assumed that in the training data, there are some undetected anomalies (thus unlabeled), and the approach is to use unsupervised machine learning algorithms to pick them out. Some of these algorithms include &lt;a href="https://scikit-learn.org/stable/modules/outlier_detection.html" rel="noopener noreferrer"&gt;one-class support vector machines (SVMs), Isolation Forest, Local Outlier Factor, and Elliptic Envelope&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Novelty detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On the other hand, novelty detection is sometimes referred to as &lt;strong&gt;semi-supervised&lt;/strong&gt; anomaly detection. Since we assume that all training data doesn’t solely consist of anomalies, they’re all labeled as normal. The goal is to detect whether or not new data is an anomaly, which is sometimes referred to as a novelty. The algorithms used in outlier detection can also be used for novelty detection, provided that there are no anomalies in the training data.&lt;/p&gt;

&lt;p&gt;Other than the outlier detection and novelty detection mentioned, it is also very common to require anomaly detection in time series data. However, since the approach and technique used for time series data are often different from the algorithms mentioned above, we’ll discuss these in detail at a later date.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code example: finding anomalies in the Beehives dataset
&lt;/h2&gt;

&lt;p&gt;In this blog post, we’ll be using this &lt;a href="https://www.kaggle.com/datasets/vivovinco/beehives/data" rel="noopener noreferrer"&gt;Beehives dataset&lt;/a&gt; as an example to detect any anomalies in the hives. This data set provides various measurements of the hive (including the temperature and relative humidity of the hive) at various times.&lt;/p&gt;

&lt;p&gt;Here, we’ll be showing two very different methods for discovering anomalies. They are OneClassSVM, which is based on support vector machine technology, which we’ll use for drawing decision boundaries, and Isolation Forest, which is an ensemble method similar to Random Forest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: OneClassSVM
&lt;/h3&gt;

&lt;p&gt;In this first example, we’ll be using the data of hive 17, assuming bees will keep their hive in a constant pleasant environment for the colony; we can look at whether this is true and if there are times that the hive experiences anomaly temperature and relative humidity levels. We’ll use &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html" rel="noopener noreferrer"&gt;OneClassSVM&lt;/a&gt; to fit our data and look at the decision-making boundaries on a scatter plot.&lt;/p&gt;

&lt;p&gt;The SVM in OneClassSVM stands for &lt;a href="https://scikit-learn.org/stable/modules/svm.html#svm" rel="noopener noreferrer"&gt;support vector machine&lt;/a&gt;, which is a popular machine learning algorithm for classification and regressions. While support vector machines can be used to &lt;a href="https://scikit-learn.org/stable/modules/svm.html#mathematical-formulation" rel="noopener noreferrer"&gt;classify data points in high dimensions&lt;/a&gt;, by choosing a kernel and a scalar parameter to define a frontier, we can create a decision boundary that includes most of the data points (normal data), while retaining a small number of anomalies outside of the boundaries to represent the probability (nu) of finding a new anomaly. The method of using support vector machines for anomaly detection is covered in a paper by Scholkopf et al. entitled &lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-99-87.pdf" rel="noopener noreferrer"&gt;&lt;em&gt;Estimating the Support of a High-Dimensional Distribution&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. Start a Jupyter project&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;When starting a &lt;a href="https://www.jetbrains.com/help/pycharm/creating-and-running-your-first-python-project.html" rel="noopener noreferrer"&gt;new project&lt;/a&gt; in PyCharm (Professional 2024.2.2), select &lt;em&gt;Jupyter&lt;/em&gt; under &lt;em&gt;Python&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9w5wrck36807kuyaqg1j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9w5wrck36807kuyaqg1j.png" alt="Start a Jupyter project in PyCharm" width="800" height="649"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The benefit of using a &lt;a href="https://www.jetbrains.com/help/pycharm/scientific-mode.html" rel="noopener noreferrer"&gt;Jupyter project&lt;/a&gt; (previously also known as a Scientific project) in PyCharm is that a file structure is generated for you, including a folder for storing your data and a folder to store all the &lt;a href="https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html" rel="noopener noreferrer"&gt;Jupyter notebooks&lt;/a&gt; so you can keep all your experiments in one place. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7nzggxh19m8n747v2bk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7nzggxh19m8n747v2bk.png" alt="Jupyter projects in PyCharm" width="800" height="711"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another huge benefit is that we can render graphs very easily with &lt;a href="https://matplotlib.org/index.html" rel="noopener noreferrer"&gt;Matplotlib&lt;/a&gt;. You will see that in the steps below.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Install dependencies&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Download this &lt;a href="https://github.com/Cheukting/anomaly-detection/blob/main/requirements.txt" rel="noopener noreferrer"&gt;requirements.txt&lt;/a&gt; from the relevant GitHub repo. Once you place it in the project directory and open it in PyCharm, you will see a prompt asking you to install the missing libraries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnz0taekx02uiniphbpzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnz0taekx02uiniphbpzg.png" alt="Install dependencies in PyCharm" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click on &lt;em&gt;Install requirements&lt;/em&gt;, and all of the requirements will be installed for you. In this project, we’re using Python 3.11.1.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Import and inspect the data&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;You can either download the &lt;a href="https://www.kaggle.com/datasets/vivovinco/beehives/data" rel="noopener noreferrer"&gt;“Beehives” dataset from Kaggle&lt;/a&gt; or from this &lt;a href="https://github.com/Cheukting/anomaly-detection/tree/main/data" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;. Put all three CSVs in the &lt;em&gt;Data&lt;/em&gt; folder. Then, in main.py, enter the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

df = pd.read_csv('data/Hive17.csv', sep=";")
df = df.dropna()
print(df.head())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, press the &lt;em&gt;Run&lt;/em&gt; button in the top right-hand corner of the screen, and our code will be run in the Python console, giving us an idea of what our data looks like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vi23v0pzwkwlywttize.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vi23v0pzwkwlywttize.gif" alt="Import data in PyCharm" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Fit the data points and inspect them in a graph&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Since we’ll be using the OneClassSVM from scikit-learn, we’ll import it together with DecisionBoundaryDisplay and Matplotlib using the code below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.svm import OneClassSVM
from sklearn.inspection import DecisionBoundaryDisplay

import matplotlib.pyplot as plt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;According to the data’s description, we know that column T17 represents the temperature of the hive, and RH17 represents the relative humidity of the hive. We’ll extract the value of these two columns as our input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = df[["T17", "RH17"]].values
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we’ll create and fit the model. Note that we’ll try the default setting first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estimator = OneClassSVM().fit(X)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we’ll show the decision boundary together with the data points:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;disp = DecisionBoundaryDisplay.from_estimator(
    estimator,
    X,
    response_method="decision_function",
    plot_method="contour",
    xlabel="Temperature", ylabel="Humidity",
    levels=[0],
)
disp.ax_.scatter(X[:, 0], X[:, 1])
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, save and press &lt;em&gt;Run&lt;/em&gt; again, and you’ll see that the plot is shown in a separate window for inspection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flb8hx8y5v16tjmwl8art.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flb8hx8y5v16tjmwl8art.png" alt="Fit the data points and inspect them in a graph in PyCharm" width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;5. Fine-tune hyperparameters&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;As the plot above shows, the decision boundary does not fit very well with the data points. The data points consist of a couple of irregular shapes instead of an oval. To fine-tune our model, we have to provide a specific value of &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDOneClassSVM.html" rel="noopener noreferrer"&gt;“mu” and “gamma” to the OneClassSVM model&lt;/a&gt;. You can try it out yourself, but after a couple of tests, it seems “nu=0.1, gamma=0.05” gives the best result.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forjbozrhcx8fgjvkphqv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forjbozrhcx8fgjvkphqv.png" alt="Fine-tune hyperparameters in PyCharm" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Isolation Forest
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html" rel="noopener noreferrer"&gt;Isolation Forest&lt;/a&gt; is an &lt;a href="https://scikit-learn.org/stable/api/sklearn.ensemble.html" rel="noopener noreferrer"&gt;ensemble-based method&lt;/a&gt;, similar to the more popular&lt;a href="https://scikit-learn.org/stable/modules/ensemble.html#forest" rel="noopener noreferrer"&gt;Random Forest&lt;/a&gt;classification method. By randomly selecting parting features and values, it will create many decision trees, and the path length from the root of the tree to the node making that decision will then be averaged over all the trees (hence “forest”). A short average path length indicates anomalies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9c74eg08vf97yolp0zn8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9c74eg08vf97yolp0zn8.png" alt="Isolation Forest" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A short decision path usually indicates data that is very different from the others.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s compare the result of OneClassSVM with IsolationForest. To do that, we’ll make two plots of the decision boundaries made by the two algorithms. In the following steps, we’ll build on the script above using the same &lt;a href="https://www.kaggle.com/datasets/vivovinco/beehives/data" rel="noopener noreferrer"&gt;hive 17 data&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;1. Import IsolationForest&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;IsolationForest can be imported from the ensemble categories in Scikit-learn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.ensemble import IsolationForest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;2. Refactor and add a new estimator&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Since now we’ll have two different estimators, let’s put them in a list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estimators = [
    OneClassSVM(nu=0.1, gamma=0.05).fit(X),
    IsolationForest(n_estimators=100).fit(X)
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, we’ll use a for loop to loop through all the estimators.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for estimator in estimators:
    disp = DecisionBoundaryDisplay.from_estimator(
        estimator,
        X,
        response_method="decision_function",
        plot_method="contour",
        xlabel="Temperature", ylabel="Humidity",
        levels=[0],
    )
    disp.ax_.scatter(X[:, 0], X[:, 1])
    plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a final touch, we’ll also add a title to each of the graphs for easier inspection. To do that, we’ll add the following after disp.ax_.scatter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;disp.ax_.set_title(
        f"Decision boundary using {estimator. __class__. __name__ }"
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You may find that refactoring using PyCharm is very easy with the auto-complete suggestions it provides.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99k8jl3t7rnhk935va2q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99k8jl3t7rnhk935va2q.png" alt="Refactoring using auto-completion in PyCharm" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4s6o7cm3e9kpm1quhj2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4s6o7cm3e9kpm1quhj2k.png" alt="Auto-completion in PyCharm" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Run the code&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Like before, running the code is as easy as pressing the &lt;em&gt;Run&lt;/em&gt; button in the top-right corner. After running the code this time, we should get two graphs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnj80w7rc3i2nr2huwwc4.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnj80w7rc3i2nr2huwwc4.gif" alt="Run the code in PyCharm" width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can easily flip through the two graphs with the preview on the right. As you can see, the decision boundary is quite different while using different algorithms. When doing anomaly detection, it’s worth experimenting with various algorithms and parameters to find the one that suits the use case the most.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next step: anomaly detection in time series data
&lt;/h2&gt;

&lt;p&gt;If the data is like our bee hive data, which is a time series, then there are additional methods to single out anomalies. As time series have trends and periods, anything out of this pattern of trends and periods can be considered as anomalies. Popular methods to detect anomalies in time series include STL decomposition and LSTM prediction.&lt;/p&gt;

&lt;p&gt;Learn how to use these methods to detect anomalies in time series &lt;a href="https://blog.jetbrains.com/pycharm/2025/01/anomaly-detection-in-time-series/" rel="noopener noreferrer"&gt;in this blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Anomaly Detection has proven to be an important aspect of business intelligence, and being able to identify anomalies and prompt immediate actions to be taken is essential in some sectors of business. Using the proper machine learning model to automatically detect anomalies can help analyze complicated and high volumes of data in a short period of time. In this blog post, we have demonstrated how to identify anomalies using statistical models like OneClassSVM.&lt;/p&gt;

&lt;p&gt;To learn more about using PyCharm for machine learning, please check out “&lt;a href="https://blog.jetbrains.com/pycharm/2022/06/start-studying-machine-learning-with-pycharm/" rel="noopener noreferrer"&gt;Start Studying Machine Learning With PyCharm&lt;/a&gt;” and “&lt;a href="https://blog.jetbrains.com/pycharm/2024/09/how-to-use-jupyter-notebooks-in-pycharm/" rel="noopener noreferrer"&gt;How to Use Jupyter Notebooks in PyCharm&lt;/a&gt;”.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detect anomalies using PyCharm
&lt;/h2&gt;

&lt;p&gt;With the Jupyter project in PyCharm Professional, you can organize your anomaly detection project with a lot of data files and notebooks easily. Graphs output can be generated to inspect anomalies and plots are very accessible in PyCharm. Other features, such as auto-complete suggestions, make navigating all the Scikit-learn models and Matplotlib plot settings a blast.&lt;/p&gt;

&lt;p&gt;Power up your data science project by using PyCharm; check out the &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;data science features&lt;/a&gt; offered to streamline your data science workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;Start with PyCharm Pro for free&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>anomalydetection</category>
    </item>
    <item>
      <title>Data Cleaning in Data Science</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Wed, 08 Jan 2025 15:02:13 +0000</pubDate>
      <link>https://dev.to/pycharm/data-cleaning-in-data-science-1ch2</link>
      <guid>https://dev.to/pycharm/data-cleaning-in-data-science-1ch2</guid>
      <description>&lt;p&gt;In this Data Science blog post series, we’ve talked about &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/how-to-get-data/" rel="noopener noreferrer"&gt;where to get data from&lt;/a&gt; and how to &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/" rel="noopener noreferrer"&gt;explore that data using pandas&lt;/a&gt;, but whilst that data is excellent for learning, it’s not similar to what we will term &lt;em&gt;real-world&lt;/em&gt; data. Data for learning has often already been cleaned and curated to allow you to learn quickly without needing to venture into the world of data cleaning, but real-world data has problems and is messy. Real-world data needs cleaning before it can give us useful insights, and that’s the subject of this blog post. &lt;/p&gt;

&lt;p&gt;Data problems can come from the behaviour of the data itself, the way the data was gathered, or even the way the data was input. Mistakes and oversights can happen at every stage of the journey. &lt;/p&gt;

&lt;p&gt;We are specifically talking about data cleaning here rather than data transformation. Data cleaning ensures that conclusions you make from your data can be generalised to the population you define. In contrast, data transformation involves tasks such as converting data formats, normalising data and aggregating data. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why Is Data Cleaning Important?
&lt;/h2&gt;

&lt;p&gt;The first thing we need to understand about datasets is what they represent. Most datasets are a sample representing a wider population, and in working with this sample, you will be able to extrapolate (or &lt;em&gt;generalise&lt;/em&gt;) your findings to this population. For example, we used a &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;dataset&lt;/a&gt; in the previous two blog posts. This dataset is broadly about house sales, but it only covers a small geographical area, a small period of time and potentially not all houses in that area and period; it is a sample of a larger population. &lt;/p&gt;

&lt;p&gt;Your data needs to be a representative sample of the wider population, for example, all house sales in that area over a defined period. To ensure that our data is a representative sample of the wider population, we must first define our population’s boundaries. &lt;/p&gt;

&lt;p&gt;As you might imagine, it’s often impractical to work with the entire population, except perhaps census data, so you need to decide where your boundaries are. These boundaries might be geographical, demographical, time-based, action-based (such as transactional) or industry-specific. There are numerous ways to define your population, but to generalise your data reliably, this is something you must do before you clean your data.&lt;/p&gt;

&lt;p&gt;In summary, if you’re planning to use your data for any kind of analysis or &lt;a href="https://blog.jetbrains.com/pycharm/2022/06/start-studying-machine-learning-with-pycharm/" rel="noopener noreferrer"&gt;machine learning&lt;/a&gt;, you need to spend time cleaning the data to ensure that you can rely on your insights and generalise them to the &lt;em&gt;real world&lt;/em&gt;. Cleaning your data results in more accurate analysis and, when it comes to machine learning, performance improvements, too.&lt;/p&gt;

&lt;p&gt;Without cleaning your data, you risk issues such as not being able to generalise your learnings to the wider population reliably, inaccurate summary statistics and incorrect visualisations. If you are using your data to train machine learning models, this can also lead to errors and inaccurate predictions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://jb.gg/m8p92h" rel="noopener noreferrer"&gt;Try PyCharm Professional for free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Examples of Data Cleaning
&lt;/h2&gt;

&lt;p&gt;We’re going to take a look at five tasks you can use to clean your data. This is not an exhaustive list, but it’s a good place to start when you get your hands on some real-world data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deduplicating data
&lt;/h3&gt;

&lt;p&gt;Duplicates are a problem because they can distort your data. Imagine you are plotting a histogram where you’re using the frequency of sale prices. If you have duplicates of the same value, you will end up with a histogram that has an inaccurate pattern based on the prices that are duplicated. &lt;/p&gt;

&lt;p&gt;As a side note, when we talk about duplication being a problem in datasets, we are talking about duplication of whole rows, each of which is a single observation. There will be duplicate values in the columns, and we expect this. We’re just talking about duplicate observations. &lt;/p&gt;

&lt;p&gt;Fortunately for us, there is a &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html" rel="noopener noreferrer"&gt;pandas method&lt;/a&gt; we can use to help us detect if there are any duplicates in our data. We can use &lt;a href="https://www.jetbrains.com/ai/" rel="noopener noreferrer"&gt;JetBrains AI&lt;/a&gt; chat if we need a reminder with a prompt such as:&lt;/p&gt;

&lt;p&gt;C_ode to identify duplicate rows_&lt;/p&gt;

&lt;p&gt;Here’s the resulting code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;duplicate_rows = df[df.duplicated()]
duplicate_rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code assumes that your DataFrame is called df_,_ so make sure to change it to the name of your DataFrame if it is not. &lt;/p&gt;

&lt;p&gt;There isn’t any duplicated data in the &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;Ames Housing dataset&lt;/a&gt; that we’ve been using, but if you’re keen to try it out, take a look at the &lt;a href="https://www.kaggle.com/datasets/cites/cites-wildlife-trade-database" rel="noopener noreferrer"&gt;CITES Wildlife Trade Database&lt;/a&gt; dataset and see if you can find the duplicates using the pandas method above.&lt;/p&gt;

&lt;p&gt;Once you have identified duplicates in your dataset, you must remove them to avoid distorting your results. You can get the code for this with JetBrains AI again with a prompt such as:&lt;/p&gt;

&lt;p&gt;_Code to drop duplicates from my dataframe _&lt;/p&gt;

&lt;p&gt;The resulting code drops the duplicates, resets the index of your DataFrame and then displays it as a new DataFrame called df_cleaned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_cleaned = df.drop_duplicates()
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are other pandas functions that you can use for more &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html" rel="noopener noreferrer"&gt;advanced duplicate management&lt;/a&gt; but this is enough to get you started with deduplicating your dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dealing with implausible values
&lt;/h3&gt;

&lt;p&gt;Implausible values can occur when data is entered incorrectly or something has gone wrong in the data-gathering process. For our &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;Ames Housing dataset&lt;/a&gt;, an implausible value might be a negative SalePrice, or a numerical value for Roof Style.&lt;/p&gt;

&lt;p&gt;Spotting implausible values in your dataset relies on a broad approach that includes looking at your &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/#summary-statistics" rel="noopener noreferrer"&gt;summary statistics&lt;/a&gt;, checking the data validation rules that were defined by the collector for each column and noting any data points that fall outside of this validation as well as using visualisations to spot patterns and anything that looks like it might be an anomaly. &lt;/p&gt;

&lt;p&gt;You will want to deal with implausible values as they can add noise and cause problems with your analysis. However, how you deal with them is somewhat open to interpretation. If you don’t have many implausible values relative to the size of your dataset, you may want to remove the records containing them. For example, if you’ve identified an implausible value in row 214 of your dataset, you can use the &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html" rel="noopener noreferrer"&gt;pandas drop function&lt;/a&gt; to remove that row from your dataset. &lt;/p&gt;

&lt;p&gt;Once again, we can get JetBrains AI to generate the code we need with a prompt like: &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Code that drops index 214 from&lt;/em&gt; &lt;em&gt;#df_cleaned&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Note that in &lt;a href="https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html" rel="noopener noreferrer"&gt;PyCharm’s Jupyter notebooks&lt;/a&gt; I can prefix words with the # sign to indicate to JetBrains AI Assistant that I am providing additional context and in this case that my DataFrame is called df_cleaned.&lt;/p&gt;

&lt;p&gt;The resulting code will remove that observation from your DataFrame, reset the index and display it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_cleaned = df_cleaned.drop(index=214)
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another popular strategy for dealing with implausible values is to impute them, meaning you replace the value with a different, plausible value based on a defined strategy. One of the most common strategies is to use the median value instead of the implausible value. Since the median is not affected by outliers, it is often chosen by data scientists for this purpose, but equally, the mean or the mode value of your data might be more appropriate in some situations. &lt;/p&gt;

&lt;p&gt;Alternatively, if you have domain knowledge about the dataset and how the data was gathered, you can replace the implausible value with one that is more meaningful. If you’re involved in or know of the data-gathering process, this option might be for you. &lt;/p&gt;

&lt;p&gt;How you choose to handle implausible values depends on their prevalence in your dataset, how the data was gathered and how you intend to define your population as well as other factors such as your domain knowledge. &lt;/p&gt;

&lt;h3&gt;
  
  
  Formatting data
&lt;/h3&gt;

&lt;p&gt;You can often spot formatting problems with your &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/#summary-statistics" rel="noopener noreferrer"&gt;summary statistics&lt;/a&gt; or early &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/#graphs" rel="noopener noreferrer"&gt;visualisations&lt;/a&gt; you perform to get an idea of the shape of your data. Some examples of inconsistent formatting are numerical values not all being defined to the same decimal place or variations in terms of spelling, such as “first” and “1st”. Incorrect data formatting can also have implications for the memory footprint of your data.&lt;/p&gt;

&lt;p&gt;Once you spot formatting issues in your dataset, you need to standardise the values. Depending on the issue you are facing, this normally involves defining your own standard and applying the change. Again, the pandas library has some useful functions here such as &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.round.html" rel="noopener noreferrer"&gt;round&lt;/a&gt;. If you wanted to round the SalePrice column to 2 decimal places, we could ask JetBrains AI for the code:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Code to round&lt;/em&gt; &lt;em&gt;#SalePrice&lt;/em&gt; _to two decimal places _&lt;/p&gt;

&lt;p&gt;The resulting code will perform the rounding and then print out the first 10 rows so you can check it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_cleaned['SalePrice'] = df_cleaned['SalePrice].round(2)
df_cleaned.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As another example, you might have inconsistent spelling – for example, a HouseStyle column that has both “1Story” and “OneStory”, and you’re confident that they mean the same thing. You can use the following prompt to get code for that:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Code to change all instances of&lt;/em&gt; &lt;em&gt;#OneStory&lt;/em&gt; &lt;em&gt;to&lt;/em&gt; &lt;em&gt;#1Story&lt;/em&gt; &lt;em&gt;in&lt;/em&gt; &lt;em&gt;#HouseStyle&lt;/em&gt;_ _&lt;/p&gt;

&lt;p&gt;The resulting code does exactly that and replaces all instances of OneStory to 1Story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_cleaned[HouseStyle'] = df_cleaned['HouseStyle'].replace('OneStory', '1Story')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Addressing outliers
&lt;/h3&gt;

&lt;p&gt;Outliers are very common in datasets, but how you address them, if at all, is very context-dependent. One of the easiest ways to spot outliers is with a box plot, which uses the &lt;a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html" rel="noopener noreferrer"&gt;seaborn&lt;/a&gt; and &lt;a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html" rel="noopener noreferrer"&gt;matplotlib&lt;/a&gt; libraries. I discussed box plots in my previous blog post on &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/" rel="noopener noreferrer"&gt;exploring data with pandas&lt;/a&gt; if you need a quick refresher. &lt;/p&gt;

&lt;p&gt;We’ll look at SalePrice in our &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;Ames housing dataset&lt;/a&gt; for this box plot. Again, I’ll use JetBrains AI to generate the code for me with a prompt such as:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Code to create a box plot of&lt;/em&gt; &lt;em&gt;#SalePrice&lt;/em&gt;_ _&lt;/p&gt;

&lt;p&gt;Here’s the resulting code that we need to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot for SalePrice
plt.figure(figsize=(10, 6))
sns.boxplot(x=df_cleaned['SalePrice'])
plt.title('Box Plot of SalePrice')
plt.xlabel('SalePrice')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The box plot tells us that we have a positive skew because the vertical median line inside the blue box is to the left of the centre. A positive skew tells us that we have more house prices at the cheaper end of the scale, which is not surprising. The box plot also tells us visually that we have lots of outliers on the right-hand side. That is a small number of houses that are much more expensive than the median price.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nn0to1m0llpqeltnv34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nn0to1m0llpqeltnv34.png" width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might accept these outliers as it’s fairly typical to expect a small number of houses with a larger price point than the majority. However, this is all dependent on the population you want to be able to generalise to and the conclusions you want to draw from your data. Putting clear boundaries around what is and what is not part of your population will allow you to make an informed decision about whether outliers in your data are going to be a problem. &lt;/p&gt;

&lt;p&gt;For example, if your population consists of people who will not be buying expensive mansions, then perhaps you can delete these outliers. If, on the other hand, your population demographics include those who might reasonably be expected to buy these expensive houses, you might want to keep them as they’re relevant to your population.&lt;/p&gt;

&lt;p&gt;I’ve talked about box plots here as ways to spot outliers, but other options such as scatter plots and histograms can quickly show you if you have outliers in your data, so you can make an informed decision on if you need to do anything about them.&lt;/p&gt;

&lt;p&gt;Addressing outliers usually falls into two categories – deleting them or using summary statistics less prone to outliers. In the first instance, we need to know exactly which rows they are. &lt;/p&gt;

&lt;p&gt;Until now we’ve just been discussing how to identify them visually. There are different ways to determine which observations are and aren’t outliers. One common approach is using a method called the &lt;em&gt;modified Z score&lt;/em&gt;. Before we look at how and why it’s modified, the Z-score is defined as:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Z-score =&lt;/em&gt; (&lt;em&gt;data point value&lt;/em&gt; – &lt;em&gt;mean&lt;/em&gt;) / &lt;em&gt;standard deviation&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The reason we then modify the Z-score for detecting outlets is that both the mean and the standard deviation are prone to outlier influence by virtue of how they are calculated. The modified Z-score is defined as:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Modified Z-score =&lt;/em&gt; (&lt;em&gt;data point value&lt;/em&gt; – &lt;em&gt;median&lt;/em&gt;) / &lt;em&gt;median absolute&lt;/em&gt; &lt;em&gt;deviation&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As we learned when we talked about &lt;a href="https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/#summary-statistics" rel="noopener noreferrer"&gt;summary statistics&lt;/a&gt;, the median is not affected by outliers. The &lt;em&gt;median absolute deviation&lt;/em&gt; is the &lt;em&gt;median&lt;/em&gt; value of the dataset’s absolute deviations from the &lt;em&gt;median&lt;/em&gt;. For example, if your data set contains these values:&lt;/p&gt;

&lt;p&gt;1, 2, 2, 2, &lt;strong&gt;3&lt;/strong&gt; , 3, 3, 5,9&lt;/p&gt;

&lt;p&gt;Then your &lt;em&gt;median&lt;/em&gt; is 3 as the value in the middle of the dataset. The deviations from your median value of 3 are as follows in the order presented above:&lt;/p&gt;

&lt;p&gt;2, 1, 1, 1, 0, 0, 0, 2, 5&lt;/p&gt;

&lt;p&gt;When we sort these into ascending order we get:&lt;/p&gt;

&lt;p&gt;0, 0, 0, 1, &lt;strong&gt;1&lt;/strong&gt; , 1, 2, 2, 5&lt;/p&gt;

&lt;p&gt;Meaning the &lt;em&gt;median absolute deviation&lt;/em&gt; in this example would be 1.&lt;/p&gt;

&lt;p&gt;How do we calculate the Modified Z-score for our data set for SalePrice? Let’s give JetBrains AI this prompt:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Given:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Modified Z-score = (data point value – median) / median absolute deviation&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Calculate the Modified Z-score for&lt;/em&gt; &lt;em&gt;#SalePrice&lt;/em&gt;_ _&lt;/p&gt;

&lt;p&gt;That results in this code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
from scipy.stats import median_abs_deviation

# Assuming 'data' is your DataFrame and it contains the 'SalePrice' column

# Calculate the median for the SalePrice column
median_sale_price = data['SalePrice'].median()

# Calculate the Median Absolute Deviation (MAD) for the SalePrice column
mad_sale_price = median_abs_deviation(data['SalePrice'], scale='normal')

# Calculate the Modified Z-scores for the SalePrice column
data['Modified_Z_Score'] = (data['SalePrice'] - median_sale_price) / mad_sale_price

# Display the first few rows with the Modified Z-scores
print(data[['SalePrice', 'Modified_Z_Score']].head())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we install the required packages and run this code, we can see we’re partway through, but now we need to decide what an outlier for our SalePrice is based on the modified Z-score. The general wisdom here is that an outlier is anything &amp;gt;=3 or &amp;lt;=-3 however, it’s worth noting that like most statistical decisions, it can and should be tailored to your dataset. &lt;/p&gt;

&lt;p&gt;Let’s pass another prompt over to JetBrains AI so we can further tailor our output:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Just list those that have a&lt;/em&gt; &lt;em&gt;#Modified_Z_Score&lt;/em&gt; _of 3 or above or -3 or below _&lt;/p&gt;

&lt;p&gt;I’m going to take this snippet of code and replace it with the relevant rows above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Filter the rows where the Modified Z-score is 3 or above, or -3 or below
outliers = data[(data['Modified_Z_Score'] &amp;gt;= 3) | (data['Modified_Z_Score'] &amp;lt;= -3)]

# Print all the filtered rows, showing their index and SalePrice
outliers = (outliers[['SalePrice', 'Modified_Z_Score']])
outliers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I have modified this code to save the outliers in a new DataFrame called outliers and print them out so I can view them. &lt;/p&gt;

&lt;p&gt;Our next step would be to remove these outliers from our DataFrame. Again we can use JetBrains AI to generate the code with a prompt like:&lt;/p&gt;

&lt;p&gt;_Create a new dataframe without the outliers _&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data_without_outliers = data.drop(index=outliers.index)

# Display the new DataFrame without outliers
print(data_without_outliers)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our new DataFrame, data_without_outliers, excludes those values where the SalePrice variable is considered an outlier. &lt;/p&gt;

&lt;p&gt;We can update our box plot code to look at the new DataFrame. It still shows our positive skew as we’d expect, but the values considered as outliers have been removed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot for SalePrice
plt.figure(figsize=(10, 6))
sns.boxplot(x=data_without_outliers['SalePrice'])
plt.title('Box Plot of SalePrice')
plt.xlabel('SalePrice')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjspaojjyw1pj04johphx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjspaojjyw1pj04johphx.png" width="800" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before we finish up here, let’s take a look at how many observations were removed from our DataFrame as a percentage because we considered them to be outliers based on SalePrice.&lt;/p&gt;

&lt;p&gt;We can use a prompt such as:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Calculate the percentage of observations removed between #data and #data_without_outliers&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Calculate the number of observations in the original and filtered DataFrames
original_count = len(data)
filtered_count = len(data_without_outliers)

# Calculate the number of removed observations
removed_count = original_count - filtered_count

# Calculate the percentage of observations removed
percentage_removed = (removed_count / original_count) * 100

# Display the percentage
print(f"Percentage of observations removed: {percentage_removed:.2f}%")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PyCharm tells us that 5.67% of observations have been removed.&lt;/p&gt;

&lt;p&gt;As I mentioned earlier, if you are keeping your outliers, consider using summary values less prone to being affected by outliers such as the &lt;em&gt;median&lt;/em&gt; and &lt;em&gt;interquartile range&lt;/em&gt;. You might consider using these measurements to form your conclusions when you’re working with datasets that you know contain outliers that you’ve not removed because they are relevant to the population you’ve defined and the conclusions you want to draw.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing values
&lt;/h3&gt;

&lt;p&gt;The fastest way to spot missing values in your dataset is with your summary statistics. As a reminder, in your DataFrame, click &lt;em&gt;Show Column Statistics&lt;/em&gt; on the right-hand side and then select &lt;em&gt;Compact&lt;/em&gt;. Missing values in the columns are shown in red, as you can see here for Lot Frontage in our &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;Ames housing dataset&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXdeSNdJvl9sk5Z8QXEJCr5rhDMI5GTGmaRdqvkIufNS8QZNQi-1QwDF1LQgTS_e9vm0B-pSKa5o2aZnNZEmPiAzvoaOjvRxmOICDRzuM_0iWumPGH_UWyR07Q8xTrzIUnYvL7-j%3Fkey%3DncaAk2neSPZb4YRTlVBIqdzw" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXdeSNdJvl9sk5Z8QXEJCr5rhDMI5GTGmaRdqvkIufNS8QZNQi-1QwDF1LQgTS_e9vm0B-pSKa5o2aZnNZEmPiAzvoaOjvRxmOICDRzuM_0iWumPGH_UWyR07Q8xTrzIUnYvL7-j%3Fkey%3DncaAk2neSPZb4YRTlVBIqdzw" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are three kinds of missingness that we have to consider for our data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing completely at random&lt;/li&gt;
&lt;li&gt;Missing at random&lt;/li&gt;
&lt;li&gt;Missing not at random&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Missing completely at random
&lt;/h3&gt;

&lt;p&gt;Missing completely at random means the data has gone missing entirely by chance and the fact that it is missing has no relationship to other variables in the dataset. This can happen when someone forgets to answer a survey question, for example. &lt;/p&gt;

&lt;p&gt;Data that is missing completely at random is rare, but it’s also among the easiest to deal with. If you have a relatively small number of observations missing completely at random, the most common approach is to delete those observations because doing so shouldn’t affect the integrity of your dataset and, thus, the conclusions you hope to draw. &lt;/p&gt;

&lt;h3&gt;
  
  
  Missing at random
&lt;/h3&gt;

&lt;p&gt;Missing at random has a pattern to it, but we’re able to explain that pattern through other variables we’ve measured. For example, someone didn’t answer a survey question because of how the data was collected.&lt;/p&gt;

&lt;p&gt;Consider in our &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;Ames housing dataset&lt;/a&gt; again, perhaps the Lot Frontage variable is missing more frequently for houses that are sold by certain real estate agencies. In that case, this missingness could be due to inconsistent data entry practices by some agencies. If true, the fact that the Lot Frontage data is missing is related to how the agency that sold the property gathered the data, which is an observed characteristic, not the Lot Frontage itself. &lt;/p&gt;

&lt;p&gt;When you have data missing at random, you will want to understand why that data is missing, which often involves digging into how the data was gathered. Once you understand why the data is missing, you can choose what to do. One of the more common approaches to deal with missing at random is to impute the values. We’ve already touched on this for implausible values, but it’s a valid strategy for missingness too. There are various options you could choose from based on your defined population and the conclusions you want to draw, including using correlated variables such as house size, year built, and sale price in this example. If you understand the pattern behind the missing data, you can often use contextual information to impute the values, which ensures that relationships between data in your dataset are preserved.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Missing not at random
&lt;/h3&gt;

&lt;p&gt;Finally, missing not at random is when the likelihood of missing data is related to unobserved data. That means that the missingness is dependent on the unobserved data. &lt;/p&gt;

&lt;p&gt;One last time, let’s return to our &lt;a href="https://www.kaggle.com/datasets/prevek18/ames-housing-dataset" rel="noopener noreferrer"&gt;Ames housing dataset&lt;/a&gt; and the fact that we have missing data in Lot Frontage. One scenario for data missing not at random is when sellers deliberately choose not to report Lot Frontage if they consider it &lt;em&gt;small&lt;/em&gt; and thus reporting it might reduce the sale price of their house. If the likelihood of Lot Frontage data being missing depends on the size of the frontage itself (which is unobserved), smaller lot frontages are less likely to be reported, meaning the missingness is directly related to the missing value.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualising missingness
&lt;/h3&gt;

&lt;p&gt;Whenever data is missing, you need to establish whether there’s a pattern. If you have a pattern, then you have a problem that you’ll likely have to address before you can generalize your data. &lt;/p&gt;

&lt;p&gt;One of the easiest ways to look for patterns is with heat map visualisations. Before we get into the code, let’s exclude variables with no missingness. We can prompt JetBrains AI for this code:&lt;/p&gt;

&lt;p&gt;_Code to create a new dataframe that contains only columns with missingness _&lt;/p&gt;

&lt;p&gt;Here’s our code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Identify columns with any missing values
columns_with_missing = data.columns[data.isnull().any()]

# Create a new DataFrame with only columns that have missing values
data_with_missingness = data[columns_with_missing]

# Display the new DataFrame
print(data_with_missingness)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before you run this code, change the final line so we can benefit from PyCharm’s nice DataFrame layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data_with_missingness
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it’s time to create a heatmap; again we will prompt JetBrains AI with code such as:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Create a heatmap of&lt;/em&gt; &lt;em&gt;#data_with_missingness&lt;/em&gt; &lt;em&gt;that is transposed&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here’s the resulting code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import seaborn as sns
import matplotlib.pyplot as plt

# Transpose the data_with_missingness DataFrame
transposed_data = data_with_missingness.T

# Create a heatmap to visualize missingness
plt.figure(figsize=(12, 8))
sns.heatmap(transposed_data.isnull(), cbar=False, yticklabels=True)
plt.title('Missing Data Heatmap (Transposed)')
plt.xlabel('Instances')
plt.ylabel('Features')
plt.tight_layout()
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that I removed cmap=’viridis’ from the heatmap arguments as I find it hard to view. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9gmbua2mpn6hsdpxsej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9gmbua2mpn6hsdpxsej.png" width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This heatmap suggests that there might be a pattern of missingness because the same variables are missing across multiple rows. In one group, we can see that Bsmt Qual, Bsmt Cond, Bsmt Exposure, BsmtFin Type 1 and Bsmt Fin Type 2 are all missing from the same observations. In another group, we can see that Garage Type, Garage Yr Bit, Garage Finish, Garage Qual and Garage Cond are all missing from the same observations.&lt;/p&gt;

&lt;p&gt;These variables all relate to basements and garages, but there are other variables related to garages or basements that are not missing. One possible explanation is that different questions were asked about garages and basements in different real estate agencies when the data was gathered, and not all of them recorded as much detail as is in the dataset. Such scenarios are common with data you don’t collect yourself, and you can explore how the data was collected if you need to learn more about missingness in your dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Data Cleaning
&lt;/h2&gt;

&lt;p&gt;As I’ve mentioned, defining your population is high on the list of best practices for data cleaning. Know what you want to achieve and how you want to generalise your data before you start cleaning it. &lt;/p&gt;

&lt;p&gt;You need to ensure that all your methods are reproducible because reproducibility also speaks to clean data. Situations that are not reproducible can have a significant impact further down the line. For this reason, I recommend keeping your Jupyter notebooks tidy and sequential while taking advantage of the Markdown features to document your decision-making at every step, especially with cleaning. &lt;/p&gt;

&lt;p&gt;When cleaning data, you should work incrementally, modifying the DataFrame rather than the original CSV file or database, and ensuring you do it all in reproducible, well-documented code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Data cleaning is a big topic, and it can have many challenges. The larger the dataset is, the more challenging the cleaning process is. You will need to keep your population in mind to generalise your conclusions more widely while balancing tradeoffs between removing and imputing missing values and understanding why that data is missing in the first place. &lt;/p&gt;

&lt;p&gt;You can think of yourself as the voice of the data. You know the journey that the data has been on and how you have maintained data integrity at all stages. You are the best person to document that journey and share it with others. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://jb.gg/m8p92h" rel="noopener noreferrer"&gt;Try PyCharm Professional for free&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datacleaning</category>
    </item>
    <item>
      <title>7 Reasons You Should Use dbt Core in PyCharm</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Mon, 16 Dec 2024 12:58:55 +0000</pubDate>
      <link>https://dev.to/pycharm/7-reasons-you-should-use-dbt-core-in-pycharm-1j5a</link>
      <guid>https://dev.to/pycharm/7-reasons-you-should-use-dbt-core-in-pycharm-1j5a</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jshm3yed7xkaw5jiqc6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jshm3yed7xkaw5jiqc6.png" alt="7 Reasons You Should Use dbt Core in PyCharm" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;dbt Core is a modern data transformation framework. It doesn’t extract or load data and is only responsible for the T in the ELT (extract-load-transform) process. dbt connects to your data warehouse and helps you prepare your data so it can later be used to answer business questions.&lt;/p&gt;

&lt;p&gt;In this blog post, we’ll talk about the top benefits of dbt and the advantages of using it in &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;PyCharm Professional&lt;/a&gt;. To make the most of these features, you should be familiar with the framework. If you know SQL well, you’ll likely find it easy to use, and if you are a total novice in the field, you can use the &lt;a href="https://learn.getdbt.com/catalog" rel="noopener noreferrer"&gt;dbt portal&lt;/a&gt; to get acquainted with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you should use dbt
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modularity and code reusability&lt;/strong&gt; – Transformations can be saved into modular, reusable models. For instance, in this example the model &lt;em&gt;int_count_customer.sql&lt;/em&gt; has a reference to &lt;em&gt;stg_day_customer.sql&lt;/em&gt; and reuses its code.&lt;/li&gt;
&lt;/ul&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Versioning&lt;/strong&gt; – dbt projects can be stored in version control systems like Git or GitHub. This allows you to track changes, collaborate with other team members, and maintain a record of all transformations.&lt;/li&gt;
&lt;/ul&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Testing&lt;/strong&gt; – dbt allows you to write tests for your data models easily and check whether the data has any duplicates or null values. Additionally, you can even create specific rules to test against, and you can perform tests on both the model and the project levels.&lt;/li&gt;
&lt;/ul&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; – dbt auto-generates documentation for data models, ensuring that team members and stakeholders all understand the data lineage and model definitions in the same way.&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;To summarize, dbt brings best practices in engineering to the field of data analysis, allowing you to produce higher-quality results while providing you with a straightforward and intuitive workflow.&lt;/p&gt;

&lt;p&gt;These benefits are just the tip of the iceberg when it comes to what the tool can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  How PyCharm streamlines your dbt workflow
&lt;/h2&gt;

&lt;p&gt;Having established the benefits of dbt, we can now turn to the 7 key reasons to use it in PyCharm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;User-friendly onboarding&lt;/strong&gt; – PyCharm streamlines the initial setup. As demonstrated in this video, setting up a project and configuring the necessary settings is straightforward. &lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Unified workspace for databases and dbt&lt;/strong&gt; – PyCharm’s integrated database plugin powered by &lt;a href="https://www.jetbrains.com/datagrip/" rel="noopener noreferrer"&gt;JetBrains DataGrip&lt;/a&gt; makes handling SQL databases significantly easier. Since it’s compatible with all databases that dbt works with, you don’t have to worry about juggling multiple tools. You can focus on data modeling and instantly view outcomes all in one place. To cover even a small number of the plugin’s features would take hours, but luckily we have a nice set of webinars dedicated to PyCharm’s functionality for databases:  &lt;a href="https://www.youtube.com/watch?v=_FlpiNno088&amp;amp;t=1301s" rel="noopener noreferrer"&gt;Visual SQL Development with PyCharm&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Git and dbt integration&lt;/strong&gt; – In one interface, you can easily clone the repo, track any changes, manage branches, resolve conflicts, and collaborate with teammates.&lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Autocompletion for your .yml  and jinja-template SQL files&lt;/strong&gt; – People love using PyCharm because of its smart autocompletion, which it, of course, offers for dbt as well.&lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Local history&lt;/strong&gt; –This feature lets you undo recent changes if they cause problems. You can also compare different versions to see what was changed and check whether updates were made correctly.&lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI Assistant&lt;/strong&gt; – AI Assistant is really helpful, especially if you’re just starting with dbt Core. It is context-aware, and in addition to having it answer your questions in the AI chat, you can have it generate code and fix problems for you, streamlining your work with data models. It also saves you from worrying about what to write in commit messages by composing them for you.&lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Project navigation&lt;/strong&gt; – PyCharm excels in project navigation, offering features like fast search functionality and the &lt;em&gt;Go to Declaration&lt;/em&gt; feature, both of which allow you to navigate through your dbt models effortlessly.&lt;/li&gt;
&lt;/ol&gt;



&lt;p&gt;That’s just a glimpse of the benefits PyCharm already offers for dbt, and our support is still in its early stages. We invite you to test it out and share your insights. Whether you have suggestions for features or want to let us know about areas for improvement, we’re eager to hear from you. &lt;/p&gt;

&lt;p&gt;Get started with PyCharm by using the promo code &lt;strong&gt;dbt-PyCharm&lt;/strong&gt; to get a 3-month free trial.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/store/redeem/" rel="noopener noreferrer"&gt;Redeem your code&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Want to learn how to use dbt in PyCharm? Head to the &lt;a href="https://www.jetbrains.com/help/pycharm/create-and-configure-dbt-project.html#profiles" rel="noopener noreferrer"&gt;documentation page&lt;/a&gt; to learn more about the IDE’s dbt support.&lt;/p&gt;

&lt;p&gt;Eager to learn more about dbt in general? Take a look &lt;a href="https://blog.jetbrains.com/big-data-tools/2022/01/25/how-i-started-out-with-dbt/" rel="noopener noreferrer"&gt;at this post on the experience of using dbt&lt;/a&gt;and &lt;a href="https://blog.jetbrains.com/big-data-tools/2022/02/22/dbt-deeper-concepts-materialization/" rel="noopener noreferrer"&gt;this analysis of deeper dbt concepts&lt;/a&gt; by Pavel Finkelshteyn.&lt;/p&gt;

</description>
      <category>dbt</category>
    </item>
    <item>
      <title>Introduction to Sentiment Analysis in Python</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Thu, 12 Dec 2024 10:01:40 +0000</pubDate>
      <link>https://dev.to/pycharm/introduction-to-sentiment-analysis-in-python-4omo</link>
      <guid>https://dev.to/pycharm/introduction-to-sentiment-analysis-in-python-4omo</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkfzfjszn63sfyg04qzh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkfzfjszn63sfyg04qzh.png" alt="Introduction to Sentiment Analysis in Python" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sentiment analysis is one of the most popular ways to analyze text. It allows us to see at a glance how people are feeling across a wide range of areas and has useful applications in fields like customer service, market and product research, and competitive analysis.&lt;/p&gt;

&lt;p&gt;Like any area of natural language processing (NLP), sentiment analysis can get complex. Luckily, &lt;a href="https://www.jetbrains.com/guide/python/" rel="noopener noreferrer"&gt;Python&lt;/a&gt; has excellent packages and tools that make this branch of NLP much more approachable.&lt;/p&gt;

&lt;p&gt;In this blog post, we’ll explore some of the most popular packages for analyzing sentiment in Python, how they work, and how you can train your own sentiment analysis model using state-of-the-art techniques. We’ll also look at some &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;PyCharm&lt;/a&gt; features that make working with these packages easier and faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is sentiment analysis?
&lt;/h2&gt;

&lt;p&gt;Sentiment analysis is the process of analyzing a piece of text to determine its emotional tone. As you can probably see from this definition, sentiment analysis is a very broad field that incorporates a wide variety of methods within the field of natural language processing.&lt;/p&gt;

&lt;p&gt;There are many ways to define “emotional tone”. The most commonly used methods determine the &lt;em&gt;valence&lt;/em&gt; or &lt;em&gt;polarity&lt;/em&gt; of a piece of text – that is, how positive or negative the sentiment expressed in a text is. Emotional tone is also usually treated as a text classification problem, where text is categorized as either positive or negative.&lt;/p&gt;

&lt;p&gt;Take the following &lt;a href="https://www.amazon.com/AmazonBasics-12-Cup-Coffee-Reusable-Stainless/dp/B084ZH769P/ref=sr_1_1_ffob_sspa?th=1" rel="noopener noreferrer"&gt;Amazon product review&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Few26dxosudxs0s1jwgil.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Few26dxosudxs0s1jwgil.png" width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is obviously not a happy customer, and sentiment analysis techniques would classify this review as negative.&lt;/p&gt;

&lt;p&gt;Contrast this with a much more satisfied buyer:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23r5j6rymgx8sjijssua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23r5j6rymgx8sjijssua.png" width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This time, sentiment analysis techniques would classify this as positive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Different types of sentiment analysis
&lt;/h3&gt;

&lt;p&gt;There are multiple ways of extracting emotional information from text. Let’s review a few of the most important ones.&lt;/p&gt;

&lt;h4&gt;
  
  
  Ways of defining sentiment
&lt;/h4&gt;

&lt;p&gt;First, sentiment analysis approaches have several different ways of defining sentiment or emotion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binary&lt;/strong&gt; : This is where the valence of a document is divided into two categories, either &lt;em&gt;positive&lt;/em&gt; or &lt;em&gt;negative&lt;/em&gt;, as with the &lt;a href="https://huggingface.co/datasets/stanfordnlp/sst2" rel="noopener noreferrer"&gt;SST-2 dataset&lt;/a&gt;. Related to this are classifications of valence that add a &lt;em&gt;neutral&lt;/em&gt; class (where a text expresses no sentiment about a topic) or even a &lt;em&gt;conflict&lt;/em&gt; class (where a text expresses both positive and negative sentiment about a topic).&lt;/p&gt;

&lt;p&gt;Some sentiment analyzers use a related measure to classify texts into &lt;em&gt;subjective&lt;/em&gt; or &lt;em&gt;objective&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-grained&lt;/strong&gt; : This term describes several different ways of approaching sentiment analysis, but here it refers to breaking down positive and negative valence into a Likert scale. A well-known example of this is the &lt;a href="https://huggingface.co/datasets/SetFit/sst5" rel="noopener noreferrer"&gt;SST-5 dataset&lt;/a&gt;, which uses a five-point Likert scale with the classes &lt;em&gt;very positive&lt;/em&gt;, &lt;em&gt;positive&lt;/em&gt;, &lt;em&gt;neutral&lt;/em&gt;, &lt;em&gt;negative&lt;/em&gt;, and &lt;em&gt;very negative&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous&lt;/strong&gt; : The valence of a piece of text can also be measured continuously, with scores indicating how positive or negative the sentiment of the writer was. For example, the &lt;a href="https://github.com/cjhutto/vaderSentiment" rel="noopener noreferrer"&gt;VADER sentiment analyzer&lt;/a&gt; gives a piece of text a score between –1 (&lt;em&gt;strongly negative&lt;/em&gt;) and 1 (&lt;em&gt;strongly positive&lt;/em&gt;), with scores close to 0 indicating a neutral sentiment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emotion-based&lt;/strong&gt; : Also known as emotion detection or emotion identification, this approach attempts to detect the specific emotion being expressed in a piece of text. You can approach this in two ways. Categorical emotion detection tries to classify the sentiment expressed by a text into one of a handful of discrete emotions, usually based on the &lt;a href="https://www.tandfonline.com/doi/abs/10.1080/02699939208411068" rel="noopener noreferrer"&gt;Ekman&lt;/a&gt; model, which includes &lt;em&gt;anger&lt;/em&gt;, &lt;em&gt;disgust&lt;/em&gt;, &lt;em&gt;fear&lt;/em&gt;, &lt;em&gt;joy&lt;/em&gt;, &lt;em&gt;sadness&lt;/em&gt;, and &lt;em&gt;surprise&lt;/em&gt;. A &lt;a href="https://huggingface.co/j-hartmann/emotion-english-distilroberta-base#appendix-%F0%9F%93%9A" rel="noopener noreferrer"&gt;number of datasets&lt;/a&gt; exist for this type of emotion detection. Dimensional emotional detection is less commonly used in sentiment analysis and instead tries to measure &lt;a href="https://link.springer.com/article/10.1007/s12144-014-9219-4" rel="noopener noreferrer"&gt;three emotional aspects&lt;/a&gt; of a piece of text: &lt;em&gt;polarity&lt;/em&gt;, &lt;em&gt;arousal&lt;/em&gt; (how exciting a feeling is), and &lt;em&gt;dominance&lt;/em&gt; (how restricted the emotional expression is).&lt;/p&gt;

&lt;h4&gt;
  
  
  Levels of analysis
&lt;/h4&gt;

&lt;p&gt;We can also consider different levels at which we can analyze a piece of text. To understand this better, let’s consider another review of the coffee maker:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frkc3f92dqx4hv04punkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frkc3f92dqx4hv04punkm.png" width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document-level&lt;/strong&gt; : This is the most basic level of analysis, where one sentiment for an entire piece of text will be returned. Document-level analysis might be fine for very short pieces of text, such as Tweets, but can give misleading answers if there is any mixed sentiment. For example, if we based the sentiment analysis for this review on the whole document, it would likely be classified as neutral or conflict, as we have two opposing sentiments about the same coffee machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sentence-level&lt;/strong&gt; : This is where the sentiment for each sentence is predicted separately. For the coffee machine review, sentence-level analysis would tell us that the reviewer felt positively about some parts of the product but negatively about others. However, this analysis doesn’t tell us what things the reviewer liked and disliked about the coffee machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aspect-based&lt;/strong&gt; : This type of sentiment analysis dives deeper into a piece of text and tries to understand the sentiment of users about specific aspects. For our review of the coffee maker, the reviewer mentioned two aspects: &lt;em&gt;appearance&lt;/em&gt; and &lt;em&gt;noise&lt;/em&gt;. By extracting these aspects, we have more information about what the user specifically did and did not like. They had a positive sentiment about the machine’s appearance but a negative sentiment about the noise it made.&lt;/p&gt;

&lt;h4&gt;
  
  
  Coupling sentiment analysis with other NLP techniques
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Intent-based&lt;/strong&gt; : In this final type of sentiment analysis, the text is classified in two ways: in terms of the sentiment being expressed, and the topic of the text. For example, if a telecommunication company receives a ticket complaining about how often their service goes down, they could classify the text intent or topic as &lt;em&gt;service reliability&lt;/em&gt; and the sentiment as &lt;em&gt;negative&lt;/em&gt;. As with aspect-based sentiment analysis, this analysis gives the company much more information than knowing whether their customers are generally happy or unhappy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Applications of sentiment analysis
&lt;/h3&gt;

&lt;p&gt;By now, you can probably already think of some potential use cases for sentiment analysis. Basically, it can be used anywhere that you could get text feedback or opinions about a topic. Organizations or individuals can use sentiment analysis to do social media monitoring and see how people feel about a brand, government entity, or topic.&lt;/p&gt;

&lt;p&gt;Customer feedback analysis can be used to find out the sentiments expressed in feedback or tickets. Product reviews can be analyzed to see how satisfied or dissatisfied people are with a company’s products. Finally, sentiment analysis can be a key component in market research and competitive analysis, where how people feel about emerging trends, features, and competitors can help guide a company’s strategies.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does sentiment analysis work?
&lt;/h2&gt;

&lt;p&gt;At a general level, sentiment analysis operates by linking words (or, in more sophisticated models, the overall tone of a text) to an emotion. The most common approaches to sentiment analysis fall into one of the three methods below.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lexicon-based approaches
&lt;/h3&gt;

&lt;p&gt;These methods rely on a lexicon that includes sentiment scores for a range of words. They combine these scores using a set of rules to get the overall sentiment for a piece of text. These methods tend to be very fast and also have the advantage of yielding more fine-grained continuous sentiment scores. However, as the lexicons need to be handcrafted, they can be time-consuming and expensive to produce.&lt;/p&gt;

&lt;h3&gt;
  
  
  Machine learning models
&lt;/h3&gt;

&lt;p&gt;These methods train a machine learning model, most commonly a Naive Bayes classifier, on a dataset that contains text and their sentiment labels, such as movie reviews. In this model, texts are generally classified as positive, negative, and sometimes neutral. These models also tend to be very fast, but as they usually don’t take into account the relationship between words in the input, they may struggle with more complex texts that involve qualifiers and negations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Large language models
&lt;/h3&gt;

&lt;p&gt;These methods rely on fine-tuning a pre-trained transformer-based large language model on the same datasets used to train the machine learning classifiers mentioned earlier. These sophisticated models are capable of modeling complex relationships between words in a piece of text but tend to be slower than the other two methods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sentiment analysis in Python
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/help/pycharm/python.html" rel="noopener noreferrer"&gt;Python&lt;/a&gt; has a rich ecosystem of packages for &lt;a href="https://blog.jetbrains.com/pycharm/tag/nlp/" rel="noopener noreferrer"&gt;NLP&lt;/a&gt;, meaning you are spoiled for choice when doing sentiment analysis in this language.&lt;/p&gt;

&lt;p&gt;Let’s review some of the most popular &lt;a href="https://www.jetbrains.com/guide/python/tutorials/getting-started-pycharm/installing-and-managing-python-packages/" rel="noopener noreferrer"&gt;Python packages&lt;/a&gt; for sentiment analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  The best Python libraries for sentiment analysis
&lt;/h3&gt;

&lt;h4&gt;
  
  
  VADER
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.nltk.org/api/nltk.sentiment.vader.html" rel="noopener noreferrer"&gt;VADER (Valence Aware Dictionary and Sentiment Reasoner)&lt;/a&gt; is a popular lexicon-based sentiment analyzer. Built into the powerful &lt;a href="https://www.nltk.org/index.html" rel="noopener noreferrer"&gt;NLTK package&lt;/a&gt;, this analyzer returns four sentiment scores: the degree to which the text was &lt;em&gt;positive&lt;/em&gt;, &lt;em&gt;neutral&lt;/em&gt;, or &lt;em&gt;negative&lt;/em&gt;, as well as a &lt;em&gt;compound&lt;/em&gt; sentiment score. The positive, neutral, and negative scores range from 0 to 1 and indicate the proportion of the text that was positive, neutral, or negative. The compound score ranges from –1 (extremely negative) to 1 (extremely positive) and indicates the overall sentiment valence of the text.&lt;/p&gt;

&lt;p&gt;Let’s look at a basic example of how it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We first need to download the VADER lexicon.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nltk.download('vader_lexicon')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can then instantiate the VADER &lt;code&gt;SentimentIntensityAnalyzer()&lt;/code&gt; and extract the sentiment scores using the &lt;code&gt;polarity_scores()&lt;/code&gt; method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;analyzer = SentimentIntensityAnalyzer()

sentence = "I love PyCharm! It's my favorite Python IDE."
sentiment_scores = analyzer.polarity_scores(sentence)
print(sentiment_scores)

{'neg': 0.0, 'neu': 0.572, 'pos': 0.428, 'compound': 0.6696}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see that VADER has given this piece of text an overall sentiment score of 0.67 and classified its contents as 43% positive, 57% neutral, and 0% negative.&lt;/p&gt;

&lt;p&gt;VADER works by looking up the sentiment scores for each word in its lexicon and combining them using a nuanced set of rules. For example, qualifiers can increase or decrease the intensity of a word’s sentiment, so a qualifier such as “a bit” before a word would decrease the sentiment intensity, but “extremely” would amplify it.&lt;/p&gt;

&lt;p&gt;VADER’s lexicon includes abbreviations such as “smh” (shaking my head) and emojis, making it particularly suitable for social media text. VADER’s main limitation is that it doesn’t work for languages other than English, but you can use projects such as &lt;a href="https://github.com/brunneis/vader-multi" rel="noopener noreferrer"&gt;&lt;code&gt;vader-multi&lt;/code&gt;&lt;/a&gt; as an alternative. I wrote about &lt;a href="https://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html" rel="noopener noreferrer"&gt;how VADER works&lt;/a&gt; if you’re interested in taking a deeper dive into this package.&lt;/p&gt;

&lt;h4&gt;
  
  
  NLTK
&lt;/h4&gt;

&lt;p&gt;Additionally, you can use NLTK to train your own machine learning-based sentiment classifier, using classifiers from &lt;code&gt;scikit-learn&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There are many ways of processing the text to feed into these models, but the simplest way is doing it based on the words that are present in the text, a type of text modeling called the bag-of-words approach. The most straightforward type of bag-of-words modeling is &lt;em&gt;binary vectorisation&lt;/em&gt;, where each word is treated as a feature, with the value of that feature being either 0 or 1 (whether the word is absent or present in the text, respectively).&lt;/p&gt;

&lt;p&gt;If you’re new to working with text data and NLP, and you’d like more information about how text can be converted into inputs for machine learning models, I gave a &lt;a href="https://www.youtube.com/live/WYmyZBg2VFI?feature=shared&amp;amp;t=261" rel="noopener noreferrer"&gt;talk on this topic&lt;/a&gt; that provides a gentle introduction.&lt;/p&gt;

&lt;p&gt;You can see an example in the &lt;a href="https://www.nltk.org/howto/sentiment.html#sentiment-analysis" rel="noopener noreferrer"&gt;NLTK documentation&lt;/a&gt;, where a Naive Bayes classifier is trained to predict whether a piece of text is subjective or objective. In this example, they add an additional negation qualifier to some of the terms based on rules which indicate whether that word or character is likely involved in negating a sentiment expressed elsewhere in the text. Real Python also has a &lt;a href="https://realpython.com/python-nltk-sentiment-analysis/#customizing-nltks-sentiment-analysis" rel="noopener noreferrer"&gt;sentiment analysis tutorial&lt;/a&gt; on training your own classifiers using NLTK, if you want to learn more about this topic.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern and TextBlob
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://github.com/clips/pattern" rel="noopener noreferrer"&gt;Pattern&lt;/a&gt; package provides another lexicon-based approach to &lt;a href="https://github.com/clips/pattern/blob/d25511f9ca7ed9356b801d8663b8b5168464e68f/pattern/text/%20__init__.py#L2316" rel="noopener noreferrer"&gt;analyzing sentiment&lt;/a&gt;. It uses the &lt;a href="https://github.com/aesuli/SentiWordNet" rel="noopener noreferrer"&gt;SentiWordNet&lt;/a&gt; lexicon, where each synonym group (&lt;em&gt;synset&lt;/em&gt;) from &lt;a href="https://github.com/clips/pattern" rel="noopener noreferrer"&gt;WordNet&lt;/a&gt; is assigned a score for positivity, negativity, and objectivity. The positive and negative scores for each word are combined using a series of rules to give a final polarity score. Similarly, the objectivity score for each word is combined to give a final subjectivity score.&lt;/p&gt;

&lt;p&gt;As WordNet contains part-of-speech information, the rules can take into account whether adjectives or adverbs preceding a word modify its sentiment. The ruleset also considers negations, exclamation marks, and emojis, and even includes some rules to handle idioms and sarcasm.&lt;/p&gt;

&lt;p&gt;However, Pattern as a standalone library is only compatible with Python 3.6. As such, the most common way to use Pattern is through &lt;a href="https://textblob.readthedocs.io/en/dev/" rel="noopener noreferrer"&gt;TextBlob&lt;/a&gt;. By default, the &lt;a href="https://github.com/sloria/TextBlob/blob/e19171014bfba910d1e33527f46d514837da234e/src/textblob/en/sentiments.py#L15" rel="noopener noreferrer"&gt;TextBlob sentiment analyzer&lt;/a&gt; uses its own implementation of the Pattern library to generate sentiment scores.&lt;/p&gt;

&lt;p&gt;Let’s have a look at this in action:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from textblob import TextBlob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see that we run the TextBlob method over our text, and then extract the sentiment using the &lt;code&gt;sentiment&lt;/code&gt; attribute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pattern_blob = TextBlob("I love PyCharm! It's my favorite Python IDE.")
sentiment = pattern_blob.sentiment

print(f"Polarity: {sentiment.polarity}")
print(f"Subjectivity: {sentiment.subjectivity}")

Polarity: 0.625
Subjectivity: 0.6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For our example sentence, Pattern in TextBlob gives us a polarity score of 0.625 (relatively close to the score given by VADER), and a subjectivity score of 0.6.&lt;/p&gt;

&lt;p&gt;But there’s also a second way of getting sentiment scores in TextBlob. This package also includes a &lt;a href="https://github.com/sloria/TextBlob/blob/e19171014bfba910d1e33527f46d514837da234e/src/textblob/en/sentiments.py#L53" rel="noopener noreferrer"&gt;pre-trained Naive Bayes classifier&lt;/a&gt;, which will label a piece of text as either positive or negative, and give you the probability of the text being either positive or negative.&lt;/p&gt;

&lt;p&gt;To use this method, we first need to download both the &lt;code&gt;punkt&lt;/code&gt; module and the &lt;code&gt;movie-reviews&lt;/code&gt; dataset from NLTK, which is used to train this model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import nltk
nltk.download('movie_reviews')
nltk.download('punkt')

from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once again, we need to run &lt;code&gt;TextBlob&lt;/code&gt; over our text, but this time we add the argument &lt;code&gt;analyzer=NaiveBayesAnalyzer()&lt;/code&gt;. Then, as before, we use the sentiment attribute to extract the sentiment scores.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nb_blob = TextBlob("I love PyCharm! It's my favorite Python IDE.", analyzer=NaiveBayesAnalyzer())
sentiment = nb_blob.sentiment
print(sentiment)

Sentiment(classification='pos', p_pos=0.5851800554016624, p_neg=0.4148199445983381)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This time we end up with a label of &lt;code&gt;pos&lt;/code&gt; (positive), with the model predicting that the text has a 59% probability of being positive and a 41% probability of being negative.&lt;/p&gt;

&lt;h4&gt;
  
  
  spaCy
&lt;/h4&gt;

&lt;p&gt;Another option is to use &lt;a href="https://spacy.io/" rel="noopener noreferrer"&gt;spaCy&lt;/a&gt; for sentiment analysis. spaCy is another popular package for NLP in Python, and has a wide range of options for processing text.&lt;/p&gt;

&lt;p&gt;The first method is by using the &lt;a href="https://spacy.io/universe/project/spacy-textblob" rel="noopener noreferrer"&gt;spacytextblob&lt;/a&gt; plugin to use the TextBlob sentiment analyzer as part of your spaCy pipeline. Before you can do this, you’ll first need to install both &lt;code&gt;spacy&lt;/code&gt; and &lt;code&gt;spacytextblob&lt;/code&gt; and download the appropriate language model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import spacy
import spacy.cli
from spacytextblob.spacytextblob import SpacyTextBlob

spacy.cli.download("en_core_web_sm")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We then load in this language model and add &lt;code&gt;spacytextblob&lt;/code&gt; to our text processing pipeline. TextBlob can be used through spaCy’s &lt;code&gt;pipe&lt;/code&gt; method, which means we can include it as part of a more complex text processing pipeline, including preprocessing steps such as part-of-speech tagging, lemmatization, and named-entity recognition. Preprocessing can normalize and enrich text, helping downstream models to get the most information out of the text inputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For now, we’ll just analyze our sample sentence without preprocessing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doc = nlp("I love PyCharm! It's my favorite Python IDE.")

print('Polarity: ', doc._.polarity)
print('Subjectivity: ', doc._.subjectivity)

Polarity: 0.625
Subjectivity: 0.6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We get the same results as when using TextBlob above.&lt;/p&gt;

&lt;p&gt;A second way we can do sentiment analysis in spaCy is by training our own model using the &lt;a href="https://spacy.io/api/textcategorizer" rel="noopener noreferrer"&gt;TextCategorizer class&lt;/a&gt;. This allows you to train a range of &lt;a href="https://spacy.io/api/architectures" rel="noopener noreferrer"&gt;spaCY created models&lt;/a&gt; using a sentiment analysis training set. Again, as this can be used as part of the spaCy pipeline, you have many options for pre-processing your text before training your model.&lt;/p&gt;

&lt;p&gt;Finally, you can use large language models to do sentiment analysis through &lt;a href="https://spacy.io/api/large-language-models#sentiment" rel="noopener noreferrer"&gt;spacy-llm&lt;/a&gt;. This allows you to prompt a variety of proprietary large language models (LLMs) from OpenAI, Anthropic, Cohere, and Google to perform sentiment analysis over your texts.&lt;/p&gt;

&lt;p&gt;This approach works slightly differently from the other methods we’ve discussed. Instead of training the model, we can use generalist models like GPT-4 to predict the sentiment of a text. You can do this either through zero-shot learning (where a prompt but no examples are passed to the model) or few-shot learning (where a prompt and a number of examples are passed to the model).&lt;/p&gt;

&lt;h4&gt;
  
  
  Transformers
&lt;/h4&gt;

&lt;p&gt;The final Python package for sentiment analysis we’ll discuss is &lt;a href="https://huggingface.co/docs/transformers/en/index" rel="noopener noreferrer"&gt;Transformers&lt;/a&gt; from &lt;a href="https://huggingface.co/" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Hugging Face hosts all major open-source LLMs for free use (among other models, including computer vision and audio models), and provides a platform for training, deploying, and sharing these models. Its Transformers package offers a wide range of functionality (including sentiment analysis) for working with the LLMs hosted by Hugging Face.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the results of sentiment analyzers
&lt;/h2&gt;

&lt;p&gt;Now that we’ve covered all of the ways you can do sentiment analysis in Python, you might be wondering, “How can I apply this to my own data?”&lt;/p&gt;

&lt;p&gt;To understand this, let’s use PyCharm to compare two packages, VADER and TextBlob. Their multiple sentiment scores offer us a few different perspectives on our data. We’ll use these packages to analyze the Amazon reviews dataset.&lt;/p&gt;

&lt;p&gt;PyCharm Professional is a powerful Python IDE for &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;data science&lt;/a&gt; that supports advanced Python &lt;a href="https://www.jetbrains.com/help/pycharm/auto-completing-code.html" rel="noopener noreferrer"&gt;code completion&lt;/a&gt;, inspections and &lt;a href="https://www.jetbrains.com/help/pycharm/debugging-code.html" rel="noopener noreferrer"&gt;debugging&lt;/a&gt;, rich &lt;a href="https://www.jetbrains.com/pycharm/integrations/#databases" rel="noopener noreferrer"&gt;databases&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/help/pycharm/running-jupyter-notebook-cells.html" rel="noopener noreferrer"&gt;Jupyter&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/help/pycharm/using-git-integration.html" rel="noopener noreferrer"&gt;Git&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/help/pycharm/conda-support-creating-conda-virtual-environment.html" rel="noopener noreferrer"&gt;Conda&lt;/a&gt;, and more – all out of the box. In addition to these, you’ll also get incredibly useful features like our DataFrame &lt;em&gt;Column Statistics&lt;/em&gt; and &lt;em&gt;Chart View&lt;/em&gt;, as well as Hugging Face &lt;a href="https://www.jetbrains.com/pycharm/integrations/" rel="noopener noreferrer"&gt;integrations&lt;/a&gt;that make working with LLMs much quicker and easier. In this blog post, we’ll explore PyCharm’s advanced features for working with dataframes, which will allow us to get a quick overview of how our sentiment scores are distributed between the two packages.&lt;/p&gt;

&lt;p&gt;If you’re now ready to get started on your own sentiment analysis project, you can activate your free three-month subscription to PyCharm. Click on the link below, and enter this promo code: &lt;strong&gt;PCSA24&lt;/strong&gt;. You’ll then receive an activation code via email.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/store/redeem/" rel="noopener noreferrer"&gt;Activate your 3-month subscription&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first thing we need to do is load in the data. We can use the &lt;code&gt;load_dataset()&lt;/code&gt; method from the Datasets package to download this &lt;a href="https://huggingface.co/datasets/fancyzhx/amazon_polarity" rel="noopener noreferrer"&gt;data from the Hugging Face Hub&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from datasets import load_dataset
amazon = load_dataset("fancyzhx/amazon_polarity")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can hover over the name of the dataset to see the Hugging Face dataset card right inside PyCharm, providing you with a convenient way to get information about Hugging Face assets without leaving the IDE.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyby69h2tnrgf7yj2dzu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyby69h2tnrgf7yj2dzu7.png" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see the contents of this dataset here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;amazon

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The training dataset has 3.6 million observations, and the test dataset contains 400,000. We’ll be working with the training dataset in this tutorial.&lt;/p&gt;

&lt;p&gt;We’ll now load in the VADER &lt;code&gt;SentimentIntensityAnalyzer&lt;/code&gt; and the TextBlob method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

nltk.download("vader_lexicon")

analyzer = SentimentIntensityAnalyzer()

from textblob import TextBlob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The training dataset has too many observations to comfortably visualize, so we’ll take a random sample of 1,000 reviews to represent the general sentiment of all the reviewers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from random import sample
sample_reviews = sample(amazon["train"]["content"], 1000)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now get the VADER and TextBlob scores for each of these reviews. We’ll loop over each review text, run them through the sentiment analyzers, and then attach the scores to a dedicated list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vader_neg = []
vader_neu = []
vader_pos = []
vader_compound = []
textblob_polarity = []
textblob_subjectivity = []

for review in sample_reviews:
   vader_sent = analyzer.polarity_scores(review)
   vader_neg += [vader_sent["neg"]]
   vader_neu += [vader_sent["neu"]]
   vader_pos += [vader_sent["pos"]]
   vader_compound += [vader_sent["compound"]]

   textblob_sent = TextBlob(review).sentiment
   textblob_polarity += [textblob_sent.polarity]
   textblob_subjectivity += [textblob_sent.subjectivity]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll then pop each of these lists into a &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;pandas&lt;/a&gt; DataFrame as a separate column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

sent_scores = pd.DataFrame({
   "vader_neg": vader_neg,
   "vader_neu": vader_neu,
   "vader_pos": vader_pos,
   "vader_compound": vader_compound,
   "textblob_polarity": textblob_polarity,
   "textblob_subjectivity": textblob_subjectivity
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we’re ready to start exploring our results.&lt;/p&gt;

&lt;p&gt;Typically, this would be the point where we’d start creating a bunch of code for exploratory data analysis. This might be done using pandas’ &lt;code&gt;describe&lt;/code&gt; method to get summary statistics over our columns, and writing &lt;a href="https://matplotlib.org/" rel="noopener noreferrer"&gt;Matplotlib&lt;/a&gt; or &lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;seaborn&lt;/a&gt; code to visualize our results. However, PyCharm has some features to speed this whole thing up.&lt;/p&gt;

&lt;p&gt;Let’s go ahead and print our DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sent_scores
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see a button in the top right-hand corner, called &lt;em&gt;Show Column Statistics&lt;/em&gt;. Clicking this gives us two different options: &lt;em&gt;Compact&lt;/em&gt; and &lt;em&gt;Detailed&lt;/em&gt;. Let’s select &lt;em&gt;Detailed&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fix6uxrw3z7566xlu5kbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fix6uxrw3z7566xlu5kbd.png" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we have summary statistics provided as part of our column headers! Looking at these, we can see the VADER compound score has a mean of 0.4 (median = 0.6), while the TextBlob polarity score provides a mean of 0.2 (median = 0.2).&lt;/p&gt;

&lt;p&gt;This result indicates that, on average, VADER tends to estimate the same set of reviews more positively than TextBlob does. It also shows that for both sentiment analyzers, we likely have more positive reviews than negative ones – we can dive into this in more detail by checking some visualizations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xs3z48wc70kwfebj75s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xs3z48wc70kwfebj75s.png" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another PyCharm feature we can use is the DataFrame &lt;em&gt;Chart View&lt;/em&gt;. The button for this function is in the top left-hand corner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhuafh3crpe867fzve74d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhuafh3crpe867fzve74d.png" width="764" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we click on the button, we switch over to the chart editor. From here, we can create no-code visualizations straight from our DataFrame.&lt;/p&gt;

&lt;p&gt;Let’s start with VADER’s compound score. To start creating this chart, go to &lt;em&gt;Show Series Settings&lt;/em&gt; in the top right-hand corner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wxiz52wu4xtp9y5e1ol.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wxiz52wu4xtp9y5e1ol.png" width="634" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Remove the default values for &lt;em&gt;X Axis&lt;/em&gt; and &lt;em&gt;Y Axis&lt;/em&gt;. Replace the &lt;em&gt;X Axis&lt;/em&gt; value with &lt;code&gt;vader_compound&lt;/code&gt;, and the &lt;em&gt;Y Axis&lt;/em&gt; value with &lt;code&gt;vader_compound&lt;/code&gt;. Click on the arrow next to the variable name in the &lt;em&gt;Y Axis&lt;/em&gt; field, and select &lt;code&gt;count&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Finally, select &lt;em&gt;Histogram&lt;/em&gt; from the chart icons, just under &lt;em&gt;Series Settings&lt;/em&gt;. We likely have a bimodal distribution for the VADER compound score, with a slight peak around –0.8 and a much larger one around 0.9. This peak likely represents the split of negative and positive reviews. There are also far more positive reviews than negative.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbq2auev8ho359du80ope.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbq2auev8ho359du80ope.png" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s repeat the same exercise and create a histogram to see the distribution of the TextBlob polarity scores.&lt;/p&gt;

&lt;p&gt;In contrast, TextBlob tends to rate most reviews as neutral, with very few reviews being strongly positive or negative. To understand why we have a discrepancy in the scores these two sentiment analyzers provide, let’s look at a review VADER rated as strongly positive and another that VADER rated strongly negative but that TextBlob rated as neutral.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ufmipq7rbfhwkhjh2sx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ufmipq7rbfhwkhjh2sx.png" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We’ll get the index of the first review where VADER rated them as positive but TextBlob rated them as neutral:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sent_scores[(sent_scores["vader_compound"] &amp;gt;= 0.8) &amp;amp; (sent_scores["textblob_polarity"].between(-0.1, 0.1))].index[0]

42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we get the index of the first review where VADER rated them as negative but TextBlob as neutral:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sent_scores[(sent_scores["vader_compound"] &amp;lt;= -0.8) &amp;amp; (sent_scores["textblob_polarity"].between(-0.1, 0.1))].index[0]

0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s first retrieve the positive review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_reviews[42]

"I love carpet sweepers for a fast clean up and a way to conserve energy. The Ewbank Multi-Sweep is a solid, well built appliance. However, if you have pets, you will find that it takes more time cleaning the sweeper than it does to actually sweep the room. The Ewbank does pick up pet hair most effectively but emptying it is a bit awkward. You need to take a rag to clean out both dirt trays and then you need a small tooth comb to pull the hair out of the brushes and the wheels. To do a proper cleaning takes quite a bit of time. My old Bissell is easier to clean when it comes to pet hair and it does a great job. If you do not have pets, I would recommend this product because it is definitely well made and for small cleanups, it would suffice. For those who complain about appliances being made of plastic, unfortunately, these days, that's the norm. It's not great and plastic definitely does not hold up but, sadly, product quality is no longer a priority in business."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This review seems mixed, but is overall somewhat positive.&lt;/p&gt;

&lt;p&gt;Now, let’s look at the negative review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_reviews[0]

'The only redeeming feature of this Cuisinart 4-cup coffee maker is the sleek black and silver design. After that, it rapidly goes downhill. It is frustratingly difficult to pour water from the carafe into the chamber unless it\'s done extremely slow and with accurate positioning. Even then, water still tends to dribble out and create a mess. The lid, itself, is VERY poorly designed with it\'s molded, round "grip" to supposedly remove the lid from the carafe. The only way I can remove it is to insert a sharp pointed object into one of the front pouring holes and pry it off! I\'ve also occasionally had a problem with the water not filtering down through the grounds, creating a coffee ground lake in the upper chamber and a mess below. I think the designer should go back to the drawing-board for this one.'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This review is unambiguously negative. From comparing the two, VADER appears more accurate, but it does tend to overly prioritize positive terms in a piece of text.&lt;/p&gt;

&lt;p&gt;The final thing we can consider is how subjective versus objective each review is. We’ll do this by creating a histogram of TextBlob’s subjectivity score.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhit0o8zrfq1x8nremc6t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhit0o8zrfq1x8nremc6t.png" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interestingly, there is a good distribution of subjectivity in the reviews, with most reviews being a mixture of subjective and objective writing. A small number of reviews are also very subjective (close to 1) or very objective (close to 0).&lt;/p&gt;

&lt;p&gt;These scores between them give us a nice way of cutting up the data. If you need to know the objective things that people did and did not like about the products, you could look at the reviews with a low subjectivity score and VADER compound scores close to 1 and –1, respectively.&lt;/p&gt;

&lt;p&gt;In contrast, if you want to know what people’s emotional reaction to the products are, you could take those with a high subjectivity score and high and low VADER compound scores.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things to consider
&lt;/h2&gt;

&lt;p&gt;As with any problem in natural language processing, there are a number of things to watch out for when doing sentiment analysis.&lt;/p&gt;

&lt;p&gt;One of the biggest considerations is the language of the texts you’re trying to analyze. Many of the lexicon-based methods only work for a limited number of languages, so if you’re working with languages not supported by these lexicons, you may need to take another approach, such as using a fine-tuned LLM or training your own model(s).&lt;/p&gt;

&lt;p&gt;As texts increase in complexity, it can also be difficult for lexicon-based analyzers and bag-of-words-based models to correctly detect sentiment. Sarcasm or more subtle context indicators can be hard for simpler models to detect, and these models may not be able to accurately classify the sentiment of such texts. LLMs may be able to handle more complex texts, but you would need to experiment with different models.&lt;/p&gt;

&lt;p&gt;Finally, when doing sentiment analysis, the same issues also come up as when dealing with any machine learning problem. Your models will only be as good as the training data you use. If you cannot get high-quality training and testing datasets suitable to your problem domain, you will not be able to correctly predict the sentiment of your target audience.&lt;/p&gt;

&lt;p&gt;You should also make sure that your targets are appropriate for your business problem. It might seem attractive to build a model to know whether your products make your customers “sad”, “angry”, or “disgusted”, but if this doesn’t help you make a decision about how to improve your products, then it isn’t solving your problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;In this blog post, we dove deeply into the fascinating area of Python sentiment analysis and showed how this complex field is made more approachable by a range of powerful packages.&lt;/p&gt;

&lt;p&gt;We covered the potential applications of sentiment analysis, different ways of assessing sentiment, and the main methods of extracting sentiment from a piece of text. We also saw some helpful features in PyCharm that make working with models and interpreting their results simpler and faster.&lt;/p&gt;

&lt;p&gt;While the field of natural language processing is currently focused intently on large language models, the older techniques of using lexicon-based analyzers or traditional machine learning models, like Naive Bayes classifiers, still have their place in sentiment analysis. These techniques shine when analyzing simpler texts, or when speed, predictions, or ease of deployment are priorities. LLMs are best suited to more complex or nuanced texts.&lt;/p&gt;

&lt;p&gt;Now that you’ve grasped the basics, you can learn how to do &lt;a href="https://blog.jetbrains.com/pycharm/2024/12/how-to-do-sentiment-analysis-with-large-language-models/" rel="noopener noreferrer"&gt;sentiment analysis with LLMs&lt;/a&gt; in our tutorial. The step-by-step guide helps you discover how to select the right model for your task, use it for sentiment analysis, and even fine-tune it yourself.&lt;/p&gt;

&lt;p&gt;If you’d like to continue learning about natural language processing or machine learning more broadly after finishing this blog post, here are some resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2024/12/how-to-do-sentiment-analysis-with-large-language-models/" rel="noopener noreferrer"&gt;Learn how to do sentiment analysis with large language models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/pycharm/2022/06/start-studying-machine-learning-with-pycharm/" rel="noopener noreferrer"&gt;Start studying machine learning with PyCharm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lp.jetbrains.com/research/ml_methods/" rel="noopener noreferrer"&gt;Explore machine learning methods in software engineering&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Get started with sentiment analysis in PyCharm today
&lt;/h2&gt;

&lt;p&gt;If you’re now ready to get started on your own sentiment analysis project, you can activate your free three-month subscription to PyCharm. Click on the link below, and enter this promo code: &lt;strong&gt;PCSA24&lt;/strong&gt;. You’ll then receive an activation code via email.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/store/redeem/" rel="noopener noreferrer"&gt;Activate your 3-month subscription&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>llms</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>How to Do Sentiment Analysis With Large Language Models</title>
      <dc:creator>Evgenia Verbina</dc:creator>
      <pubDate>Thu, 05 Dec 2024 10:49:14 +0000</pubDate>
      <link>https://dev.to/pycharm/how-to-do-sentiment-analysis-with-large-language-models-5ca4</link>
      <guid>https://dev.to/pycharm/how-to-do-sentiment-analysis-with-large-language-models-5ca4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bxgouxodiwfsxuhzqq4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bxgouxodiwfsxuhzqq4.png" alt="How to Do Sentiment Analysis with Large Language Models" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sentiment analysis is a powerful tool for understanding emotions in text. While there are many ways to approach sentiment analysis, including more traditional lexicon-based and machine learning approaches, today we’ll be focusing on one of the most cutting-edge ways of working with text – large language models (LLMs). We’ll explain how you can use these powerful models to predict the sentiment expressed in a text.&lt;/p&gt;

&lt;p&gt;As a practical tutorial, this post will introduce you to the types of LLMs most suited for sentiment analysis tasks and then show you how to choose the right model for your specific task.&lt;/p&gt;

&lt;p&gt;We’ll cover using models that other people have fine-tuned for sentiment analysis and how to fine-tune one yourself. We’ll also look at some of the powerful tools and resources available that can help you work with these models easily, while demystifying what can feel like an overly complex and overwhelming topic.&lt;/p&gt;

&lt;p&gt;To get the most out of this blog post, we’d recommend you have some experience training machine learning or deep learning models and be confident using Python. &lt;a href="https://blog.jetbrains.com/pycharm/2024/12/introduction-to-sentiment-analysis-in-python/" rel="noopener noreferrer"&gt;Our introductory blog post on sentiment analysis with Python&lt;/a&gt; is a great place to begin. That said, you don’t necessarily need to have a background in large language models to enjoy it.&lt;/p&gt;

&lt;p&gt;Let’s get started!&lt;/p&gt;

&lt;h2&gt;
  
  
  What are large language models?
&lt;/h2&gt;

&lt;p&gt;Large language models are some of the latest and most powerful &lt;a href="https://www.jetbrains.com/help/pycharm/scientific-tools.html" rel="noopener noreferrer"&gt;tools&lt;/a&gt; for solving natural language problems. In brief, they are generalist language models that can complete a range of natural language tasks, from named entity recognition to question answering. LLMs are based on the transformer architecture, a type of neural network that uses a mechanism called attention to represent complex and nuanced relationships between words in a piece of text. This design allows LLMs to accurately represent the information being conveyed in a piece of text.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://huggingface.co/learn/nlp-course/chapter1/4?fw=pt" rel="noopener noreferrer"&gt;full transformer model architecture&lt;/a&gt; consists of two blocks. Encoder blocks are designed to receive text inputs and build a representation of them, creating a feature set based on the text corpus over which the model is trained. Decoder blocks take the features generated by the encoder and other inputs and attempt to generate a sequence based on these.&lt;/p&gt;

&lt;p&gt;Transformer models can be divided up based on whether they contain encoder blocks, decoder blocks, or both.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/learn/nlp-course/chapter1/5?fw=pt" rel="noopener noreferrer"&gt;Encoder-only models&lt;/a&gt; tend to be good at tasks requiring a detailed understanding of the input to do downstream tasks, like text classification and named entity recognition.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt" rel="noopener noreferrer"&gt;Decoder-only models&lt;/a&gt; are best for tasks such as text generation.&lt;/li&gt;
&lt;li&gt;Encoder-decoder, or &lt;a href="https://huggingface.co/learn/nlp-course/chapter1/7?fw=pt" rel="noopener noreferrer"&gt;sequence-to-sequence models&lt;/a&gt; are mainly used for tasks that require the model to evaluate an input and generate a different output, such as translation. In fact, translation was the original task that transformer models were designed for!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This &lt;a href="https://huggingface.co/learn/nlp-course/chapter1/9?fw=pt" rel="noopener noreferrer"&gt;Hugging Face table&lt;/a&gt; (also featured below), which I took from their course on &lt;a href="https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt" rel="noopener noreferrer"&gt;natural language processing&lt;/a&gt;, gives an overview of what each model tends to be strongest at.&lt;/p&gt;

&lt;p&gt;After finishing this blog post and discovering what other natural language tasks you can perform with the Transformers library, I recommend the course if you’d like to learn more about LLMs. It strikes an excellent balance between accessibility and technical depth.&lt;/p&gt;

&lt;p&gt;| Model type | Examples | Tasks |&lt;br&gt;
| Encoder-only | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |&lt;br&gt;
| Decoder-only | CTRL, GPT, GPT-2, Transformer XL | Text generation |&lt;br&gt;
| Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering |&lt;/p&gt;

&lt;p&gt;Sentiment analysis is usually treated as a text or sentence classification problem with LLMs, meaning that encoder-only models such as RoBERTa, BERT, and ELECTRA are most often used for this task. However, there are some exceptions. For example, the top scoring model for aspect-based sentiment analysis, InstructABSA, is based on a fine-tuned version of T5, an encoder-decoder model.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using large language models for sentiment analysis
&lt;/h2&gt;

&lt;p&gt;With all of the background out of the way, we can now get started with using LLMs to do sentiment analysis.&lt;/p&gt;
&lt;h3&gt;
  
  
  Install PyCharm to get started with sentiment analysis
&lt;/h3&gt;

&lt;p&gt;We’ll use &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;PyCharm Professional&lt;/a&gt; for this demo, but you can follow along with any other IDE that supports Python development.&lt;/p&gt;

&lt;p&gt;PyCharm Professional is a powerful Python IDE for &lt;a href="https://www.jetbrains.com/pycharm/data-science/" rel="noopener noreferrer"&gt;data science&lt;/a&gt;. It supports advanced Python &lt;a href="https://www.jetbrains.com/help/pycharm/auto-completing-code.html" rel="noopener noreferrer"&gt;code completion&lt;/a&gt;, inspections and &lt;a href="https://www.jetbrains.com/help/pycharm/debugging-code.html" rel="noopener noreferrer"&gt;debugging&lt;/a&gt;, rich &lt;a href="https://www.jetbrains.com/pycharm/integrations/#databases" rel="noopener noreferrer"&gt;databases&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/help/pycharm/running-jupyter-notebook-cells.html" rel="noopener noreferrer"&gt;Jupyter&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/help/pycharm/using-git-integration.html" rel="noopener noreferrer"&gt;Git&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/help/pycharm/conda-support-creating-conda-virtual-environment.html" rel="noopener noreferrer"&gt;Conda&lt;/a&gt;, and more right out of the box. You can try out great features such as our DataFrame &lt;em&gt;Column Statistics&lt;/em&gt; and &lt;em&gt;Chart View&lt;/em&gt;, as well as &lt;a href="https://blog.jetbrains.com/pycharm/2024/11/hugging-face-integration/" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt;integrations, which make working with LLMs much simpler and faster.&lt;/p&gt;

&lt;p&gt;If you’d like to follow along with this tutorial, you can activate your free three-month subscription to PyCharm using this special promo code: &lt;strong&gt;PCSA24&lt;/strong&gt;. Click on the link below, and enter the code. You’ll then receive an activation code through your email.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/store/redeem/" rel="noopener noreferrer"&gt;Activate your free three-month subscription&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Import the required libraries
&lt;/h3&gt;

&lt;p&gt;There are two parts to this tutorial: using an LLM that someone else has fine-tuned for sentiment analysis, and fine-tuning a model ourselves.&lt;/p&gt;

&lt;p&gt;In order to run both parts of this tutorial, we need to import the following packages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transformers: As described, this will allow us to use fine-tuned LLMs for sentiment analysis and fine-tune our own models.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/" rel="noopener noreferrer"&gt;PyTorch&lt;/a&gt;, &lt;a href="https://www.tensorflow.org/" rel="noopener noreferrer"&gt;Tensorflow&lt;/a&gt;, or &lt;a href="https://flax.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Flax&lt;/a&gt;: Transformers acts as a high-level interface for deep learning frameworks, reusing their functionality for building, training, and running neural networks. In order to actually work with LLMs using the Transformers package, you will need to install your choice of PyTorch, Tensorflow, or Flax. PyTorch supports the &lt;a href="https://jax.readthedocs.io/en/latest/quickstart.html" rel="noopener noreferrer"&gt;largest number of models&lt;/a&gt; of the three frameworks, so that’s the one we’ll use in this tutorial.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/docs/datasets/en/index" rel="noopener noreferrer"&gt;Datasets&lt;/a&gt;: This is another package from Hugging Face that allows you to easily work with the datasets hosted on Hugging Face Hub. We’ll need this package to get a dataset to fine-tune an LLM for sentiment analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In order to fine-tune our own model, we also need to import these additional packages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://numpy.org/" rel="noopener noreferrer"&gt;NumPy&lt;/a&gt;: NumPy allows us to work with arrays. We’ll need this to do some post-processing on the predictions generated by our LLM.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scikit-learn.org/stable/" rel="noopener noreferrer"&gt;scikit-learn&lt;/a&gt;: This package contains a huge range of functionality for machine learning. We’ll use it to evaluate the performance of our model.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/docs/evaluate/en/index" rel="noopener noreferrer"&gt;Evaluate&lt;/a&gt;: This is another package from Hugging Face. Evaluate adds a convenient interface for measuring the performance of models. It will give us an alternative way of measuring our model’s performance.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/docs/accelerate/en/index" rel="noopener noreferrer"&gt;Accelerate&lt;/a&gt;: This final package from Hugging Face, Accelerate, takes care of distributed model training.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can easily find and install these in PyCharm. Make sure you’re using a Python 3.7 or higher interpreter. For this demo, we’ll be using Python 3.11.7.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flapmpaqiqoeyhoa5a538.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flapmpaqiqoeyhoa5a538.png" width="800" height="197"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Pick the right model
&lt;/h3&gt;

&lt;p&gt;The next step is picking the right model. Before we get into that, we need to cover some terminology.&lt;/p&gt;

&lt;p&gt;LLMs are made up of two components: an &lt;em&gt;architecture&lt;/em&gt; and a &lt;em&gt;checkpoint&lt;/em&gt;. The architecture is like the blueprint of the model, and describes what will be contained in each layer and each operation that takes place within the model.&lt;/p&gt;

&lt;p&gt;The checkpoint refers to the weights that will be used within each layer. Each of the pretrained models will use an architecture like T5 or GPT, and obtain the specific weights (the model checkpoint) by training the model over a huge corpus of text data.&lt;/p&gt;

&lt;p&gt;Fine-tuning will adjust the weights in the checkpoint by retraining the last layer(s) on a dataset specialized in a certain task or domain. To make predictions (called &lt;em&gt;inference&lt;/em&gt;), an architecture will load in the checkpoint and use this to process text inputs, and together this is called a model.&lt;/p&gt;

&lt;p&gt;If you’ve ever looked at the &lt;a href="https://huggingface.co/models" rel="noopener noreferrer"&gt;models available on Hugging Face&lt;/a&gt;, you might have been overwhelmed by the sheer number of them (even when we narrow them down to encoder-only models).&lt;/p&gt;

&lt;p&gt;So, how do you know which one to use for sentiment analysis?&lt;/p&gt;

&lt;p&gt;One useful place to start is the &lt;a href="https://paperswithcode.com/task/sentiment-analysis" rel="noopener noreferrer"&gt;sentiment analysis page&lt;/a&gt; on Papers With Code. This page includes a very helpful overview of this task and a Benchmarks table that includes the top-performing models for each sentiment analysis benchmarking dataset. From this page, we can see that some of the commonly appearing models are those based on BERT and RoBERTa architectures.&lt;/p&gt;

&lt;p&gt;While we may not be able to access these exact model checkpoints on Hugging Face (as not all of them will be uploaded there), it can give us a guide for what sorts of models might perform well at this task. Papers With Code also has similar pages for a range of other natural language tasks: If you search for the task in the upper left-hand corner of the site, you can navigate to these.&lt;/p&gt;

&lt;p&gt;Now that we know what kinds of architectures are likely to do well for this problem, we can start searching for a specific model.&lt;/p&gt;

&lt;p&gt;PyCharm has an built-in integration with Hugging Face that allows us to search for models directly. Simply right-click anywhere in your Jupyter notebook or Python script, and select &lt;em&gt;Insert HF model&lt;/em&gt;. You’ll be presented with the following window:&lt;/p&gt;



&lt;p&gt;You can see that we can find Hugging Face models either by the task type (which we can select from the menu on the left-hand side), by keyword search in the search box at the top of the window, or by a combination of both. Models are ranked by the number of likes by default, but we can also select models based on downloads or when the model was created or last modified.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foou9nn89ut0m3c7st2jc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foou9nn89ut0m3c7st2jc.png" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you use a model for a task, the checkpoint is downloaded and cached, making it faster the next time you need to use that model. You can see all of the models you’ve downloaded in the &lt;em&gt;Hugging Face&lt;/em&gt; tool window.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxlhxrnmbk0x35iqulkz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxlhxrnmbk0x35iqulkz.png" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once we’ve downloaded the model, we can also look at its model card again by hovering over the model name in our Jupyter notebook or Python script. We can do the same thing with dataset cards.&lt;/p&gt;
&lt;h2&gt;
  
  
  Use a fine-tuned LLM for sentiment analysis
&lt;/h2&gt;

&lt;p&gt;Let’s move on to how we can use a model that someone else has already fine-tuned for sentiment analysis.&lt;/p&gt;

&lt;p&gt;As mentioned, sentiment analysis is usually treated as a text classification problem for LLMs.  This means that in our Hugging Face model selection window, we’ll select &lt;em&gt;Text Classification&lt;/em&gt;, which can be found under &lt;em&gt;Natural Language Processing&lt;/em&gt; on the left-hand side. To narrow the results down to sentiment analysis models, we’ll type “sentiment” in the search box in the upper left-hand corner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpfp99vdjdsjqakse4r4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpfp99vdjdsjqakse4r4.png" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see various fine-tuned models, and as expected from what we saw on the Papers With Code Benchmarks table, most of them use RoBERTa or BERT architectures. Let’s try out the top ranked model, &lt;em&gt;Twitter-roBERTa-base for Sentiment Analysis&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8y9mlnkxffyrep7co48u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8y9mlnkxffyrep7co48u.png" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see that after we select &lt;em&gt;Use Model&lt;/em&gt; in the Hugging Face model selection window, code is automatically generated at the caret in our Jupyter notebook or Python script to allow us to start working with this model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import pipeline
pipe = pipeline("text-classification",
 model="cardiffnlp/twitter-roberta-base-sentiment-latest")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before we can do inference with this model, we’ll need to modify this code.&lt;/p&gt;

&lt;p&gt;The first thing we can check is whether we have a GPU available, which will make the model run faster. We’ll check for two types: NVIDIA GPUs, which support CUDA, and Apple GPUs, which support MPS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My computer supports MPS, so we can add a device argument to the pipeline and add &lt;code&gt;"mps"&lt;/code&gt;. If your computer supports CUDA, you can instead add the argument &lt;code&gt;device=0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest",
                device="mps")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we can get the fine-tuned LLM to run inference over our example text.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result = pipe("I love PyCharm! It's my favorite Python IDE.")
result

[{'label': 'positive', 'score': 0.9914802312850952}]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see that this model predicts that the text will be positive, with 99% probability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-tune your own LLM for sentiment analysis
&lt;/h2&gt;

&lt;p&gt;The other way we can use LLMs for sentiment analysis is to fine-tune our own model.&lt;/p&gt;

&lt;p&gt;You might wonder why you’d bother doing this, given the huge number of fine-tuned models that already exist on Hugging Face Hub. The main reason you might want to fine-tune a model is so that you can tailor it to your specific use case.&lt;/p&gt;

&lt;p&gt;Most models are fine-tuned on public datasets, especially social media posts and movie reviews, and you might need your model to be more sensitive to your specific domain or use case.&lt;/p&gt;

&lt;p&gt;Model fine-tuning can be quite a complex topic, so in this demonstration, I’ll explain how to do it at a more general level. However, if you want to understand this in more detail, you can read more about it in Hugging Face’s excellent NLP course, which I recommended earlier. In their tutorial, they explain in detail how to &lt;a href="https://huggingface.co/learn/nlp-course/chapter3/2?fw=pt" rel="noopener noreferrer"&gt;process data&lt;/a&gt; for fine-tuning models and two different approaches to fine-tuning: with the &lt;a href="https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt" rel="noopener noreferrer"&gt;trainer API&lt;/a&gt; and &lt;a href="https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt" rel="noopener noreferrer"&gt;without&lt;/a&gt; it.&lt;/p&gt;

&lt;p&gt;To demonstrate how to fine-tune a model, we’ll use the &lt;a href="https://huggingface.co/datasets/stanfordnlp/sst2" rel="noopener noreferrer"&gt;SST-2 dataset&lt;/a&gt;, which is composed of single lines pulled from movie reviews that have been annotated as either negative or positive.&lt;/p&gt;

&lt;p&gt;As mentioned earlier, BERT models consistently show up as top performers on the Papers With Code benchmarks, so we’ll fine-tune a BERT checkpoint.&lt;/p&gt;

&lt;p&gt;We can again search for these models in PyCharm’s Hugging Face model selection window.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaqk239t7p47w4bfi2cy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaqk239t7p47w4bfi2cy.png" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the most popular BERT model is &lt;code&gt;bert-base-uncased&lt;/code&gt;. This is perfect for our use case, as this was also trained on lowercase text, so it will match the casing of our dataset.&lt;/p&gt;

&lt;p&gt;We could have used the popular &lt;code&gt;bert-large-uncased&lt;/code&gt;, but the base model has only 110 million parameters compared to BERT large, which has 340 million, so the base model is a bit friendlier for fine-tuning on a local machine.&lt;/p&gt;

&lt;p&gt;If you still want to use a smaller model, you could also try this with a &lt;a href="https://huggingface.co/distilbert/distilbert-base-uncased" rel="noopener noreferrer"&gt;DistilBERT model&lt;/a&gt;, which has far fewer parameters but still preserves most of the performance of the original BERT models.&lt;/p&gt;

&lt;p&gt;Let’s start by reading in our dataset. We can do so using the &lt;code&gt;load_dataset()&lt;/code&gt; function from the Datasets package. SST-2 is part of the &lt;a href="https://huggingface.co/datasets/nyu-mll/glue" rel="noopener noreferrer"&gt;GLUE&lt;/a&gt; dataset, which is designed to see how well a model can complete a range of natural language tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from datasets import load_dataset

sst_2_raw = load_dataset("glue", "sst2")
sst_2_raw

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dataset has already been split into the train, validation, and test sets. We have around 67,349 training examples – quite a modest number for fine-tuning such a large model.&lt;/p&gt;

&lt;p&gt;Here’s an example from this dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sst_2_raw["train"][1]

{'sentence': 'contains no wit , only labored gags ', 'label': 0, 'idx': 1}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see what the labels mean by calling the features attribute on the training set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sst_2_raw["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0 indicates a negative sentiment, and 1 indicates a positive one.&lt;/p&gt;

&lt;p&gt;Let’s look at the number in each class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(f'Number of negative examples: {sst_2_raw["train"]["label"].count(0)}')
print(f'Number of positive examples: {sst_2_raw["train"]["label"].count(1)}')

Number of negative examples: 29780
Number of positive examples: 37569
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The classes in our training data are a tad unbalanced, but they aren’t excessively skewed.&lt;/p&gt;

&lt;p&gt;We now need to tokenize our data, transforming the raw text into a form that our model can use. To do this, we need to use the same tokenizer that was used to train the &lt;code&gt;bert-large-uncased&lt;/code&gt; model in the first place. The AutoTokenizer class will take care of all of the under-the-hood details for us.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import AutoTokenizer

checkpoint = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we’ve loaded in the correct tokenizer, we can apply this to the training data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tokenised_sentences = tokenizer(sst_2_raw["train"]["sentence"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we need to add a function to pad our tokenized sentences. This will make sure all of the inputs in a training batch are the same length – text inputs are rarely the same length and models require a consistent number of features for each input.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import DataCollatorWithPadding

def tokenize_function(example):
    return tokenizer(example["sentence"])

tokenized_datasets = sst_2_raw.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we’ve prepared our dataset, we need to determine how well the model is fitting to the data as it trains. To do this, we need to decide which metrics to use to evaluate the model’s prediction performance.&lt;/p&gt;

&lt;p&gt;As we’re dealing with a binary classification problem, we have a few choices of metrics, the most popular of which are accuracy, precision, recall, and the F1 score. In the “Evaluate the model” section, we’ll discuss the pros and cons of using each of these measures.&lt;/p&gt;

&lt;p&gt;We have two ways of creating an evaluation function for our model. The first is using the Evaluate package. This package allows us to use the specific evaluator for the SST-2 dataset, meaning we’ll evaluate the model fine-tuning using the specific metrics for this task. In the case of SST-2, the metric used is accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import evaluate
import numpy as np

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "sst2")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, if we want to customize the metrics used, we can also create our own evaluation function. &lt;/p&gt;

&lt;p&gt;In this case, I’ve imported the accuracy, precision, recall, and F1 score metrics from scikit-learn. I’ve then created a function which takes in the &lt;em&gt;predicted&lt;/em&gt; labels versus &lt;em&gt;actual&lt;/em&gt; labels for each sentence and calculates the four required metrics. We’ll use this function, as it gives us a wider variety of metrics we can check our model performance against.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='macro'),
        'precision': precision_score(labels, predictions, average='macro'),
        'recall': recall_score(labels, predictions, average='macro')
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we’ve done all of the setup, we’re ready to train the model. The first thing we need to do is define some parameters that will control the training process using the &lt;code&gt;TrainingArguments&lt;/code&gt; class. We’ve only specified a few parameters here, but &lt;a href="https://huggingface.co/docs/transformers/v4.43.4/en/main_classes/trainer#transformers.TrainingArguments" rel="noopener noreferrer"&gt;this class&lt;/a&gt; has an enormous number of possible arguments allowing you to calibrate your model training to a high degree of specificity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="sst2-bert-fine-tuning",
                                  eval_strategy="epoch",
                                  num_train_epochs=3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our case, we’ve used the following arguments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;output_dir&lt;/code&gt;: The output directory where we want our model predictions and checkpoints saved.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;eval_strategy="epoch"&lt;/code&gt;: This ensures that the evaluation is performed at the end of each training epoch. Other possible values are “steps” (meaning that evaluation is done at regular step intervals) and “no” (meaning that evaluation is not done during training).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;num_train_epochs=3&lt;/code&gt;: This sets the number of training epochs (or the number of times the training loop will repeat over all of the data). In this case, it’s set to train on the data three times.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next step is to load in our pre-trained BERT model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s break this down step-by-step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;AutoModelForSequenceClassification&lt;/code&gt; class does two things. First, it automatically identifies the appropriate model architecture from the Hugging Face model hub given the provided checkpoint string. In our case, this would be the BERT architecture. Second, it converts this model into one we can use for classification. It does this by discarding the weights in the model’s final layer(s) so that we can retrain these using our sentiment analysis dataset.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;from_pretrained()&lt;/code&gt; method loads in our selected checkpoint, which in this case is &lt;code&gt;bert-base-uncased&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The argument &lt;code&gt;num_labels=2&lt;/code&gt; indicates that we have two classes to predict in our model: &lt;em&gt;positive&lt;/em&gt; and &lt;em&gt;negative&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We get a message telling us that some model weights were not initialized when we ran this code. This message is exactly the one we want – it tells us that the &lt;code&gt;AutoModelForSequenceClassification&lt;/code&gt; class reset the final model weights in preparation for our fine-tuning.&lt;/p&gt;

&lt;p&gt;The last step is to set up our &lt;code&gt;Trainer&lt;/code&gt; object. This stage takes in the model, the training arguments, the train and validation datasets, our tokenizer and padding function, and our evaluation function. It uses all of these to train the weights for the head (or final layers) of the BERT model, evaluating the performance of the model after each epoch on the validation set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now kick off the training. The &lt;code&gt;Trainer&lt;/code&gt; class gives us a nice timer that tells us both the elapsed time and how much longer the training is estimated to take. We can also see the metrics after each epoch, as we requested when creating the TrainingArguments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;trainer.train()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3s971aybo3d8hhe2dz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3s971aybo3d8hhe2dz1.png" width="800" height="105"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluate the model
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Classification metrics
&lt;/h4&gt;

&lt;p&gt;Before we have a look at how our model performed, let’s first discuss the evaluation metrics we used in more detail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; : As mentioned, this is the default evaluation metric for the SST-2 dataset. Accuracy is the simplest metric for evaluating classification models, being the ratio of correct predictions to all predictions. Accuracy is a good choice when the target classes are well balanced, meaning each class has an approximately equal number of instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision&lt;/strong&gt; : Precision calculates the percentage of the correctly predicted positive observations to the total predicted positives. It is important when the cost of a false positive is high. For example, in spam detection, you would rather miss a spam email (false negative) than have non-spam emails land in your spam folder (false positive).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall (also known as sensitivity)&lt;/strong&gt;: Recall calculates the percentage of the correctly predicted positive observations to all observations in the actual class. It is of interest when the cost of false negatives is high, meaning classifying a positive class incorrectly as negative. For example, in disease diagnosis, you would rather have false alarms (false positives) than miss someone who is actually ill (false negatives).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F1-score&lt;/strong&gt; : The F1-score is the harmonic mean of precision and recall. It tries to find the balance between both measures. It is a more reliable metric than accuracy when dealing with imbalanced classes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our case, we had slightly imbalanced classes, so it’s a good idea to check both accuracy and the F1 score. If they differ, the F1 score is likely to be more trustworthy. However, if they are roughly the same, it is nice to be able to use accuracy, as it is easily interpretable.&lt;/p&gt;

&lt;p&gt;Knowing whether your model is better at predicting one class versus the other is also useful. Depending on your application, capturing all customers who are unhappy with your service may be more important, even if you sometimes get false negatives. In this case, a model with high recall would be a priority over high precision.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model predictions
&lt;/h4&gt;

&lt;p&gt;Now that we’ve trained our model, we need to evaluate it. Normally, we would use the test set to get a final, unbiased evaluation, but the SST-2 test set does not have labels, so we cannot use it for evaluation. In this case, we’ll use the validation set accuracy scores for our final evaluation. We can do this using the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;trainer.evaluate(eval_dataset=tokenized_datasets["validation"])

{'eval_loss': 0.4223457872867584,
 'eval_accuracy': 0.9071100917431193,
 'eval_f1': 0.9070209502998072,
 'eval_precision': 0.9074841225920363,
 'eval_recall': 0.9068472678285763,
 'eval_runtime': 3.9341,
 'eval_samples_per_second': 221.649,
 'eval_steps_per_second': 27.706,
 'epoch': 3.0}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see that the model has a 90% accuracy on the test set, comparable to other &lt;a href="https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english" rel="noopener noreferrer"&gt;BERT models trained on SST-2&lt;/a&gt;. If we wanted to improve our model performance, we could investigate a few things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check whether the model is overfitting&lt;/strong&gt; : While small by LLM standards, the BERT model we used for fine-tuning is still very large, and our training set was quite modest. In such cases, overfitting is quite common. To check this, we should compare our validation set metrics with our training set metrics. If the training set metrics are much higher than the validation set metrics, then we have overfit the model. You can adjust a &lt;a href="https://discuss.huggingface.co/t/bert-fine-tuning-low-epochs/54869" rel="noopener noreferrer"&gt;range of parameters&lt;/a&gt; during model training to help mitigate this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Train on more epochs&lt;/strong&gt; : In this example, we only trained the model for three epochs. If the model is not overfitting, continuing to train it for longer may improve its performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check where the model has misclassified&lt;/strong&gt; : We could dig into where the model is classifying correctly and incorrectly to see if we could spot a pattern. This may allow us to spot any issues with ambiguous cases or mislabelled data. Perhaps the fact this is a binary classification problem with no label for “neutral” sentiment means there is a subset of sentences that the model cannot properly classify.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To finish our section on evaluating this model, let’s see how it goes with our test sentence. We’ll pass our fine-tuned model and tokenizer to a TextClassificationPipeline, then pass our sentence to this pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import TextClassificationPipeline

pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)

predictions = pipeline("I love PyCharm! It's my favourite Python IDE.")

print(predictions)

[[{'label': 'LABEL_0', 'score': 0.0006891043740324676}, {'label': 'LABEL_1', 'score': 0.9993108510971069}]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our model assigns &lt;code&gt;LABEL_0&lt;/code&gt; (negative) a probability of 0.0007 and &lt;code&gt;LABEL_1&lt;/code&gt; (positive) a probability of 0.999, indicating it predicts that the sentence has a positive sentiment with 99% certainty. This result is similar to the one we got from the fine-tuned RoBERTa model we used earlier in the post.&lt;/p&gt;

&lt;h4&gt;
  
  
  Sentiment analysis benchmarks
&lt;/h4&gt;

&lt;p&gt;Instead of evaluating the model on only the dataset it was trained on, we could also assess it on other datasets.&lt;/p&gt;

&lt;p&gt;As you can see from the Papers With Code benchmarking table, you can use a wide variety of labeled datasets to assess the performance of your sentiment classifiers. These datasets include the &lt;a href="https://huggingface.co/datasets/SetFit/sst5" rel="noopener noreferrer"&gt;SST-5 fine-grained classification&lt;/a&gt;, &lt;a href="https://huggingface.co/datasets/stanfordnlp/imdb" rel="noopener noreferrer"&gt;IMDB dataset&lt;/a&gt;, &lt;a href="https://huggingface.co/datasets/yassiracharki/Yelp_Reviews_for_Binary_Senti_Analysis" rel="noopener noreferrer"&gt;Yelp binary&lt;/a&gt; and &lt;a href="https://www.kaggle.com/datasets/yacharki/yelp-reviews-for-sa-finegrained-5-classes-csv" rel="noopener noreferrer"&gt;fine-grained classification&lt;/a&gt;, &lt;a href="https://huggingface.co/datasets/fancyzhx/amazon_polarity" rel="noopener noreferrer"&gt;Amazon review polarity&lt;/a&gt;, &lt;a href="https://huggingface.co/datasets/cardiffnlp/tweet_eval" rel="noopener noreferrer"&gt;TweetEval&lt;/a&gt;, and the &lt;a href="https://www.kaggle.com/datasets/charitarth/semeval-2014-task-4-aspectbasedsentimentanalysis" rel="noopener noreferrer"&gt;SemEval Aspect-based&lt;/a&gt; sentiment analysis dataset.&lt;/p&gt;

&lt;p&gt;When evaluating your model, the main thing is to ensure that the datasets represent your problem domain.&lt;/p&gt;

&lt;p&gt;Most of the benchmarking datasets contain either reviews or social media texts, so if your problem is in either of these domains, you may find an existing benchmark that mirrors your business domain closely enough. However, suppose you are applying sentiment analysis to a more specialized problem. In that case, it may be necessary to create your own benchmarks to ensure your model can generalize to your problem domain properly.&lt;/p&gt;

&lt;p&gt;Since there are multiple ways of measuring sentiment, it’s also necessary to make sure that any benchmarks you use to assess your model have the same target as the dataset you trained your model on.&lt;/p&gt;

&lt;p&gt;For example, it wouldn’t be a fair measure of a model’s performance to fine-tune it on the SST-2 with a binary target, and then test it on the SST-5. As the model has never seen the &lt;em&gt;very positive&lt;/em&gt;, &lt;em&gt;very negative&lt;/em&gt;, and &lt;em&gt;neutral&lt;/em&gt; categories, it will not be able to accurately predict texts with these labels and hence will perform poorly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;In this blog post, we saw how LLMs can be a powerful way of classifying the sentiment expressed in a piece of text and took a hands-on approach to fine-tuning an LLM for this purpose.&lt;/p&gt;

&lt;p&gt;We saw how understanding which types of models are most suited for sentiment analysis, as well as how being able to see the top performing models on different benchmarks with resources like Papers With Code can help you narrow down your options for which models to use.&lt;/p&gt;

&lt;p&gt;We also learned how Hugging Face’s powerful tooling for using these models and their integration into PyCharm makes using LLMs for sentiment analysis approachable for anyone with a background in machine learning.&lt;/p&gt;

&lt;p&gt;If you’d like to continue learning about large language models, check out our guest blog post by Dido Grigorov, who explains how to &lt;a href="https://blog.jetbrains.com/pycharm/2024/08/how-to-build-chatbots-with-langchain/" rel="noopener noreferrer"&gt;build a chatbot using the LangChain package&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get started with sentiment analysis with PyCharm today
&lt;/h2&gt;

&lt;p&gt;If you’re ready to get started on your own sentiment analysis project, you can activate your free three-month subscription of PyCharm. Click on the link below, and enter this promo code: &lt;strong&gt;PCSA24&lt;/strong&gt;. You’ll then receive an activation code through your email.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jetbrains.com/store/redeem/" rel="noopener noreferrer"&gt;Activate your free three-month subscription&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>ai</category>
      <category>llms</category>
      <category>pycharm</category>
    </item>
  </channel>
</rss>
