DEV Community: thebadcoder

The Sustainability Impacts of ChatGPT: A Comprehensive Analysis

thebadcoder — Wed, 17 Apr 2024 14:44:01 +0000

Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) and LLaMA (Large Language Model Meta AI) have revolutionized the way we interact with data and machines, providing deep insights and enhancing human-machine interactions. As transformative as LLMs are for tasks like translation, content generation, and customer support, they come with substantial environmental costs primarily due to their high energy demands.

This article provides an essential technical backdrop, how LLMs affect our environment, the ongoing efforts to mitigate these effects, and how policies and personal actions can contribute to more sustainable AI practices.

What are Large Language Models?

ChatGPT, Claude, Gemini, and yes BERT (Bidirectional Encoder Representations from Transformers) are all Large Language Models but what are they and why are they so energy extensive?

Large Language Models (LLMs) are a type of artificial intelligence system that is trained on vast amounts of text data, allowing them to generate human-like responses, understand and process natural language, and perform a wide range of language-related tasks. It is like looking for patterns in the texts to figure out what to say back to you, and these texts are huge amounts of articles/books/posts/etc., also called training data.

Essentially, LLMs work by using neural networks to identify patterns and relationships in the training data, which can then be used to generate new text, answer questions, translate between languages, and more. These neural networks have layers of algorithms, each designed to recognize different elements of human language, from simple grammar to complex idioms and mainly context.

If you enjoy reading this so far, Subscribe for free and follow me for more content :)

The training process involves repeatedly adjusting these layers to minimize errors in output, requiring multiple iterations across potentially billions of parameters. For example, GPT-3 has about 175 billion parameters. It is trained on about 45TB of text data from different datasets. This, right now, is a medium to small LLM. The more ‘better’ a model is the more complex and resource-intensive it gets.

Source: Information is Beautiful

This computation is not only data-intensive (remember the huge amounts of training data?) but also requires a lot of electrical power, typically executed on specialized hardware like GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units). You can already see the storage, training/processing, and operational costs that it can incur.

Understanding the Environmental Impacts of LLMs

Each phase of the LLMs has its own footprint:

Training Impacts:

The training process requires considerable computational resources, typically involving multiple high-powered GPUs or TPUs that run continuously for weeks or even months. This consumes large amounts of electricity, contributing to the carbon footprint of LLMs. All these AI companies boast about how amazing and powerful their new model is, and how much information it can process but they rarely discuss the computational and environmental cost of said models.

For example, students from the University of Copenhagen developed a tool to predict the carbon footprint of algorithms and found that one training session with GPT-3 uses the same amount of energy that is needed by 126 homes in Denmark annually.

Another famous study by researchers at the University of Massachusetts, Amherst, performed an analysis of the carbon footprint of transformer models.

Source: MIT Technology Review

They found that a simple transformer with around 213 million parameters emits almost 5 times the lifetime emissions of the average American car or about 315 round trip flights from New York to San Francisco.

Just to put things in perspective, let’s see what an actual model’s carbon footprint is like. A simple 213-parameter model produces 626,155 lbs of CO2. Claude 3 is rumored to have 500 billion parameters. Let’s assume a linear scaling (for our simple brain to comprehend), which would be an increase of 2348x.

So 626,155 x 2348 is a whopping 1,469,456,540 or 1.5 Billion lbs of CO2 for a model like Claude 3 that we are using nowadays.

This is of course due to the energy-intensive nature of the training process, which involves running the model through billions of computations. But why companies are shy about revealing such numbers?

Storage and Operational Impacts:

The data centers that power LLMs are also a major source of environmental impact. These facilities require large amounts of energy for cooling, ventilation, and other operational needs. They also generate e-waste from the constant upgrading and replacement of hardware.

A recent study at the University of California, Riverside, revealed the significant water footprint of LLMs. Microsoft used approximately 700,000 liters of freshwater during GPT-3’s training in its data centers which is equal to how much water is needed to make 320 Tesla vehicles.

I know it says during the training process but the model also uses a lot of water in the inference process (when you are using it). For a brief exchange involving 20-50 queries, the water usage is comparable to a 500 ml bottle. Given its billions of users, the cumulative water footprint for processing these interactions is quite significant. Even if we take 1 billion users that is 500 million liters of water; it can fill up around 200 Olympic-sized swimming pools.

Moreover, the storage and hosting of LLMs, which can be terabytes in size, requires dedicated server infrastructure, further adding to the environmental footprint.

In response to Bloomberg asking OpenAI about the sustainability concern, they had this to say:

‘OpenAI runs on Azure, and we work closely with Microsoft’s team to improve efficiency and our footprint to run large language models.‘

Also found these discussions on the OpenAI Community

Hardware and Other Impacts:

LLMs also require significant hardware resources, such as high-performance GPUs, storage, and memory. I was reading this article when I read the author’s update that they had assumed ChatGPT runs on 16 GPUs, but in fact, it runs on more than 29,000 GPUs. The manufacturing, transportation, and eventual disposal of these hardware can also have environmental impacts.

This raises concerns about environmental justice, as the resource-intensive nature of these models may disproportionately affect marginalized communities and developing regions that have less access to clean energy and sustainable infrastructure.

How to be more sustainable with AI?

As the concerns around the environmental impacts of LLMs have grown, there have been various efforts and advancements to make these models more sustainable. But I can’t lie, I do not feel like these efforts are enough. I am not convinced that the speed of these advancements can keep up with the exponential growth of LLMs and AI.

Some of the key areas that we need to start or continue focusing on:

1. What can companies do?

Companies are investing heavily to reduce the energy consumption of AI tools and models. Some are exploring using carbon offsets to counterbalance the emissions generated by their LLMs. Microsoft, Google, Apple, and Meta all have pledged to be carbon-neutral and net-zero. I mean Google developed tensor processing units (TPUs) that are more energy-efficient than traditional GPUs.

Source: Dental Lace

But there is still a need for awareness of how much impact AI has on the Earth publicly. Companies need to be transparent with footprints and work with the public to make Earth more sustainable. Not only that, they need to implement renewable and recycling programs.

This is not just about being green in the public eye; it’s about pushing the sustainable agenda to the forefront and setting industry standards.

2. What can developers do?

There is already an emphasis on researching and developing more efficient models. Techniques such as transfer learning, pruning, quantization, knowledge distillation, etc. are being employed to make models more efficient without sacrificing performance.

Developers should prioritize building and contributing to open-source projects focused on sustainable AI practices. More training in eco-conscious programming can also be embedded in developer education.

3. What can policymakers do?

Policymaking plays a crucial role in guiding the development and implementation of AI technologies sustainably. Policies that incentivize energy-efficient AI, set emissions targets, and encourage the use of renewable energy can help drive companies to adopt more sustainable practices.

The EU’s Green Deal includes specific provisions for digital sector sustainability, aiming to significantly reduce its carbon and electronic waste footprint. The AI Act is the first-ever legal framework on AI, which addresses the risks of AI. These are a good start but we need more, especially in the US. We need to enact policies that require tech companies to report and reduce their carbon footprints. Transparency in energy consumption should be mandatory, not optional.

4. What can YOU do?

Of course, as individuals, raising awareness and supporting green AI initiatives is crucial, but it starts with being well-informed. Understanding the environmental implications of AI use and sharing this knowledge can catalyze collective action toward sustainable practices. Being vocal and using our influence matters significantly.

Additionally, on the technical side, using concise prompts and selecting models that are efficient in processing can reduce computational demands. By streamlining the complexity and length of prompts and reducing unnecessary interactions, we can significantly cut down on the computational resources needed. This not only conserves energy but also aligns with more sustainable AI usage practices.

Conclusion

While LLMs offer unprecedented capabilities, their environmental impact cannot be overlooked. It’s clear that bold steps are needed from all stakeholders—companies, developers, policymakers, and users alike. The path to sustainable AI is complex and challenging, but with concerted effort and innovation, it’s possible to harness the benefits of AI without compromising our planet.

Response from ChatGPT 4

That is it from me! I hope this exploration was helpful in some way! What are your thoughts on sustainable AI? What are some trends you noticed?

If you found value in this article, please share it with someone who might also benefit from it. Your support helps spread knowledge and inspires more content like this. Let's keep the conversation going—share your thoughts and experiences below!

Generator Functions in Python

thebadcoder — Mon, 04 Mar 2024 01:42:58 +0000

Today, we are going to learn about one of Python's most intriguing features—generator functions. These are obviously not your normal functions; they're a handy tool for dealing with data more efficiently, especially when you have extremely large datasets that could give traditional functions a run for their memory.

What are Generator functions?

Any function that uses yield instead of return is a generator function. Simple right? but what does yield do? The yield keyword here produces a sequence of values over time, instead of returning a single value. For example, instead of returning a list of items or a whole dataset, it produces a sequence of items from a list or rows from the dataset at a time.

This allows generator functions to produce values on the fly and pauses their state between outputs, making them memory-efficient and performance-friendly.

This has nothing to do with the trending and cool Generative AI that's been capturing everyone's attention; we are talking about functions that help us streamline our data processing tasks with elegance and efficiency.

How does it do that?

A normal function runs through all the lines of codes within it and returns a value. The generator function, however, returns an iterator, or in this case, a generator object. This produces only one value at a time so no other values will be stored. When the next() function is used on the object, the next value is produced.

Basically, we can loop through something without storing everyone all at once with the help of generators. You can learn more in detail about this here.

Let’s try to understand it from an example:

def infinite_counter():
    count = 0
    while True:
        yield count
        count += 1

# Using the generator
for number in infinite_counter():
    print(number)
    if number >= 10:  # Stop at 10 to keep things sane!
        break

This is a simple infinite counter. Here's what's happening:

infinite_counter defines a generator function that starts counting from 0.
Inside an infinite loop, it yields the current count, then increments it.
In the for loop, the function is called so it produces numbers from 0 upwards.
The loop prints each number, and we've added a break condition to stop the loop after it prints 10, preventing an actual infinite loop.

This example shows the power of generators to handle potentially infinite sequences in a memory-efficient way, yielding one item at a time!

Generative Expressions:

Generator functions are just one way to create generator objects. Another way is Generator Expressions. They are just like list comprehensions but for generators; you can convert a list comprehension into a generator expression by replacing the square brackets [] with parentheses ().

squares = (x**2 for x in range(10))

This generator expression creates a sequence of squared numbers, showcasing the elegance and simplicity of using generative functions for on-the-fly data processing.

Generator Functions in Data Processing

Generative functions are the unsung heroes of memory management. Their ability to yield data incrementally means they can process information piece by piece, rather than loading everything into memory at once. This approach is not just about being resourceful; it's a necessity when dealing with datasets that can go across gigabytes or even terabytes.

Generative functions are incredibly versatile, fitting a wide range of scenarios beyond the basics:
-Real-time data streams: Perfect for processing live data feeds, where data is continuous and potentially infinite.
-Large files: Useful for reading and processing data without the need to load everything into memory simultaneously.
-Data transformation pipelines: Implement stages of data transformation where each function passes its output to the next, efficiently handling data at each step.

Let’s run through some examples because the author is not happy with the length of this article so far.

Example 1: Processing large files

Let’s say you want to filter out specific entries from a file based on certain criteria, such as error messages in a log file:

def filter_errors(log_file):
    with open(log_file, 'r') as file:
        for line in file:
            if "ERROR" in line:
                yield line.strip()

This function goes through each line of the log, yielding only those that contain error messages, showing us how generative functions can be used for real-time data filtering.

Example 2: Data Loading and Preprocessing

Generators are particularly useful in machine learning for data loading and preprocessing. Libraries like TensorFlow and PyTorch support data loaders that can be used to stream data from disks in batches using generator functions.

TensorFlow extensively uses the concept of generators through its tf.data.Dataset API, which allows for efficient data loading, preprocessing, and augmentation on the fly during model training.

The from_generator method allows you to create a Dataset from a Python generator. Here, TensorFlow uses the generator function indirectly to stream data:

import tensorflow as tf

dataset = tf.data.Dataset.from_generator(
    generator=load_data,
    output_types=(tf.float32, tf.float32),
    args=(arg1, arg2,)
)

dataset = dataset.batch(32)  # Batching data for training

Example 3: Principles in Pandas

Pandas offers functionality that aligns with the principles of generators, useful when dealing with large datasets that might not fit into memory.

For row-wise iteration, iterrows and itertuples can be used, though it's important to note that these methods may not always be the most efficient way to iterate over a data frame.

import pandas as pd

df = pd.DataFrame({'a': range(10), 'b': range(10, 20)})

# iterrows example
for index, row in df.iterrows():
    print(row['a'], row['b'])

# itertuples example
for row in df.itertuples(index=False):
    print(row.a, row.b)

Pro-Tip: Pandas' read_csv function allows processing large CSV files in manageable chunks, a method particularly beneficial for large datasets. When you use the chunksize parameter, the function will return an iterator object.

for chunk in pd.read_csv('large_file.csv', chunksize=100000):
    # Process each chunk here
    process(chunk)

Example 4: Web Crawling with Generators

Scrapy, an asynchronous web scraping framework, uses generators and coroutines to handle web requests and responses efficiently. Here's a simplified example of a Scrapy spider that uses generators to crawl web pages:

import scrapy

class ErrorLogSpider(scrapy.Spider):
    name = "error_log_spider"
    start_urls = ['http://example.com/logs']

    def parse(self, response):
        # Extract log page URLs
        log_urls = response.css('a::attr(href)').getall()
        for url in log_urls:
            yield response.follow(url, self.parse_log)

    def parse_log(self, response):
        # Extract error messages
        for error_msg in response.css('.error::text').getall():
            yield {'error': error_msg}

We start from a main page, follow links to log pages, and extract error messages, all while using generators to facilitate efficient data extraction and processing in web crawling tasks.

Example 5: Related concepts in other libraries

This concept of deferring computation and efficiently managing resources is a common thread that ties together various Python libraries. Libraries like numpy leverage iterators for creating arrays from iterable sequences, optimizing memory usage in data manipulation. Similarly, PySpark employs lazy evaluation to efficiently process big data across distributed systems, executing transformations only when an action requires the result, thereby optimizing computation and resource utilization.

Understanding and leveraging these principles allows data professionals to handle larger datasets, speed up data processing, and write more efficient and scalable Python code.

Hopefully, these examples illustrate the power of generators, iterators, and related concepts, highlighting their importance in efficient data processing and analysis in various contexts. Whether you're processing streams of real-time data or chipping away at massive datasets, embracing generative functions can elevate your data handling to new heights.

That is it from me! Have you used similar techniques before, or do you see new opportunities to apply them in your work? I hope this exploration was helpful in some way!

What is Brand Safety Analysis? : A Data Nerd’s Perspective

thebadcoder — Thu, 01 Feb 2024 17:08:32 +0000

The concept of Brand Safety Analysis (BSA) has emerged as a crucial tool for marketers, content creators, and data enthusiasts. But what does it really mean, and why is it gaining such traction?

I came across BSA through Conan O’Brien’s podcast, “Conan O’Brien Needs A Friend.”, and I decided that I want to make my first written article on this topic. So critique me on how I’m doing?

Anyway, if you don’t know Conan or haven’t seen his work, I highly recommend checking him out. But... you definitely don’t know me, so check me out too. :P

Conan O’Brien is not just a household name; he’s a seasoned comedian, writer, and producer known for his quick wit and innovative comedy. He used to work as a writer for The Simpsons and IS my favorite talk show host, who is also now podcasting. Every year on his podcast, they have a ‘State of the Podcast’ session similar to how the United States of America holds their ‘State of the Union’, where they kinda analyze and see how the podcast is doing overall.

Why Brand Safety Analysis?

Targeted Advertising: BSA enables brands to place their advertisements more strategically, ensuring creators align with the content’s values and reach the desired audience.

Content Strategy: It offers content creators insight into how their thematic choices or language use might influence potential brand partnerships.

Consumer Insights: By understanding what content resonates with audiences, brands can tailor their marketing strategies more effectively.

The Essence of Brand Safety Analysis

At its core, Brand Safety Analysis is an evaluative process used by brands to ensure that a platform’s content aligns with their values and public image. When you start out creating content and getting sponsorships, it is very unlikely that you come across BSA. This is because the smaller companies do not really care about your brand when they set you up with sponsorship “codes” that they track to know how much sales are coming from your audience.

But as you get bigger and get approached by bigger companies for partnerships, BSA plays a huge role. If a company is partnering with you, and you do not align with their views/values, it can hurt both parties. This analysis acts like a compatibility check for potential partnerships between advertisers and content creators. In the case of Conan’s podcast, a detailed BSA provided insights into various content dimensions that advertisers scrutinize, such as:

Obscenity and Profanity: Quantifying the use of language that might be deemed inappropriate.

Adult and Sexual Content: Assessing references to sexual content, innuendos, or adult themes.

Hate Speech and Aggression: Identifying content that could be perceived as discriminatory or promoting violence.

Illegal Drug References: Highlighting mentions of illegal substances or activities.

Military Conflict: Analyzing discussions related to wars or military actions.

Violation of Human Rights: Scrutinizing content for potential endorsements of human rights violations.

The challenge here, particularly for a data nerd, lies in the methodology. How do we develop algorithms capable of understanding humor, sarcasm, and the nuanced dynamics of conversation? Conan’s content, renowned for its comedic genius and sarcasm, brings to light the intricate challenge of quantifying content properly.

The Future of Content and Brand Synergy

Brand Safety Analysis is not just the simple goal of avoiding controversy. It’s about creating a harmonious relationship between content and advertising, benefiting brands, creators, and consumers. As we delve deeper into the digital content era, the significance of Brand Safety Analysis only grows, opening new avenues for exploration and understanding; especially in the field of data.

For tech professionals, marketers, and business people, BSA promises a blend of challenges and opportunities, all aimed at enhancing the digital content landscape.

That is it from me! What’s your take on Brand Safety Analysis? Have you encountered similar methodologies in your work, or do you see potential applications in your field? Hope you enjoyed this article and was helpful in some way! I would really appreciate it if you would like, comment, or share this article with someone who might find some value.

Check out the video that talks about BSA from Conan that inspired me to write my first medium article :)