Satyam Gupta

Posted on Jun 20 • Edited on Jun 27

A Beginner’s Guide to Building Your First Serverless Data Pipeline on AWS

#aws #lambda #s3 #cloud

Stepping into the world of cloud computing, especially AWS, can feel daunting with its vast array of services.
This article will walk you through building a simple, cost-effective serverless data ingestion pipeline that fetches real-time weather data and stores it in AWS S3. It's perfect for anyone with basic Python knowledge looking to get their hands dirty with AWS Lambda and S3 on the free tier.

Why Serverless and AWS?
Serverless computing, like AWS Lambda, allows you to run code without provisioning or managing servers. You only pay for the compute time you consume, making it incredibly cost-effective and scalable for many use cases. Combining this with AWS S3, a highly durable and scalable object storage service, gives you a powerful foundation for data pipelines, microservices, and more.

For a beginner, this project is ideal because it:

Uses services with generous AWS Free Tier limits.
Introduces fundamental AWS concepts: Compute (Lambda), Storage (S3), Permissions (IAM), and Automation (EventBridge).
Leverages Python, a language many data professionals are familiar with.

Project Goal & Architecture
Our goal is simple: automatically fetch current weather data for a specific location and store it as a historical record in an AWS S3 bucket.

+----------------+     +-------------------+     +--------------------+
| AWS EventBridge| --> | AWS Lambda        | --> | AWS S3 Bucket      |
| (Scheduled Rule)|     | (Python Function) |     | (Weather Data JSON)|
+----------------+     +-------------------+     +--------------------+

The Core Components

AWS S3 (Simple Storage Service): Our data lake! This is where all the fetched weather data (as JSON files) will be stored. S3 is designed for 99.999999999% durability, making it extremely reliable.
AWS IAM (Identity and Access Management): To manage who can do what in our AWS account. We'll create a specific role for our Lambda function, granting it only the necessary permissions (e.g., to write to S3).
AWS Lambda: The heart of our serverless magic. Our Python code will run here, fetching data, processing it, and pushing it to S3. We don't worry about servers, patching, or scaling; Lambda handles it all.
Open-Meteo API: Our external data source. This is a free and open-source API that provides weather data without requiring an API key for basic usage.
AWS EventBridge (formerly CloudWatch Events): This service will act as our scheduler. We'll set up a rule to trigger our Lambda function at regular intervals (e.g., daily).
AWS CloudWatch: Lambda automatically integrates with CloudWatch for logging and monitoring, which is essential for debugging and observing our function's execution.

Do Check out the full project README on my GitHub

Step 1: Set up Your AWS Account (If you haven't already)

Go to aws.amazon.com.
Click "Create an AWS Account" and follow the instructions. You will need a credit card, but you won't be charged for free-tier usage. Important: Note down your AWS Account ID.

Step 2: Create an S3 Bucket (Storage)
This is where your weather data files will be stored.

Log in to the AWS Management Console.
Search for "S3" in the search bar at the top and click on "S3". Click "Create bucket".

Bucket name: Choose a unique, globally unique name (e.g., yourname-weather-data-bucket). Rule: Bucket names must be lowercase, no spaces, and unique across all of AWS.
AWS Region: Choose a region close to you (e.g., us-east-1 (N. Virginia), eu-west-1 (Ireland)). This impacts latency and some pricing, but usually stays within free tier for small usage.
Object Ownership: Keep the default "ACLs disabled (recommended)".
Block Public Access settings for this bucket: Keep "Block all public access" checked. This is crucial for security. Your Lambda function will access it, not the public internet.
Leave other settings as default for now.
Click "Create bucket".

Step 3: Create an IAM Role for Lambda (Permissions)
Your Lambda function needs permission to interact with other AWS services (like S3 for putting objects and CloudWatch for logging).

In the AWS Management Console, search for "IAM" and click on "IAM".
In the left navigation pane, click "Roles".
Click "Create role".
Trusted entity type: Select "AWS service".
Use case: Select "Lambda" from the list. Click "Next".
Permissions policies:
Search for AWSLambdaBasicExecutionRole and select it. This grants permission to write logs to CloudWatch.
Search for AmazonS3FullAccess and select it. This grants permission to put objects into S3.
Click "Next".
Role name: Give it a descriptive name, e.g., lambda-weather-s3-role.
Leave other settings as default.
Click "Create role".

Step 4: Create Your Lambda Function (Compute)
This will host your Python code.

In the AWS Management Console, search for "Lambda" and click on "Lambda".
Click "Create function".
Author from scratch is usually the best option for custom code.
Function name: e.g., WeatherDataFetcher
Runtime: Choose Python 3.10 (or the latest Python 3.x available).
Architecture: x86_64 (default).
Permissions:
Under "Change default execution role", select "Use an existing role".
Choose the IAM role you just created: lambda-weather-s3-role.
Click "Create function".

Step 5: Write Lambda Python Code and Configure Environment
Now you'll add your Python code to the Lambda function.

Once your Lambda function is created, you'll be on its configuration page. Scroll down to the "Code source" section.

You'll see a default lambda_function.py file. Replace its content with the following Python code.

import json
import os
import datetime
import logging
import urllib.request

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

import boto3
s3 = boto3.client('s3')

# Configuration (make sure these are set in the Lambda environment variables)
S3_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME')
CITY_LATITUDE = os.environ.get('CITY_LATITUDE')
CITY_LONGITUDE = os.environ.get('CITY_LONGITUDE')

WEATHER_API_URL = "https://api.open-meteo.com/v1/forecast"

def lambda_handler(event, context):
    logger.info(f"Lambda function triggered at {datetime.datetime.now()}")

    if not S3_BUCKET_NAME or not CITY_LATITUDE or not CITY_LONGITUDE:
        logger.error("Missing environment variables: S3_BUCKET_NAME, CITY_LATITUDE, or CITY_LONGITUDE.")
        return {
            'statusCode': 500,
            'body': json.dumps('Configuration error: Missing environment variables.')
        }

    try:
        # Build API URL with parameters
        params = (
            f"latitude={CITY_LATITUDE}&longitude={CITY_LONGITUDE}"
            "&current_weather=true"
            "&temperature_unit=fahrenheit"
            "&windspeed_unit=mph"
            "&timezone=auto"
        )
        full_url = f"{WEATHER_API_URL}?{params}"
        logger.info(f"Fetching weather data from: {full_url}")

        with urllib.request.urlopen(full_url) as response:
            weather_data = json.loads(response.read().decode())

        logger.info(f"Weather data: {weather_data}")

        current_weather = weather_data.get('current_weather', {})
        logger.info(f"Current weather: {current_weather}")

        # Parse timestamp safely
        time_value = current_weather.get('time')

        if time_value:
            try:
                # Open-Meteo returns ISO8601 string like "2025-06-20T18:30"
                timestamp = datetime.datetime.fromisoformat(time_value).isoformat()
            except ValueError:
                # If format isn't ISO, assume UNIX timestamp
                timestamp = datetime.datetime.fromtimestamp(float(time_value)).isoformat()
        else:
            timestamp = datetime.datetime.now().isoformat()

        formatted_data = {
            "timestamp": timestamp,
            "latitude": CITY_LATITUDE,
            "longitude": CITY_LONGITUDE,
            "temperature": current_weather.get('temperature'),
            "windspeed": current_weather.get('windspeed'),
            "winddirection": current_weather.get('winddirection'),
            "weathercode": current_weather.get('weathercode'),
            "is_day": current_weather.get('is_day'),
            "source_api": "Open-Meteo"
        }

        current_utc_time = datetime.datetime.utcnow()
        s3_key_prefix = current_utc_time.strftime("data/%Y/%m/%d/")
        s3_file_name = current_utc_time.strftime("weather_%Y-%m-%d-%H-%M-%S.json")
        s3_object_key = s3_key_prefix + s3_file_name
        logger.info(f"Uploading data to S3 key: {s3_object_key}")

        s3.put_object(
            Bucket=S3_BUCKET_NAME,
            Key=s3_object_key,
            Body=json.dumps(formatted_data, indent=2),
            ContentType='application/json'
        )
        logger.info(f"Successfully uploaded to s3://{S3_BUCKET_NAME}/{s3_object_key}")

        return {
            'statusCode': 200,
            'body': json.dumps(f'Weather data saved to S3 bucket {S3_BUCKET_NAME} as {s3_object_key}!')
        }

    except Exception as e:
        logger.error(f"Error occurred: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps(f'An error occurred: {e}')
        }

Deploy the code: After pasting the code, click the "Deploy" button at the top right of the "Code source" section.
Configure Environment Variables:
Below the "Code source" section, click on the "Configuration" tab.
In the left sidebar, click "Environment variables".
Click "Edit" and then "Add environment variable".
Add the following three variables:
Key: S3_BUCKET_NAME, Value: yourname-weather-data-bucket(your actual S3 bucket name) Key: CITY_LATITUDE, Value: YOUR_CITY_LATITUDE (e.g., 28.61 for New Delhi) Key: CITY_LONGITUDE, Value: YOUR_CITY_LONGITUDE (e.g., 77.20 for New Delhi)
Click "Save".
Adjust Basic Settings (Memory & Timeout):
- Still under the "Configuration" tab, click "General configuration".
- Click "Edit".
- Memory: Keep it at 128 MB (default, lowest cost).
- Timeout: Increase it slightly to 30 seconds (or 1 minute) to allow enough time for API calls and S3 uploads.
Click "Save".

Step 6: Test Your Lambda Function

On your Lambda function page, click the "Test" tab (or "Test" button near the top right).
Click the "Create new event" dropdown.
Event name: test-invoke
Event template: "hello-world" (the content of the JSON doesn't matter for this function as it doesn't use the event payload).
Click "Save".
Click the "Test" button.

You should see "Execution results" indicating Status: Succeeded.
Check the "Log output" for messages from logger.info() which will confirm the API call and S3 upload.

Verify in S3: Go back to your S3 bucket in the AWS console. You should see a new folder data/ and inside it, subfolders for the current year/month/day, containing a new JSON file (e.g., weather_YYYY-MM-DD-HH-MM-SS.json).

Step 7: Configure a Trigger (EventBridge Schedule)
This makes your function run automatically.

On your Lambda function page, click the "Configuration" tab.
In the left sidebar, click "Triggers".
Click "Add trigger".
Select a source: Choose "EventBridge (CloudWatch Events)".
Rule: "Create a new rule".
Rule name: e.g., daily-weather-fetcher
Rule type: "Schedule expression".
Schedule expression: cron(0 0 * * ? *) (This will run the function once every 24 hours at midnight UTC). Tip: You can use rate(1 day) for daily, or cron(0 12 ? * MON-FRI *) for 12 PM UTC, Monday-Friday. Be mindful of frequency to stay within free tier! For testing, you could use rate(5 minutes) but change it back after testing.
Leave other settings as default.
Click "Add". Your Lambda function will now automatically run according to your schedule, fetching and storing weather data daily!

Step 8: Clean Up Your AWS Resources (Crucial for Free Tier)
Always remember to clean up resources after your learning project to avoid unexpected charges, especially if you step outside the free tier. EventBridge rule, Lambda function, S3 bucket (after emptying its contents), and the IAM role.

I hope this article provides a clear guide and inspires others to get hands-on with AWS. The cloud journey is full of exciting possibilities!

DEV Community

A Beginner’s Guide to Building Your First Serverless Data Pipeline on AWS

Top comments (0)