Jen Wike Huger for Parseable

Posted on Jun 26, 2024

Ingesting Data to Parseable Using Pandas

#database #monitoring #devops #python

A Step-by-Step Guide

Managing and deriving insights from vast amounts of historical data is not just a challenge but a necessity. Imagine your team grappling with numerous log files, trying to pinpoint issues. But, because logs are stored as files, it is very inefficient to search through them. This scenario is all too familiar for many developers.

Enter Parseable, a powerful solution to analyze your application logs. By integrating with pandas, the renowned Python library for data analysis, Parseable offers a seamless way to ingest and leverage historical data without the need to discard valuable logs.

In this post, we explore how Parseable can revolutionize your data management strategy, enabling you to unlock actionable insights from both current and archived log data effortlessly.

Requirements

Python installed on your system
Pandas library
Requests library
A CSV file to be ingested
Access to the Parseable API

The CSV File

Our code example is based on a Kaggle Dataset. We've used a CSV file named e-shop_clothing_2008.csv. Feel free to use your own dataset to follow along. First, ensure your CSV file is formatted correctly and accessible from the script's directory.

The Parseable API

Next, we'll interact with the Parseable API to send our data. Here we're using the demo Parseable instance. Before sending any data, please ensure you have entered the correct endpoint and credentials:

Endpoint: https://demo.technocube.in/api/v1/ingest
Username: admin
Password: admin

Writing the Script

Here’s a Python script that reads the CSV file in chunks and sends each chunk to the Parseable API (in a stream called testclickstream). Replace the CSV file path, Parseable endpoint, and authentication credentials with your own.

import pandas as pd
import requests
import json

# Define the CSV file path
csv_file_path = 'e-shop_clothing_2008.csv'

# Define the Parseable API endpoint
parseable_endpoint = 'https://demo.technocube.in/api/v1/ingest'

# Basic authentication credentials
username = 'admin'
password = 'admin'

headers = {
    'Content-Type': 'application/json',
    'X-P-Stream': 'testclickstream'
}

# Read and process the CSV file in chunks
chunk_size = 100  # Number of rows per chunk
for chunk in pd.read_csv(csv_file_path, chunksize=chunk_size, delimiter=';'):
    # Convert the chunk DataFrame to a list of dictionaries
    json_data = chunk.to_dict(orient='records')

    # Convert list of dictionaries to JSON string
    json_str = json.dumps(json_data)

    # Send the JSON data to Parseable
    response = requests.post(parseable_endpoint, auth=(username, password), headers=headers, data=json_str)

    # Check the response
    if response.status_code == 200:
        print('Chunk sent successfully!')
    else:
        print(f'Failed to send chunk. Status code: {response.status_code}')
        print(response.text)

Explanation of the Script

We have divided the code's flow into six steps to help you understand its function. This will also help you understand how Pandas libraries and Parseable work together.

Importing Libraries
The script starts by importing Pandas for data manipulation, Requests for HTTP requests, and JSON for handling JSON data.
Defining File Path and Endpoint: Specify the path to the CSV file and the Parseable API endpoint. Replace these with your actual file path and API endpoint.

Authentication and Headers
Set up basic authentication credentials and headers. The X-P-Stream header indicates the stream or collection name.
Reading CSV in Chunks: Use pd.read_csv to read the CSV file in chunks of 100 rows. The chunk size parameter handles large files efficiently without memory issues.

Converting Data to JSON
Convert each chunk to a list of dictionaries using to_dict with orient='records', then to a JSON string.

Sending Data to Parseable
Send the JSON data to the Parseable API using a POST request. Check the response status code to ensure successful ingestion. Print any errors.

Handling Errors and Retries

Network issues or server errors might prevent successful data ingestion in real-world scenarios. To make the script more robust, implement error handling and retries. Also, look for code errors, if any.

Next Steps

Ingesting data into Parseable using Pandas is straightforward and efficient. By reading data in chunks and converting it to JSON, we can seamlessly send it to the Parseable API.

This script serves as a foundation and is customizable to your specific needs, including sophisticated error handling, logging, or parallel processing.

Follow this guide to integrate Pandas and Parseable effectively, ensuring smooth and efficient data ingestion for your projects.

To get started or try Parseable, visit our demo page.

DEV Community