Adeoye Malumi

Posted on Jul 3

Predicting Tomorrow's Tremors: A Machine Learning Approach to Earthquake Nowcasting in California

#datascience #machinelearning #python #coding

Earthquakes are a constant, terrifying reality, especially in tectonically active zones like California. While pinpointing the exact time and location of a future quake remains one of science's grand challenges, the concept of earthquake nowcasting offers a pragmatic alternative: assessing the current probability of a significant event happening within a near-term window.

This article walks through the entire journey of building and deploying a machine learning model designed to nowcast the likelihood of Magnitude 6.0+ earthquakes in California within a 30-day horizon. I'll cover everything from robust data acquisition to feature engineering, model training, and the practicalities of deployment.

1. The Bedrock: Data Acquisition

Every data-driven project starts with data. For me, this meant building a comprehensive historical catalog of seismic events in California.

I leveraged the ObsPy library to interact with the USGS FDSN client. My goal was to gather all earthquakes with a Magnitude 2.0 or greater (M2+) within a specific region (32.0°N to 42.0°N latitude, -125.0°W to -114.0°W longitude) from 1990 to the present day.

One of the initial hurdles was dealing with potential API request limits when trying to fetch decades of data at once. To overcome this, I implemented a robust chunking mechanism. Instead of one massive request, I'd iteratively fetch data in smaller time windows (starting with years, then recursively breaking down into months, weeks, or even days if a chunk proved too large). This ensured I could reliably acquire the entire historical catalog without hitting service caps. The collected data was then saved locally as a CSV for efficient reuse.

2. Sculpting Signals: Feature Engineering

Raw earthquake event lists aren't directly useful for machine learning. The magic happens in feature engineering – transforming this raw data into meaningful numerical representations that the model can learn from.

For each sliding time window, I calculated a rich set of features:

Regional Features (Across the Entire California Study Area):
These capture the overall seismic state of the broader region:
Seismicity Rate: The total number of events within the window.
b-value: A critical seismological parameter indicating the ratio of small to large earthquakes. A decrease in b-value can sometimes precede larger events, suggesting increased stress.
Magnitude Statistics: Mean, standard deviation, and maximum magnitude.
Inter-Event Time Statistics: Mean and coefficient of variation of the time between successive earthquakes. Irregularity (high CV) might be a signal.
Depth Statistics: Mean and standard deviation of earthquake depths.

Spatial Features (Per Grid Cell):

To capture localized patterns, I divided the entire California region into a grid of 0.5-degree by 0.5-degree cells. For each cell, if it contained at least 3 events within the window (our MIN_EVENTS_PER_CELL threshold to ensure statistical significance), I calculated:

Local Seismicity Rate
Local b-value
Local Mean Magnitude

I used a 90-day sliding window to compute these features, advancing the window by 7 days for each new sample. The target label for the model was binary: 1 if a Magnitude 6.0+ earthquake occurred within the subsequent 30-day prediction horizon after the feature window, and 0 otherwise.

3. The Brain: Model Training & Evaluation

With our features ready, it was time to train the predictive brain of our system.

I chose XGBoost Classifier as the core model. It's a powerful gradient boosting framework known for its performance and ability to handle complex, tabular datasets.

Tackling Class Imbalance

Earthquake nowcasting suffers from extreme class imbalance: periods without a large earthquake significantly outnumber periods preceding one. To address this, we employed several strategies:
Stratified Splitting: When splitting data into training and test sets, we used stratify=y to ensure both sets maintained the original proportion of large earthquake events.
SMOTE (Synthetic Minority Over-sampling Technique): Applied to the training data, SMOTE generated synthetic samples of the minority class (large earthquake events), effectively balancing the dataset for the model to learn from. We dynamically adjusted SMOTE's k_neighbors parameter to ensure it always had enough real minority samples to work with.

Hyperparameter Tuning & Evaluation

We performed hyperparameter tuning using GridSearchCV, focusing on optimizing the model's F1-score. The F1-score is particularly valuable for imbalanced datasets as it provides a balance between precision (minimizing false positives) and recall (minimizing false negatives).

After training, the model's performance was rigorously evaluated on the untouched test set. I examined:

Classification Report: Providing precision, recall, and F1-score for both classes.
ROC-AUC Score: A measure of the model's ability to distinguish between classes across all possible thresholds.
Confusion Matrix: A visual breakdown of true positives, true negatives, false positives, and false negatives.

Here's an example of a Confusion Matrix from a training run:

We also analyzed the Precision-Recall Curve
, which is often more informative than ROC for imbalanced datasets:

Finally, we looked at Feature Importance to understand which seismic indicators the XGBoost model deemed most influential in its predictions. Features related to regional seismicity rate, standard deviation of depth, coefficient of variation of inter-event time, and localized spatial b-values often topped the list.

The Optimal Threshold

Crucially, we didn't just rely on the model's default 0.5 probability threshold. We analyzed the Precision, Recall, and F1-score across a range of thresholds and identified the optimal F1-score threshold as 0.3593. This value provides the best balance between catching large earthquakes and avoiding excessive false alarms for our specific model.

4. Bringing it Live: Model Deployment

Training a great model is one thing; making it useful in a real-world setting is another. This required careful deployment steps.

Model Serialization

After training, the best_model (our optimized XGBoost classifier) was saved to disk using joblib. But there's a critical detail: XGBoost models are sensitive to the order of input features. So, alongside the model, we also saved the exact ordered list of feature column names that the model was trained on. This ensures that when the model is loaded later for prediction, the incoming data is always presented in the correct sequence.

Python

import joblib
# ... after model training ...
joblib.dump(best_model, "earthquake_prediction_model.joblib")
joblib.dump(X.columns.tolist(), "model_feature_columns.joblib")

The Prediction Script (predict_earthquake.py)

A separate, lightweight Python script (predict_earthquake.py) was created specifically for making live predictions. This script is designed to run independently, without needing to retrain the model. Its core functions are:

Load Assets: It loads the saved earthquake_prediction_model.joblib and the model_feature_columns.joblib list.

Fetch Latest Data: It connects to the USGS FDSN client to fetch only the most recent earthquake data required for the current 90-day feature window (ending at the current time).

Consistent Feature Engineering: It applies the exact same feature engineering logic as the training pipeline to this latest data. This consistency is paramount. It also handles cases where certain grid cells might be inactive in the current window by filling their features with zeros, matching how the training data was prepared.

Predict: The engineered features are passed to the loaded model, which outputs a probability score.

Threshold & Alert: The pre-determined optimal threshold of 0.3593 is applied. If the predicted probability exceeds this, an alert is triggered.

5. Staying Vigilant: Automation & Alerting

A prediction system is only useful if it runs consistently and communicates its findings effectively.

Automated Scheduling
To ensure continuous nowcasting, predict_earthquake.py was automated to run at regular intervals (e.g., daily). This was set up using:

cron jobs for Linux/macOS environments, which allow scheduling commands to run at specific times.

Task Scheduler for Windows, providing a graphical interface for similar functionality.

Robust Alerting & Logging
Beyond simple console output, the system was enhanced for practical deployment:

Dedicated Log File: All prediction cycle information – data fetches, warnings (like NaN values being filled), and final predictions – are written to a dedicated log file (earthquake_prediction.log). This is invaluable for monitoring the system's health and troubleshooting.

Email Notifications: Crucially, if a "Large Quake" prediction is made (probability >= 0.3593), the script is configured to send an immediate email alert. This ensures that relevant stakeholders are notified without needing to constantly monitor logs.

Python

# Conceptual snippet for email alert in predict_earthquake.py
import smtplib
from email.mime.text import MIMEText
import logging # Already configured earlier

# ... email config variables ...

def send_email_alert(subject, body):
    try:
        msg = MIMEText(body)
        msg["Subject"] = subject
        msg["From"] = EMAIL_SENDER
        msg["To"] = EMAIL_RECEIVER
        with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
            server.starttls()
            server.login(EMAIL_SENDER, EMAIL_PASSWORD)
            server.send_message(msg)
        logging.info(f"Email alert sent successfully to {EMAIL_RECEIVER}")
    except Exception as e:
        logging.error(f"Failed to send email alert: {e}")

# ... later in make_prediction() ...
if prediction_label == 1:
    alert_message = (f"ALERT! A large earthquake (M{TARGET_MAGNITUDE}+) is predicted "
                     f"in California within the next {PREDICTION_HORIZON_DAYS} days.\n"
                     f"Probability: {prediction_proba:.4f}")
    logging.warning(f"Prediction: {alert_message}")
    send_email_alert("Earthquake Prediction ALERT!", alert_message)

Conclusion

This project successfully establishes a complete, automated pipeline for earthquake nowcasting in California using machine learning. From meticulously gathering and engineering seismic data to training a robust XGBoost model and deploying it with automated scheduling and alerting, the system represents a significant step towards leveraging data science for natural hazard preparedness.

While the inherent complexities of earthquake prediction mean no model is perfect, this system provides a valuable, data-driven assessment of current seismic risk. The journey highlights the importance of not just model accuracy, but also the practical considerations of data handling, feature consistency, and operational deployment in building real-world ML solutions.

The next steps involve continuous monitoring of the system's performance, periodic retraining with updated data to ensure relevance, and potentially exploring more advanced validation techniques like time-series cross-validation for even greater robustness.