DEV Community

Rimsha Malik
Rimsha Malik

Posted on

Meteorite Fall Prediction Using Machine Learning & Explainable AI

ABSTRACT:
Meteorites are the rocky pieces of material from outer space that are considered to be the most important imperishable evidence regarding the solar system's existence, planet Earth's formation, and the functioning of the cosmic activities. The Homo sapiens of Earth's roaster varied more than 45,000 different meteorite events that were recorded through the NASA-Meteoritical Society's database coalesced of different characteristics such as mass, class, geographic coordinates, fall status, and recovery year. By scrutinizing this dataset of hefty volume, it would be possible to not only identify the patterns but also to make the foretelling of the meteorite's situation of being "Found" or "Fell" according to its physical measures. In the case of this particular research, the data underwent a lot of preprocessing steps that included the cure of missing values, the elimination of faulty entries such as invalid years or zero-coordinates, the encoding of categorical variables, and the scaling of continuous attributes for obtaining the best model performance. A number of different machine learning algorithms the likes of Random Forest, Neural Networks, SVM, KNN, and AdaBoost were trained and evaluated against each other. Their performance was quantified using a variety of measures including accuracy, precision, recall, F1-score, and ROC /AUC. In order to make the results more understandable, besides accuracy measures, the explainable AI techniques such as SHAP and LIME were also used to interpret the specific predictions made for individual cases, as well as by revealing the overall importance of different features. The outcomes imply that among the different approaches used in the study, ensemble methods, especially Random Forest and AdaBoost, gave the best classification accuracy. The analysis of interpretability further indicated that the mass of the meteorite, the type of its classification, and the year of its recovery were the three major factors influencing the prediction concerning the fall-status of the meteorite. This study is a step forward in the realm of astronomical data science as it not only opens up a data-driven and interpretable corridor for the meteoritic behavior comprehension and also adds to the scientific decision making process.
1 . Introduction
Meteors are remnants of primordial heavenly bodies that make their way through space and ultimately drop on our planet, bringing along the oldest materials of the solar system. Their examination provides fundamental knowledge of the processes of the planet's birth, chemical changes, and cosmic timeline. As many meteorites are found long after they landed without any witnessing of their fall .It is still very useful to classify meteorites as those that were "Found" and those that "Fell". By recognizing these trends, scientists will be able to support potential studies about the meteorite origins, frequency of impacts, and the challenges associated with detection.
The advent of large-scale datasets has opened doors for astronomical research to incorporate machine learning techniques. Among those datasets is NASA’s Meteoritical Society, which is one of the most complete worldwide meteorite collection databases and incorporates attributes like mass, geographical coordinates, classification type, and year of recovery. These features provide a strong foundation for identifying relationships between meteorite characteristics and their fall status. By applying computational methods to this dataset, researchers can investigate trends that may be difficult to observe manually in such a large volume of data.
Machine learning classification algorithms are very effective in predicting fall status because they have the ability to handle both types of variables, numerical and categorical. These models not only output predictions but also indicate, for example, whether heavier meteorites are more likely to be seen falling or whether certain areas report more observed falls. Such understanding enables the construction of more profound knowledge aspects of meteorite detection, environmental factors affecting it, and the historical reporting practices.
In spite of the fact that predictive accuracy is of utmost importance,the scientific field needs interpretability as well, so they can make validations of the decision-making process. For that, this study uses SHAP, and LIME explanation methods, which are very popular in AI explainability. The methods allow giving an open view of the model's working by pointing out how much each feature contributes to the prediction. The research, therefore, not only combines powerful machine learning models and interpretability tools but also produces trustworthy, intelligible, and scientifically valuable insights into meteorite classification.
2. Methodology

This study follows a structured data mining workflow, including dataset understanding, preprocessing, feature selection, model training, and evaluation.

Dataset:
The NASA Meteorite Landings dataset, containing over 45,000 meteorite records, was used. It includes attributes like name, ID, mass, classification, fall status, year, latitude, longitude, and geolocation. The dataset is a mix of numerical and categorical variables and contains inconsistencies, missing values, and anomalies.

Data Preprocessing:
Data cleaning involved handling missing values, removing unrealistic or invalid entries (e.g., zero coordinates, negative masses, invalid years), encoding categorical variables, and scaling continuous features. This resulted in a cleaner dataset suitable for analysis and modeling.

Exploratory Data Analysis (EDA):
Visualization techniques revealed patterns and distributions in the data:

Box plots showed latitude variability and outliers.

Histograms highlighted the dominance of “Found” meteorites and spatial clusters.

Bar plots revealed temporal patterns and the frequency of “Fell” meteorites.
EDA provided insights into spatial, temporal, and categorical trends.

Feature Selection:
Feature importance was assessed using Information Gain, Gain Ratio, and Gini Decrease in Orange. Top features selected were reclat (latitude), mass, year, and reclong (longitude). Less informative attributes like ID, name, nametype, recclass, and geolocation were removed, simplifying the model and improving predictive accuracy.

Models:
Several machine learning models were applied to classify meteorites as “Fell” or “Found”:

Logistic Regression – linear and interpretable.

Decision Tree – captures non-linear relationships.

Random Forest – ensemble method for stability and accuracy.

K-Nearest Neighbors – distance-based classification.

Support Vector Machine – finds optimal hyperplane for classification.

Multi-Layer Perceptron (MLP) – neural network capturing deep non-linear patterns.

AdaBoost – boosts weak learners for improved performance.

Performance Evaluation:
Models were evaluated using accuracy, precision, recall, F1-score, ROC curves, and AUC. These metrics provided a comprehensive assessment of the models’ ability to classify meteorites correctly and distinguish between “Fell” and “Found” events.
3. Results

Test and Score:
Among the six models tested, AdaBoost performed best, achieving near-perfect results (AUC 1.000, F1-score 0.999). Random Forest also excelled with high accuracy, precision, and recall. k-Nearest Neighbors (kNN) showed strong performance (AUC 0.997) with balanced predictions. Neural Networks performed moderately (AUC 0.988), while Naive Bayes had average results due to its independence assumptions. SVM performed weakest, struggling to separate the two classes.
| Model | AUC | Accuracy | F1 | Precision | Recall | MCC |
| -------------- | ----- | -------- | ----- | --------- | ------ | ----- |
| SVM | 0.660 | 0.305 | 0.058 | 0.030 | 0.775 | 0.024 |
| kNN | 0.997 | 0.989 | 0.780 | 0.827 | 0.738 | 0.776 |
| Random Forest | 1.000 | 0.999 | 0.988 | 0.993 | 0.982 | 0.987 |
| Naive Bayes | 0.966 | 0.960 | 0.482 | 0.375 | 0.674 | 0.485 |
| Neural Network | 0.988 | 0.985 | 0.694 | 0.800 | 0.612 | 0.693 |
| AdaBoost | 1.000 | 1.000 | 0.999 | 1.000 | 0.999 | 0.999 |
ROC Curve Analysis:
ROC curves confirmed the test scores: AdaBoost and Random Forest achieved the best class separation (curves near top-left corner), followed by kNN. Neural Network performed moderately, Naive Bayes showed limited performance, and SVM performed poorly (curve near diagonal).

Confusion Matrix Analysis:

SVM: Many false positives, poor precision.

Decision Tree: Failed on minority class (“Fell”).

kNN: Correctly identified 539 “Fell” cases, moderate false positives.

Random Forest: Best performance, correctly classified 712 “Fell” cases with only 3 false positives.

Naive Bayes: Moderate with higher false positives.
Overall, Random Forest and kNN were most effective; SVM and Decision Tree were unsuitable for imbalanced classes.

Performance Curve Analysis:
kNN achieved the highest area-under-the-curve score (21.246) for the “Fell” class, showing excellent ranking ability. Random Forest also performed strongly (6.449), confirming their predictive power.

Model Interpretability (SHAP & LIME):
SHAP and LIME were used to explain model predictions:

SHAP: Provided a global view of feature impacts across the dataset.

LIME: Offered local explanations for individual predictions.
These tools helped understand how each feature influenced predictions, improving both global and local interpretability, especially for Random Forest.

Top comments (0)