Sangeeth Macherla

Posted on Apr 28

How Missing Data Analysis Lab uses Flask, Bayesian optimization, and MongoDB in one regression workflow

#datascience #machinelearning #mongodb #python

Missing Data Analysis Lab is a Flask-served Python application for studying missing-value behavior, comparing imputation strategies, training regression models, and surfacing experiment results through a lightweight dashboard. This page mirrors the structure of your reference file, but it is written around your actual project architecture, endpoints, and machine learning pipeline.

Team Members

This Project Was Developed By

We would like to express our sincere gratitude to @chanda_rajkumar for their valuable guidance and support throughout this project.

Their insights into system design, architecture, and development played a key role in shaping Energy AI.
Why this project combines analysis, modeling, and delivery in one app

This project is not just a notebook or a model-training script. It is organized as a complete experiment workflow where dataset upload, missingness diagnostics, model benchmarking, optimization, result persistence, and dashboard delivery all belong to the same application boundary. That matters because model quality is only one part of the user experience. The project also needs reproducible datasets, clear visual outputs, explainable metrics, and an interface that keeps the analysis readable.

That is why the application is centered around a Flask runtime that serves both the API layer and the frontend entry page. The browser loads a first-party dashboard from the frontend folder, while the backend coordinates experiment execution, saved artifacts, authentication flow, and optional MongoDB persistence. Instead of scattering the workflow across unrelated tools, the project keeps the analysis pipeline and the user-facing dashboard closely aligned.

This design is especially useful for research-style ML work. Missing-data analysis usually generates many outputs: summaries, plots, tuned model metrics, prediction tables, and explanatory text. A single integrated app makes those outputs easier to reproduce, compare, and present. It also creates a cleaner path from raw CSV input to interpretable model evaluation.

Our application architecture

The current stack is organized around a Flask backend in api/flask_main.py, a reusable experiment pipeline in src/pipeline.py, static dashboard files in frontend/, generated artifacts inresults/, and optional MongoDB-backed storage for experiment and dataset history. The application can work with uploaded CSV files or synthesize benchmark datasets when no file is supplied. From there it analyzes missingness, applies imputation strategies, trains models, optionally runs Optuna-based Bayesian optimization, and returns structured results to the dashboard

{
  "runtime": "Flask-served Python web application",
  "ui_entry": "frontend/index.html",
  "frontend_assets": [
    "frontend/styles.css",
    "frontend/app.js",
    "frontend/auth.css",
    "frontend/auth.js"
  ],
  "core_modules": [
    "api/flask_main.py",
    "api/auth_routes.py",
    "src/pipeline.py",
    "src/missing_analysis.py",
    "src/imputation.py",
    "src/models.py",
    "src/optimization.py",
    "src/mongo_store.py"
  ],
  "supported_models": [
    "linear_regression",
    "ridge",
    "lasso",
    "random_forest"
  ],
  "imputation_methods": [
    "drop_rows",
    "mean",
    "median",
    "iterative"
  ],
  "storage_mode": "in-memory fallback with optional MongoDB persistence"
}

Pipeline workflow and ML methodology

The heart of the project lives in the experiment pipeline. The workflow starts by either loading an uploaded CSV or generating a synthetic regression dataset with controlled missingness. The system then detects the target column, computes missing-value analysis, creates missingness plots, splits the dataset into train, validation, and test partitions, and benchmarks multiple model and imputation combinations.

For each imputation method, the project trains baseline regressors and optionally performs Bayesian optimization through Optuna. Validation metrics guide model selection so the project does not choose the best configuration only by test-set luck. Once the best run is identified, the application saves prediction outputs, comparison plots, residual diagnostics, and result summaries that the dashboard can render directly.

{
  "dataset_sources": [
    "uploaded CSV",
    "synthetic regression dataset"
  ],
  "analysis_steps": [
    "missingness summary",
    "missingness plots",
    "train/validation/test split",
    "imputation comparison",
    "baseline model training",
    "Optuna Bayesian optimization",
    "diagnostic plot generation",
    "insight and ranking generation"
  ],
  "selection_rule": "best configuration is chosen using validation-first scoring",
  "outputs": [
    "results_table",
    "best_run",
    "predictions",
    "feature_importance",
    "imputation_ranking",
    "artifact URLs"
  ]
}

API and route surface

The Flask layer exposes the analysis workflow as practical API endpoints rather than leaving it buried in offline scripts. The root route serves the dashboard, /frontend/<filename> serves static frontend files, and /artifacts/<filename> exposes generated plots and result files. Data-oriented routes support upload, summary generation, missingness analysis, training, optimization, leaderboard views, performance trend reporting, and dataset history inspection.

The project also includes an authentication flow backed by MongoDB for signup, login, logout, and session-aware usage. That makes the app more than a throwaway experiment runner. It is structured like a shareable product with a user-facing control room and protected data actions.

MongoDB and persistence strategy

MongoDB is used here as an application support layer rather than as the core modeling engine. The main experiment can still run without live database storage, but when MongoDB is available the system persists dataset metadata, experiment results, leaderboard rows, and time-series style performance history. That gives the project a bridge between local experimentation and longer-term result tracking.

This storage model fits the project because experiment outputs are document-like. One run may contain summary metrics, artifact paths, prediction arrays, insight text, and metadata about the selected model and imputation method. Another record may only contain dataset-level statistics. MongoDB is a flexible choice for storing those evolving shapes without forcing the whole project into a rigid table design too early.

{
  "dataset_cache": "in-memory cache for active uploaded data",
  "latest_result": "in-memory fallback for current experiment session",
  "mongodb_usage": [
    "dataset metadata",
    "experiment results",
    "leaderboard records",
    "performance trend points",
    "authentication users and sessions"
  ],
  "fallback_behavior": "saved results JSON and CSV can still populate the UI when live ML runtime is unavailable"
}

Frontend dashboard structure
The user interface is intentionally lightweight, but it is not minimal in capability. The dashboard is divided into focused pages for overview, experiment control, insights, visuals, and results. That structure keeps the workflow approachable: the user can inspect dataset summaries, configure experiments, review generated explanations, inspect charts, and download predictions without leaving the same app.

This is a good fit for the project because missing-data experiments produce multiple categories of output. Putting everything into a single page would quickly become hard to scan. By separating control, interpretation, and visualization, the frontend makes the research workflow feel more like an analysis studio than a raw API demo.

Generated artifacts and evaluation outputs

The project writes a complete output package into the results/ directory. That includes tabular results in CSV and JSON form, missingness visualizations, performance comparison plots, prediction-vs-actual plots, residual distribution charts, residuals-vs-predicted diagnostics, and Q-Q normality plots. These artifacts are important because they turn raw metrics into interpretable evidence about how the model behaves under different missing-data treatments.

By persisting those files, the project becomes much easier to demonstrate, review, and compare across experiments. Even when the live training runtime is not available, previously saved artifacts can still power the dashboard through fallback loading behavior.

{
  "results_dir_outputs": [
    "latest_results.csv",
    "latest_results.json",
    "predictions_output.csv",
    "missingness_heatmap.png",
    "missingness_bar.png",
    "performance_comparison.png",
    "actual_vs_predicted.png",
    "residual_distribution.png",
    "residuals_vs_predicted.png",
    "qq_plot.png"
  ],
  "evaluation_metrics": [
    "MSE",
    "RMSE",
    "R2",
    "MAE"
  ]
}

Execution

The project is designed to run locally as a Flask application. The dashboard is served on the local Flask port, and the generated results can be viewed directly through the interface once the backend is running. The implementation is also organized cleanly enough to support classroom demos, portfolio presentation, or later deployment work.

Missing Data Analysis Lab is a Flask-based machine learning application for missing-data diagnostics, imputation comparison, regression benchmarking, Bayesian optimization, optional MongoDB persistence, and dashboard-based result delivery.

GITHUB REPOSITORY

DEV Community

How Missing Data Analysis Lab uses Flask, Bayesian optimization, and MongoDB in one regression workflow

Top comments (0)