Skip to content

4blocking/data_science_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data_science_project

Chronic Disease Risk Prediction using BRFSS

Overview

Our modern lifestyles have potentially brought many changes to the evolution of the habits of our everyday lives. Such habits are usually characterized by poor diet, increased stress, not enough sleep and reduced exercise. Such behavioral shifts, and many more, have been associated with the steady increase in the appearance of chronic diseases such as diabetes, cardiovascular diseases, obesity and even mental health disorders.

According to the World Health Organization (WHO), non-communicable diseases which are not caused by infections and usually result in long-term health consequences, are responsible for the majority of deaths globally. Notably, many of those cases are linked to lifestyle factors that can be easily modified. Therefore, there is an urgent need for the development of helpful tools which can help people understand how much their daily habits contribute to increased risk of a specific disease.

This project aims to develop a system that is applied on the healthcare and data science domains and is capable of predicting the risk of several important and common chronic diseases based on lifestyle characteristics. The system will rely on public health survey data and utilize machine learning techniques to identify patterns between lifestyle factors and health outcomes on people. Finally, an interactive prototype will allow users to personally explore the dataset and its characteristics, as well as assess how different lifestyle choices influence their health risks.


Problem Statement

Throughout the world, there has been an overall increase in chronic disease cases such as diabetes, heart disease, obesity and depression. While healthcare facilities often handle this challenge efficiently with medical treatments which are widely available, such treatments address the symptoms rather than the underlying causes.

This problem is important due to its potential in preventative care. It is well known that overprescribing medicine is a common practice for treating everyday conditions, especially in well-developed, high GDP countries. However, a better nonmedical alternative usually exists, and this is where this project's contribution lies.

The core problem addressed in this project is the lack of accessible tools that can help individuals understand the severity of their lifestyle habits on their personal health. This project aims to develop a predictive system that estimates the probability of developing chronic diseases based on lifestyle factors. Obviously, this system is not intended for diagnosing diseases before they happen but rather operates as a risk estimation tool for spreading awareness and supporting preventative measures.

Research Questions

  • Which lifestyle variables (e.g., BMI, physical activity, smoking, sleep) have the strongest influence on disease risk?
  • Can machine learning models capture patterns in survey data to produce reliable estimates of disease risk?
  • Can we cluster groups of individuals with similar lifestyle and health risks?

Hypotheses

  • Individuals with poorer lifestyle habits (e.g., low physical activity, poor sleep, high BMI) are more likely to develop chronic diseases.
  • Machine learning models trained on such data can produce reliable estimates of disease risk.

Dataset (BRFSS)

This project uses the Behavioral Risk Factor Surveillance System (BRFSS) dataset, provided by the CDC.

BRFSS is a large-scale health-related survey dataset containing information about:

  • Lifestyle habits (smoking, exercise, diet)
  • Health conditions (diabetes, heart disease, etc.)
  • Demographics (age, gender, etc.)

System Architecture

The application is built around a streaming data pipeline and a REST API backend:

BRFSS .XPT File
      │
      ▼
 Kafka Producer  ──►  Kafka Topic (brfss_raw)
                              │
                    ┌─────────┴──────────┐
                    │                    │
               Spark Streaming     Kafka Consumer
               (transformation)    (raw storage)
                    │                    │
                    ▼                    ▼
           brfss_processed         brfss_raw
            (PostgreSQL)           (PostgreSQL)
                    │
                    ▼
          ML Training (Random Forest)
                    │
                    ▼
         Saved Models (.pkl files)
                    │
                    ▼
           FastAPI REST Backend
                    │
                    ▼
             Frontend (port 5173)

Key components:

  • FastAPI — REST API backend with JWT authentication
  • PostgreSQL — primary database for users, surveys, processed data, and ML models
  • Apache Kafka (via Docker) — message broker for streaming raw BRFSS records
  • Apache Spark (via Docker) — stream processor that cleans and transforms raw records
  • Random Forest classifiers — one per disease (diabetes, depression, heart disease, arthritis)
  • SHAP — explains individual predictions by showing feature contributions

Prerequisites

Install all of the following before proceeding:

Software Version Notes
Python 3.10+ Use a virtual environment
PostgreSQL 14+ Must be running locally on port 5432
Docker Desktop Latest For Kafka and Spark containers
Java (JDK) 11 or 17 Required by Spark
Hadoop (Windows only) 3.x Required by Spark on Windows — see below

Project Structure

data_science_project/
├── backend/
│   ├── app/
│   │   ├── core/           # Config, security (JWT), dependencies
│   │   ├── db/             # Database engine and table initialisation
│   │   ├── models/         # SQLModel ORM table definitions
│   │   ├── repositories/   # Database query functions
│   │   ├── routers/        # FastAPI route handlers
│   │   ├── schemas/        # Pydantic request/response models
│   │   ├── services/       # Business logic layer
│   │   └── main.py         # FastAPI app entry point
│   ├── imputation/         # Hot-deck missing value imputation
│   ├── kafka/              # Kafka producer and consumer
│   ├── machine_learning/   # Model training, evaluation, clustering
│   ├── offline_scripts/    # One-time setup and initial ingestion scripts
│   └── spark/              # Spark streaming processor
├── docker-compose.yml      # Kafka + Spark infrastructure
├── requirements.txt        # Python dependencies
└── .env                    # Environment variables (you create this)

Step-by-Step Setup

Step 1 — Download the BRFSS Dataset

  1. Go to: https://www.cdc.gov/brfss/annual_data/annual_data.htm
  2. Select a year (e.g., 2024)
  3. Download the SAS Transport Format (.XPT) file — it will be named something like LLCP2024.XPT
  4. Place the file somewhere accessible on your machine and note the full path

Step 2 — Install Python Dependencies

It is recommended to use a virtual environment:

python -m venv venv

# Windows
venv\Scripts\activate

# macOS / Linux
source venv/bin/activate

pip install -r requirements.txt

Note: pyspark requires Java to be installed and JAVA_HOME to be set. Verify with java -version.


Step 3 — Set Up PostgreSQL

  1. Install and start PostgreSQL (default port 5432)
  2. Open the PostgreSQL shell or pgAdmin and create a new database:
CREATE DATABASE brfss;
  1. Note down your PostgreSQL username and password — you will need them for the .env file

Step 4 — Set Up Hadoop (Windows Only)

Apache Spark requires Hadoop binaries on Windows. The application expects Hadoop at C:\hadoop.

  1. Download a pre-built Hadoop binary package for Windows (e.g., from https://github.com/steveloughran/winutils/releases — choose the version matching your Spark: Spark 3.5.x uses Hadoop 3.x)
  2. Extract the archive so that C:\hadoop\bin\winutils.exe and C:\hadoop\bin\hadoop.dll exist
  3. Set the environment variable:
    HADOOP_HOME=C:\hadoop
    
    You can set this permanently in Windows: System Properties → Environment Variables → New System Variable
  4. Also add C:\hadoop\bin to your system PATH

Step 5 — Start Infrastructure with Docker

Make sure Docker Desktop is running, then from the project root:

docker-compose up -d

This starts three containers:

  • broker — Apache Kafka on port 9092
  • spark-master — Spark master on ports 8080 (UI) and 7077
  • spark-worker — Spark worker with 8 GB memory and 4 cores, connected to the master

Verify they are running:

docker ps

Step 6 — Create the .env File

Create a file named .env in the project root (same level as docker-compose.yml) with the following content — replace the placeholder values with your own:

# PostgreSQL connection
SQLALCHEMY_DATABASE_URL=postgresql+psycopg2://<user>:<password>@localhost:5432/brfss

# JWT authentication
SECRET_KEY=your_random_secret_key_here
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=480

# Kafka settings
KAFKA_BROKER=localhost:9092
KAFKA_EXTERNAL_BROKER=localhost:9092
KAFKA_TOPIC=brfss_raw
KAFKA_BATCH_SIZE=5000
XPT_FILE_PATH=C:/path/to/your/LLCP2024.XPT

# Spark → PostgreSQL (used by Spark JDBC writer)
POSTGRES_JDBC_URL=jdbc:postgresql://localhost:5432/brfss
POSTGRES_USER=<user>
POSTGRES_PASSWORD=<password>

Tips:

  • SECRET_KEY should be a long, random string. Generate one with: python -c "import secrets; print(secrets.token_hex(32))"
  • XPT_FILE_PATH must use forward slashes or escaped backslashes even on Windows
  • KAFKA_BATCH_SIZE=5000 is a good default; increase for faster ingestion on powerful machines

Step 7 — Initialise the Database Tables

Run these two scripts once to create the raw and processed BRFSS tables in PostgreSQL:

# From the project root:

# 1. Create the brfss_raw table (schema is inferred from the XPT file)
python -m backend.offline_scripts.init_raw_brfss_table

# 2. Create the brfss_processed table
python -m backend.spark.init_processed_brfss_table

The application's FastAPI startup also auto-creates the account, mlmodel, predictions, and survey tables via SQLModel on first run, so you do not need to create those manually.


Step 8 — Initial Data Ingestion

This one-time script loads the BRFSS .XPT file, cleans and transforms all records, applies hot-deck imputation for missing values, and writes the result to the brfss_processed table:

python -m backend.offline_scripts.initial_dataset_ingestion

This may take several minutes depending on the size of the dataset (the 2024 file contains ~400,000+ records). You will see progress output including missing value counts before and after imputation.


Step 9 — Train the Machine Learning Models

Once the brfss_processed table has data, train the classifiers:

python -m backend.machine_learning.trainer

This trains one Random Forest classifier per disease target:

  • diabetes
  • depression
  • heart_disease
  • arthritis

Each model is evaluated with ROC AUC, PR AUC, and Brier score. Models are saved as .pkl files inside backend/data/saved_models/ and registered in the mlmodel database table. The best-performing model per disease is automatically marked as active.


Step 10 — Run the Backend Server

From the project root:

uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000

The API will be available at:

  • Base URL: http://localhost:8000
  • Interactive docs (Swagger UI): http://localhost:8000/docs
  • Alternative docs (ReDoc): http://localhost:8000/redoc

On startup the application automatically creates any missing database tables.


API Reference

All endpoints that require authentication expect a Bearer token in the Authorization header:

Authorization: Bearer <access_token>

Authentication

Method Endpoint Auth Description
POST /account/ No Register a new user account
POST /account/login No Log in and receive a JWT token

Register request body:

{
  "email": "user@example.com",
  "password": "yourpassword"
}

Login request body:

{
  "email": "user@example.com",
  "password": "yourpassword"
}

Login response:

{
  "access_token": "eyJ...",
  "token_type": "bearer"
}

Disease Risk Prediction

Method Endpoint Auth Description
POST /predict/ Yes (user) Submit lifestyle data and receive disease risk scores

Request body:

{
  "sex": "male",
  "age_group": "35-39",
  "income": "$50,000 to $100,000",
  "state": "California",
  "education": "College graduate",
  "employment": "Employed",
  "smoked_100_cigarettes": false,
  "alcohol_consumption": true,
  "exercise": true,
  "mental_health_days": 3,
  "ecigarette_use": "Never used",
  "physical_health_days": 2,
  "weight_kg": 80.0,
  "height_m": 1.80
}

Valid values for enumerated fields:

Field Valid values
sex "male", "female"
age_group "18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+"
income "Less than $15,000", "$15,000 to $25,000", "$25,000 to $35,000", "$35,000 to $50,000", "$50,000 to $100,000", "$100,000 to $200,000", "$200,000 or more"
education "Never attended school", "Elementary", "Some high school", "High school graduate", "Some college", "College graduate"
employment "Employed", "Unemployed", "Student", "Retired", "Unable to work"
ecigarette_use "Never used", "Every day", "Some days", "Used in past only"
state Any U.S. state, "District of Columbia", "Guam", "Puerto Rico", "Virgin Islands"
mental_health_days Integer 0–30
physical_health_days Integer 0–30
weight_kg Float 23.0–295.0
height_m Float 0.91–2.44

Response:

{
  "predictions": [
    {
      "model_id": 1,
      "disease": "diabetes",
      "prediction": 0.12,
      "shap_values": {
        "sex": -0.003,
        "age_group": 0.021,
        "income": -0.015,
        "weight_kg": 0.045,
        ...
      }
    },
    ...
  ]
}

The prediction is a probability (0.0–1.0) of having that disease. The shap_values show how much each feature pushed the prediction up or down.


Survey History

Method Endpoint Auth Description
GET /survey/ Yes (user) Get all past surveys for the logged-in user
GET /survey/{survey_id} Yes (user) Get details of a specific survey
DELETE /survey/{survey_id} Yes (user) Delete a specific survey

Analytics

Method Endpoint Auth Description
GET /analytics/bar Yes (user) Get aggregated statistics for bar chart visualisation

Query parameters:

Parameter Required Valid values
metric Yes "disease_prevalence", "exercise_rate", "smoking_rate", "avg_mental_health_days"
group_by Yes "age_group", "sex", "income", "employment"
disease Only when metric=disease_prevalence "diabetes", "heart_disease", "depression", "arthritis"
sex No "Male", "Female"
age_group No Any valid age group string
income No Any valid income string
education No Any valid education string
employment No Any valid employment string

Example request:

GET /analytics/bar?metric=disease_prevalence&group_by=age_group&disease=diabetes

Response:

{
  "labels": ["18-24", "25-29", "30-34", ...],
  "values": [0.03, 0.04, 0.06, ...]
}

Data Upload (Admin Only)

Method Endpoint Auth Description
POST /upload/ Yes (admin role) Upload a new BRFSS .XPT file and trigger the full pipeline

Upload a .XPT file using multipart/form-data. The pipeline runs in a background thread and will:

  1. Stream records through Kafka
  2. Transform them with Spark
  3. Store the results in brfss_processed
  4. Retrain all ML models

To give a user admin privileges, update their role column directly in the account table:

UPDATE account SET role = 'admin' WHERE email = 'admin@example.com';

Data Pipeline Details

Hot-Deck Imputation

Missing values in the BRFSS dataset are filled using hot-deck imputation: for each respondent with a missing value in a column, a donor is found from already-complete records that shares the same values for a set of grouping columns (e.g., same sex and age group). The donor's value is then copied across.

The imputation is applied in a specific order to ensure that grouping columns are available when needed:

age_group → education → employment → income →
alcohol_consumption → exercise → ecigarette_use →
smoked_100_cigarettes → mental_health_days →
physical_health_days → weight_kg → height_m

BRFSS Variable Mapping

The raw BRFSS variables are mapped to human-readable column names:

Raw Variable Processed Column
SEXVAR sex
_AGEG5YR age_group
_INCOMG1 income
_STATE state
EDUCA education
EMPLOY1 employment
SMOKE100 smoked_100_cigarettes
DRNKANY6 alcohol_consumption
EXERANY2 exercise
MENTHLTH mental_health_days
ECIGNOW3 / ECIGNOW2 ecigarette_use
PHYSHLTH physical_health_days
WTKG3 weight_kg (kg)
HTM4 height_m (metres)
DIABETE4 diabetes
ADDEPEV3 depression
_MICHD heart_disease
HAVARTH4 arthritis

Troubleshooting

ModuleNotFoundError: No module named 'backend'
Run all commands from the project root directory so that the backend package is on the Python path.

hadoop.dll error when running Spark on Windows
Ensure C:\hadoop\bin\hadoop.dll exists and HADOOP_HOME=C:\hadoop is set as a system environment variable. Restart your terminal after setting it.

Kafka connection refused
Make sure the Docker containers are running: docker ps. If the broker container is not listed, run docker-compose up -d again.

SQLALCHEMY_DATABASE_URL validation error on startup
The .env file is missing or not in the project root. Double-check the file exists at the same level as docker-compose.yml.

Spark cannot connect to PostgreSQL
Spark uses JDBC to write to PostgreSQL. Ensure POSTGRES_JDBC_URL, POSTGRES_USER, and POSTGRES_PASSWORD are set correctly in .env. The JDBC driver (org.postgresql:postgresql:42.7.3) is downloaded automatically by Spark on first run — this requires an internet connection.

No active ML models / prediction returns an error
Run the training step (Step 9) before making prediction requests. You need at least one trained model per disease in the mlmodel table with is_active = true.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages