data_science_project

Chronic Disease Risk Prediction using BRFSS

Overview

Our modern lifestyles have potentially brought many changes to the evolution of the habits of our everyday lives. Such habits are usually characterized by poor diet, increased stress, not enough sleep and reduced exercise. Such behavioral shifts, and many more, have been associated with the steady increase in the appearance of chronic diseases such as diabetes, cardiovascular diseases, obesity and even mental health disorders.

According to the World Health Organization (WHO), non-communicable diseases which are not caused by infections and usually result in long-term health consequences, are responsible for the majority of deaths globally. Notably, many of those cases are linked to lifestyle factors that can be easily modified. Therefore, there is an urgent need for the development of helpful tools which can help people understand how much their daily habits contribute to increased risk of a specific disease.

This project aims to develop a system that is applied on the healthcare and data science domains and is capable of predicting the risk of several important and common chronic diseases based on lifestyle characteristics. The system will rely on public health survey data and utilize machine learning techniques to identify patterns between lifestyle factors and health outcomes on people. Finally, an interactive prototype will allow users to personally explore the dataset and its characteristics, as well as assess how different lifestyle choices influence their health risks.

Problem Statement

Throughout the world, there has been an overall increase in chronic disease cases such as diabetes, heart disease, obesity and depression. While healthcare facilities often handle this challenge efficiently with medical treatments which are widely available, such treatments address the symptoms rather than the underlying causes.

This problem is important due to its potential in preventative care. It is well known that overprescribing medicine is a common practice for treating everyday conditions, especially in well-developed, high GDP countries. However, a better nonmedical alternative usually exists, and this is where this project's contribution lies.

The core problem addressed in this project is the lack of accessible tools that can help individuals understand the severity of their lifestyle habits on their personal health. This project aims to develop a predictive system that estimates the probability of developing chronic diseases based on lifestyle factors. Obviously, this system is not intended for diagnosing diseases before they happen but rather operates as a risk estimation tool for spreading awareness and supporting preventative measures.

Research Questions

Which lifestyle variables (e.g., BMI, physical activity, smoking, sleep) have the strongest influence on disease risk?
Can machine learning models capture patterns in survey data to produce reliable estimates of disease risk?
Can we cluster groups of individuals with similar lifestyle and health risks?

Hypotheses

Individuals with poorer lifestyle habits (e.g., low physical activity, poor sleep, high BMI) are more likely to develop chronic diseases.
Machine learning models trained on such data can produce reliable estimates of disease risk.

Dataset (BRFSS)

This project uses the Behavioral Risk Factor Surveillance System (BRFSS) dataset, provided by the CDC.

Main source: https://www.cdc.gov/brfss/index.html
Annual datasets: https://www.cdc.gov/brfss/annual_data/annual_data.htm
Example (2022): https://www.cdc.gov/brfss/annual_data/annual_2022.html

BRFSS is a large-scale health-related survey dataset containing information about:

Lifestyle habits (smoking, exercise, diet)
Health conditions (diabetes, heart disease, etc.)
Demographics (age, gender, etc.)

System Architecture

The application is built around a streaming data pipeline and a REST API backend:

BRFSS .XPT File
      │
      ▼
 Kafka Producer  ──►  Kafka Topic (brfss_raw)
                              │
                    ┌─────────┴──────────┐
                    │                    │
               Spark Streaming     Kafka Consumer
               (transformation)    (raw storage)
                    │                    │
                    ▼                    ▼
           brfss_processed         brfss_raw
            (PostgreSQL)           (PostgreSQL)
                    │
                    ▼
          ML Training (Random Forest)
                    │
                    ▼
         Saved Models (.pkl files)
                    │
                    ▼
           FastAPI REST Backend
                    │
                    ▼
             Frontend (port 5173)

Key components:

FastAPI — REST API backend with JWT authentication
PostgreSQL — primary database for users, surveys, processed data, and ML models
Apache Kafka (via Docker) — message broker for streaming raw BRFSS records
Apache Spark (via Docker) — stream processor that cleans and transforms raw records
Random Forest classifiers — one per disease (diabetes, depression, heart disease, arthritis)
SHAP — explains individual predictions by showing feature contributions

Prerequisites

Install all of the following before proceeding:

Software	Version	Notes
Python	3.10+	Use a virtual environment
PostgreSQL	14+	Must be running locally on port 5432
Docker Desktop	Latest	For Kafka and Spark containers
Java (JDK)	11 or 17	Required by Spark
Hadoop (Windows only)	3.x	Required by Spark on Windows — see below

Project Structure

data_science_project/
├── backend/
│   ├── app/
│   │   ├── core/           # Config, security (JWT), dependencies
│   │   ├── db/             # Database engine and table initialisation
│   │   ├── models/         # SQLModel ORM table definitions
│   │   ├── repositories/   # Database query functions
│   │   ├── routers/        # FastAPI route handlers
│   │   ├── schemas/        # Pydantic request/response models
│   │   ├── services/       # Business logic layer
│   │   └── main.py         # FastAPI app entry point
│   ├── imputation/         # Hot-deck missing value imputation
│   ├── kafka/              # Kafka producer and consumer
│   ├── machine_learning/   # Model training, evaluation, clustering
│   ├── offline_scripts/    # One-time setup and initial ingestion scripts
│   └── spark/              # Spark streaming processor
├── docker-compose.yml      # Kafka + Spark infrastructure
├── requirements.txt        # Python dependencies
└── .env                    # Environment variables (you create this)

Step-by-Step Setup

Step 1 — Download the BRFSS Dataset

Go to: https://www.cdc.gov/brfss/annual_data/annual_data.htm
Select a year (e.g., 2024)
Download the SAS Transport Format (.XPT) file — it will be named something like LLCP2024.XPT
Place the file somewhere accessible on your machine and note the full path

Step 2 — Install Python Dependencies

It is recommended to use a virtual environment:

python -m venv venv

# Windows
venv\Scripts\activate

# macOS / Linux
source venv/bin/activate

pip install -r requirements.txt

Note: pyspark requires Java to be installed and JAVA_HOME to be set. Verify with java -version.

Step 3 — Set Up PostgreSQL

Install and start PostgreSQL (default port 5432)
Open the PostgreSQL shell or pgAdmin and create a new database:

CREATE DATABASE brfss;

Note down your PostgreSQL username and password — you will need them for the .env file

Step 4 — Set Up Hadoop (Windows Only)

Apache Spark requires Hadoop binaries on Windows. The application expects Hadoop at C:\hadoop.

Download a pre-built Hadoop binary package for Windows (e.g., from https://github.com/steveloughran/winutils/releases — choose the version matching your Spark: Spark 3.5.x uses Hadoop 3.x)
Extract the archive so that C:\hadoop\bin\winutils.exe and C:\hadoop\bin\hadoop.dll exist
Set the environment variable:
```
HADOOP_HOME=C:\hadoop
```
You can set this permanently in Windows: System Properties → Environment Variables → New System Variable
Also add C:\hadoop\bin to your system PATH

Step 5 — Start Infrastructure with Docker

Make sure Docker Desktop is running, then from the project root:

docker-compose up -d

This starts three containers:

broker — Apache Kafka on port 9092
spark-master — Spark master on ports 8080 (UI) and 7077
spark-worker — Spark worker with 8 GB memory and 4 cores, connected to the master

Verify they are running:

docker ps

Step 6 — Create the `.env` File

Create a file named .env in the project root (same level as docker-compose.yml) with the following content — replace the placeholder values with your own:

# PostgreSQL connection
SQLALCHEMY_DATABASE_URL=postgresql+psycopg2://<user>:<password>@localhost:5432/brfss

# JWT authentication
SECRET_KEY=your_random_secret_key_here
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=480

# Kafka settings
KAFKA_BROKER=localhost:9092
KAFKA_EXTERNAL_BROKER=localhost:9092
KAFKA_TOPIC=brfss_raw
KAFKA_BATCH_SIZE=5000
XPT_FILE_PATH=C:/path/to/your/LLCP2024.XPT

# Spark → PostgreSQL (used by Spark JDBC writer)
POSTGRES_JDBC_URL=jdbc:postgresql://localhost:5432/brfss
POSTGRES_USER=<user>
POSTGRES_PASSWORD=<password>

Tips:

SECRET_KEY should be a long, random string. Generate one with: python -c "import secrets; print(secrets.token_hex(32))"
XPT_FILE_PATH must use forward slashes or escaped backslashes even on Windows
KAFKA_BATCH_SIZE=5000 is a good default; increase for faster ingestion on powerful machines

Step 7 — Initialise the Database Tables

Run these two scripts once to create the raw and processed BRFSS tables in PostgreSQL:

# From the project root:

# 1. Create the brfss_raw table (schema is inferred from the XPT file)
python -m backend.offline_scripts.init_raw_brfss_table

# 2. Create the brfss_processed table
python -m backend.spark.init_processed_brfss_table

The application's FastAPI startup also auto-creates the account, mlmodel, predictions, and survey tables via SQLModel on first run, so you do not need to create those manually.

Step 8 — Initial Data Ingestion

This one-time script loads the BRFSS .XPT file, cleans and transforms all records, applies hot-deck imputation for missing values, and writes the result to the brfss_processed table:

python -m backend.offline_scripts.initial_dataset_ingestion

This may take several minutes depending on the size of the dataset (the 2024 file contains ~400,000+ records). You will see progress output including missing value counts before and after imputation.

Step 9 — Train the Machine Learning Models

Once the brfss_processed table has data, train the classifiers:

python -m backend.machine_learning.trainer

This trains one Random Forest classifier per disease target:

diabetes
depression
heart_disease
arthritis

Each model is evaluated with ROC AUC, PR AUC, and Brier score. Models are saved as .pkl files inside backend/data/saved_models/ and registered in the mlmodel database table. The best-performing model per disease is automatically marked as active.

Step 10 — Run the Backend Server

From the project root:

uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000

The API will be available at:

Base URL: http://localhost:8000
Interactive docs (Swagger UI): http://localhost:8000/docs
Alternative docs (ReDoc): http://localhost:8000/redoc

On startup the application automatically creates any missing database tables.

API Reference

All endpoints that require authentication expect a Bearer token in the Authorization header:

Authorization: Bearer <access_token>

Authentication

Method	Endpoint	Auth	Description
`POST`	`/account/`	No	Register a new user account
`POST`	`/account/login`	No	Log in and receive a JWT token

Register request body:

{
  "email": "user@example.com",
  "password": "yourpassword"
}

Login request body:

{
  "email": "user@example.com",
  "password": "yourpassword"
}

Login response:

{
  "access_token": "eyJ...",
  "token_type": "bearer"
}

Disease Risk Prediction

Method	Endpoint	Auth	Description
`POST`	`/predict/`	Yes (user)	Submit lifestyle data and receive disease risk scores

Request body:

{
  "sex": "male",
  "age_group": "35-39",
  "income": "$50,000 to $100,000",
  "state": "California",
  "education": "College graduate",
  "employment": "Employed",
  "smoked_100_cigarettes": false,
  "alcohol_consumption": true,
  "exercise": true,
  "mental_health_days": 3,
  "ecigarette_use": "Never used",
  "physical_health_days": 2,
  "weight_kg": 80.0,
  "height_m": 1.80
}

Valid values for enumerated fields:

Field	Valid values
`sex`	`"male"`, `"female"`
`age_group`	`"18-24"`, `"25-29"`, `"30-34"`, `"35-39"`, `"40-44"`, `"45-49"`, `"50-54"`, `"55-59"`, `"60-64"`, `"65-69"`, `"70-74"`, `"75-79"`, `"80+"`
`income`	`"Less than $15,000"`, `"$15,000 to $25,000"`, `"$25,000 to $35,000"`, `"$35,000 to $50,000"`, `"$50,000 to $100,000"`, `"$100,000 to $200,000"`, `"$200,000 or more"`
`education`	`"Never attended school"`, `"Elementary"`, `"Some high school"`, `"High school graduate"`, `"Some college"`, `"College graduate"`
`employment`	`"Employed"`, `"Unemployed"`, `"Student"`, `"Retired"`, `"Unable to work"`
`ecigarette_use`	`"Never used"`, `"Every day"`, `"Some days"`, `"Used in past only"`
`state`	Any U.S. state, `"District of Columbia"`, `"Guam"`, `"Puerto Rico"`, `"Virgin Islands"`
`mental_health_days`	Integer 0–30
`physical_health_days`	Integer 0–30
`weight_kg`	Float 23.0–295.0
`height_m`	Float 0.91–2.44

Response:

{
  "predictions": [
    {
      "model_id": 1,
      "disease": "diabetes",
      "prediction": 0.12,
      "shap_values": {
        "sex": -0.003,
        "age_group": 0.021,
        "income": -0.015,
        "weight_kg": 0.045,
        ...
      }
    },
    ...
  ]
}

The prediction is a probability (0.0–1.0) of having that disease. The shap_values show how much each feature pushed the prediction up or down.

Survey History

Method	Endpoint	Auth	Description
`GET`	`/survey/`	Yes (user)	Get all past surveys for the logged-in user
`GET`	`/survey/{survey_id}`	Yes (user)	Get details of a specific survey
`DELETE`	`/survey/{survey_id}`	Yes (user)	Delete a specific survey

Analytics

Method	Endpoint	Auth	Description
`GET`	`/analytics/bar`	Yes (user)	Get aggregated statistics for bar chart visualisation

Query parameters:

Parameter	Required	Valid values
`metric`	Yes	`"disease_prevalence"`, `"exercise_rate"`, `"smoking_rate"`, `"avg_mental_health_days"`
`group_by`	Yes	`"age_group"`, `"sex"`, `"income"`, `"employment"`
`disease`	Only when `metric=disease_prevalence`	`"diabetes"`, `"heart_disease"`, `"depression"`, `"arthritis"`
`sex`	No	`"Male"`, `"Female"`
`age_group`	No	Any valid age group string
`income`	No	Any valid income string
`education`	No	Any valid education string
`employment`	No	Any valid employment string

Example request:

GET /analytics/bar?metric=disease_prevalence&group_by=age_group&disease=diabetes

Response:

{
  "labels": ["18-24", "25-29", "30-34", ...],
  "values": [0.03, 0.04, 0.06, ...]
}

Data Upload (Admin Only)

Method	Endpoint	Auth	Description
`POST`	`/upload/`	Yes (admin role)	Upload a new BRFSS `.XPT` file and trigger the full pipeline

Upload a .XPT file using multipart/form-data. The pipeline runs in a background thread and will:

Stream records through Kafka
Transform them with Spark
Store the results in brfss_processed
Retrain all ML models

To give a user admin privileges, update their role column directly in the account table:

UPDATE account SET role = 'admin' WHERE email = 'admin@example.com';

Data Pipeline Details

Hot-Deck Imputation

Missing values in the BRFSS dataset are filled using hot-deck imputation: for each respondent with a missing value in a column, a donor is found from already-complete records that shares the same values for a set of grouping columns (e.g., same sex and age group). The donor's value is then copied across.

The imputation is applied in a specific order to ensure that grouping columns are available when needed:

age_group → education → employment → income →
alcohol_consumption → exercise → ecigarette_use →
smoked_100_cigarettes → mental_health_days →
physical_health_days → weight_kg → height_m

BRFSS Variable Mapping

The raw BRFSS variables are mapped to human-readable column names:

Raw Variable	Processed Column
`SEXVAR`	`sex`
`_AGEG5YR`	`age_group`
`_INCOMG1`	`income`
`_STATE`	`state`
`EDUCA`	`education`
`EMPLOY1`	`employment`
`SMOKE100`	`smoked_100_cigarettes`
`DRNKANY6`	`alcohol_consumption`
`EXERANY2`	`exercise`
`MENTHLTH`	`mental_health_days`
`ECIGNOW3` / `ECIGNOW2`	`ecigarette_use`
`PHYSHLTH`	`physical_health_days`
`WTKG3`	`weight_kg` (kg)
`HTM4`	`height_m` (metres)
`DIABETE4`	`diabetes`
`ADDEPEV3`	`depression`
`_MICHD`	`heart_disease`
`HAVARTH4`	`arthritis`

Troubleshooting

ModuleNotFoundError: No module named 'backend'
Run all commands from the project root directory so that the backend package is on the Python path.

hadoop.dll error when running Spark on Windows
Ensure C:\hadoop\bin\hadoop.dll exists and HADOOP_HOME=C:\hadoop is set as a system environment variable. Restart your terminal after setting it.

Kafka connection refused
Make sure the Docker containers are running: docker ps. If the broker container is not listed, run docker-compose up -d again.

SQLALCHEMY_DATABASE_URL validation error on startup
The .env file is missing or not in the project root. Double-check the file exists at the same level as docker-compose.yml.

Spark cannot connect to PostgreSQL
Spark uses JDBC to write to PostgreSQL. Ensure POSTGRES_JDBC_URL, POSTGRES_USER, and POSTGRES_PASSWORD are set correctly in .env. The JDBC driver (org.postgresql:postgresql:42.7.3) is downloaded automatically by Spark on first run — this requires an internet connection.

No active ML models / prediction returns an error
Run the training step (Step 9) before making prediction requests. You need at least one trained model per disease in the mlmodel table with is_active = true.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
backend		backend
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

data_science_project

Chronic Disease Risk Prediction using BRFSS

Overview

Problem Statement

Research Questions

Hypotheses

Dataset (BRFSS)

System Architecture

Prerequisites

Project Structure

Step-by-Step Setup

Step 1 — Download the BRFSS Dataset

Step 2 — Install Python Dependencies

Step 3 — Set Up PostgreSQL

Step 4 — Set Up Hadoop (Windows Only)

Step 5 — Start Infrastructure with Docker

Step 6 — Create the .env File

Step 7 — Initialise the Database Tables

Step 8 — Initial Data Ingestion

Step 9 — Train the Machine Learning Models

Step 10 — Run the Backend Server

API Reference

Authentication

Disease Risk Prediction

Survey History

Analytics

Data Upload (Admin Only)

Data Pipeline Details

Hot-Deck Imputation

BRFSS Variable Mapping

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 6 — Create the `.env` File

Packages