Our modern lifestyles have potentially brought many changes to the evolution of the habits of our everyday lives. Such habits are usually characterized by poor diet, increased stress, not enough sleep and reduced exercise. Such behavioral shifts, and many more, have been associated with the steady increase in the appearance of chronic diseases such as diabetes, cardiovascular diseases, obesity and even mental health disorders.
According to the World Health Organization (WHO), non-communicable diseases which are not caused by infections and usually result in long-term health consequences, are responsible for the majority of deaths globally. Notably, many of those cases are linked to lifestyle factors that can be easily modified. Therefore, there is an urgent need for the development of helpful tools which can help people understand how much their daily habits contribute to increased risk of a specific disease.
This project aims to develop a system that is applied on the healthcare and data science domains and is capable of predicting the risk of several important and common chronic diseases based on lifestyle characteristics. The system will rely on public health survey data and utilize machine learning techniques to identify patterns between lifestyle factors and health outcomes on people. Finally, an interactive prototype will allow users to personally explore the dataset and its characteristics, as well as assess how different lifestyle choices influence their health risks.
Throughout the world, there has been an overall increase in chronic disease cases such as diabetes, heart disease, obesity and depression. While healthcare facilities often handle this challenge efficiently with medical treatments which are widely available, such treatments address the symptoms rather than the underlying causes.
This problem is important due to its potential in preventative care. It is well known that overprescribing medicine is a common practice for treating everyday conditions, especially in well-developed, high GDP countries. However, a better nonmedical alternative usually exists, and this is where this project's contribution lies.
The core problem addressed in this project is the lack of accessible tools that can help individuals understand the severity of their lifestyle habits on their personal health. This project aims to develop a predictive system that estimates the probability of developing chronic diseases based on lifestyle factors. Obviously, this system is not intended for diagnosing diseases before they happen but rather operates as a risk estimation tool for spreading awareness and supporting preventative measures.
- Which lifestyle variables (e.g., BMI, physical activity, smoking, sleep) have the strongest influence on disease risk?
- Can machine learning models capture patterns in survey data to produce reliable estimates of disease risk?
- Can we cluster groups of individuals with similar lifestyle and health risks?
- Individuals with poorer lifestyle habits (e.g., low physical activity, poor sleep, high BMI) are more likely to develop chronic diseases.
- Machine learning models trained on such data can produce reliable estimates of disease risk.
This project uses the Behavioral Risk Factor Surveillance System (BRFSS) dataset, provided by the CDC.
- Main source: https://www.cdc.gov/brfss/index.html
- Annual datasets: https://www.cdc.gov/brfss/annual_data/annual_data.htm
- Example (2022): https://www.cdc.gov/brfss/annual_data/annual_2022.html
BRFSS is a large-scale health-related survey dataset containing information about:
- Lifestyle habits (smoking, exercise, diet)
- Health conditions (diabetes, heart disease, etc.)
- Demographics (age, gender, etc.)
The application is built around a streaming data pipeline and a REST API backend:
BRFSS .XPT File
│
▼
Kafka Producer ──► Kafka Topic (brfss_raw)
│
┌─────────┴──────────┐
│ │
Spark Streaming Kafka Consumer
(transformation) (raw storage)
│ │
▼ ▼
brfss_processed brfss_raw
(PostgreSQL) (PostgreSQL)
│
▼
ML Training (Random Forest)
│
▼
Saved Models (.pkl files)
│
▼
FastAPI REST Backend
│
▼
Frontend (port 5173)
Key components:
- FastAPI — REST API backend with JWT authentication
- PostgreSQL — primary database for users, surveys, processed data, and ML models
- Apache Kafka (via Docker) — message broker for streaming raw BRFSS records
- Apache Spark (via Docker) — stream processor that cleans and transforms raw records
- Random Forest classifiers — one per disease (diabetes, depression, heart disease, arthritis)
- SHAP — explains individual predictions by showing feature contributions
Install all of the following before proceeding:
| Software | Version | Notes |
|---|---|---|
| Python | 3.10+ | Use a virtual environment |
| PostgreSQL | 14+ | Must be running locally on port 5432 |
| Docker Desktop | Latest | For Kafka and Spark containers |
| Java (JDK) | 11 or 17 | Required by Spark |
| Hadoop (Windows only) | 3.x | Required by Spark on Windows — see below |
data_science_project/
├── backend/
│ ├── app/
│ │ ├── core/ # Config, security (JWT), dependencies
│ │ ├── db/ # Database engine and table initialisation
│ │ ├── models/ # SQLModel ORM table definitions
│ │ ├── repositories/ # Database query functions
│ │ ├── routers/ # FastAPI route handlers
│ │ ├── schemas/ # Pydantic request/response models
│ │ ├── services/ # Business logic layer
│ │ └── main.py # FastAPI app entry point
│ ├── imputation/ # Hot-deck missing value imputation
│ ├── kafka/ # Kafka producer and consumer
│ ├── machine_learning/ # Model training, evaluation, clustering
│ ├── offline_scripts/ # One-time setup and initial ingestion scripts
│ └── spark/ # Spark streaming processor
├── docker-compose.yml # Kafka + Spark infrastructure
├── requirements.txt # Python dependencies
└── .env # Environment variables (you create this)
- Go to: https://www.cdc.gov/brfss/annual_data/annual_data.htm
- Select a year (e.g., 2024)
- Download the SAS Transport Format (.XPT) file — it will be named something like
LLCP2024.XPT - Place the file somewhere accessible on your machine and note the full path
It is recommended to use a virtual environment:
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate
pip install -r requirements.txtNote:
pysparkrequires Java to be installed andJAVA_HOMEto be set. Verify withjava -version.
- Install and start PostgreSQL (default port 5432)
- Open the PostgreSQL shell or pgAdmin and create a new database:
CREATE DATABASE brfss;- Note down your PostgreSQL username and password — you will need them for the
.envfile
Apache Spark requires Hadoop binaries on Windows. The application expects Hadoop at C:\hadoop.
- Download a pre-built Hadoop binary package for Windows (e.g., from https://github.com/steveloughran/winutils/releases — choose the version matching your Spark: Spark 3.5.x uses Hadoop 3.x)
- Extract the archive so that
C:\hadoop\bin\winutils.exeandC:\hadoop\bin\hadoop.dllexist - Set the environment variable:
You can set this permanently in Windows: System Properties → Environment Variables → New System Variable
HADOOP_HOME=C:\hadoop - Also add
C:\hadoop\binto your systemPATH
Make sure Docker Desktop is running, then from the project root:
docker-compose up -dThis starts three containers:
- broker — Apache Kafka on port
9092 - spark-master — Spark master on ports
8080(UI) and7077 - spark-worker — Spark worker with 8 GB memory and 4 cores, connected to the master
Verify they are running:
docker psCreate a file named .env in the project root (same level as docker-compose.yml) with the following content — replace the placeholder values with your own:
# PostgreSQL connection
SQLALCHEMY_DATABASE_URL=postgresql+psycopg2://<user>:<password>@localhost:5432/brfss
# JWT authentication
SECRET_KEY=your_random_secret_key_here
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=480
# Kafka settings
KAFKA_BROKER=localhost:9092
KAFKA_EXTERNAL_BROKER=localhost:9092
KAFKA_TOPIC=brfss_raw
KAFKA_BATCH_SIZE=5000
XPT_FILE_PATH=C:/path/to/your/LLCP2024.XPT
# Spark → PostgreSQL (used by Spark JDBC writer)
POSTGRES_JDBC_URL=jdbc:postgresql://localhost:5432/brfss
POSTGRES_USER=<user>
POSTGRES_PASSWORD=<password>Tips:
SECRET_KEYshould be a long, random string. Generate one with:python -c "import secrets; print(secrets.token_hex(32))"XPT_FILE_PATHmust use forward slashes or escaped backslashes even on WindowsKAFKA_BATCH_SIZE=5000is a good default; increase for faster ingestion on powerful machines
Run these two scripts once to create the raw and processed BRFSS tables in PostgreSQL:
# From the project root:
# 1. Create the brfss_raw table (schema is inferred from the XPT file)
python -m backend.offline_scripts.init_raw_brfss_table
# 2. Create the brfss_processed table
python -m backend.spark.init_processed_brfss_tableThe application's FastAPI startup also auto-creates the
account,mlmodel,predictions, andsurveytables via SQLModel on first run, so you do not need to create those manually.
This one-time script loads the BRFSS .XPT file, cleans and transforms all records, applies hot-deck imputation for missing values, and writes the result to the brfss_processed table:
python -m backend.offline_scripts.initial_dataset_ingestionThis may take several minutes depending on the size of the dataset (the 2024 file contains ~400,000+ records). You will see progress output including missing value counts before and after imputation.
Once the brfss_processed table has data, train the classifiers:
python -m backend.machine_learning.trainerThis trains one Random Forest classifier per disease target:
diabetesdepressionheart_diseasearthritis
Each model is evaluated with ROC AUC, PR AUC, and Brier score. Models are saved as .pkl files inside backend/data/saved_models/ and registered in the mlmodel database table. The best-performing model per disease is automatically marked as active.
From the project root:
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000The API will be available at:
- Base URL:
http://localhost:8000 - Interactive docs (Swagger UI):
http://localhost:8000/docs - Alternative docs (ReDoc):
http://localhost:8000/redoc
On startup the application automatically creates any missing database tables.
All endpoints that require authentication expect a Bearer token in the Authorization header:
Authorization: Bearer <access_token>
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/account/ |
No | Register a new user account |
POST |
/account/login |
No | Log in and receive a JWT token |
Register request body:
{
"email": "user@example.com",
"password": "yourpassword"
}Login request body:
{
"email": "user@example.com",
"password": "yourpassword"
}Login response:
{
"access_token": "eyJ...",
"token_type": "bearer"
}| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/predict/ |
Yes (user) | Submit lifestyle data and receive disease risk scores |
Request body:
{
"sex": "male",
"age_group": "35-39",
"income": "$50,000 to $100,000",
"state": "California",
"education": "College graduate",
"employment": "Employed",
"smoked_100_cigarettes": false,
"alcohol_consumption": true,
"exercise": true,
"mental_health_days": 3,
"ecigarette_use": "Never used",
"physical_health_days": 2,
"weight_kg": 80.0,
"height_m": 1.80
}Valid values for enumerated fields:
| Field | Valid values |
|---|---|
sex |
"male", "female" |
age_group |
"18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+" |
income |
"Less than $15,000", "$15,000 to $25,000", "$25,000 to $35,000", "$35,000 to $50,000", "$50,000 to $100,000", "$100,000 to $200,000", "$200,000 or more" |
education |
"Never attended school", "Elementary", "Some high school", "High school graduate", "Some college", "College graduate" |
employment |
"Employed", "Unemployed", "Student", "Retired", "Unable to work" |
ecigarette_use |
"Never used", "Every day", "Some days", "Used in past only" |
state |
Any U.S. state, "District of Columbia", "Guam", "Puerto Rico", "Virgin Islands" |
mental_health_days |
Integer 0–30 |
physical_health_days |
Integer 0–30 |
weight_kg |
Float 23.0–295.0 |
height_m |
Float 0.91–2.44 |
Response:
{
"predictions": [
{
"model_id": 1,
"disease": "diabetes",
"prediction": 0.12,
"shap_values": {
"sex": -0.003,
"age_group": 0.021,
"income": -0.015,
"weight_kg": 0.045,
...
}
},
...
]
}The prediction is a probability (0.0–1.0) of having that disease. The shap_values show how much each feature pushed the prediction up or down.
| Method | Endpoint | Auth | Description |
|---|---|---|---|
GET |
/survey/ |
Yes (user) | Get all past surveys for the logged-in user |
GET |
/survey/{survey_id} |
Yes (user) | Get details of a specific survey |
DELETE |
/survey/{survey_id} |
Yes (user) | Delete a specific survey |
| Method | Endpoint | Auth | Description |
|---|---|---|---|
GET |
/analytics/bar |
Yes (user) | Get aggregated statistics for bar chart visualisation |
Query parameters:
| Parameter | Required | Valid values |
|---|---|---|
metric |
Yes | "disease_prevalence", "exercise_rate", "smoking_rate", "avg_mental_health_days" |
group_by |
Yes | "age_group", "sex", "income", "employment" |
disease |
Only when metric=disease_prevalence |
"diabetes", "heart_disease", "depression", "arthritis" |
sex |
No | "Male", "Female" |
age_group |
No | Any valid age group string |
income |
No | Any valid income string |
education |
No | Any valid education string |
employment |
No | Any valid employment string |
Example request:
GET /analytics/bar?metric=disease_prevalence&group_by=age_group&disease=diabetes
Response:
{
"labels": ["18-24", "25-29", "30-34", ...],
"values": [0.03, 0.04, 0.06, ...]
}| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/upload/ |
Yes (admin role) | Upload a new BRFSS .XPT file and trigger the full pipeline |
Upload a .XPT file using multipart/form-data. The pipeline runs in a background thread and will:
- Stream records through Kafka
- Transform them with Spark
- Store the results in
brfss_processed - Retrain all ML models
To give a user admin privileges, update their role column directly in the account table:
UPDATE account SET role = 'admin' WHERE email = 'admin@example.com';Missing values in the BRFSS dataset are filled using hot-deck imputation: for each respondent with a missing value in a column, a donor is found from already-complete records that shares the same values for a set of grouping columns (e.g., same sex and age group). The donor's value is then copied across.
The imputation is applied in a specific order to ensure that grouping columns are available when needed:
age_group → education → employment → income →
alcohol_consumption → exercise → ecigarette_use →
smoked_100_cigarettes → mental_health_days →
physical_health_days → weight_kg → height_m
The raw BRFSS variables are mapped to human-readable column names:
| Raw Variable | Processed Column |
|---|---|
SEXVAR |
sex |
_AGEG5YR |
age_group |
_INCOMG1 |
income |
_STATE |
state |
EDUCA |
education |
EMPLOY1 |
employment |
SMOKE100 |
smoked_100_cigarettes |
DRNKANY6 |
alcohol_consumption |
EXERANY2 |
exercise |
MENTHLTH |
mental_health_days |
ECIGNOW3 / ECIGNOW2 |
ecigarette_use |
PHYSHLTH |
physical_health_days |
WTKG3 |
weight_kg (kg) |
HTM4 |
height_m (metres) |
DIABETE4 |
diabetes |
ADDEPEV3 |
depression |
_MICHD |
heart_disease |
HAVARTH4 |
arthritis |
ModuleNotFoundError: No module named 'backend'
Run all commands from the project root directory so that the backend package is on the Python path.
hadoop.dll error when running Spark on Windows
Ensure C:\hadoop\bin\hadoop.dll exists and HADOOP_HOME=C:\hadoop is set as a system environment variable. Restart your terminal after setting it.
Kafka connection refused
Make sure the Docker containers are running: docker ps. If the broker container is not listed, run docker-compose up -d again.
SQLALCHEMY_DATABASE_URL validation error on startup
The .env file is missing or not in the project root. Double-check the file exists at the same level as docker-compose.yml.
Spark cannot connect to PostgreSQL
Spark uses JDBC to write to PostgreSQL. Ensure POSTGRES_JDBC_URL, POSTGRES_USER, and POSTGRES_PASSWORD are set correctly in .env. The JDBC driver (org.postgresql:postgresql:42.7.3) is downloaded automatically by Spark on first run — this requires an internet connection.
No active ML models / prediction returns an error
Run the training step (Step 9) before making prediction requests. You need at least one trained model per disease in the mlmodel table with is_active = true.