Comprehensive tennis match-prediction system powered by historical data, bookmaker odds, and machine learning. Designed to run as a Streamlit web application; data pipelines operate autonomously on GitHub Actions.
This repository originally took inspiration from LewisWJackson's tennis predictor, but has evolved substantially with new data sources, caching layers, and a modern UI.
- Historical data (2020–present) built from TennisMyLife and tennis-data.co.uk; odds matched via intelligent name normalisation (81.5% success rate).
- Live pre-match odds fetched from Matchstat RapidAPI with per-day caching and a 500-call/month budget guard.
- Full feature engineering pipeline generating ELO, serve stats, surface form, H2H counts, and market probabilities. Built daily via GitHub Actions.
- Streamlit UI with three tabs:
- Today's Matches (live odds, ELO, market value)
- Match Explorer (filterable historical dataset)
- ELO Rankings (overall and surface leaderboards)
- Automated update workflow (
.github/workflows/update_data.yml) downloads latest matches and rebuilds features, committing changes back tomain. - MIT-licensed data sources: TennisMyLife (1968–present) and tennis-data.co.uk (odds 2020–2025). All code is permissively licensed.
-
Clone this repo and create a Python 3.11 venv.
-
Install dependencies:
pip install -r requirements.txt
-
Populate keys:
ODDS_API_KEYfor The Odds API (optional; historical odds join).RAPIDAPI_KEYfor Matchstat tennis API; place in.envor.streamlit/secrets.toml(required for live odds).
-
Run initial data prep:
python update_tml_data.py # download current-year TML files python features.py # build feature matrix (2020+)
-
Train or update the prediction model (optional but required for model probabilities & betting edge shown in the UI):
python train.py # trains & saves best model; metrics printedThe training script evaluates accuracy, AUC‑ROC, Brier score, and log loss; results are stored in
data_files/tennis_predictor.pkland displayed in the "Model Stats" tab of the app. -
Start the app:
streamlit run predictions.py
- Today's Matches tab now shows market odds, model win probabilities, and a green-highlighted "edge" column when the model's probability exceeds the market's implied probability.
- Cells remain blank when neither player has ATP main-tour history (e.g. futures/ITF events).
-
Deploy to Streamlit Cloud by connecting this repo; the GitHub Action will keep data fresh each morning.
tennis-predictions/
├── data_files/ # intermediate and output datasets
│ ├── features_2020_present.parquet # feature matrix used by app
│ └── *.xlsx # raw tennis-data.co.uk downloads
├── docs/ # design and reference documentation
├── tml-data/ # TennisMyLife CSVs + enriched odds
├── matchstat_api.py # client with caching & budget tracking
├── features.py # feature engineering pipeline
├── update_tml_data.py # daily TML downloader
├── enrich_with_odds.py # join tennis-data.co.uk odds onto TML
├── predictions.py # Streamlit application
└── .github/workflows/update_data.yml # scheduled data-refresh CI
A GitHub Action (update_data.yml) runs daily at 05:00 UTC to:
- Refresh current-year TML files.
- Re-run
features.pyto rebuildfeatures_2020_present.parquet. - Commit and push any changed data.
See the docs/ folder for deeper guides — data acquisition, feature
engineering, odds integration, and more. Start with
docs/01_roadmap.md.
Happy coding and may your nets be full of aces! 🎾