Skip to content
View stan-buren's full-sized avatar

Sponsoring

@nekohasekai

Block or report stan-buren

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
stan-buren/README.md

header

Telegram X GitHub Email

Hi there 👋

I'm Stan Büren

Forward-thinking Data Engineer evolving into a Data Architect, skilled in building reliable, scalable, and secure data foundations that turn raw data into actionable business insights. Combines 3+ years of business & compliance experience with 2+ years of hands-on engineering and programming experience within production environments.


🎯 Core Philosophy

"My architectural focus is on treating data as a product."

  • Quality-First Engineering: I prioritize data quality and apply rigorous software engineering best practices to distributed systems so that downstream analysts and data scientists can derive value without friction.
  • Declarative Logic: I strongly advocate for declarative, easily readable logic, such as modular SQL in dbt or well-documented PySpark transformations.
  • Production Stability: A robust pipeline is not just about moving data from Point A to Point B; it is about testing, governing, and securing it along the way—especially when navigating strict compliance frameworks like DORA and the EU AI Act.

💡 The Engineering Intersection

To me, data engineering is the ultimate intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. I am highly communicative, endlessly curious, and deeply passionate about continuous learning.

Furthermore, I am an AI-empowered practitioner. I actively leverage modern AI tooling in my daily workflows to significantly boost my productivity, accelerate my learning curve, and deliver faster, higher-quality results for the business.

🛠 Tech Stack

💻 Languages
🗄️ Databases
☁️ Cloud
⚙️ Infrastructure
🔧 Data
Engineering
📊 Visualization
⚙️ Webpages

dbt Spark Databricks Apache Iceberg SeaweedFS MinIO Redpanda Traefik Kestra Gemini Antigravity uv just

Certificates

CS50 Certificate 1 CS50 Certificate 2

Data Engineering Zoomcamp Certificate

📁 Projects

A production-ready metadata ingestion engine and local lakehouse orchestrator for European power grid data.

Python Apache Spark Apache Iceberg Docker SeaweedFS uv

This repository implements a scalable data ingestion layer pulling electrical transmission metadata from the ENTSO-E platforms. It features a layered, fully testable I/O structure that separates raw client fetches from cloud storage uploads.

  • Layered I/O & Emulated Lakehouse: Orchestrates dynamically configured sync routines (using standard pytest mocks) and syncs data to a local SeaweedFS S3 instance running Apache Iceberg tables on Spark 4.1.1.
  • Centralized Path SSOT: All directory layouts are declaratively configured in paths.yml (Single Source of Truth), dynamically populated as Python Path objects, and audited via AST-based quality gates to prevent hardcoding.
  • Configurable Ingestion Scopes: Developers can declaratively select specific power grid domains (Load, Generation, Transmission) to ingest via YAML configurations.
  • Strict Observability: Uses structured logging, custom domain exceptions, and localized limits config to prevent API rate-limiting issues.

An industrial-grade real-time MLOps pipeline estimating engine Remaining Useful Life (RUL) via Bayesian Inference.

Python Google Cloud Terraform PyTorch Redpanda DuckDB Sigstore Streamlit

This project features a real-time streaming telemetry pipeline and a complete MLOps lifecycle. The core Bayesian Variational Inference (Flipout) model is trained on Google Cloud Platform (GCP) using high-performance compute nodes, then packaged and cryptographically signed before deployment for inference.

  • High-Performance GCP Training Loop: Orchestrates automated training runs on ephemeral GCP Compute Engine instances (AMD Milan-based C2D High-Performance instances) provisioned via Terraform. Preprocesses massive NASA HDF5 telemetry datasets in parallel across 32 cores, computing global Z-score statistics before training.
  • Keyless Attestation & Secure MLOps: Hardens model distribution by exporting weights as SafeTensors, generating keyless cryptographic signatures via Sigstore / Cosign on the GCS-integrated worker, and publishing build metadata (provenance.json) to Google Artifact Registry.
  • Bayesian VI & Flight Class Analysis: Solves short-haul vs long-haul mission estimation drift using Bayesian CNNs to output both RUL predictions and a real-time confidence/uncertainty (Sigma) threshold.
  • Hardware Isolation Shim: Employs an adaptive runtime shim layer that dynamically intercepts research-grade execution parameters (via metaclass hooks) to force CPU execution and prevent CUDA runtime crashes on edge serving nodes.

An enterprise-grade, real-time data engineering pipeline streaming aircraft engine telemetry.

Go Redpanda Apache Spark Terraform dbt Google Cloud Streamlit

This project simulates a fleet of aircraft engines generating high-frequency telemetry in real-time, ingests the massive event stream using a modern distributed stack, and delivers analytical health insights through a cloud data warehouse.

  • High-Throughput Edge Simulator: Features a custom Golang simulator acting as an edge device, streaming millions of sensor telemetry records directly to Redpanda (Kafka).
  • Structured Streaming & Lakehouse: Consumes event streams via PySpark 4.1.1 and flushes them to Google Cloud Storage (GCS) as Parquet files using Hive partitioning to minimize downstream scan costs.
  • Zero-Copy BigQuery DWH: Integrates BigQuery External Tables to automatically discover GCS partitions, enabling analytics without data duplication.
  • Analytics Engineering & BI: Implements staging and metrics mart layers in dbt (calculating running averages of exhaust temperature margins) and visualizes engine degradation in a Streamlit dashboard.

A modern, data-driven static website for a children's theatre studio in Saint Petersburg.

Astro Vite Sass JavaScript HTML5

Designed as a responsive, content-rich web platform for a real children's theatre studio. Originally built on vanilla HTML/JS, the site was refactored into a modular, high-performance static site using Astro and JSON-driven content schemas.

  • Astro Architecture Migration: Refactored the codebase from a monolithic layout into a multi-page static site utilizing reusable Astro components and layouts.
  • JSON-Driven Content Modeling: Decoupled structural data (repertoire, scheduling, FAQs) from HTML, storing them as clean JSON collections rendered dynamically inside Astro templates.
  • Advanced Styling with Sass: Reorganized custom styles using modular SCSS variables, nested nesting, and structured mixins compiled via Vite.
  • Third-Party Afisha Integration: Seamlessly embeds the Yandex.Afisha widget, letting users browse shows and securely purchase tickets inline.
  • SEO & Performance Optimization: Generated zero-JS static HTML outputs, ensuring sub-second load times, excellent web vitals, and search engine indexability.

📬 Let's Connect

Feel free to reach out for collaborations, data engineering discussions, or just to say hello!

Pinned Loading

  1. cmapss-streaming-pipeline cmapss-streaming-pipeline Public

    End-to-End Data Engineering pipeline for aircraft engine telemetry. Features a high-throughput Golang simulator, Redpanda, PySpark 4.1.1 ingestion, GCS Data Lake, BigQuery, dbt, and Streamlit.

    Go

  2. emozika-theatre emozika-theatre Public

    Website for Emozika children's theatre

    SCSS

  3. entsoe-quickstart entsoe-quickstart Public

    Use this repo to start your own ENTSO-E journey. This repo provides the initial, self-sustaining lift needed to get the data flowing.

    Python

  4. n-cmapss-rul-mlops-factory n-cmapss-rul-mlops-factory Public

    N-CMAPSS MLOps Factory. A uv monorepo (3 isolated pkgs) with 1-click orchestrators for cloud training & Redpanda/Streamlit live inference. Trains a Bayesian VI (Flipout) model for RUL with uncertai…

    Python