This project implements a complete end‑to‑end ELT pipeline for healthcare procedure data using dbt and Python, following a modern medallion architecture:
Bronze → raw data (seeds)
Silver → cleaned & enriched staging models
Python → filtered on cost‑per‑day calculation
Gold → dimensions, fact table, analysis queries and marts payer specific
The project satisfies all requirements from Tasks 1–3 of the assignment.
Architecture
-
Bronze Layer — Raw Data The Bronze layer contains raw CSVs loaded as dbt seeds: 1.procedures 2.patients 3.encounters 4.payers 5.organizations These are stored as raw tables with no transformations applied.
-
Silver Layer — Cleaned & Enriched Staging Models The Silver layer standardizes and enriches the data:
- cleans column names
- casts data types
- trims whitespace
- converts timestamps
- calculates duration_seconds
- extracts date parts (year, month_number, week_number, day_name)
- calculates cost per day
- creates a python table with filter on costs per day (procedure_cost.py)
I decided to separate the calculation of COST_PER_DAY from the filtering step. The calculation happens in stg_procedures, ensuring that all downstream models (including the Task 2 analytical queries) have access to the complete dataset. The filtering (procedures costing less than 30,000 per day) is implemented in the Python model procedure_cost, as required by Task 1.
A schema.yml file defines tests such as: - not_null - unique - accepted_values - greater_than This ensures data quality before moving to Gold.
-
Gold Layer — Dimensions, Fact, Analysis & Marts Dimensions The Gold layer contains four dimensions:
- dim_patients — demographic attributes
- dim_payers — payer metadata
- dim_procedures — procedure descriptions
- dim_encounters — the bridge between procedures and payers
Fact Table: fact_procedures joins dim tables. This is the central fact table for analysis.
Analysis Queries (Task 2) Three analytical models answer the required business questions:
1.Rank payers by total cost
2.Top 5 highest‑costing patients
3.Top 5 procedures by median cost/day
These are implemented in gold/analysis/.
Data Marts (Task 3) Two payer‑specific marts were created:
dm_united_healthcare
dm_humana
Each mart filters the fact table by payer_name. These marts provide business‑ready datasets for each client.
Notes & Limitations
I intentionally kept this project as simple and lightweight as possible to focus on clarity, correctness, and demonstrating the core concepts of dbt, dimensional modeling, and Python-based transformations. Because of this, there are a few practical considerations:
The project structure is complete, but a fully production-ready dbt deployment would require additional elements, such as:
a fully configured profiles.yml
environment-specific connection settings
CI/CD integration
documentation site generation
more extensive testing coverage
I have not executed the full pipeline end‑to‑end inside dbt Cloud or dbt Core, so the physical creation of all tables and views has not been validated in a live warehouse.
The SQL and Python models are written to be syntactically correct and logically consistent, but depending on the warehouse (Databricks, Postgres, DuckDB, etc.), minor adjustments may be required.
Despite these limitations, the project demonstrates:
a complete medallion architecture
a working star schema
Python + SQL hybrid transformations
data marts for two clients
dbt testing
clean, modular, interview‑ready code
The goal was to deliver a clear, maintainable, and conceptually strong solution — and this structure provides a solid foundation for further development.