Skip to content

WB Data Catalog (Experimental): AI-powered metadata discovery, profiling & cohort building for BigQuery#421

Open
vrajat44 wants to merge 3 commits into
masterfrom
wb-data-catalog-v2
Open

WB Data Catalog (Experimental): AI-powered metadata discovery, profiling & cohort building for BigQuery#421
vrajat44 wants to merge 3 commits into
masterfrom
wb-data-catalog-v2

Conversation

@vrajat44

@vrajat44 vrajat44 commented May 25, 2026

Copy link
Copy Markdown

WB Data Catalog (Experimental)

AI-powered metadata discovery, profiling, and cohort building for BigQuery datasets. Automatically generates semantic profiles, FHIR concept bindings, and cross-table relationships, then lets you build cohorts and query your data through natural language.

Logo

Features

  • Technical Profiling — automated column stats (nulls, cardinality, distributions) via BigQuery
  • Semantic Profiling — AI-generated metadata: business names, definitions, terminology bindings, sensitivity classification, entity classification, cohort dimensions
  • FHIR Concept Bindings — fixed concept binding (column IS a concept) vs code system binding (column CONTAINS codes), value set bindings for cohort dimensions
  • Structural Links — typed join paths with cardinality and confidence between tables
  • Bulk Profiling — profile all tables in a dataset with progress tracking and two-pass enrichment
  • Cohort Builder — three modes: table filters, terminology-based, natural language
  • Chat Agent — natural language Q&A over profiled metadata with tool use (SQL generation, concept search)
  • Data Preview — sample rows with column-level metadata overlay
  • Chart Advisor — AI-suggested visualizations from profiled data
  • Graphic Walker — interactive visual exploration (Tableau-style)
  • Terminology Registry — browse and search all terminology bindings across the project
  • Workspace Integration — auto-detect Workbench workspaces, data collections, cross-project datasets

Install

Step Action
1 Open a Workbench workspace → Apps tab → Add App
2 Select "Custom App"
3 App name: data-catalog, Source: wb-data-catalog-v2
4 Wait for build (~3 min), then open

Architecture

Layer Tech Details
Frontend React 18 + Vite + TypeScript SPA with react-router, styled-components, recharts
Backend FastAPI + Python 3.11+ BigQuery SDK, Vertex AI (Gemini), GCS for profile storage
AI Gemini 2.5 Flash (default) Semantic profiling, chat agent, chart advisor, cohort NL
Storage GCS bucket metadata-json-{project} Profile JSONs, catalog context, terminology registry
Deploy Workbench custom app (Docker) Single container, FastAPI serves built frontend as static

Gemini Model

Defaults to gemini-2.5-flash. Configurable via Settings panel. Auto-detects available models and locations from Vertex AI.

What's in this PR

src/wb-data-catalog-v2/
├── backend/           # FastAPI app, profiling engine, chat agent
│   ├── verily_profiler/  # Profiling models, semantic LLM, tech stats
│   └── verily_chat/      # LangGraph agent, context builder
├── frontend/          # React SPA
│   ├── src/components/   # UI components (profiles, chat, cohort builder)
│   ├── src/hooks/        # Data fetching hooks
│   └── src/pages/        # Route pages
├── tests/             # Profile field tests, regression tests
├── Dockerfile         # Multi-stage build
├── docker-compose.yaml
├── start.sh
└── .devcontainer/     # Workbench devcontainer config

Feedback

Share feedback or report bugs

🤖 Generated with Claude Code

vrajat44 and others added 3 commits May 8, 2026 10:58
React + FastAPI data catalog with technical and semantic profiling,
terminology registry, three-mode cohort builder (table filters,
terminology, natural language), and chat agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vrajat44 vrajat44 changed the title WB Data Catalog v2: AI-powered data catalog for BigQuery datasets MetaLens: AI-powered metadata discovery, profiling & cohort building for BigQuery Jun 2, 2026
@vrajat44 vrajat44 changed the title MetaLens: AI-powered metadata discovery, profiling & cohort building for BigQuery WB Data Catalog (Experimental): AI-powered metadata discovery, profiling & cohort building for BigQuery Jun 2, 2026
@vrajat44 vrajat44 marked this pull request as ready for review June 2, 2026 21:59
@vrajat44 vrajat44 requested review from a team as code owners June 2, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant