Skip to content

SemplificaAI/t5-temporal-normalizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

T5 Temporal Normalizer

Rust ONNX Runtime License: MIT

A high-performance, multilingual temporal expression normalizer and pseudonymization system based on ByT5 architecture, with production-ready Rust implementation using ONNX Runtime 2.0 RC9.

🚀 Key Features

  • 🌍 Multilingual Support: Italian, English, French, German, Spanish, Portuguese
  • 🔒 Privacy-Preserving: Deterministic date shifting for patient data pseudonymization
  • ⚡ High Performance: Optimized for CPU, GPU (CUDA), and NPU (Qualcomm)
  • 🎯 Multiple Precision: FP32, FP16, INT8 quantization support
  • 🛡️ OCR Robust: Handles noisy, garbled, and typo-ridden temporal expressions
  • 📦 Production Ready: Rust implementation with ONNX Runtime for cross-platform deployment

📋 Supported Formats

Input Examples

# Dates (relative, absolute, noisy)
"15/mar/72""2024-03-15"
"3 days after""2024-01-04"
"D+5""2024-01-06"
"12dicem6re""2024-12-12"  # OCR error

# Times (various formats)
"16:21""16:21"
"4pm""16:00"
"08:05""08:05"
"22.15""22:15"

Prompt Format

YYYY-MM-DD | lang_code | temporal_expression

Example: 2024-01-01 | it | 15/mar/72

🌍 Multilingual Support

The model supports 6 major languages with equal training distribution:

  • 🇮🇹 Italian (it)
  • 🇬🇧 English (en)
  • 🇫🇷 French (fr)
  • 🇪🇸 Spanish (es)
  • 🇩🇪 German (de)
  • 🇵🇹 Portuguese (pt)

Each language includes medical slang, abbreviations, and cultural variations in temporal expressions.

🎯 Use Cases

🏥 Clinical/Medical (EHR)

Perfect for extracting precise timelines from electronic health records where doctors use extreme abbreviations ("ricovero", "post-op", "adm", "D+3").

⚖️ Legal & Compliance

Analysis of contracts and legal documents where deadlines are expressed relatively.

🤖 Conversational AI & Booking

Chatbots for appointments ("book for next Tuesday afternoon").

📊 Financial Analysis & News

Timeline extraction from articles or financial reports.

🚚 Logistics & Supply Chain

Parsing informal emails or shipping documents ("delivery expected in 2 days").

🚀 Performance & Hardware Support

📊 Benchmark Results (1K samples)

Precision Model Size Accuracy Throughput Memory Usage
FP32 ~1.14 GB 99.40% ~44.0/s ~500MB
FP16 ~738 MB 99.40% ~39.8/s ~400MB
INT8 ~290 MB 99.40% ~31.7/s ~300MB

🖥️ Platform Support

  • 🖥️ Desktop: CPU x86-64 with SIMD optimizations
  • 🎮 GPU: NVIDIA CUDA (RTX 3090/4090) with TensorRT support
  • 📱 Edge/Mobile: ARM64 (Qualcomm Snapdragon) optimized
  • ☁️ Cloud: Docker containers, Kubernetes ready

🔧 Rust Implementation Features

  • ONNX Runtime 2.0 RC9: Latest optimizations and bug fixes
  • Memory Safety: Rust's ownership system prevents memory leaks
  • Zero-Copy: Efficient tensor handling with minimal allocations
  • Async Support: Optional tokio integration for web services
  • Batch Processing: Optimized for high-throughput scenarios

📁 Project Structure

t5-temporal-normalizer/
├── 🐍 Python Components
│   ├── train_byt5.py              # Training script (GPU optimized)
│   ├── export_onnx_v2.py          # ONNX export with quantization
│   ├── dataset_generator.py       # Synthetic data generation
│   └── eval_*.py                  # Evaluation scripts
├── 🦀 Rust Implementation
│   ├── rust/
│   │   ├── Cargo.toml             # Dependencies (ONNX 2.0 RC9)
│   │   └── src/
│   │       ├── lib.rs             # Core library
│   │       └── main.rs            # Demo application
│   └── README_RUST.md             # Rust-specific documentation
├── 📊 Models & Data
│   ├── byt5_temporal_model/       # Trained PyTorch model
│   ├── onnx_models/               # Exported ONNX models
│   └── temporal_train_v3.csv      # Training dataset
└── 📚 Documentation
    ├── PLAN.md                    # Project roadmap
    └── README.md                  # This file

🏗️ Architecture

🧠 ByT5 (Token-Free Model)

  • Character-level: Operates on UTF-8 bytes, immune to OOV tokens
  • OCR Robust: Naturally handles typos, garbled text, and formatting errors
  • Anchor Date Injection: Prompt format shifts temporal complexity to model forward pass

🔒 Privacy-Preserving Features

  • Deterministic Shifting: SHA256(patient_id + salt) → 1-30 day offset
  • Timeline Preservation: Relative differences between dates remain constant
  • Time Integrity: Hours are never shifted, only dates

🎯 Input/Output Format

Input:  "2024-01-01 | it | 15/mar/72"
Output: "2024-03-15" (shifted by +14 days for patient_001)

Input:  "2024-01-01 | en | 4pm"
Output: "16:00" (unchanged - time is never shifted)

🚀 Quick Start

1️⃣ Python Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2️⃣ Train Model (GPU Required)

# Generate synthetic dataset
python dataset_generator.py

# Train ByT5 model (RTX 3090/4090 recommended)
CUDA_VISIBLE_DEVICES=1 python train_byt5.py

3️⃣ Export to ONNX

# Export all precision variants
python export_onnx_v2.py

4️⃣ Rust Implementation

# Compile Rust library
cd rust
cargo build --release

# Run demo
cargo run --bin temporal_normalizer --release

5️⃣ Python Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("./byt5_temporal_model/final")

# Normalize temporal expression
inputs = tokenizer("2024-01-01 | en | 3 days post admission", return_tensors="pt")
outputs = model.generate(**inputs, max_length=16)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)  # Output: 2024-01-04

🦀 Rust Usage Example

use t5_temporal_normalizer::{TemporalNormalizer, TemporalResult};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize normalizer
    let mut normalizer = TemporalNormalizer::new(
        "../byt5_temporal_model/onnx/model.onnx",
        "../byt5_temporal_model/onnx/tokenizer.json",
        "secret_salt_for_privacy".to_string(),
    )?;

    // Normalize with date shifting
    let result = normalizer.normalize_temporal_expression(
        "15/mar/72",
        "2024-01-01",
        "it",
        Some("patient_001")
    )?;

    println!("{} -> {} (shifted: {})", 
        result.original, result.normalized, result.was_shifted);
    
    Ok(())
}

🔧 Configuration

Environment Variables

# GPU selection (for multi-GPU systems)
export CUDA_VISIBLE_DEVICES=1

# Rust optimizations
export RUSTFLAGS='-C target-cpu=native'

ONNX Runtime Providers

  • CPUExecutionProvider: Universal CPU support
  • CUDAExecutionProvider: NVIDIA GPU acceleration
  • TensorrtExecutionProvider: Maximum GPU performance
  • DmlExecutionProvider: Windows DirectML (ARM64)

📊 Dataset & Training

  • Size: 2.5M synthetic examples (80% train, 10% validation, 10% test)
  • Languages: Balanced multilingual distribution
  • Noise: 15% OCR errors, 15% missing language tags
  • Training: Exact Match metric with early stopping
  • Precision: BF16 training for GPU efficiency

🧪 Testing & Validation

# Python validation
python eval_pytorch.py
python eval_onnx.py

# Rust testing
cd rust && cargo test

# Performance benchmarks
cd rust && cargo run --release -- --benchmark

🐛 Troubleshooting

Common Issues

  • CUDA not found: Install PyTorch with CUDA support
  • Model loading error: Check file paths in main.rs
  • Performance issues: Use RUSTFLAGS='-C target-cpu=native'
  • Memory errors: Reduce batch size or use INT8 quantization

Debug Mode

# Enable verbose logging
RUST_LOG=debug cargo run --release

# Python debug mode
python -c "import torch; print(torch.cuda.is_available())"

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

📞 Support


⚡ Built with Rust, ONNX Runtime 2.0 RC9, and ByT5 for production-grade temporal normalization.

About

Temporal normalizer based on T5

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors