T5 Temporal Normalizer

A high-performance, multilingual temporal expression normalizer and pseudonymization system based on ByT5 architecture, with production-ready Rust implementation using ONNX Runtime 2.0 RC9.

🚀 Key Features

🌍 Multilingual Support: Italian, English, French, German, Spanish, Portuguese
🔒 Privacy-Preserving: Deterministic date shifting for patient data pseudonymization
⚡ High Performance: Optimized for CPU, GPU (CUDA), and NPU (Qualcomm)
🎯 Multiple Precision: FP32, FP16, INT8 quantization support
🛡️ OCR Robust: Handles noisy, garbled, and typo-ridden temporal expressions
📦 Production Ready: Rust implementation with ONNX Runtime for cross-platform deployment

📋 Supported Formats

Input Examples

# Dates (relative, absolute, noisy)
"15/mar/72"           → "2024-03-15"
"3 days after"       → "2024-01-04"
"D+5"                → "2024-01-06"
"12dicem6re"         → "2024-12-12"  # OCR error

# Times (various formats)
"16:21"              → "16:21"
"4pm"                → "16:00"
"08:05"              → "08:05"
"22.15"              → "22:15"

Prompt Format

YYYY-MM-DD | lang_code | temporal_expression

Example: 2024-01-01 | it | 15/mar/72

🌍 Multilingual Support

The model supports 6 major languages with equal training distribution:

🇮🇹 Italian (it)
🇬🇧 English (en)
🇫🇷 French (fr)
🇪🇸 Spanish (es)
🇩🇪 German (de)
🇵🇹 Portuguese (pt)

Each language includes medical slang, abbreviations, and cultural variations in temporal expressions.

🎯 Use Cases

🏥 Clinical/Medical (EHR)

Perfect for extracting precise timelines from electronic health records where doctors use extreme abbreviations ("ricovero", "post-op", "adm", "D+3").

⚖️ Legal & Compliance

Analysis of contracts and legal documents where deadlines are expressed relatively.

🤖 Conversational AI & Booking

Chatbots for appointments ("book for next Tuesday afternoon").

📊 Financial Analysis & News

Timeline extraction from articles or financial reports.

🚚 Logistics & Supply Chain

Parsing informal emails or shipping documents ("delivery expected in 2 days").

🚀 Performance & Hardware Support

📊 Benchmark Results (1K samples)

Precision	Model Size	Accuracy	Throughput	Memory Usage
FP32	~1.14 GB	99.40%	~44.0/s	~500MB
FP16	~738 MB	99.40%	~39.8/s	~400MB
INT8	~290 MB	99.40%	~31.7/s	~300MB

🖥️ Platform Support

🖥️ Desktop: CPU x86-64 with SIMD optimizations
🎮 GPU: NVIDIA CUDA (RTX 3090/4090) with TensorRT support
📱 Edge/Mobile: ARM64 (Qualcomm Snapdragon) optimized
☁️ Cloud: Docker containers, Kubernetes ready

🔧 Rust Implementation Features

ONNX Runtime 2.0 RC9: Latest optimizations and bug fixes
Memory Safety: Rust's ownership system prevents memory leaks
Zero-Copy: Efficient tensor handling with minimal allocations
Async Support: Optional tokio integration for web services
Batch Processing: Optimized for high-throughput scenarios

📁 Project Structure

t5-temporal-normalizer/
├── 🐍 Python Components
│   ├── train_byt5.py              # Training script (GPU optimized)
│   ├── export_onnx_v2.py          # ONNX export with quantization
│   ├── dataset_generator.py       # Synthetic data generation
│   └── eval_*.py                  # Evaluation scripts
├── 🦀 Rust Implementation
│   ├── rust/
│   │   ├── Cargo.toml             # Dependencies (ONNX 2.0 RC9)
│   │   └── src/
│   │       ├── lib.rs             # Core library
│   │       └── main.rs            # Demo application
│   └── README_RUST.md             # Rust-specific documentation
├── 📊 Models & Data
│   ├── byt5_temporal_model/       # Trained PyTorch model
│   ├── onnx_models/               # Exported ONNX models
│   └── temporal_train_v3.csv      # Training dataset
└── 📚 Documentation
    ├── PLAN.md                    # Project roadmap
    └── README.md                  # This file

🏗️ Architecture

🧠 ByT5 (Token-Free Model)

Character-level: Operates on UTF-8 bytes, immune to OOV tokens
OCR Robust: Naturally handles typos, garbled text, and formatting errors
Anchor Date Injection: Prompt format shifts temporal complexity to model forward pass

🔒 Privacy-Preserving Features

Deterministic Shifting: SHA256(patient_id + salt) → 1-30 day offset
Timeline Preservation: Relative differences between dates remain constant
Time Integrity: Hours are never shifted, only dates

🎯 Input/Output Format

Input:  "2024-01-01 | it | 15/mar/72"
Output: "2024-03-15" (shifted by +14 days for patient_001)

Input:  "2024-01-01 | en | 4pm"
Output: "16:00" (unchanged - time is never shifted)

🚀 Quick Start

1️⃣ Python Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2️⃣ Train Model (GPU Required)

# Generate synthetic dataset
python dataset_generator.py

# Train ByT5 model (RTX 3090/4090 recommended)
CUDA_VISIBLE_DEVICES=1 python train_byt5.py

3️⃣ Export to ONNX

# Export all precision variants
python export_onnx_v2.py

4️⃣ Rust Implementation

# Compile Rust library
cd rust
cargo build --release

# Run demo
cargo run --bin temporal_normalizer --release

5️⃣ Python Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("./byt5_temporal_model/final")

# Normalize temporal expression
inputs = tokenizer("2024-01-01 | en | 3 days post admission", return_tensors="pt")
outputs = model.generate(**inputs, max_length=16)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)  # Output: 2024-01-04

🦀 Rust Usage Example

use t5_temporal_normalizer::{TemporalNormalizer, TemporalResult};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize normalizer
    let mut normalizer = TemporalNormalizer::new(
        "../byt5_temporal_model/onnx/model.onnx",
        "../byt5_temporal_model/onnx/tokenizer.json",
        "secret_salt_for_privacy".to_string(),
    )?;

    // Normalize with date shifting
    let result = normalizer.normalize_temporal_expression(
        "15/mar/72",
        "2024-01-01",
        "it",
        Some("patient_001")
    )?;

    println!("{} -> {} (shifted: {})", 
        result.original, result.normalized, result.was_shifted);
    
    Ok(())
}

🔧 Configuration

Environment Variables

# GPU selection (for multi-GPU systems)
export CUDA_VISIBLE_DEVICES=1

# Rust optimizations
export RUSTFLAGS='-C target-cpu=native'

ONNX Runtime Providers

CPUExecutionProvider: Universal CPU support
CUDAExecutionProvider: NVIDIA GPU acceleration
TensorrtExecutionProvider: Maximum GPU performance
DmlExecutionProvider: Windows DirectML (ARM64)

📊 Dataset & Training

Size: 2.5M synthetic examples (80% train, 10% validation, 10% test)
Languages: Balanced multilingual distribution
Noise: 15% OCR errors, 15% missing language tags
Training: Exact Match metric with early stopping
Precision: BF16 training for GPU efficiency

🧪 Testing & Validation

# Python validation
python eval_pytorch.py
python eval_onnx.py

# Rust testing
cd rust && cargo test

# Performance benchmarks
cd rust && cargo run --release -- --benchmark

🐛 Troubleshooting

Common Issues

CUDA not found: Install PyTorch with CUDA support
Model loading error: Check file paths in main.rs
Performance issues: Use RUSTFLAGS='-C target-cpu=native'
Memory errors: Reduce batch size or use INT8 quantization

Debug Mode

# Enable verbose logging
RUST_LOG=debug cargo run --release

# Python debug mode
python -c "import torch; print(torch.cuda.is_available())"

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📞 Support

⚡ Built with Rust, ONNX Runtime 2.0 RC9, and ByT5 for production-grade temporal normalization.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
byt5_temporal_model/final		byt5_temporal_model/final
dataset		dataset
export		export
go		go
go_eval		go_eval
python		python
rust		rust
training		training
.gitignore		.gitignore
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
README_HF.md		README_HF.md
README_RUST.md		README_RUST.md
concept.svg		concept.svg
dataset_generator.py		dataset_generator.py
eval_onnx.py		eval_onnx.py
eval_pytorch.py		eval_pytorch.py
export_onnx.py		export_onnx.py
export_onnx_v2.py		export_onnx_v2.py
generate_test_set.py		generate_test_set.py
inference.rs		inference.rs
requirements.txt		requirements.txt
temporal_train_v3.csv		temporal_train_v3.csv
train_byt5.py		train_byt5.py

Folders and files

Latest commit

History

Repository files navigation

T5 Temporal Normalizer

🚀 Key Features

📋 Supported Formats

Input Examples

Prompt Format

🌍 Multilingual Support

🎯 Use Cases

🏥 Clinical/Medical (EHR)

⚖️ Legal & Compliance

🤖 Conversational AI & Booking

📊 Financial Analysis & News

🚚 Logistics & Supply Chain

🚀 Performance & Hardware Support

📊 Benchmark Results (1K samples)

🖥️ Platform Support

🔧 Rust Implementation Features

📁 Project Structure

🏗️ Architecture

🧠 ByT5 (Token-Free Model)

🔒 Privacy-Preserving Features

🎯 Input/Output Format

🚀 Quick Start

1️⃣ Python Setup

2️⃣ Train Model (GPU Required)

3️⃣ Export to ONNX

4️⃣ Rust Implementation

5️⃣ Python Inference

🦀 Rust Usage Example

🔧 Configuration

Environment Variables

ONNX Runtime Providers

📊 Dataset & Training

🧪 Testing & Validation

🐛 Troubleshooting

Common Issues

Debug Mode

📄 License

🤝 Contributing

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages