SentinalX is an advanced telecom fraud detection system that uses a hybrid approach combining rule-based filtering with machine learning to achieve optimal performance with zero false positives for legitimate high-frequency users (delivery partners).
โ Hybrid Detection Architecture
- Stage 1: Hard rule filter for instant whitelisting
- Stage 2: Isolation Forest ML for sophisticated anomaly detection
โ Performance Targets
- Accuracy: 96-98%
- Precision: 95%+ (low false positives)
- Recall: 98%+ (catch all fraudsters)
- Delivery Partner FPR: 0% (guaranteed by hard rule)
- Inference Time: <50ms per prediction
โ Production Ready
- Fast inference (<50ms)
- Scalable architecture
- Comprehensive evaluation metrics
- Easy deployment
Input Data
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STAGE 1: Hard Rule Filter โ
โ โ
โ IF callFrequency > 50 AND โ
โ avgCallDistance < 10: โ
โ โ LEGITIMATE (100% confident) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (if not matched)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STAGE 2: Isolation Forest ML โ
โ โ
โ โข Feature engineering โ
โ โข Anomaly score calculation โ
โ โข Risk type classification โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Final Prediction + Confidence
- Delivery Partners: High frequency, low distance (protected by hard rule)
- Regular Users: Low frequency, longer calls
- Business Users: Moderate activity, professional patterns
- Traveling Professionals: Multiple locations, legitimate reasons
- Digital Arrest Bots: Very high frequency, ultra-short calls, cross-state
- Traditional Scammers: Moderate volume, long distance operations
- Low Volume Scammers: Targeted attacks, fewer victims
- Python 3.8+
- pip package manager
# Clone or navigate to the project directory
cd SentinalX
# Install dependencies
pip install -r requirements.txtpython data_generator.pyThis creates:
Data/training_dataset.csv(13,000 samples, 65% legitimate / 35% fraud)Data/test_dataset.csv(2,600 samples, same distribution)
python train_model.pyOutput:
models/isolation_forest.pkl(trained ML model)models/scaler.pkl(feature scaler)models/config.json(configuration & training stats)
python evaluate_model.pyGenerates comprehensive metrics:
- Overall accuracy, precision, recall
- Per-user-type performance
- Delivery partner FPR (should be 0%)
- Confusion matrix visualization
evaluation_report.json
Demo Mode (pre-configured test cases):
python predict.pyInteractive Mode (manual input):
python predict.py --interactiveBased on the hybrid architecture, you should achieve:
| Metric | Target | Why It Matters |
|---|---|---|
| Accuracy | 96-98% | Overall correctness |
| Precision (Fraud) | 95%+ | Low false positives = happy users |
| Recall (Fraud) | 98%+ | Catch all scammers |
| Delivery Partner FPR | 0% | Hard rule guarantees protection |
| Inference Time | <50ms | Real-time processing |
model = IsolationForest(
n_estimators=100, # Number of trees (speed vs accuracy)
contamination=0.3, # Expected fraud rate (30%)
max_samples=256, # Samples per tree (speed optimization)
random_state=42, # Reproducibility
n_jobs=-1 # Use all CPU cores
)The model uses 9 key features:
Base Features:
avgDuration- Average call duration (seconds)callFrequency- Number of calls in perioduniqueContacts- Number of unique contactsavgCallDistance- Average distance between caller/receiver (km)circleDiversity- Number of geographic circles
Engineered Features:
6. call_intensity = callFrequency / avgDuration
7. distance_per_call = avgCallDistance / callFrequency
8. contact_circle_ratio = uniqueContacts / (circleDiversity + 1)
9. high_freq_long_distance = (callFrequency > 40) & (avgCallDistance > 1000)
SentinalX/
โ
โโโ Data/
โ โโโ training_dataset.csv # Training data (13K samples)
โ โโโ test_dataset.csv # Test data (2.6K samples)
โ
โโโ models/ # Saved models (created after training)
โ โโโ isolation_forest.pkl
โ โโโ scaler.pkl
โ โโโ config.json
โ
โโโ data_generator.py # Synthetic data generation
โโโ train_model.py # Model training pipeline
โโโ evaluate_model.py # Comprehensive evaluation
โโโ predict.py # Inference & predictions
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
from predict import FraudDetectionAPI
api = FraudDetectionAPI('models')
result = api.predict_single({
'callFrequency': 65,
'avgDuration': 8.5,
'uniqueContacts': 85,
'avgCallDistance': 6.2,
'circleDiversity': 1
})
print(result['prediction']) # โ LEGITIMATE
print(result['confidence']) # โ 1.0
print(result['detectionStage']) # โ RULE_BASEDresult = api.predict_single({
'callFrequency': 89,
'avgDuration': 5.2,
'uniqueContacts': 156,
'avgCallDistance': 2100,
'circleDiversity': 6
})
print(result['prediction']) # โ FRAUD
print(result['riskType']) # โ HIGH_RISK_CROSS_STATE_OPERATION
print(result['detectionStage']) # โ ML_ISOLATION_FORESTusers = [
{'callFrequency': 65, 'avgDuration': 8.5, ...},
{'callFrequency': 18, 'avgDuration': 120, ...},
{'callFrequency': 89, 'avgDuration': 5.2, ...}
]
results = api.predict_batch(users)
for i, result in enumerate(results):
print(f"User {i+1}: {result['prediction']}")if callFrequency > 50 and avgCallDistance < 10:
return {
"isAnomaly": False,
"confidence": 1.0,
"riskType": "LEGITIMATE_HIGH_FREQUENCY",
"detectionStage": "RULE_BASED"
}Why this works:
- Delivery partners make many calls (>50) in a small area (<10km)
- This pattern is nearly impossible for fraud operations
- Guarantees 0% false positives for this critical user group
For everyone else, the model:
- Scales features using StandardScaler
- Calculates anomaly score (more negative = more suspicious)
- Classifies as fraud if score exceeds threshold
- Provides confidence and risk type
- Overall Accuracy: Should stay >96%
- Precision: Minimize false positives (target >95%)
- Recall: Catch all fraud (target >98%)
- Delivery Partner FPR: Must be 0%
- Inference Time: Must be <50ms
After running evaluate_model.py, check evaluation_report.json for:
- Detailed metrics by user type
- Confusion matrix breakdown
- Detection stage statistics
- Inference time analysis
- Model trained on sufficient data (13K+ samples)
- Test accuracy meets targets (>96%)
- Delivery partner FPR is 0%
- Inference time <50ms
- Error handling implemented
- Monitoring/logging in place
- Model versioning system
- Rollback plan ready
- Speed: Use ONNX runtime for faster inference
- Scale: Deploy with FastAPI or Flask
- Monitoring: Track prediction distribution over time
- Retraining: Schedule periodic retraining with new data
Issue: Model accuracy below 96%
- Solution: Regenerate data with more samples
- Check feature engineering calculations
Issue: Delivery partner FPR > 0%
- Solution: Verify hard rule implementation
- Check if callFrequency/avgCallDistance features are correct
Issue: Inference time > 50ms
- Solution: Reduce
max_samplesparameter - Consider ONNX conversion for production
Issue: Import errors
- Solution: Run
pip install -r requirements.txt - Ensure Python 3.8+ is installed
- Fast Training: O(n log n) complexity
- Anomaly Detection: No need for labeled fraud examples during training
- Handles High Dimensions: Works well with engineered features
- Fast Inference: <50ms per prediction
- No Hyperparameter Tuning: Works well with default settings
- Guarantees: Hard rule provides absolute protection for delivery partners
- Flexibility: ML handles novel fraud patterns
- Speed: Rule-based stage is instant
- Interpretability: Clear reasoning for both stages
This project is part of the SentinalX fraud detection initiative.
To improve the model:
- Add new user profiles to
data_generator.py - Tune Isolation Forest parameters in
train_model.py - Add new engineered features
- Implement additional detection stages
For issues or questions:
- Check the troubleshooting section
- Review evaluation metrics
- Verify data quality
Once deployed, track:
- Fraud catch rate: % of actual fraud detected
- User satisfaction: Complaints about false positives
- Processing speed: Average inference time
- Model drift: Performance degradation over time
Target: 98%+ fraud detection with <1% false positive rate
Built with ๐ for safer telecommunications