This repository contains my work, progress, and contributions for the Data Engineering Project as part of the Applied Data Science and Artificial Intelligence program.
The goal of this project is to design and implement a complete data engineering pipeline involving:
- Web data crawling from multiple dynamic websites
- Collection of structured data and media content
- Distributed storage using Hadoop HDFS
- Data redundancy and fault-tolerance testing
- Database design and implementation
- Data ingestion from HDFS into a relational database
- Business intelligence and analytical query development
This repository serves as a record of:
- Project development progress
- Individual contributions
- Source code and scripts
- Configuration files
- Documentation
- Testing results
- Data processing workflows
- Python
- Playwright
- Hadoop HDFS
- Apache Spark
- Docker
- SQL
- Git & GitHub
The repository is continuously updated throughout the project lifecycle to document:
- Data collection activities
- Data storage implementation
- Cluster setup and configuration
- Database development
- Query creation and analysis
- Testing and optimization