Cloud Analytics for Data Analyst Job Market Insights
Project Overview
This project aims to perform cloud analytics and data warehouse implementation for historical data related to data analyst job postings. We created an entity-relationship model, built a data warehouse with a star schema model, and orchestrated a data pipeline using Extract, Transform, and Load (ETL) procedures. The project summarizes trends from the selected historical dataset to aid job seekers in developing skills for better career prospects in data-oriented roles.
Methodology
Dataset Selection
We used a dataset on data analyst job postings obtained from Kaggle and through SERP API. The dataset contains over 16,000 job listings for data analysts in the US, updated daily with over 100 new job posts.
ETL Process
We implemented an ETL process using Apache Airflow through Google Big Query. The process involves two layers: a staging layer for data quality checks and a main layer where data is split into dimension and fact tables stored in our Data Warehouse.
Cloud Architecture
Our cloud-based ETL setup uses Google Cloud Platform, including:
- Cloud Composer (built on Apache Airflow) for orchestration
- Cloud Storage buckets for data storage
- Big Query as our data warehouse
- MongoDB Atlas for NoSQL storage
Airflow Pipeline
We created an Airflow DAG (Directed Acyclic Graph) that defines the pipeline, including tasks for data extraction, transformation, and loading. The pipeline is scheduled to run daily and includes features such as:
- GCS update sensor trigger
- Mail alerts for success, failure, and exceptions
Analysis
We performed various analyses using SQL queries on Google Big Query and created visualizations in Tableau. Key insights include:
- Number of Work from Home (WFH) options by company
- Average salary by top companies
- Count of job postings by company
- Most mentioned skills in job postings
- Salary statistics by job title
- Job postings and mean salary by job schedule
- Top 10 cities with highest average salary
Key Findings
- High demand for Data Analysts at the entry level
- Data Engineers earn the highest average salary of $217,500
- Most in-demand skills: SQL, Power BI, Tableau, Python, and Excel
- High demand in central American states (Missouri, Kansas, Oklahoma, Arkansas)
- Top recruiting companies: Upwork, Edward Jones, Talentify.io, Dice, and Insight Global
- Preferred recruitment channels: LinkedIn, Upwork, Indeed, Bebee
Conclusion
This project demonstrates the power of cloud-based ETL and analytics in deriving meaningful insights from large datasets. By leveraging Google Cloud Platform services and advanced analytics tools, we were able to process and analyze a vast amount of job market data, providing valuable insights for job seekers in the data analytics field.