Data Projects
SweetPulse
Personal Project, Vancouver-based dessert cafe (name protected under NDA, references available upon request)
github.com/awlh18/SweetPulse
About the project:
After learning the owner of a local dessert cafe wanted to understand how weather impacts sales, I was inspired to reach out and explore the question. I conducted a multiple regression analysis with historical sales data provided by the owner and weather data of Metro Vancouver sourced from vancouver.weatherstats.ca. The initial findings revealed statistically significant correlation between specific products sales and weather conditions.
Recognizing the potential value from the analysis, the project evolved into a forecasting and analytics tool - SweetPulse. The tool provides real-time sales and order volume predictions based on weather forecasts and other operational inputs, as well as interactive visualizations to uncover actionable sales patterns. The objective is to support data-driven decision-making for daily operations planning.
Highlights:
Conducted exploratory and statistical analyses to understand key revenue drivers, sales trends, and seasonality; identified items with strong weather correlation and quantified impact of independent business campaigns on sales.
Trained and deployed Linear and Poisson Regression models to forecast daily revenue and order volume, achieving prediction errors within 8% of average sales and 10% of average orders.
Created a Streamlit dashboard application to monitor KPIs, visualize insights, and access demand forecasts; integrated a retraining pipeline allowing cafe staff to update models and dashboard with new data.
Implemented model diagnostics to help users monitor model performance and validity, supporting future retraining and model selection efforts.
Tools used:
- Python, pandas, NumPy, scikit-learn, statsmodels, Docker, Streamlit, plotly, pandera, Make
PhishSense 2.0
UBC MDS Capstone Project, UBC Cybersecurity
github.com/awlh18/phishsense-2.0
About the project:
One of UBC Cybersecurity’s core responsibilities involves protecting UBC email users from phishing attacks. While the team uses machine learning to classify reported emails, the existing model relies solely on word occurrences in the header and body, resulting in limited performance and necessitates manual review by a human analyst. This process is further challenged by the high-volume of reported emails and time sensitivity in addressing these threats.
The project aims to develop a new machine learning pipeline that accurately identifies benign emails from suspicious emails, thereby reducing the number of tickets requiring manual review. The solution shortens response time to true malicious threats, reduces manual workload, and strengthens UBC’s cybersecurity.
Highlights:
Explored a range of approaches, including classical machine learning models, Bidirectional Encoder Representations from Transformers (BERT) based models, and large language models (LLMs).
Final data product employs a stacking classifier with four XGBoost base-estimators (each learning from a different set of email features) and a Support Vector Classifier as the final classifier.
Achieved an F1-Score of 0.75 for benign email classification, representing an improvement of 0.20 over existing model. Performance gain projected to reduce manual review time by 10 hours per month.
Built a data preprocessing pipeline to transform raw email files into structured tabular data with extracted metadata and content features for downstream analysis and model training.
Incorporated pipeline into a web service for integration into UBC Cybersecurity’s existing incident response workflow.
Presented model design and projected operational efficiencies to UBC IT leadership, translating technical results into business insights.
Tools used:
- Python, pandas, scikit-learn, XGBoost, matplotlib, Flask, pandera, email, Click, Make, regex
Vancouver Condo Market Tracker
Personal Project
github.com/awlh18/CondoTracker
About the project:
An interactive Tableau dashboard that provides a holistic view on City of Vancouver’s condo market alongside other indicators of the Canadian mortgage market.
The dashboard highlights Vancouver’s condo market activities, such as housing starts, completions, and inventory levels. It also tracks mortgage rate trends, origination volumes, and household financial vulnerability, offering a comprehensive view of market sentiment. The objective is to help real estate market participants, such as realtors, investors, renters and policymakers make more informed decisions.
Data are sourced from publically available data provided by the Bank of Canada (BoC) and the Canada Mortgage and Housing Corporation (CMHC) via public APIs.
Tools used:
- Python, pandas, numpy, requests, rpy2, R, Tableau
linreg_ally
UBC MDS Academic Project
github.com/UBC-MDS/linreg_ally
About the project:
linreg_ally is a Python package designed to help users determine if Ordinary Least Squares (OLS) regression is an appropriate model for their data. The package automates key steps, including performing OLS regression, data formatting checks, assumption validation, and multicollinearity detection, ensuring the data meets the prerequisites for a reliable model.
The project is part of the coursework of DSCI 524 - Collaborative Software Development, which emphasizes incorporating version control, testing, continuous integration, and modular programming in data scientific workflows.
Highlights:
- Published on PyPI along with a CI/CD pipeline for automated testing and deployment.
Tools used:
- Python, Poetry, Cookiecutter, GitHub Actions, pytest