Data Projects

SweetPulse

Personal Project, Vancouver-based dessert cafe (name protected under NDA, references available upon request)
github.com/awlh18/SweetPulse

About the project:
After learning the owner of a local dessert cafe wanted to understand how weather impacts sales, I was inspired to reach out and explore the question. I conducted a multiple regression analysis with historical sales data provided by the owner and weather data of Metro Vancouver sourced from vancouver.weatherstats.ca. The initial findings revealed statistically significant correlation between specific products sales and weather conditions.

Recognizing the potential value from the analysis, the project evolved into a forecasting and analytics tool - SweetPulse. The tool provides real-time sales and order volume predictions based on weather forecasts and other operational inputs, as well as interactive visualizations to uncover actionable sales patterns. The objective is to support data-driven decision-making for daily operations planning.

Highlights:

Trained and deployed Linear and Poisson Regression models to forecast daily revenue and order volume, achieving prediction errors within 8% of average sales and 10% of average orders.
Implemented model diagnostics to help users monitor model performance and validity, supporting future retraining and model selection efforts.
Developed a pipeline enabling cafe staff to easily retrain forecasting models and refresh dashboard visualizations as new sales data becomes available.

Tools used:

Python, pandas, NumPy, scikit-learn, statsmodels, Docker, Streamlit, plotly, pandera, Make

PhishSense 2.0

UBC MDS Capstone Project, UBC Cybersecurity
github.com/awlh18/phishsense-2.0

About the project:
One of UBC Cybersecurity’s core responsibilities involves protecting UBC email users from phishing attacks. While the team uses machine learning to classify reported emails, the existing model relies solely on word occurrences in the header and body, resulting in limited performance and necessitates manual review by a human analyst. This process is further challenged by the high-volume of reported emails and time sensitivity in addressing these threats.

The project aims to develop a new machine learning pipeline that accurately identifies benign emails from suspicious emails, thereby reducing the number of tickets requiring manual review. The solution shortens response time to true malicious threats, reduces manual workload, and strengthens UBC’s cybersecurity.

Highlights:

Explored a range of approaches, including classical machine learning models, Bidirectional Encoder Representations from Transformers (BERT) based models, and large language models (LLMs).
Final data product employs a stacking classifier with four XGBoost base-estimators (each learning from a different set of email features) and a Support Vector Classifier as the final classifier.
Achieved an F1-Score of 0.75 for benign email classification, representing an improvement of 0.20 over existing model. Performance gain projected to reduce manual review time by 10 hours per month.
Built a data preprocessing pipeline to transform raw email files into structured tabular data with extracted metadata and content features for downstream analysis and model training.
Incorporated pipeline into a containerized web service for integration into UBC Cybersecurity’s existing incident response workflow.
Presented model design and projected operational efficiencies to UBC IT leadership, translating technical results into business insights.

Tools used:

Python, pandas, scikit-learn, XGBoost, Podman, matplotlib, Flask, pandera, email, Click, Make, regex

linreg_ally

UBC MDS Academic Project
github.com/UBC-MDS/linreg_ally

About the project:
linreg_ally is a Python package designed to help users determine if Ordinary Least Squares (OLS) regression is an appropriate model for their data. The package automates key steps, including performing OLS regression, data formatting checks, assumption validation, and multicollinearity detection, ensuring the data meets the prerequisites for a reliable model.

The project is part of the coursework of DSCI 524 - Collaborative Software Development, which emphasizes incorporating version control, testing, continuous integration, and modular programming in data scientific workflows.

Highlights:

Published on PyPI along with a CI/CD pipeline for automated testing and deployment.

Tools used:

Python, Poetry, Cookiecutter, GitHub Actions, pytest