Pipelines as Expressions
In this tutorial, we’ll build an end-to-end machine learning pipeline using xorq expressions to predict the number of comments a Hacker News story will receive based on its title. The pipeline fetches live data, processes text, trains a model, and makes predictions - all expressed as a single composable expression.
Fetching Live Data
First, we create a function to fetch the latest stories from Hacker News:
We then load this data into xorq:
Text Vectorization
To process the story titles, we use TF-IDF vectorization:
These functions are wrapped as PyArrow UDFs so they can be used in expressions:
Train/Test Split and Feature Transformation
We split our data while preserving uniqueness based on ID and title:
And transform both sets using our vectorization UDFs:
Model Training and Prediction
We define functions for training an XGBoost regressor and making predictions with it:
Building the Pipeline Expression
Finally, we compose everything together into a single expressions, that works as a pipeline:
To run the pipeline and get predictions:
The key advantage of this approach is that everything - from data fetching through prediction - is expressed as a single composable pipeline. xorq handles the execution details, optimization, and can even cache intermediate results.
This makes it easy to:
- Update the pipeline with new data
- Modify individual steps without rewriting the whole pipeline
- Cache and reuse expensive computations
- Execute different parts of the pipeline on different engines
This expression-based approach provides a clean, declarative way to build ML pipelines while maintaining the flexibility to use powerful ML libraries like scikit-learn and XGBoost.