Core Concepts
The core concepts to understand what is xorq
Caching System
xorq provides a sophisticated caching system that enables efficient iterative development of ML pipelines. The caching system allows you to:
- Cache results from upstream query engines
- Persist data locally or in remote storage
- Automatically invalidate cache when source data changes
- Chain caches across multiple engines
Storage Types
xorq supports two main types of cache storage:
1. SourceStorage
- Automatically invalidates cache when upstream data changes
- Persistence depends on the source backend
- Supports both remote (Snowflake, Postgres) and in-process (pandas, DuckDB) backends
2. SnapshotStorage
- No automatic invalidation
- Ideal for one-off analyses
- Persistence depends on source backend
3. ParquetCacheStorage
- Special case of SourceStorage
- Caches results as Parquet files on local disk
- Uses source backend for writing
- Ensures durable persistence
Hashing Strategies
Cache invalidation uses different hashing strategies based on the storage type:
Storage Type | Hash Components |
---|---|
In-Memory | Data bytes + Schema |
Disk-Based | Query plan + Schema |
Remote | Table metadata + Last modified time |
Key Benefits
- Faster Iteration:
- Reduce network calls to source systems
- Minimize recomputation of expensive operations
- Cache intermediate results for complex pipelines
- Declarative Integration:
- Chain cache operations anywhere in the expression
- Transparent integration with existing pipelines
- Multiple storage options for different use cases
- Automatic Management:
- Smart invalidation based on source changes
- No manual cache management required
- Efficient storage utilization
- Multi-Engine Support:
- Cache data between different engines
- Optimize storage location for performance
- Flexible persistence options
Multi-Engine System
xorq’s multi-engine system enables seamless data movement between different query engines, allowing you to leverage the strengths of each engine while maintaining a unified workflow.
The into_backend
Operator
The core of xorq’s multi-engine capability is the into_backend
operator, which enables:
- Transparent data movement between engines
- Zero-copy data transfer using Apache Arrow
- Automatic optimization of data placement
Supported Engines
xorq currently supports:
- In-Process Engines
- DuckDB
- DataFusion
- Pandas
- Distributed Engines
- Trino
- Snowflake
- BigQuery
Engine Selection Guidelines
Choose engines based on their strengths:
- DuckDB: Local processing, AsOf joins, efficient file formats
- DataFusion: Custom UDFs, streaming processing
- Trino: Distributed queries, federation, security
- Snowflake/BigQuery: Managed infrastructure, scalability
Data Transfer
Data movement between engines is handled through:
- Arrow Flight: Zero-copy data transfer protocol
- Memory Management: Automatic spilling to disk
- Batching: Efficient chunk-based processing
Custom UD(X)F System
xorq provides a powerful system for extending query engines with custom User-Defined Functions (UDFs). Here are three key types supported:
1. Scalar UDF with Model Integration
2. User Defined Aggregate Functions
Additionally you can use UDAF for training ML models, see the example for training an XGBoost model
3. Window UDF for Analysis
Ephemeral Flight Service
xorq’s Ephemeral Flight Service provides a high-performance data transfer mechanism between engines using Apache Arrow Flight. Unlike traditional data transfer methods, this service provides:
- Automatic Lifecycle Management
- Zero-Copy Data Movement
- Direct memory transfer between processes
- No serialization/deserialization overhead
- Efficient handling of large datasets
- Process Isolation
- Separate processes for different engines
- Independent resource management
- Fault isolation
- Resource Management
- Security Integration
Comparison with Ibis
While xorq is built on top of Ibis, it extends its capabilities in several key ways. Here’s a comprehensive example showing the differences:
Key Advantages over Ibis
- Unified Experience
- Consistent API across engines
- Seamless engine transitions
- Integrated caching system
- ML Focus
- Built-in ML tooling
- Efficient feature engineering
- Model training integration
- Development Workflow
- Interactive development support
- Caching for iteration