10-minute tour of xorq

This tutorial will walk you through the key features of xorq, a data processing library that enables you to work seamlessly across multiple data engines.

Installation

First, install xorq using pip. We’ll include the duckdb extra to enable the duckdb backend:

pip install 'xorq[duckdb]'

Setting up Connections

Let’s start by creating connections to different backends:

import xorq as xo
from xorq import _  # import the underscore accessor for column reference

# Create connections to different backends
con = xo.connect()  # xorq's main connection
ddb = xo.duckdb.connect()  # DuckDB connection
pg = xo.postgres.connect_examples()  # Postgres connection

In this section, we:

Import xorq and its deferred object, which allow us to referred to columns
Create a xorq connection (backed by the xorq backend)
Create a DuckDB connection
Create a Postgres connection

Note that you can create a custom Postgres connection by specifying the different parameters, for example:

pg = xo.postgres.connect(
  host="localhost",
  port=5432,
  user="postgres",
  password="postgres",
  database="ibis_testing"
)

Reading Data

Now let’s read some data into xorq:

# Read a parquet file using xorq
path = xo.config.options.pins.get_path("batting")
batting = con.read_parquet(path, table="batting")

xorq can read data from various sources. Here we’re reading a Parquet file directly. The table_name parameter specifies how this table will be referenced inside the con backend.

Basic Operations

Let’s perform some basic data operations:

# Filtering and selection
recent_batting = (
    batting[batting.yearID > 2010]  # filter for recent years
    .select(['playerID', 'yearID', 'teamID', 'G', 'AB', 'R', 'H'])  # select specific columns
)

# Execute to see results
recent_batting.execute()

The output is

       playerID  yearID teamID    G     AB     R      H
   utleych01    2013    PHI  131  476.0  73.0  135.0
   valaich01    2013    MIA   22   64.0   4.0   14.0
   valbulu01    2013    CHN  108  331.0  34.0   72.0
   valdejo02    2013    NYN   66  133.0  16.0   25.0
   valdelu01    2013    BAL    4    0.0   0.0    0.0
...         ...     ...    ...  ...    ...   ...    ...
ugglada01    2013    ATL  136  448.0  60.0   80.0
uptonbj01    2013    ATL  126  391.0  30.0   72.0
uptonju01    2013    ATL  149  558.0  94.0  147.0
uribeju01    2013    LAN  132  388.0  47.0  108.0
urruthe01    2013    BAL   24   58.0   5.0   16.0

[7127 rows x 7 columns]

Note that xorq operations are lazy - they don’t execute until you call execute, which returns a pandas DataFrame.

Multi-Engine Operations

One of xorq’s powerful features is the ability to move data between different backends using into_backend(). This method converts an expression from one backend into a table in another backend, using a PyArrow RecordBatchReader as an intermediate format:

# Read a table from Postgres and move it to xorq's backend
awards = pg.table("awards_players").into_backend(con, "awards")  # bring into xorq backend

# Perform a join between the two tables
player_awards = (
    recent_batting.join(
        awards,
        ['playerID', 'yearID'],  # join keys
        how='left'  # left join
    )
    .select([
        'playerID',
        'yearID',
        'teamID',
        'awardID',
        'G',
        'AB',
        'H'
    ])
)

player_awards.execute()

The corresponding output is:

       playerID  yearID teamID                      awardID    G     AB      H
   victosh01    2011    PHI          Branch Rickey Award  132  519.0  145.0
   aybarer01    2011    LAA                   Gold Glove  143  556.0  155.0
   parrage01    2011    ARI                   Gold Glove  141  445.0  130.0
   rodnefe01    2012    TBA  Comeback Player of the Year   76    0.0    0.0
   molinya01    2012    SLN                   Gold Glove  138  505.0  159.0
...         ...     ...    ...                          ...  ...    ...    ...
sanchjo01    2013    PIT                         None    6    5.0    0.0
saundmi01    2013    SEA                         None  132  406.0   96.0
selleju01    2013    LAN                         None   27   69.0   13.0
skipwky01    2013    MIA                         None    4    3.0    0.0
sogarer01    2013    OAK                         None  130  368.0   98.0

[7201 rows x 7 columns]

into_backend() is particularly useful when you want to:

Move data between different database engines
Combine data from multiple sources
Avoid writing intermediate results to disk

Leveraging Different Backends

Different backends have different strengths. Let’s use DuckDB for some aggregations:

# Move data to DuckDB for additional processing
ddb_awards = player_awards.into_backend(ddb, "ddb_awards")

# Perform DuckDB-specific operations
ddb_summary = (
    ddb_awards.group_by(['yearID', 'awardID'])
    .agg([
        _.count().name('player_count'),
        _.G.mean().name('avg_games'),
        _.H.mean().name('avg_hits'),
    ])
    .order_by(['yearID', 'awardID'])
)

print("Award summary from DuckDB:")
ddb_summary.execute()

The output:

    yearID                              awardID  ...   avg_games    avg_hits
   2011                             ALCS MVP  ...  124.000000  125.000000
   2011                    All-Star Game MVP  ...  162.000000  170.000000
   2011                      Babe Ruth Award  ...   97.000000   99.000000
   2011                  Branch Rickey Award  ...  132.000000  145.000000
   2011          Comeback Player of the Year  ...  151.500000  179.500000
..     ...                                  ...  ...         ...         ...
  2015                   Rookie of the Year  ...  125.000000  131.000000
  2015                       Silver Slugger  ...  144.555556  160.722222
  2015  TSN Major League Player of the Year  ...  158.000000  184.000000
  2015                     World Series MVP  ...  142.000000  138.000000
  2015                                 None  ...   45.295580   25.419199

[85 rows x 5 columns]

Caching Expressions

xorq provides caching capabilities to optimize performance:

from pathlib import Path
from xorq.common.caching import ParquetCacheStorage

# Create a storage for cached data
cache_storage = ParquetCacheStorage(source=con, path=Path.cwd())

# Cache the results
cached_awards = player_awards.cache(storage=cache_storage)

# The next execution will use the cached data
cached_awards.execute()

Key Takeaways

xorq provides a consistent interface across different data engines
Operations are lazy until .execute() is called
Data can be moved between backends using into_backend()
Caching helps optimize performance for frequently used queries
Different backends can be leveraged for their specific strengths

User Guide

Tutorials

Installation

Setting up Connections

Reading Data

Basic Operations

Multi-Engine Operations

Leveraging Different Backends

Caching Expressions

Key Takeaways

User Guide

Tutorials

​Installation

​Setting up Connections

​Reading Data

​Basic Operations

​Multi-Engine Operations

​Leveraging Different Backends

​Caching Expressions

​Key Takeaways

Installation

Setting up Connections

Reading Data

Basic Operations

Multi-Engine Operations

Leveraging Different Backends

Caching Expressions

Key Takeaways