Table expressions
Tables are one of the core data structures in Ibis.
Table
An immutable and lazy dataframe.
Analogous to a SQL table or a pandas DataFrame. A table expression
contains an ordered set of named
columns, each with a single
known type. Unless explicitly ordered with an
.order_by()
,
the order of rows is undefined.
Table immutability means that the data underlying an Ibis Table
cannot
be modified: every method on a Table returns a new Table with those
changes. Laziness means that an Ibis Table
expression does not run
your computation every time you call one of its methods. Instead, it is
a symbolic expression that represents a set of operations to be
performed, which typically is translated into a SQL query. That SQL
query is then executed on a backend, where the data actually lives. The
result (now small enough to be manageable) can then be materialized back
into python as a pandas/pyarrow/python DataFrame/Column/scalar.
You will not create Table objects directly. Instead, you will create one
- from a pandas DataFrame, pyarrow table, Polars table, or raw python
dicts/lists with
letsql.memtable(df)
- from an existing table in a data platform with
connection.table("name")
- from a file or URL, into a specific backend with
connection.read_csv/parquet/json("path/to/file")
(only some backends, typically local ones, support this) - from a file or URL, into the default backend with
ibis.read_csv/read_json/read_parquet("path/to/file")
Methods
Name | Description |
---|---|
alias | Create a table expression with a specific name alias . |
as_scalar | Inform ibis that the table expression should be treated as a scalar. |
count | Compute the number of rows in the table. |
difference | Compute the set difference of multiple table expressions. |
distinct | Return a Table with duplicate rows removed. |
dropna | Deprecated - use drop_null instead. |
fillna | Deprecated - use fill_null instead. |
filter | Select rows from table based on predicates . |
intersect | Compute the set intersection of multiple table expressions. |
limit | Select n rows from self starting at offset . |
order_by | Sort a table by one or more expressions. |
sample | Sample a fraction of rows from a table. |
select | Compute a new table expression using exprs and named_exprs . |
sql | Run a SQL query against a table expression. |
union | Compute the set union of multiple table expressions. |
view | Create a new table expression distinct from the current one. |
cache | Cache the results of a computation to improve performance on subsequent executions. |
into_backend | Converts the Expr to a table in the given backend con with an optional table name name . |
alias
Create a table expression with a specific name alias
.
This method is useful for exposing an ibis expression to the underlying
backend for use in the
Table.sql
method.
.alias
creates a temporary view in the database.
This side effect will be removed in a future version of xorq and is not part of the public API.
Parameters
Name | Type | Description | Default |
---|---|---|---|
alias | str | Name of the child expression | required |
Returns
Name | Type | Description |
---|---|---|
Table | An table expression |
Examples
as_scalar
Inform ibis that the table expression should be treated as a scalar.
Note that the table must have exactly one column and one row for this to work. If the table has more than one column an error will be raised in expression construction time. If the table has more than one row an error will be raised by the backend when the expression is executed.
Returns
Name | Type | Description |
---|---|---|
Scalar | A scalar subquery |
Examples
count
Compute the number of rows in the table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
where | ir.BooleanValue | None | Optional boolean expression to filter rows when counting. | None |
Returns
Name | Type | Description |
---|---|---|
IntegerScalar | Number of rows in the table |
Examples
difference
Compute the set difference of multiple table expressions.
The input tables must have identical schemas.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table | Table | A table expression | required |
*rest | Table | Additional table expressions | () |
distinct | bool | Only diff distinct rows not occurring in the calling table | True |
See Also
Returns
Name | Type | Description |
---|---|---|
Table | The rows present in self that are not present in tables . |
Examples
distinct
Return a Table with duplicate rows removed.
Similar to pandas.DataFrame.drop_duplicates()
.
Some backends do not support keep='last'
Parameters
Name | Type | Description | Default |
---|---|---|---|
on | str | Iterable[str] | s.Selector | None | Only consider certain columns for identifying duplicates. By default, deduplicate all of the columns. | None |
keep | Literal[‘first’, ‘last’] | None | Determines which duplicates to keep. - "first" : Drop duplicates except for the first occurrence. - "last" : Drop duplicates except for the last occurrence. - None : Drop all duplicates | 'first' |
Examples
Compute the distinct rows of a subset of columns
Drop all duplicate rows except the first
Drop all duplicate rows except the last
Drop all duplicated rows
You can pass selectors
to on
The only valid values of keep
are "first"
, "last"
and
.
dropna
Deprecated - use drop_null
instead.
fillna
Deprecated - use fill_null
instead.
filter
Select rows from table
based on predicates
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
predicates | ir.BooleanValue | Sequence[ir.BooleanValue] | IfAnyAll | Boolean value expressions used to select rows in table . | () |
Returns
Name | Type | Description |
---|---|---|
Table | Filtered table expression |
Examples
intersect
Compute the set intersection of multiple table expressions.
The input tables must have identical schemas.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table | Table | A table expression | required |
*rest | Table | Additional table expressions | () |
distinct | bool | Only return distinct rows | True |
Returns
Name | Type | Description |
---|---|---|
Table | A new table containing the intersection of all input tables. |
See Also
Examples
limit
Select n
rows from self
starting at offset
.
The result set is not deterministic without a call to
order_by
.]
Parameters
Name | Type | Description | Default |
---|---|---|---|
n | int | None | Number of rows to include. If None , the entire table is selected starting from offset . | required |
offset | int | Number of rows to skip first | 0 |
Returns
Name | Type | Description |
---|---|---|
Table | The first n rows of self starting at offset |
Examples
You can use None
with offset
to slice starting from a particular row
See Also
order_by
Sort a table by one or more expressions.
Similar to pandas.DataFrame.sort_values()
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
by | str | ir.Column | s.Selector | Sequence[str] | Sequence[ir.Column] | Sequence[s.Selector] | None | Expressions to sort the table by. | () |
Returns
Name | Type | Description |
---|---|---|
Table | Sorted table |
Examples
Sort by b. Default is ascending. Note how capital letters come before lowercase
Sort in descending order
You can also use the deferred API to get the same result
Sort by multiple columns/expressions
You can actually pass arbitrary expressions to use as sort keys. For
example, to ignore the case of the strings in column b
This means that shuffling a Table is super simple
Selectors are allowed as sort keys and are a concise way to sort by multiple columns matching some criteria
Use the across
selector to
apply a specific order to multiple columns
sample
Sample a fraction of rows from a table.
Sampling is by definition a random operation. Some backends support
specifying a seed
for repeatable results, but not all backends support
that option. And some backends (duckdb, for example) do support
specifying a seed but may still not have repeatable results in all
cases.
In all cases, results are backend-specific. An execution against one
backend is unlikely to sample the same rows when executed against a
different backend, even with the same seed
set.
Parameters
Name | Type | Description | Default |
---|---|---|---|
fraction | float | The percentage of rows to include in the sample, expressed as a float between 0 and 1. | required |
method | Literal[‘row’, ‘block’] | The sampling method to use. The default is “row”, which includes each row with a probability of fraction . If method is “block”, some backends may instead perform sampling a fraction of blocks of rows (where “block” is a backend dependent definition). This is identical to “row” for backends lacking a blockwise sampling implementation. For those coming from SQL, “row” and “block” correspond to “bernoulli” and “system” respectively in a TABLESAMPLE clause. | 'row' |
seed | int | None | An optional random seed to use, for repeatable sampling. The range of possible seed values is backend specific (most support at least [0, 2**31 - 1] ). Backends that never support specifying a seed for repeatable sampling will error appropriately. Note that some backends (like DuckDB) do support specifying a seed, but may still not have repeatable results in all cases. | None |
Returns
Name | Type | Description |
---|---|---|
Table | The input table, with fraction of rows selected. |
Examples
Sample approximately half the rows, with a seed specified for reproducibility.
select
Compute a new table expression using exprs
and named_exprs
.
Passing an aggregate function to this method will broadcast the aggregate’s value over the number of rows in the table and automatically constructs a window function expression. See the examples section for more details.
For backwards compatibility the keyword argument exprs
is reserved and
cannot be used to name an expression. This behavior will be removed in
v4.
Parameters
Name | Type | Description | Default |
---|---|---|---|
exprs | ir.Value | str | Iterable[ir.Value | str] | Column expression, string, or list of column expressions and strings. | () |
named_exprs | ir.Value | str | Column expressions | {} |
Returns
Name | Type | Description |
---|---|---|
Table | Table expression |
Examples
Simple projection
In that simple case, you could also just use python’s indexing syntax
Projection by zero-indexed column position
Projection with renaming and compute in one call
You can do the same thing with a named expression, and using the deferred API
Projection with aggregation expressions
Projection with a selector
Projection + aggregation across multiple columns
sql
Run a SQL query against a table expression.
Parameters
Name | Type | Description | Default |
---|---|---|---|
query | str | Query string | required |
dialect | str | None | Optional string indicating the dialect of query . Defaults to the backend’s native dialect. | None |
Returns
Name | Type | Description |
---|---|---|
Table | An opaque table expression |
Examples
Mix and match ibis expressions with SQL queries
Because ibis expressions aren’t named, they aren’t visible to subsequent
.sql
calls. Use the alias
method to assign a name to an expression.
See Also
union
Compute the set union of multiple table expressions.
The input tables must have identical schemas.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table | Table | A table expression | required |
*rest | Table | Additional table expressions | () |
distinct | bool | Only return distinct rows | False |
Returns
Name | Type | Description |
---|---|---|
Table | A new table containing the union of all input tables. |
See Also
Examples
view
Create a new table expression distinct from the current one.
Use this API for any self-referencing operations like a self-join.
Returns
Name | Type | Description |
---|---|---|
Table | Table expression |
cache
Cache the results of a computation to improve performance on subsequent executions. This method allows you to cache the results of a computation either in memory, on disk using Parquet files, or in a database table. The caching strategy and storage location are determined by the storage parameter.
Parameters
Name | Type | Description | Default |
---|---|---|---|
storage | CacheStorage | The storage strategy to use for caching. Can be one of: - ParquetCacheStorage: Caches results as Parquet files on disk - SourceStorage: Caches results in the source database - ParquetSnapshot: Creates a snapshot of data in Parquet format - SnapshotStorage: Creates a snapshot in the source database If None, uses the default storage configuration. | None |
Returns
Name | Type | Description |
---|---|---|
Expr | A new expression that represents the cached computation. |
Notes
The cache method supports two main strategies: 1. ModificationTimeStrategy: Tracks changes based on modification time 2. SnapshotStrategy: Creates point-in-time snapshots of the data
Each strategy can be combined with either Parquet or database storage.
Examples
Using ParquetCacheStorage:
Using SourceStorage with PostgreSQL:
Using cache with filtering:
See Also
ParquetCacheStorage : Storage implementation for Parquet files SourceStorage : Storage implementation for database tables ModificationTimeStrategy : Strategy for tracking changes by modification time SnapshotStrategy : Strategy for creating data snapshots
Notes
- The cache is identified by a unique key based on the computation and strategy
- Cache invalidation is handled automatically based on the chosen strategy
- Cross-source caching (e.g., from PostgreSQL to DuckDB) is supported
- Cache locations can be configured globally through letsql.config.options
into_backend
Converts the Expr to a table in the given backend con
with an optional
table name name
.
The table is backed by a PyArrow RecordBatchReader, the RecordBatchReader is teed so it can safely be reaused without spilling to disk.
Parameters
Name | Type | Description | Default |
---|---|---|---|
con | The backend where the table should be created | required | |
name | The name of the table | None |