Interface SupportsBucketing


@PublicEvolving public interface SupportsBucketing
Enables to write bucketed data into a DynamicTableSink.

Buckets enable load balancing in an external storage system by splitting data into disjoint subsets. These subsets group rows with potentially "infinite" keyspace into smaller and more manageable chunks that allow for efficient parallel processing.

Bucketing depends heavily on the semantics of the underlying connector. However, a user can influence the bucketing behavior by specifying the number of buckets, the bucketing algorithm, and (if the algorithm allows it) the columns which are used for target bucket calculation.

All bucketing components (i.e. bucket number, distribution algorithm, bucket key columns) are optional from a SQL syntax perspective. This ability interface defines which algorithms (listAlgorithms()) are effectively supported and whether a bucket count is mandatory (requiresBucketCount()). The planner will perform necessary validation checks.

Given the following SQL statements:


 -- Example 1
 CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED BY HASH(uid) INTO 4 BUCKETS;

 -- Example 2
 CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED BY (uid) INTO 4 BUCKETS;

 -- Example 3
 CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED BY (uid);

 -- Example 4
 CREATE TABLE MyTable (uid BIGINT, name STRING) DISTRIBUTED INTO 4 BUCKETS;
 

Example 1 declares a hash function on a fixed number of 4 buckets (i.e. HASH(uid) % 4 = target bucket). Example 2 leaves the selection of an algorithm up to the connector, represented as TableDistribution.Kind.UNKNOWN. Additionally, Example 3 leaves the number of buckets up to the connector. In contrast, Example 4 only defines the number of buckets.

A sink can implement both SupportsPartitioning and SupportsBucketing. Conceptually, a partition can be seen as kind of "directory" whereas buckets correspond to "files" per directory. Partitioning splits the data on a small, human-readable keyspace (e.g. by year or by geographical region). This enables efficient selection via equality, inequality, or ranges due to knowledge about existing partitions. Bucketing operates within partitions on a potentially large and infinite keyspace.

See Also:
  • Method Details

    • listAlgorithms

      Set<TableDistribution.Kind> listAlgorithms()
      Returns the set of supported bucketing algorithms.

      The set must be non-empty. Otherwise, the planner will throw an error during validation.

      If specifying an algorithm is optional, this set must include TableDistribution.Kind.UNKNOWN.

    • requiresBucketCount

      boolean requiresBucketCount()
      Returns whether the DynamicTableSink requires a bucket count.

      If this method returns true, the DynamicTableSink will require a bucket count.

      If this method return false, the DynamicTableSink may or may not consider the provided bucket count.