Package org.apache.polaris.persistence.nosql.maintenance.api


package org.apache.polaris.persistence.nosql.maintenance.api
Maintenance operations include a bunch of tasks that are regularly executed against a backend database.

Types of maintenance operations include:

  • Purging unreferenced objects and references within a catalog
  • Purging whole catalogs that are marked to be purged
  • Purging whole realms that are marked to be purged

Discussion

Not all databases offer support to perform "prefix key" deletions, which are, for example, necessary to purge a whole realm. Some databases do support "deleting a huge number of rows". Some have another API for prefix-key deletions, for example, Google's BigTable dropRowRange on the table-admin-client. Relational databases may require different configurations with respect to isolation level to run those maintenance operations in a "better" way. Some databases do not support such "prefix-key deletions" at all, for example, Apache Cassandra or RocksDb or Amazon's DynamoDb.

Backend implementations therefore expose whether it can leverage "prefix-key deletions" when one or more realms are to be purged. If a Backend does not support "prefix-key deletions", the whole repository has to be scanned.

Purging unreferenced data

The other maintenance operations like purging a catalog or unreferenced objects or references a two-step approach that works even for large multi-tenant setups:

  1. Memoize the current timestamp, subtract some amount to account for expected wall-clock drifts.
  2. Identify all objects and references that must be retained, memoize those in a probabilistic data structure (bloom filter). See below.
  3. Scan the whole database to identify the objects and references that were not identified as being referenced in the previous step.
  4. Delete the unreferenced objects and references if, and only if, their createdAtMicros() timestamp is less than the timestamp memoized in the first step.

Identifying objects and references

Implementations of

invalid reference
@ApplicationScoped
PerRealmRetainedIdentifier are called to identify the references and objects that have to be retained for a realm.

Implementations of

invalid reference
@ApplicationScoped
ObjTypeRetainedIdentifier are called for each identified object of the requested object type.

Realm status

The maintenance service implementation will check the current status of the realm to retain and to purge, that the status is valid for being retained (valid: ACTIVE and INACTIVE) and being purged (valid: PURGING). Realms that have been asked to be purged and for which no data has been encountered will be state-transitioned to PURGED.

System realm "::system::"

The system realm is maintained like every other realm.

Future export use cases (TBD/TBC)

These can be useful in a hosted and multi-tenant SaaS environment, when an export of the data for a particular realm is requested.

  • Export live/referenced objects, filtered by realm. A possible implementation would hook into the implementation of PerRealmRetainedIdentifier via a delegate over Persistence. The actual approach and implementation is therefore out of the scope of the maintenance service.
  • Low-level export, filtered by realm. This one is different from the one above, as it would export references and all object-parts, in contrast to fully materialized objects. A possible implementation would hook into the scanning-part of the maintenance service implementation.