The Bloat Busters: pg_repack vs pg_squeeze

As the database size increases and the number of transactions per second rise, you'll inevitably face the challenge of the table bloat. Although PostgreSQL assists as much as possible with its auto-vacuum feature, there will come a time when you will compel whether to run VACUUM FULL. Unless you have option of longish downtime windows, this is not an easy decision.

Thankfully, the rich ecosystem of PostgreSQL offers more than one solution how to make it simpler. Ignoring older tools like pg_reorg, two contenders worth considering are pg_repack and pg_squeeze. This article dives into their strengths and weaknesses to help you decide which one is better for your specific use case.

Cases of Heavy Duty Maintenance

In scenarios with a continuous and predictable transaction pattern, you can usually rely on auto-vacuum. However, there are use cases where this won't be sufficient. Such examples typically involve "bulk" operations—whether it's bulk imports, deletions, or a combination of both. Imagine scenarios where:

You migrate one or more column data types over a longer period (and no, using ALTER COLUMN name TYPE new_type is not the best option).
You drop a column that has been moved to a different table.
You need to modify a large amount of data, using soft-deletes and later actual DELETEs.
Due to changes in compliance requirements, you need to DELETE a massive amount of data.

In all these cases, you might end up with a huge table and the traditional solution would be VACUUM FULL, which is associated with significant downtime due to table locking.

Alternatives to VACUUM FULL

Due to the limitations of VACUUM FULL, several alternatives have emerged. The first contender in this space was pg_reorg (which this comparison will not cover), later superseded by pg_repack. The relatively new kid on the block is pg_squeeze.

Each solution comes with different architectures, deployment methods, and sets of pros and cons.

As an added benefit, both tools can effectively serve as alternatives to CLUSTER (which also requires the exclusive table lock similar to full vacuum), making them effective in optimising data storage based on the given order.

Both tools also share common behaviour. They won’t magically remove the bloat—you need to have at least the same amount of space available as the new table will require. Therefore, deploying both tools needs to be considered well before you reach critical levels of available disk space. Both will also generate a significant volume of WAL files, which need to be stored, processed, backed up, etc.

While not directly comparable to these tools, another method that can be used manually is table partitioning. Although it won’t be included in this comparison, if planned in advance, it can simplify certain maintenance tasks, improve performance, and make routine vacuuming faster and more efficient. It's only fair to say - it all depends on the specific scenarios and usage.

pg_repack

At the time of writing this article, pg_repack can be considered the most well-known solution for combating table bloat in the PostgreSQL ecosystem. Built as an extension, it's easy to install (either from package repositories or via a self-compiled artifact) and its setup does not require a cluster restart.

pg_repack works by creating a new copy of the table being processed, setting up triggers to replicate new data while it fills up with the existing data. An exclusive full table lock is required both at the start and finish of the process when swapping the old and new tables.

The maintenance is initiated from the CLI and allows you to specify a number of arguments to fine-tune the repacking of individual tables to match the requirements. pg_repack allows re-clustering of data based on columns only, i.e., it does not explicitly require an index and can therefore overcome some limitations of pg_squeeze in this regard.

pg_repack is very effective in reclaiming space and can work with all types of bloat. It's available out of the box both on Amazon RDS and Google Cloud SQL.

The drawback you might experience when terminating the process (which may be necessary for various reasons, such as impact on the running environment) is that it won’t clean up all the fragments as it won't remain connected to the target database.

pg_squeeze

Compared to the previous solution, pg_squeeze is built differently, relying on logical decoding instead of triggers. Its main benefit is a lower impact on the host system during table rebuilding, improving availability and stability.

Like pg_repack, pg_squeeze creates a new table and copies the existing data from the bloated table. Logical replication is involved in streaming changes from the original table to the newly created one in real-time. This allows the new table to stay up-to-date during the process without unnecessary impact on the regular operations performed on the bloated table. Thus, pg_squeeze significantly reduces the need for locking. The exclusive lock is needed only during the final phase of the operation, when the old table is swapped out for the new, optimized table. The duration of the exclusive lock can also be configured.

While pg_squeeze is also available as an extension, its deployment necessitates configuration changes involving wal_level, max_replication_slots, and shared_preload_libraries. Due to this, a restart of the cluster is required.

On the other hand, pg_squeeze is designed for regular, rather than ad-hoc processing only. You can register a table for regular processing, and whenever the table meets the criteria to be "squeezed," a task will be added to a queue, where it will be sequentially processed in the order they were created. The automated processing offers basic options usable in most scenarios, but you might find it limited if you need to work around other operational constraints of the cluster/environment. Having said that, the maintenance of the table can also be triggered manually.

While pg_squeeze might be considered superior to pg_repack in terms of maintenance operations and impact, it comes with a significant caveat when reclaiming space—compared to pg_repack, it copies the full rows as they are. This behaviour renders it ineffective at removing bloat created due to dropped columns (behaviour still present at the time of the writing of this article at the end of April 2024).

As already mentioned, pg_squeeze can use a specified index for clustering when needed. The limitation you might find is that clustering cannot be performed on a partial index. Compared to pg_repack, it always seems to clean up all the artefacts accordingly (thanks to the always-running worker process).

Unfortunately at the moment of writing the article pg_squeeze is not available for Amazon RDS, only Google Cloud SQL.

Conclusion

It's remarkable to have two mature and production-tested tools at your disposal. Deciding between them comes down to the specific requirements, the use cases, and the operational specifics of the business services.

The very opinionated difference between the tools can be made, using pg_squeeze for automated, continuous cleaning of specific tables, and pg_repack as the heavyweight champion of controlled setups.

(Edit) Added availability on Amazon RDS and Google Cloud SQL.

Cases of Heavy Duty Maintenance#

Alternatives to VACUUM FULL#

pg_repack#

pg_squeeze#

Conclusion#

Cases of Heavy Duty Maintenance

Alternatives to VACUUM FULL

pg_repack

pg_squeeze

Conclusion