GC jobs are hanging or not finishing after upgrade to 3.4 for some users.
Staff are currently mystified. Customers are rolling back to 3.3.
“there was the improvements to reduce the number of multiple atime updates on already seen chunks (as keep track by the cache) and the change in iteration logic to better correlate index files, as well as the bugfix for the possibly untouched chunks. Why this has negative effects with your particular setup remains to be clarified, until now the possible cause is only hypothesis.”
Thanks for the heads up, I was scheduled to upgrade my PBS next weekend.
I’m also concerned. This appears to be serious. I’ve frozen upgrades as well.
This is the latest from the staff.
- My guess is that you are writing and pruning backups faster than the GC is able to keep up with. New backups might be created faster (using fast incremental mode) than the GC phase 1can keep up with due to cache capacity limit. How big are such backup snapshots typically in your case? Are they in the TiB range? In previous versions new snapshot indices were not considered during GC, which could however lead to possibly not touching all chunks in very specific edge cases with long running GC and high frequency pruning setups.
- You can try and increase the gc-cache-capacity in the datastore tuning options to the maximum value of 8388608 and restart GC, see https://pbs.proxmox.com/docs/storage.html#tuning
- In general, you should consider adding a special device for such a storage setup, see https://pbs.proxmox.com/docs/sysadmin.html#zfs-special-device
My interpretation is that this issue will affect setups with slower disks and larger backups.
The recommendation to add an SSD special vdev is an interesting aside. Like, are they actually pushing that now? That’s some ZFS rocket surgery. (I like ZFS.)