There's been some problems with the validator ever since it crashed about a week ago. It doesn't crash any more since I added much more disk space. But it easily falls behind and needs to play catch up several hours after each restart. I learned this from state_accounting field in the output of "rippled server_info". I created a metric and a chart from it. This is what the chart looks for the last week.
I saw some "Not deleting" messages in the log which led me to the above code.
The "Full" mode state means it is fully synced and participating in consensus process. You can see it is now asymptotically approaching 100 now. The value means the percentage of uptime the validator stays in each mode.
I figured out why the disk was out of space. It was because the validator fell behind. When that happens, online delete is disabled. See this code:
I still can't determine the exact reason why it fell behind. I suspect it was caused by some "insane" testnet peer. So I blocked some peers with iptables. But now it's almost 100% full and it still has an insane testnet peer.
However I do know why it didn't recover after disk is enlarged. I believe it's because I changed the db to NuDB from RocksDB and still have online_delete set in the config. I changed it because I learned NuDB works better on SSD. But it doesn't support update or delete. It is meant to be used in full history validator. It appears that when it tries to do online_delete, it starts to fall behind. Now I have switched back to RocksDB and everything is fine.