A noisy neighbor in the cloud
A RDS database experienced a strange issue where CPU credit was being drained. It was a burstable T3 medium instance.
- Nothing significant was running in the database at that time. The top 10 SQL queries in performance insights were PMM (performance monitoring tool) queries, which are light.
2. According to Performance Insights, CPU usage is not high. CPU utilization is almost single digit when alert was sent and peak was 15%.
3. Metrics on the database instance showed CPU credits being exhausted.
4. I had seen some CPU steal percentage earlier in the day, but that doesn’t explain the current high CPU usage.
5. Reduced autovacuum intensity (poor autovacuum worker was blamed by application team again) but made no difference to the situation.
6. pg_stat_activity showed no long running sessions.
Thus, the only culprit left was our noisy neighbor who is invisible to us. Upon discussion with application team, T-class was not appropriate for service level or database criticality. A graviton upgrade was also planned in few days, so we upgraded to the standard graviton instance a bit early (m6g.large).
Now everything looks fine after update and the incident has been resolved.