I think OOM kills is an important one, especially with containerized workloads. I've found that RAM used/limit metrics aren't sufficient as often the spike that leads to the OOM event happens faster than the metric resolution giving misleading charts.
Ideally I'd see these events overlaid with the time series to make it obvious that a restart was caused by OOM as opposed to other forms of crash.
Ideally I'd see these events overlaid with the time series to make it obvious that a restart was caused by OOM as opposed to other forms of crash.