RHQ, the common services project for infrastructure management

  Dashboard > RHQ-Project > ... > Proposals for further enhancements > Trends
  RHQ-Project Log In | Sign Up   View a printable version of the current page.  
  Trends
Added by Heiko W. Rupp , last edited by Heiko W. Rupp on Jun 09, 2008  (view change)
Labels: 
(None)

Trends and proactive alarming

Currently we are only able to alert operations when a problem has already occured. There is a possibility to have alerts on crossing thresholds like 10% away from the upper or lower vaules of a metric or when a metric value crosses some threshold around the baseline of a metric.

In order to proactively inform operations about upcoming issues, we should add trending functionality and the possiblity to compute a time delta when a critical situation will arise (with a certain probability). Lets have a look at the following graph:

Here we have a dynamic metric and a trend function. In addition we have a threshold value („[~hrupp:SLA]Service Level Agreement]"). With the help of the current value and the trend graph, we could compute a deltaT time value when the metric would hit the threshold value. DeltaT could then be fed into the alert subsystem to alert if deltaT is less than a given value.
Of course, this is not limited to dynamic metrics, but would work even better for trendsup or trendsdown metrics, as the extrapolation is easier in that case.

This algorithm is targeted at metrics, where it is expected for operations to react before a critical situation arises. An example for this would be the used capacity of a storage array - if its know that the storage will be full within the next two days, operations still has time to proactively add an additional disk or replace an existing with a bigger one.
An example where this makes no sense would be the cpu load. One could argue that if the load reaches some level, an additional machine could be brought into service, but the changes in cpu load are too quick to effectively trigger this with some trend function.

Possible implementation

As this trend computation is relatively expensive, it should explictly enabled and disabled by default.

We said previously that this proactive alarming should only be used for stuff where operations has a chance to prevent the issue. This means that there is no need to calculate deltaT in real time. Instead a „cron job" can update it e.g. on a hourly basis.

Depending on the type of base metric different trending functions could be applied. For dynamic some 30, 100, 200 days trend might be interesting.
For trends up and down, it should be relatively easy to approximate the curve by two points (last measurement + current value:

Of course that is very simplistic and a square or even higher level curve with more than two points would be better.

Powered by a free Atlassian Confluence Open Source Project License granted to Hyperic HQ. Evaluate Confluence today.
Powered by Atlassian Confluence 2.7.1, the Enterprise Wiki. Bug/feature request - Atlassian news - Contact administrators