The Problem with the Baseline Out of Bounds Concept
Summary of Baselines/OOBsBaselines are calculated periodically that consist of a min/max/average for a measurement over a period of time (on the order of days - 1 day, 3 days or 7 days typically). Baselines are normally recalculated every X days, where X is typically 1 or 3. Once a baseline is calculated, if a measurement is collected that is more than 5% higher than the baseline max, or more than 5% lower than the baseline min, an OOB is inserted into the database (RHQ_MEASUREMENT_OOB) to indicate a collection was out of bounds from the normal values most recently seen. A resource that has 1 or more OOBs associated with it is considered a "problem resource" because its measurements have come in out of bounds from its normal past activity. Why OOBs Are SuboptimalThe concept of baselines is fine - they represent a known range of normal measurement values over a recent period of time. The concept of OOBs, which sounds fine in theory, actually produces less-than-useful results in practice. Some of it is due to the way we currently implement it, and some of it is based on the semantics of the measurement in question. To illustrate the point, I will give some examples that I have actually seen happen.
First ExampleHere is the first example for annoying OOB behavior. I had shutdown all servers/agents the night before. I then completely truncated the OOB and baseline tables (to start from scratch). The next day, I started up a new server and forced it to generate new baselines (so, baselines are now new and up to date). I then started a single agent on the same box as where the server was running. Within a few minutes (if that) I got over 100 OOBs generated. About 10 minutes later this number doubled to almost 200 OOBs. (note that none of these OOBs were coming from perftest plugin resources - I updated the perftest plugin and fixed it so it will not generate boat loads of artificial OOBs - so all OOBs were true OOBs in true resources) Also note that I only have a single agent running. 1 agent, less than 15 minutes or so generating 200 OOBs. Extrapolate that out - 1 agent in 1 hour = 800 OOBs. 100 agents in 1 hour = 80,000 OOBs. 100 agents in 1 day ~~ 2M OOBs. Now, this was after the server and agents were down for several hours. So, the baselines included data that were several hours old and didn't include data from the newly started server and agent. Things like system memory usage, server thread count and the like are now going to be potentially very different when they were the last time the systems were running. Because of this, many metrics will have values that are no longer within the range of the baseline that was calculated based on data from days ago, hence, for every measurement collection in question, an OOB will be inserted in the DB. Second ExampleTake a look at the graph image below that shows metric "Active Thread Count": The server was shutdown as you can see for several hours earlier and then restarted. It started with a smaller number of threads than before. This metric is sending off OOBs every collection interval. Now, for the next 1 day until we recalc the baseline, I predict that every time this is collected, an OOB will be inserted because previously, the min/max was 137/297 but we are coming in at under 100 threads. So unless and until the server starts spinning up new threads (up to 130, or -5%), we are going to trigger a low OOB every time this metric is collected. Of course, as new agents start talking to this server, those threads might spin up, so this may be a bit artificial (after all, I only have 1 agent talking to this server now). However, you can assume this same kind of thing would happen with other types of measurements - that is to say, after a resource restart, measurements might come in at vastly different values than what they were before. While technically these are out of bounds from previous values, can we call these "problem resources"? Semantically, this server is not a problem resource - but because it has many OOBs associated with it, our UI would list this as a problem resource. Third ExampleTake a look at the graph image below that shows a graph for the average time it takes for the agent to send messages to the server: Again, the agent was restarted and the server is able to reply to the agents messages. Before, the baseline was min/max=1.3s/34.4s. But my newly started agent has stabilized and can send messages in under 1 second now (0.6s around). Since the low baseline is so small, there is a very small window that I can go under without triggering an OOB (5% is 0.13s - round it to 0.1s for sake of discussion). Baseline Min is 1.3s so unless my agent is slow enough to come in at 1.2s or slower, I will be getting OOBs. But notice what I said - the agent has to have degraded performance for it to not trigger an OOB. Once my agent is stable and can send messages fast (e.g. 0.6s) - that triggers an OOB. That is poor semantics because common sense says it should be the opposite - I shouldn't make this resource look like a "problem resource" because in this particular resource's case, falling well below the baseline min is actually a good thing. These semantics are obviously specific to this resource and this metric (i.e. other metrics that fall well below the baseline min might be cause for concern, but this is not one of those cases). Why The OOB Concept Falls ShortThe OOB design as it stands today falls short. Baselines are fine - people can see what the min/max/avg is for a particular metric over a period of time - that seems useful. But blindly triggering an event like an OOB when a measurement falls outside that range is not as useful. Why isn't this useful? I have so many OOBs, I don't know which ones to ignore and which ones I should consider valid enough to be concerned about! And even if an OOB is valid, in some cases, I might be getting OOBs for many consecutive collection periods so its hard for me to determine if an new OOB represents a new problem, or if it represents a problem that has already been raised by a previous OOB. As with any kind of information overload, once users get flooded with OOBs like this (in which the majority will end up getting ignored), it won't be long before they simply ignore OOBs entirely. So, while in theory OOBs sound like an interesting and useful feature, we have seen in practice that OOBs simply flood the system in such a way that they simply can't provide any useful information. When thousands to millions of OOBs occur, with the vast number of them potentially not actually representing a new problem, it is very easy for a user to conclude that trying to find a meaningful OOB is too time-consuming to be worth the trouble, and therefore OOBs to be of little-to-no value. What Can We Make Useful?As mentioned before, baselines are useful - they represent a normal range of values previously collected from a resource. But what we are doing today is flooding the user with OOBs that alert him to every measurement that is out of bounds from those baselines - whether or not the information is useful or semantically makes sense. We are forcing the user to scan the vast collection of OOBs and look for those that are important. Even though OOBs (as we know them) are proving to be less useful, the concept of alerting the user when a measurement is out of bounds is still very useful - but only if the user says it is useful. In other words, let the user tell us what baselines are important to him and have the user tell us which out-of-bounds conditions is important to him. Only then should we issue an "OOB" alert. In fact, we already have the above capability - we can alert users based on values compared to the baseline! See the screen snapshots below: Note a couple added benefits here:
We can set these OOB alerts on a per resource basis or across all resources of a particular type using alert templates. So, even if we were to take out OOB functionality entirely, we still have the ability to inform the user when a value is found to be out-of-bounds - we just ask that the user define an alert to tell us when we should inform him of the OOB condition. We don't blindly log an OOB whenver the condition occurs. If we want to keep the ability to store OOBs, I recommend we disable OOBs for all measurements, but have the user be able to turn on an "enable" flag to say, "I am interested in counting the number of times this measurement is out of bounds". Note that the user has to do this anyway if they want to be alerted when this condition occurs. In the end, what OOBs give us that alert-on-baseline doesn't is just a count of the number of times a measurement is out of bounds. Just a number. It has no semantics whatsoever - it doesn't necessarily mean the resource is a problem resource and it doesn't necessarily mean the measurement is bad or wrong. All it simply is, is "this is the number of times a metric was seen to be out of the range of the baseline". With that in mind, I'm not sure if its worth trying to fix the OOB design - does OOB provide such a valueable feature over and above alert-on-baseline-OOB to warrant spending additional time to fix it? What If We Removed OOB Functionality?If we completely remove OOB calculations and storage as implemented in RHQ 1.1, our server would run much better:
What Can Replace Current OOB Functionality?Remember that, in theory, the concept of an OOB sounds great. How we make this useful in the real world becomes the challenge. What can we do as a replacement to the OOB functionality we plan to rip out? First, let's come to a conclusion as to what it is we really want when we say we like the OOB as a theoretical concept. What we want is for the system to answer a very basic question, which is:
In other words, what we really want to know is which resources are behaving in a way that it hasn't before, based on recent past behavior. This really is nothing more than a database query that compares a resource's measurements with its baselines - any measurements that are outside of the baselines will force its resource to be returned by the query. As a corrolary, we could have this query sort the results such that those resources that are behaving the most abnormally would come first. So, we can therefore, replace the current OOB functionality with a database query with a UI component that displays the results of that query. What we would like future OOB features to tell us is:
|