Machine Learning

Machine Learning with the Elastic Stack Part-1


Exploring Count Functions in Elastic ML

Learn about count functions in this article by Rich Collier, a solutions architect at Elastic. Joining the Elastic team from the Prelert acquisition, Rich has over 20 years' experience as a solutions architect and pre-sales systems engineer for software, hardware, and service-based solutions.

Elastic ML jobs contain detectors for a combination of a function applied to some aspect of the data (for example, a field). The detectors we will be exploring in this article will be those that simply count occurrences of things over time.

The three main functions to get familiar with are as follows:

  • Count: Counts the number of documents in the bucket resulting from a query of the raw data index
  • High Count: The same as Count, but will only flag an anomaly if the count is higher than expected
  • Low Count: The same as Count, but will only flag an anomaly if the count is lower than expected

You will see that there are a variety of one-sided functions in ML (to only detect anomalies in a certain direction). Additionally, it is important to know that this function is not counting a field or even the existence of fields within a document; it is merely counting the documents.

To get a more intuitive feeling for what the Count function does, let's see what a standard (non-ML) Kibana visualization shows us for a particular dataset when that dataset is viewed with a Count aggregation on the Y-Axis and a 10-minute resolution of the Date Histogram aggregation on the X-Axis:


From the preceding screenshot, we can make a few observations:

  • This vertical bar visualization counts the number of documents in the index for each 10-minute bucket of time and displays the resulting view. We can see, for example, that the number of documents at the 11:10 AM mark on February 9 has a spike in documents/events that seems much higher than the typicalrate (the points of time excluding the spike); in this case, the count is 277.
  • To automate the analysis of this data, we plotted it with an ML job. We can use a Single Metric Jobsince there is only one time series (a count of all docs in this index). Configuring the job will look like the following, after the initial steps of the Single Metric Job wizard are completed:


We can see that the Count aggregation function is used (although High Count would also have been appropriate), and the Bucket span is set to the same value we have when we build our Kibana visualization. After running the job, the resulting anomaly is found:


Of course, the anomaly of 277 documents/events is exactly what we had hoped would be found, since this is exactly what we saw when we manually analyzed the data in the vertical bar visualization earlier.

Notice what happens, however, if the same data is analyzed with a 60m bucket span instead of a 10m one:


Note that because the rate spike that occurred was so short, when the event count aggregates over the span of an hour, the spike doesn't look anomalous anymore, and as such ML doesn't even consider it anomalous.

As mentioned earlier, the one-sided functions of Low Count and High Count are especially useful when trying to find deviations only in one direction. Perhaps you only want to find a drop of orders on your e-commerce site (because a spike in orders would be good news!), or perhaps you only want to spot a spike in errors (because a drop in errors is a good thing too!).

Remember, the Count functions count documents, not fields. If you have a field that represents a summarized count of something, then that will need special treatment as described in the next section.

Continue Reading Article: Machine Learning with the Elastic Stack Part-2

© copyright 2017 All Rights Reserved.

A Product of HunterTech Ventures