Machine Learning

Machine Learning with the Elastic Stack Part-2

Summarized counts:

We clearly stated that the Count functions simply tally the number of documents per unit of time. But what if the data that you are using actually has a field value that contains a summarized count already? For example, in the following data, the events_per_min field represents a summarized number of occurrences of something (online purchases in this case) that occurred at the last minute:

{

    "metrictype": "kpi",

    "@timestamp": "2016-02-12T23:11:09.000Z",

    "events_per_min": 22,

    "@version": "1",

    "type": "it_ops_kpi",

    "metricname": "online_purchases",

    "metricvalue": "22",

    "kpi_indicator": "online_purchases"

  }

To get the ML job to recognize that the events_per_min field is the thing that needs to be tallied (and not the documents themselves), we need to set a summary_count_field_name directive (which is only settable in the UI in Advanced jobs):

 

 

After specifying events_per_min as summary_count_field_name, the appropriate detector configuration in this case simply employs the low_count function:

 

The results of running the job give exactly what we expect—a detection of some cases when my customer online purchases were lower than they should have been, including times when the orders dropped completely to zero, as well as a partial loss of orders on one midday:

 

Splitting the counts

This can be done with the Count functions. This makes it handy to get many simultaneous event rate analyses at once, accomplished with either the Multi Metric job or the Advanced job UI wizards.

Some common use cases for this are as follows:

  • Finding an increase in error messages in a log by error ID or type
  • Finding a change in log volume by host; perhaps some configuration was changed
  • Determining whether certain products suddenly are selling better or worse than they used to

To accomplish this, the same mechanisms are used. For example, in a Multi Metric job, one can choose a categorical field to split the data while using a Count (event rate) function:

 

This result in the following, where it was determined that only one of the many entities being modeled was actually unusual (the spike in the volume of requests for the airline AAL):

 

As you can see, it is extremely easy to see volume-based variations across a wide number of unique instances of a categorical field in the data. We can see at a glance which entities are unusual and which are not.

Other counting functions

In addition to the functions described so far, there are several other counting functions that enable a broader set of use cases.

Non-zero count

The non-zero count functions (non_zero_count, low_non_zero_count, and high_non_zero_count) allow the handling of count-based analysis, but also allow for accurate modeling in cases where the data may be sparse and you would not want the non-existence of data to be explicitly treated as zero, but rather as null. In other words, a dataset in time looks like the following:

4,3,0,0,2,0,5,3,2,0,2,0,0,1,0,4

Data with the non_zero_count functions will be interpreted as the following:

4,3,2,5,3,2,2,1,4

The act of treating zeros as null can be useful in cases where the non-existence of measurements at regular intervals is expected. Some practical examples of this are as follows:

  • The number of airline tickets purchased per month by an individual
  • The number of times a server reboots in a day
  • The number of login attempts on a system per hour

Distinct count

The distinct count functions (distinct_count, low_distinct_count, and high_distinct_count) measure the uniqueness (cardinality) of values for a particular field. There are many possible uses of this function, particularly when used in the context of population analysis to uncover entities that are logging an overly diverse set of field values. A good classic example is looking for IP addresses that are engaged in port scanning, accessing an unusually large number of distinct destination port numbers on remote machines:

{

  "function" : "high_distinct_count",

  "field_name" : "dest_port",

  "over_field_name": "src_ip"

}

Notice that the src_ip field is defined as the over field, thus invoking population analysis and comparing the activity of source IPs against each other. An additional discussion on population analysis follows next.

If you found this article interesting, you can explore Machine Learning with the Elastic Stack to leverage Elastic Stack’s machine learning features to gain valuable insight from your data. Machine Learning with the Elastic Stack is a comprehensive overview of the embedded commercial features of anomaly detection and forecasting.

© copyright 2017 www.aimlmarketplace.com. All Rights Reserved.

A Product of HunterTech Ventures