Splunk MLTK’s Cluster Numeric Events assistant for clustering and understanding the housing market

Splunk MLTK’s Cluster Numeric Events assistant for clustering and understanding the housing market


Splunk has hundreds of apps that extend its
core capabilities. The “Machine Learning Toolkit” app delivers
new ML Specific search commands, visualizations, and other tools for performing machine learning
on your Splunk data. Splunk has hundreds of apps that extend its
core capabilities. The “Machine Learning Toolkit” app delivers new ML Specific search
commands, visualizations, and other tools for performing machine learning on your Splunk
data. In addition, the application has ML Assistants that guide you through the process
of performing particular analytics, building custom models, and operationalizing them on
the Splunk Platform. In this video, we’ll explore the “Cluster
Numeric Events” assistant, which helps to group the events based on similarity of numeric
fields. This task is commonly referred to as clustering. There are countless uses for
clustering, from identifying the hot zones in any spread of points to market segmentation.
We will cover those in subsequent videos. Note that most of the input and output panels
in this assistant have tooltips, giving us more information about the particular panel.
Hover over a panel title or field label to reveal the tooltip. For this example, we want to group the vehicles
driven on a racetrack by onboard metrics like engine temperature and G-forces. These groups
could help understand the similarities between various car models. Using the search field,
we first load the track day dataset that is packaged with the app. After clicking the
search button, we get the “Raw Data Preview” panel below, which shows the fields in the
dataset. We also have the option to preprocess our
data before clustering. We can standardize our fields so that the scale differences of
the numeric data in lateral G-Force and engine speed fields are removed and do not bias the
clustering. After this, we can apply Principal components analysis or PCA option to reduce
the dimensionality of the problem. PCA is useful to reduce the number of fields to be
used for processing while preserving their importance and variability. We will reduce
our seven fields to three fields. Next step is to select the algorithm to be
used for clustering. Splunk’s ML App includes several clustering algorithms, each of which
is better suited for different kinds of data. You should experiment with the options to
see which one works best for your data. For this example, we will use Birch algorithm. In this example, we group the cars based on
the 3 fields generated by PCA (that is the principle components). Selecting fields, in
general, is a hard problem that depends on the use case, characteristics of the data,
and properties of the algorithm, itself, that are beyond the scope of this video. Birch algorithm requires that we specify k:
the number of clusters. In this example, we’ll group the cars into six clusters. Next, we have the option to save this model
as per a naming convention that we can find on the “Load Existing Settings” tab; we’ll
revisit this tab later in this video. After clicking the cluster button, the cars
are divided into 6 clusters based on the selected fields and some un-clustered points as well. To evaluate the clusters and their relation
to the fields used for clustering, a Cluster Visualization is created. This Visualization
is the scatter plot of fields selected for clustering with highlighted clusters in each
plot. Hover over any point in visualization to know its value, and you can also select
a group of points in the visualization and see the corresponding points in the other
plots. On the “Load Existing Settings” tab we
can compare various clustering setups using statistics such as Field to use for clustering,
Clustering Algorithm, Dimensionality Reduction, etc. and reload the one that worked best.
We can also apply our configuration to new data. These assistants not only walk you through
the steps required for clustering, but all of these tools can be reused and provide a
variety of options for operationalizing. You can view the SPL commands used at every step,
schedule the configured setup of the cluster, create alerts based on the output generated
or use one of the validation panels on your own dashboard. To learn more about the Machine Learning Toolkit,
including the other Assistants, browse through our ML videos.

Leave a Reply

Your email address will not be published. Required fields are marked *