Monitor Dell Foglight Topology Churn with Splunk

Topology churn is one issue that can cause serious Foglight performance degradation. It is the result of constant changing and creation of new versions of existing topology objects, from bad configurations or ill written custom agents. We can view the overall churn by browsing the Alarms dashboard’s All System Changes view (see figure 1).

While the dashboard above gives you an indication of churn, it does not tell you what is causing it. This information is only available if you generate a Foglight Management Server (FMS) Support Bundle and examine the Diagnostics Snapshot data (see figure 2), but it is a fixed snapshot for changes over the past week. The column that denotes churn is Num Recent Versions.

There is a better approach. If we can capture the topology type changes every 30 minutes and feed that snapshot to Splunk, we can start graphing and spotting trends. Having the ability to spot trends mean we can understand when churn usually occurs so that we can isolate efforts in reducing churn.
To provide you with an example, I run an FMS and a Splunk lab on Docker containers (see figure 3).

Next, I wrote a Foglight groovy script that extracts the number of changes observed for a topology type over a 30-minute period. Figure 4 shows the script in action.

This script above can then be executed by Splunk every 30 minutes with the results stored and analysed. Instead of calling the fglcmd.sh script directly, i wrote a wrapper called run.sh (see figure 5).

Once the data is stored in Splunk, we can analyse and create dashboards to highlight Topology Types that cause churn in real-time. Figure 6 below shows such an example. Compare this to what you see in figure 1, you get heaps more intelligence to work with when trying to reduce Foglight Topology Churn.
