Topology churn is one issue that can cause serious Foglight performance degradation. It is the result of constant changing and creation of new versions of existing topology objects, from bad configurations or ill written custom agents. We can view the overall churn by browsing the Alarms dashboard’s All System Changes view (see figure 1).

All system changes

Figure 1: (Foglight Management Server) All system changes

While the dashboard above gives you an indication of churn, it does not tell you what is causing it. This information is only available if you generate a Foglight Management Server (FMS) Support Bundle and examine the Diagnostics Snapshot data (see figure 2), but it is a fixed snapshot for changes over the past week. The column that denotes churn is Num Recent Versions.

Figure 2: Churn from the diagnostic snapshot

Figure 2: (Bash) Churn from the diagnostic snapshot

There is a better approach. If we can capture the topology type changes every 30 minutes and feed that snapshot to Splunk, we can start graphing and spotting trends. Having the ability to spot trends mean we can understand when churn usually occurs so that we can isolate efforts in reducing churn.

To provide you with an example, I run an FMS and a Splunk lab on Docker containers (see figure 3).

Figure 3: Foglight Management Server and Splunk running containers

Figure 3: (Bash) Foglight Management Server and Splunk running on Docker containers

Next, I wrote a Foglight groovy script that extracts the number of changes observed for a topology type over a 30-minute period. Figure 4 shows the script in action.

Figure 4: Groovy script to extract churn for last 30 minutes

Figure 4: (Bash) Groovy script to extract churn for last 30 minutes

This script above can then be executed by Splunk every 30 minutes with the results stored and analysed. Instead of calling the fglcmd.sh script directly, i wrote a wrapper called run.sh (see figure 5).

Figure 5: Configuring Splunk to run script to collect churn metrics

Figure 5: (Splunk) Configuring Splunk to run script every 30 minutes to collect churn metrics

Once the data is stored in Splunk, we can analyse and create dashboards to highlight Topology Types that cause churn in real-time. Figure 6 below shows such an example. Compare this to what you see in figure 1, you get heaps more intelligence to work with when trying to reduce Foglight Topology Churn.

Figure 6: (Splunk) Splunk Dashboard showing churn

Figure 6: (Splunk) Splunk Dashboard showing churn

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.