Scaling HP Diagnostics

I was recently called in to troubleshoot a large-scale HP Diagnostics installation. This Tech Tip contains some of the things I learnt about scaling a Diagnostics Server to handle a large number of Probes.

First, some architectural information…

For a basic installation, there is a single Diagnostics Server that is running in Commander mode which collects data from multiple Probes.
HP Diagnostics with a single server in Commander mode

To scale Diagnostics to handle a larger number of Probes, the recommended method is to have multiple Diagnostics Servers in Mediator mode which aggregate the data from the Probes, and then send it to a Diagnostics Server in Commander mode.
HP Diagnostics with servers in Mediator mode

In this particular situation, we needed to scale a single Diagnostics 6.6 Server (in Commander mode) running on Solaris 9 from 4 Probes to 52 Probes. The default Diagnostics Server configuration is definitely not set up to handle this many probes.

Increase your JVM heap size

The recommended JVM heap sizes vary depending on the number of Probes the Diagnostics Server has to handle.

Number of Probes Recommended Heap Size
0 – 10 512 MB
11 – 20 700 MB
21 – 30 1,400 MB

The default setting is 512 MB, which I increased to 2048 MB in the <diagnostics server dir>\bin\server.nanny file by changing “-Xmx512m” to “-Xmx2048”.

Increase the number of threads used to collect data from the Probes
Every probe has a buffer that it stores data in until the Diagnostics Server can collect it. The Diagnostics HealthView was displaying warnings for most probes, with the message that “The probe is reporting that it is capturing data faster than it can handle it.”

HP Diagnostics HealthView - the probe is reporting that it is capturing data faster than it can handle it

…which appears in the server.log file like this:

2009-05-25 17:50:36,450: INFO     registrar                      phHandler.addWarningToGraphElement[1559]: Adding warning for element: The probe is reporting that it is capturing data faster than it can handle it.

You will also see the following error in the probe.log file as the buffer fills up:

2009-05-25 09:44:53,537 SEVERE com.mercury.opal.capture [MessageListenerThreadPool : 3] Failed to capture Exception: code=Server.generalException, description=java.net.SocketTimeoutException: Socket operation timed out before it could be completed, details=
com.mercury.diagnostics.common.io.LimitExceededException
	at com.mercury.opal.capture.util.Buffer.incrementPosition_th(Buffer.java:756)
	at com.mercury.opal.capture.util.DataBuffer.prepareForBulkWrite(DataBuffer.java:349)
	at com.mercury.opal.capture.event.FragmentSOAPFaultEvent.exception(FragmentSOAPFaultEvent.java:56)
	at com.mercury.opal.capture.BasicMethodCaptureAgent.exception(BasicMethodCaptureAgent.java:1041)
	at com.mercury.opal.capture.proxy.MethodCaptureProxy.exception(MethodCaptureProxy.java:367)

This is either going to be a problem with the Probe buffer size or with the frequency that the Diagnostics Server is collecting data.

I found that the default number of threads used to collect data from all the Probes is only 5, which is far too small for a large number of Probes. In server.properties, I changed the following line in server.properties from 5 to 52 (threads are cheap)…

# Maximum number of Threads to use when pulling data from probes.
#  -- Pulling from a probe can take 60s to timeout if there is something wrong
#     on the SUT.  If there are many probes timing out, increasing the number 
#     of Threads will allow data to be pulled from other probes more timely.
probe.pull.max.threads = 5

Increase the number of threads used to process the data from the Probes

After I changed the number of Diagnostics Server threads that pull data from the probes, the error on the Probes went away, but the Diagnostics Server started complaining that it could not keep up with the amount of data that it was receiving.

2009-05-25 17:50:36,467: INFO     registrar                      phHandler.addWarningToGraphElement[1559]: Adding warning for element: The server is reporting that it is receiving data from probes faster than it can process it.  Probes may begin throttling.

It was necessary to also increase the number of correlation threads and aggregation threads from their default value of 2.

#maximum number of correlation threads to use
correlation.max.thread.count=2
 
#number of aggregation threads to use
aggregation.thread.count=2

There are also some operating system settings that were necessary to change on Solaris.

Increase the number of file descriptors

Solaris (and other Unix-like operating systems) limits the number of files that can be open at one time by a shell and it’s child processes. You can check this by running the following:

<wpdiagnsvr01> $ ulimit -a

In my case, the “open files” limit was set to 256. This was increased to 1024 by putting a call to “ulimit -n 1024″ at the top of the Diagnostics Server startup script.

Increase the max number of TCP connections

I was seeing lots of connectivity problems between the Probes and the Diagnostics Server.

In the probe.log, I would see long stretches where the probe would be trying to connect..

2009-05-31 07:37:21,010 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:37:51,015 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:37:51,015 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:38:21,020 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:38:21,021 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:38:51,026 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:38:51,026 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:39:21,031 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:39:21,032 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:39:51,037 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:39:51,038 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:40:21,043 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:40:21,043 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...

…and in the server.log, I would see network IO errors:

2009-06-03 09:27:54,164: WARNING  data_in                        ProbePullerTask.logQueryError[295]      : Unable to pull data from probe AppMbr01-wpapp01 ProbeTrendsPullerTask [next pull:1243985275000] - Remote query error: IO error sending query to http://localhost:2006/rhttp/out/AppMbr01:35009/query/?response_format=writable&action=probe_data&clientRelease=7&dataset=onlineCache&path=/level[equals(name, 'trends')]&response_format=writable&reset_records=true - http://localhost:2006/rhttp/out/AppMbr01:35009/query/?response_format=writable&action=probe_data&clientRelease=7&dataset=onlineCache&path=/level[equals(name, 'trends')]&response_format=writable&reset_records=true
2009-06-03 09:28:06,913: INFO     data_in                        ProbePullerTask.updateCache[638]        : Successfully pulled (ProbeTrendsPullerTask) data from  AppMbr01-wpapp01 ProbeTrendsPullerTask
...
2009-05-25 18:22:07,178: WARNING  time_synchronization
ProbeInfo.update[474]                   : Unable to synchronize time for probe AppMbr15 Exception is: java.net.SocketTimeoutException: connect timed out

The default configuration of Solaris 9 allows a very small number of TCP connections (defined by tcp_conn_req_max_q). It can be checked by running:

ndd /dev/tcp tcp_conn_req_max_q

I found that tcp_conn_req_max_q was set to 1024, which was too small for the number of connections that the listener on port 2006 needed to handle. I increased the tcp_conn_req_max_q and tcp_conn_req_max_q0 settings by 4 times.

To find out if you are hitting this limit, run the following command:

<wpdiagnsvr01> $  netstat -s | fgrep -i listendrop
        tcpListenDrop       = 15684     tcpListenDropQ0     =     0

This shows that 15684 TCP connections have been refused since the server was last rebooted. If this number increases, it means that you are hitting your tcp_conn_req_max_q limit, and should increase it.

Increasing this setting must be done by the root user. The new setting should also be included in the Solaris startup file so that the setting will still be applied after the server is restartd.

4 comments

Can you offer experience or guidance on what quantity of fully subscribed mediators you have scaled the Diagnostics Commander to consume and consolidate?

Stuart Moncrieff

On Red Hat Linux, check the max number of TCP connections by looking in /proc/sys/net/core/somaxconn (default is 128, which is too low).

To see if you are hitting the TCP connection limit, run netstat –tcp –listening –statistics, and check for “failed connection attempts”.

[…] If you have to use Diagnostics, you might also be interested in my articles on Installing HP Diagnostics and Scaling HP Diagnostics. […]

Stuart Moncrieff

Note that if you install any patches for Diagnostics from HP (or if you upgrade your Diagnostics Server), they tend to overwrite all your configuration files.

Make sure that you have backed up all the configuration files in your \etc directory, and make sure that you also back up your server.nanny file (which is in a different directory) if you have changed it.

Leave a Reply