Scaling HP Diagnostics

I was recently called in to troubleshoot a large-scale HP Diagnostics installation. This Tech Tip contains some of the things I learnt about scaling a Diagnostics Server to handle a large number of Probes.
First, some architectural information...
For a basic installation, there is a single Diagnostics Server that is running in Commander mode which collects data from multiple Probes.
To scale Diagnostics to handle a larger number of Probes, the recommended method is to have multiple Diagnostics Servers in Mediator mode which aggregate the data from the Probes, and then send it to a Diagnostics Server in Commander mode.
In this particular situation, we needed to scale a single Diagnostics 6.6 Server (in Commander mode) running on Solaris 9 from 4 Probes to 52 Probes. The default Diagnostics Server configuration is definitely not set up to handle this many probes.
Increase your JVM heap size
The recommended JVM heap sizes vary depending on the number of Probes the Diagnostics Server has to handle.
Number of Probes | Recommended Heap Size |
---|---|
0 - 10 | 512 MB |
11 - 20 | 700 MB |
21 - 30 | 1,400 MB |
The default setting is 512 MB, which I increased to 2048 MB in the <diagnostics server dir>\bin\server.nanny file by changing "-Xmx512m" to "-Xmx2048".
Increase the number of threads used to collect data from the Probes
Every probe has a buffer that it stores data in until the Diagnostics Server can collect it. The Diagnostics HealthView was displaying warnings for most probes, with the message that "The probe is reporting that it is capturing data faster than it can handle it."
...which appears in the server.log file like this:
2009-05-25 17:50:36,450: INFO registrar phHandler.addWarningToGraphElement[1559]: Adding warning for element: The probe is reporting that it is capturing data faster than it can handle it.
You will also see the following error in the probe.log file as the buffer fills up:
2009-05-25 09:44:53,537 SEVERE com.mercury.opal.capture [MessageListenerThreadPool : 3] Failed to capture Exception: code=Server.generalException, description=java.net.SocketTimeoutException: Socket operation timed out before it could be completed, details=
com.mercury.diagnostics.common.io.LimitExceededException
at com.mercury.opal.capture.util.Buffer.incrementPosition_th(Buffer.java:756)
at com.mercury.opal.capture.util.DataBuffer.prepareForBulkWrite(DataBuffer.java:349)
at com.mercury.opal.capture.event.FragmentSOAPFaultEvent.exception(FragmentSOAPFaultEvent.java:56)
at com.mercury.opal.capture.BasicMethodCaptureAgent.exception(BasicMethodCaptureAgent.java:1041)
at com.mercury.opal.capture.proxy.MethodCaptureProxy.exception(MethodCaptureProxy.java:367)
This is either going to be a problem with the Probe buffer size or with the frequency that the Diagnostics Server is collecting data.
I found that the default number of threads used to collect data from all the Probes is only 5, which is far too small for a large number of Probes. In server.properties, I changed the following line in server.properties from 5 to 52 (threads are cheap)...
# Maximum number of Threads to use when pulling data from probes.
# -- Pulling from a probe can take 60s to timeout if there is something wrong
# on the SUT. If there are many probes timing out, increasing the number
# of Threads will allow data to be pulled from other probes more timely.
probe.pull.max.threads = 5
Increase the number of threads used to process the data from the Probes
After I changed the number of Diagnostics Server threads that pull data from the probes, the error on the Probes went away, but the Diagnostics Server started complaining that it could not keep up with the amount of data that it was receiving.
2009-05-25 17:50:36,467: INFO registrar phHandler.addWarningToGraphElement[1559]: Adding warning for element: The server is reporting that it is receiving data from probes faster than it can process it. Probes may begin throttling.
It was necessary to also increase the number of correlation threads and aggregation threads from their default value of 2.
#maximum number of correlation threads to use
correlation.max.thread.count=2
#number of aggregation threads to use
aggregation.thread.count=2
There are also some operating system settings that were necessary to change on Solaris.
Increase the number of file descriptors
Solaris (and other Unix-like operating systems) limits the number of files that can be open at one time by a shell and it's child processes. You can check this by running the following:
$ ulimit -a
In my case, the "open files" limit was set to 256. This was increased to 1024 by putting a call to "ulimit -n 1024" at the top of the Diagnostics Server startup script.
Increase the max number of TCP connections
I was seeing lots of connectivity problems between the Probes and the Diagnostics Server.
In the probe.log, I would see long stretches where the probe would be trying to connect..
2009-05-31 07:37:21,010 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:37:51,015 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:37:51,015 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:38:21,020 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:38:21,021 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:38:51,026 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:38:51,026 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:39:21,031 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:39:21,032 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:39:51,037 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:39:51,038 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
2009-05-31 07:40:21,043 INFO class com.mercury.diagnostics.probe.enterprise.MediatorManager [Server connection Thread] Connecting to server: AppMbr02 wpdiagsvr01:2612 [null]
2009-05-31 07:40:21,043 INFO class com.mercury.diagnostics.probe.enterprise.HostPort [Server connection Thread] Attempting to connect to mediator on AppMbr02 wpdiagsvr01:2612 [null]...
...and in the server.log, I would see network IO errors:
2009-06-03 09:27:54,164: WARNING data_in ProbePullerTask.logQueryError[295] : Unable to pull data from probe AppMbr01-wpapp01 ProbeTrendsPullerTask [next pull:1243985275000] - Remote query error: IO error sending query to http://localhost:2006/rhttp/out/AppMbr01:35009/query/?response_format=writable&action=probe_data&clientRelease=7&dataset=onlineCache&path=/level[equals(name, 'trends')]&response_format=writable&reset_records=true - http://localhost:2006/rhttp/out/AppMbr01:35009/query/?response_format=writable&action=probe_data&clientRelease=7&dataset=onlineCache&path=/level[equals(name, 'trends')]&response_format=writable&reset_records=true
2009-06-03 09:28:06,913: INFO data_in ProbePullerTask.updateCache[638] : Successfully pulled (ProbeTrendsPullerTask) data from AppMbr01-wpapp01 ProbeTrendsPullerTask
...
2009-05-25 18:22:07,178: WARNING time_synchronization
ProbeInfo.update[474] : Unable to synchronize time for probe AppMbr15 Exception is: java.net.SocketTimeoutException: connect timed out
The default configuration of Solaris 9 allows a very small number of TCP connections (defined by tcp_conn_req_max_q). It can be checked by running:
ndd /dev/tcp tcp_conn_req_max_q
I found that tcp_conn_req_max_q was set to 1024, which was too small for the number of connections that the listener on port 2006 needed to handle. I increased the tcp_conn_req_max_q and tcp_conn_req_max_q0 settings by 4 times.
To find out if you are hitting this limit, run the following command:
$ netstat -s | fgrep -i listendrop
tcpListenDrop = 15684 tcpListenDropQ0 = 0
This shows that 15684 TCP connections have been refused since the server was last rebooted. If this number increases, it means that you are hitting your tcp_conn_req_max_q limit, and should increase it.
Increasing this setting must be done by the root user. The new setting should also be included in the Solaris startup file so that the setting will still be applied after the server is restartd.
Tech tips from JDS

Browser Console
Read More

Glide Variables
Read More

Understanding Database Indexes in ServiceNow
Read More

Fast-track ServiceNow upgrades with Automated Testing Framework (ATF)
Read More

Read More

Splunk .conf18
Read More

ServiceNow Catalog Client Scripts: G_Form Clear Values
Read More

Is DevPerfOps a thing?
Read More

The benefits of performance testing with LoadRunner
Read More

Monitoring Atlassian Suite with AppDynamics
Read More

5 quick tips for customising your SAP data in Splunk
Read More

How to maintain versatility throughout your SAP lifecycle
Read More

How to revitalise your performance testing in SAP
Read More

Reserve and import data through Micro Focus ALM
Read More

How to effectively manage your CMDB in ServiceNow
Read More

ServiceNow and single sign-on
Read More

How to customise the ServiceNow Service Portal
Read More

Integrating a hand-signed signature to an Incident Form in ServiceNow
Read More

Integrating OMi (Operations Manager i) with ServiceNow
Read More

Implementing an electronic signature in ALM
Read More

Service portal simplicity
Read More

Learning from real-world cloud security crises
Read More

Static Variables and Pointers in ServiceNow
Read More

Citrix and web client engagement on an Enterprise system
Read More

Understanding outbound web services in ServiceNow
Read More

How to solve SSL 3 recording issues in HPE VuGen
Read More

How to record Angular JS Single Page Applications (SPA)
Read More

Calculating Pacing for Performance Tests
Read More

Vugen and GitHub Integration
Read More

What’s new in LoadRunner 12.53
Read More

Filtered Reference Fields in ServiceNow
Read More

ServiceNow performance testing tips
Read More

Monitor Dell Foglight Topology Churn with Splunk
Read More

Straight-Through Processing with ServiceNow
Read More

Splunk: Using Regex to Simplify Your Data
Read More

ServiceNow Choice List Dependencies
Read More

Tips for replaying RDP VuGen scripts in BSM or LoadRunner
Read More

Incorporating iSPI metric reports into MyBSM dashboard pages
Read More

Using SV contexts to simulate stored data
Read More

What’s new in LoadRunner 12.02
Read More

Recycle Bin for Quality Center
Read More

LoadRunner Correlation with web_reg_save_param_regexp
Read More

LoadRunner 11.52
Read More

QC for Testers – Quiz
Read More

Agile Performance Tuning with HP Diagnostics
Read More

What’s new in HP Service Virtualization 2.30
Read More

Understanding LoadRunner Virtual User Days (VUDs)
Read More

Problems recording HTTPS with VuGen
Read More

Improving the management and efficiency of QTP execution
Read More

Performance testing Oracle WebCenter with LoadRunner
Read More

Generating custom reports with Quality Center OTA using Python
Read More

Asynchronous Communication: Scripting For Cognos
Read More

How to fix common VuGen recording problems
Read More

Monitoring Active Directory accounts with HP BAC
Read More

URL Attachments in Quality Center
Read More

What’s new in LoadRunner 11.00?
Read More

Restore old License Usage stats after upgrading Quality Center
Read More

Changing LoadRunner/VuGen log options at runtime
Read More

Restricting large attachments in Quality Center
Read More

Retrieving Quality Center user login statistics
Read More

A comparison of open source load testing tools
...
Read More

Worst practices in performance testing
Read More

LoadRunner Sales Questions
Read More

LoadRunner Analysis: Hints and tips
Read More

LoadRunner in Windows 7
HP Loadrunner 11 is now available. This new version now natively supports Windows 7 and Windows Server 2008. I ...
Read More

Using the QuickTest Professional “commuter” license
Read More

Installing HP Diagnostics
Read More

Understanding LoadRunner licensing
Read More

VuGen scripting for YouTube video
Read More

Creating a Web + MMS vuser
Read More

Why you should use backwards dates
Read More

How to get the host’s IP address from within VuGen
Read More

VuGen scripting for BMC Remedy Action Request System 7.1
Read More

Unique usernames for BPM scripts
Read More

Mapping drives for LoadRunner Windows monitoring
Read More

VuGen feature requests
Read More

LoadRunner script completion checklist
Read More

Querying Quality Center user roles
Read More

Querying the Quality Center Database
Read More

HPSU 2009 Presentation – Performance Testing Web 2.0
Read More

Scaling HP Diagnostics
Read More

Global variables aren’t really global in LoadRunner
Read More

Client-side certificates for VuGen
Read More

Detect malicious HTML/JavaScript payloads with WebInspect (e.g. ASPROX, Gumblar, Income Iframe)
Read More

VuGen code snippets
Read More

Integrating QTP with Terminal Emulators
Read More

Why you must add try/catch blocks to Java-based BPM scripts
Read More

Querying a MySQL database with LoadRunner
Read More

ANZTB 2009 Presentation: Performance Testing Web 2.0
Read More

How to make QTP “analog mode” steps more reliable
Read More

Testing multiple browsers in a Standardized Operating Environment (SOE)
Read More

DNS-based load balancing for virtual users
Read More

What’s new in LoadRunner 9.50?
Read More

Calculating the difference between two dates or timestamps
Read More

The “is it done yet” loop
Read More

Think time that cannot be ignored
Read More

Understanding aggregate variance within LoadRunner analysis
Read More

Load balancing vusers without a load balancer
Read More

Harvesting file names with VuGen
Read More

Parameterising Unix/Posix timestamps in VuGen
Read More

HP Software trial license periods
Read More

How to handle HTTP POSTs with a changing number of name-value pairs
Read More

VuGen string comparison behaviour
Read More

Persistent data in VuGen with MySQL
Read More

How to write a Performance Test Plan
Read More

Unable to add virtual machine
To get ...
Read More

LoadRunner scripting languages
Read More

WDiff replacement for VuGen
Read More

Testing web services with a standard Web Vuser
Read More

Why your BPM scripts should use Download Filters
Read More

Querying your web server logs
Read More

Importing IIS Logs into SQL Server
Read More

QTP “Uninstall was not completed” problem
Read More

VuGen correlation for SAP Web Dynpro
Read More

How to save $500 on your HP software license
Read More

Testing and monitoring acronyms
Read More

Solving VuGen script generation errors
Read More

An introduction to SiteScope EMS Topology
Read More

Using the BAC JMX Console
Read More