YSoft SafeQ Performance and Availability Monitoring Guidelines

Executive Summary

The purpose of this document is to provide guidelines on monitoring the overall performance and system health of YSoft SafeQ in the production environment.

This document is intended as a guide only; it is intended to supplement, not replace, the expertise of the contracted infrastructure monitoring team.

Best Practices

From a high level perspective, monitoring the performance and health of the solution can be answered in five questions:

  • Are the services running as needed?

  • What is the CPU utilization?

  • What is the memory utilization?

  • What is the hard disk utilization?

  • What is the network utilization?

Most common issues resulting in a degradation or outright denial of service will manifest by one of these metrics falling outside of the normal boundaries. Such variances will indicate a need to begin diagnosis for system performance issues.

Y Soft recommends looking at the general health of services using a multitude of tools. Aside from monitoring the general health of the underlying infrastructure - which is beyond the scope of this article - there are operating system and service-level metrics that can be leveraged. These include standard tools, such as the use of Performance Monitor and Windows Service Monitoring, but also the use of YSoft SafeQ-specific APIs and tools.

Windows Performance Counters

Microsoft Windows operating systems come with the standard monitoring tool perfmon.exe. This solution is more robust than Task Manager or Resource Monitor, in that it provides a wider array of metrics and logging support. The following metrics should be noted when analyzing YSoft SafeQ servers for stability. Technicians familiar with perfmon can set up monitoring to alert when deviations appear.

Values marked with a (*) indicate that each individual instance will be collected.

Values marked with a (_Total) indicate that the sum or average (where appropriate) of all instances will be collected.

Value Collected

Description (from Perfmon)

Ideal Range

Notes

\Memory\Available Mbytes

Available MBytes is the amount of physical memory, in Megabytes, immediately available for allocation to a process or for system use. It is equal to the sum of memory assigned to the standby (cached), free and zero page lists.

N/A

\Memory\Pages/sec

Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory\\Pages Input/sec and Memory\\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

Near 0.

This is an indicator of how often page files are written or read from disk. High values indicate low memory management.

\Memory\Pages Input/sec

Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk. When a page is faulted, the system tries to read multiple contiguous pages into memory to maximize the benefit of the read operation. Compare the value of Memory\\Pages Input/sec to the value of Memory\\Page Reads/sec to determine the average number of pages read into memory during each read operation.

Near 0.

Occasional spikes are expected.

\Memory\Pool Nonpaged Bytes

Pool Nonpaged Bytes is the size, in bytes, of the nonpaged pool, an area of the system virtual memory that is used for objects that cannot be written to disk, but must remain in physical memory as long as they are allocated. Memory\\Pool Nonpaged Bytes is calculated differently than Process\\Pool Nonpaged Bytes, so it might not equal Process(_Total)\\Pool Nonpaged Bytes. This counter displays the last observed value only; it is not an average.

N/A

\Memory\Pool Paged Bytes

Pool Paged Bytes is the size, in bytes, of the paged pool, an area of the system virtual memory that is used for objects that can be written to disk when they are not being used. Memory\\Pool Paged Bytes is calculated differently than Process\\Pool Paged Bytes, so it might not equal Process(_Total)\\Pool Paged Bytes. This counter displays the last observed value only; it is not an average.

N/A

\Memory\% Committed Bytes in Use

% Committed Bytes In Use is the ratio of Memory\\Committed Bytes to the Memory\\Commit Limit. Committed memory is the physical memory in use for which space has been reserved in the paging file should it need to be written to disk. The commit limit is determined by the size of the paging file. If the paging file is enlarged, the commit limit increases, and the ratio is reduced). This counter displays the current percentage value only; it is not an average.

Less than 50%.

\Network Interface(*)\Packets Received Errors

Packets Received Errors is the number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol.

Near 0.

\Network Interface(*)\Output Queue Length

Output Queue Length is the length of the output packet queue (in packets). If this is longer than two, there are delays and the bottleneck should be found and eliminated, if possible. Since the requests are queued by the Network Driver Interface Specification (NDIS) in this implementation, this will always be 0.

Less than 2.

Higher than an average of 2 indicates a network bottleneck.

\Network Interface(*)\Bytes Total/sec

Bytes Total/sec is the rate at which bytes are sent and received over each network adapter, including framing characters. Network Interface\Bytes Total/sec is a sum of Network Interface\Bytes Received/sec and Network Interface\Bytes Sent/sec.

N/A

\PhysicalDisk(_Total)\Avg. Disk sec/Read

Avg. Disk sec/Read is the average time, in seconds, of a read of data from the disk.

Near 0.

Useful metric for determining disk latency. Higher values are bad.

\PhysicalDisk(_Total)\Avg. Disc sec/Transfer

Avg. Disk sec/Transfer is the time, in seconds, of the average disk transfer.

Near 0.

Useful metric for determining disk latency. Higher values are bad.

\PhysicalDisk(_Total)\Avg. Disk sec/Write

Avg. Disk sec/Write is the average time, in seconds, of a write of data to the disk.

Near 0.

Useful metric for determining disk latency. Higher values are bad.

\PhysicalDisk(*)\Current Disk Queue Length

Current Disk Queue Length is the number of requests outstanding on the disk at the time the performance data is collected. It also includes requests in service at the time of the collection. This is a instantaneous snapshot, not an average over the time interval. Multi-spindle disk devices can have multiple requests that are active at one time, but other concurrent requests are awaiting service. This counter might reflect a transitory high or low queue length, but if there is a sustained load on the disk drive, it is likely that this will be consistently high. Requests experience delays proportional to the length of this queue minus the number of spindles on the disks. For good performance, this difference should average less than two.

2-3 per spindle during idle.

During spikes, check correlation to \Memory\Pages Input/sec

\PhysicalDisk(_Total)\Disk Bytes/sec

Disk Bytes/sec is the rate bytes are transferred to or from the disk during write or read operations.

N/A

Check Correlation between this and \PhysicalDisk\Current Disk Queue Length and \Memory\Pages Input/sec.

\PhysicalDisk(_Total)\% Idle Time

% Idle Time reports the percentage of time during the sample interval that the disk was idle.

Varies

Very low idle time indicates either the system is being overutilized or the disk isn't responsive enough. Compare to physical disk metrics. A very high idle time indicates the server is being underutilized.

\Process(_Total)\Working Set

Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process. If free memory in the computer is above a threshold, pages are left in the Working Set of a process even if they are not in use. When free memory falls below a threshold, pages are trimmed from Working Sets. If they are needed they will then be soft-faulted back into the Working Set before leaving main memory.

N/A

Useful for comparison to the \Memory metrics.

\Processor(_Total)\%Processor Time

% Processor Time is the percentage of elapsed time that the processor spends to execute a non-Idle thread. It is calculated by measuring the percentage of time that the processor spends executing the idle thread and then subtracting that value from 100%. (Each processor has an idle thread that consumes cycles when no other threads are ready to run). This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. It should be noted that the accounting calculation of whether the processor is idle is performed at an internal sampling interval of the system clock (10ms). On todays fast processors, % Processor Time can therefore underestimate the processor utilization as the processor may be spending a lot of time servicing threads between the system clock sampling interval. Workload based timer applications are one example of applications which are more likely to be measured inaccurately as timers are signaled just after the sample is taken.

Less than 50%

With high values, compare to processor queue length to determine total load on the system.

\Processor(_Total)\% Idle Time

% Idle Time is the percentage of time the processor is idle during the sample interval

50% or more.

This, with \Processor(_Total)\% Processor Time, helps us understand how much time is spent context switching.

\System\Processor Queue Length

Processor Queue Length is the number of threads in the processor queue. Unlike the disk counters, this counter counters, this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload. A sustained processor queue of less than 10 threads per processor is normally acceptable, dependent of the workload.

Less than 10 per processor.

\TCPv4\Connection Failures

Connection Failures is the number of times TCP connections have made a direct transition to the CLOSED state from the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state.

N/A

This is an aggregator since the last system restart. Examine the delta between samples. A high change could be an indicator of network issues.

\TCPv4\Connections Established

Connections Established is the number of TCP connections for which the current state is either ESTABLISHED or CLOSE-WAIT.

Varies

This tells us how many active connections are present.

\TCPv4\Connections Reset

Connections Reset is the number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.

N/A

This is an aggregator since the last system restart. Examine the delta between samples. A high change could be an indicator of network issues.

\TCPv6\Connection Failures

Connection Failures is the number of times TCP connections have made a direct transition to the CLOSED state from the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state.

N/A

Reserved for future use

\TCPv6\Connections Established

Connections Established is the number of TCP connections for which the current state is either ESTABLISHED or CLOSE-WAIT.

N/A

Reserved for future use

\TCPv6\Connections Reset

Connections Reset is the number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.

N/A

Reserved for future use


Additional Performance Monitoring Metrics: Microsoft SQL Server

Value Collected

Description

Process Object : % Processor Time : sqlservr

CPU Time consumed by the SQLSERVR process (Microsoft SQL Server service process).

SQL Server Access Methods Object : Full Scans / Sec

Full Scan Access Method is bypassing all indexes and may indicate sub-optimal performance. Certain amount of full-scan accesses cannot be prevented, but extensive usage of Full Scan accesses should trigger analysis and optimization.

SQL server Databases : Active transactions : All instances

Number of cuncurrently running transaction. Should not exceed long-term observed threshold. While this number is closely related to user activity happening in the system, having this metric grow continuously over a long periods of time may indicate problems.

SQL server Databases : Transactions/sec : All instances

Performance oriented metric indicating database engine throughput.

SQL server: Transactions: Longest Transaction Running time

Transactions represent database operations, which are all time bound. Some transactions are long running, but no transactions should run indefinitely.

SQL Server Buffer Manager Object : Cache Hit Ratio (Buffer Cache hit ratio)

Performance oriented metrics. If the cache hit ratio is steadily low, analysis of the performance profile should be triggered to optimize cache utilization.

SQL Server General Statistics Object : User Connections

Number of concurrent user connections. While there can be up to hundreds of concurrent connections, this number should not exceed certain threshold. Please refer to SafeQ configuration to determine such threshold.

  • Example: if the connection pool size in SafeQ is configured to 100 concurrent connections, this number should generally stay below this number. If it does not, that indicates connection pool exhaustion which should trigger an alert.

SQL Server Locks Object : Average Wait Time : All instances

Average wait time on SQL Server locks - for mutual exclusions on shared resources. Wait time should stay below or around certain observed threshold. If this number is steadily growing, that may indicate a problem.

SQL Server Locks Object : Number of deadlocks /sec: All instances

Deadlocks indicate deadlocked transactions.

Service Monitoring

YSoft SafeQ is comprised of several services, which vary by server role and functionality. However, the general health of all of these services is important for stable and sustained operation of the solution as a whole:

Service Name

Description

Server Roles

Notes

YSoft SafeQ Terminal Server

Responsible for communication and AAA (Authentication, Authorization, Accounting)

Site Server


YSoft SafeQ Spooler Controller Group Service

Responsible for synchronization of Site Servers within a group

Site Server

Startup type is manual by default, service is running when cluster is formed. It is stopped when SPOC is standalone.

YSoft SafeQ Spooler Controller

Business logic layer for Terminal Server, FlexiSpooler, and Workflow Processing

Site Server


YSoft SafeQ Mobile Print Server

Processes print jobs submitted via email workflows

Mobile Print Server

Multiple instances of MPS service can co-exist in one environment.

YSoft SafeQ Management Service

Hosting of Administrative web interface (Apache Tomcat), and management of the solution enterprise-wide

Management Server


YSoft SafeQ LDAP Replicator

Responsible for replication of user data from directory services (Microsoft Active Directory)

Management Server


YSoft SafeQ FlexiSpooler

Responsible for print job reception, storage, and release.

Site Server


YSoft Bundled Etcd

Responsible for centralized configuration of the solution

All


Monitoring Network Services

In general, monitoring using TCP half-handshake / half-open connections is strongly recommended (similar monitoring technique is employed by load balancing solutions, like BIG-IP F5). Typical third party tool which is capable of performing half-open connections is nmap.

Port

Service

Description

Implications

Monitor on Management?

Monitor on Site Server?

443

Management Service service

Allows for the Dashboard

Stopping this service has impact on Dashboard availability

YES

NO

515

LPD listener

Allows for job reception from workstations

If unavailable, jobs are not being received by Site Server

NO

YES

4096

Management / Site Server service

Common port used for hardware terminal communication; also can be used to determine the service is up

If Management service is unavailable, Dashboard is unavailable as well (even if the service is running)

YES

YES

5012

Terminal Server service

Allows for Embedded Terminals authentication

If unavailable, users cannot authenticate at MFPs.

NO

YES

9100

SafeQ Client listener

Allows for job reception from SafeQ Client (only if failover option 4 or 5 is in use)

If unavailable, users cannot print

NO

YES


Terminal Server API Integration

YSoft SafeQ's Terminal Server, which is present on all Site Servers, has a REST API that can be leveraged to check availability of services. Infrastructure monitoring capable of leveraging this can access server status through the following cURL command, replacing {example.tld} with the Site Server's FQDN or IP address (default port is 5021):

For UNIX clients:

curl --include 'https://{example.tld}/ts/v1/hello'

For Windows clients using PowerShell (curl is an alias for Invoke-WebRequest)

curl -Uri https://{example.tld}/ts/v1/hello

If the server is operational, a HTTP Response of 200 OK will be returned, HTTP Response 500 (Internal Server Error) indicates application failure.

images/download/attachments/160480526/1.png

The HELLO resource is a diagnostics resource intentionally built into the Terminal Server service. Many application monitoring tools provide install-able agents or connectors which are able to invoke REST-ful Web Services and evaluate Status Code. While the resource can be checked manually, it's main purpose is to be monitored automatically by application monitoring solution.

Site Server Monitoring with JMX

The Site Services installation can also be monitored through Java Management EXtensions (JMX) Management Beans (MBeans) exposed by the YSoft SafeQ Spooler Controller service. JMX (Java Management Extensions) provide instrumentation of the Java Virtual Machine. Many application monitoring tools leverage JMX connectors which enable automated collection and monitoring of JMX metrics. JMX metrics can also be checked interactively using the bundled JConsole tool (see screenshot) or 3rd party JMX command line utility (https://github.com/jiaqi/jmxterm).

By default, JMX information is exposed on localhost interface on tcp/9898. Configuration can be changed to enforce TLS-based encryption and authentication by user / password.

images/download/attachments/160480526/2.png

Recommended JMX Mbeans and Metrics (Attributes) to include in Application Monitoring

MBean

Attribute

Description

Expected value

distCache:component=CacheManager,name="cacheManager",type=CacheManager

clusterSize

Number of members of Site Server cluster.

The value should be equal on all members of the cluster.

java.lang:type=Threading

threadCount

Number of threads


safeq/ymq/MessagingContext

getOnlinePeersCount

Number of connected peers


getOnlinePeers

List of connected peers

Matching GUIDs for Spoolers, Clients, and Mobile Print servers

getDisconnectedPeersCount

Number of disconnected peers

0

getDisconnectedPeers

List of disconnected peers

Should be empty

safeq/eu.ysoft.safeq.ors.OrsNode

getNodeState

The state of the Site Server

ONLINE

Proactive Care

Y Soft offers a YSoft SafeQ analysis solution known as Proactive Care, which was developed in response to monitoring requests by other customers.

The following files shall be monitored as often as every 15 minutes (shortTask configuration property in proactive-care-agent.conf). The files mentioned below are typically located in folder C:\SafeQ6\Proactive Care Agent\results, Those are CSV formatted files, columns are numbered starting at 1:

sqhc-orsmonitor-[cluster name].result

  1. Check column 1 for timestamp - alert if the file was not updated recently. Then the monitoring is not running.

  2. Check column 3 - alert if value is not 1. Then the server is offline.

  3. Check column 16 - alert if value is other than number of members in the cluster.

sqhc-services-[server name]-SPOC.result

  1. Check column 1 for timestamp - alert if the file was not updated recently. Then the monitoring is not running.

  2. Check columns 3-7 which record return value of the services. Alert if value is above expected threshold.

    • XSA = ping result on https://" + hostName + ":5012/XeroxXSA/Service.asmx

    • XSA_IP = ping result on https://" + hostIP + ":5012/XeroxXSA/Service.asmx"

    • EIP = ping result on http://" + hostName + ":5011/

    • EIP_IP = ping result on http://" + hostIP + ":5011/

    • EUI = response from End User Interface

sqhc-services-[server name]-FSP.result

  1. Check column 1 for timestamp - alert if the file was not updated recently. Then the monitoring is not running.

  2. Check column 3 for LPR response - alert of not 0.

Infinispan HTTP/REST Endpoint

If advanced cluster health monitoring is required, including FlexiSpooler to Spooler Controller connectivity, locally available Infinispan HTTP/REST endpoint can be used. This endpoint is critical for system functionality and thus caution is advised.

  • FlexiSpooler Address Book registrations can be retrieved from endpoint:
    http://localhost:81/distLayer/com.ysoft.safeq.spoc.addressbook.AddressBook_distnamespace/

Please note that due to security/performance sensitivity, this endpoint is available on localhost / loopback network interface only.

Database Maintenance on Management Servers

YSoft SafeQ 6 performs internal database maintenance tasks regularly every day (by default configured for 1:00 / 1am).

Execution and successful finish of these maintenance tasks can be observed in the management-service logs:

Started service: 'DATABASE_MAINTENANCE' with result 'SUCCESS' for tenant: 'ApplicationTenantIdentification[tenantGuid=cluster_mngmt]'
on cluster node: 'skyfwzbh4t0i1k9l'

Upon successful finish, the following message is logged:

Ending invocation: 'Invocation[id=15266, invocationStatus=IN_PROGRESS, clusterNodeId='skyfwzbh4t0i1k9l',
lastModification='2018-08-27T05:00:00.083Z', serviceIdentification=DATABASE_MAINTENANCE]'
for tenant: 'ApplicationTenantIdentification[tenantGuid=cluster_mngmt]'

Please note that the DATABASE_MAINTENANCE service task is triggered on all management servers. If you are working with single database instance for all cluster nodes, the task will successfully complete on only one of them.

Failure is indicated by the following message in the logs:

Started service: 'DATABASE_MAINTENANCE' with result 'FAILED' for tenant: 'ApplicationTenantIdentification[tenantGuid=cluster_mngmt]' on cluster node: '8612h4ol3voj5xpb'

Please note that these messages are logged with INFO severity, so you need to have INFO log level enabled to see them.

Microsoft SQL Server: Logical Index Fragmentation

In case the regular maintenance fails to run, database indexes fragmentation will continue to increase over time (depending on the real traffic in the system). The expected fragmentation level is around 10%, should stay around this number and not increase over time. Expected runtime of the DATABASE_MAINTENANCE task is up to 15 minutes, depending on your configuration and fragmentation levels.

Fragmentation can be checked using the following query:

MS SQL Index Fragmentation
SELECT
OBJECT_NAME(ips.object_id) AS [TableName],
avg_fragmentation_in_percent,
si.name [IndexName],
schema_name(st.schema_id) AS [SchemaName], page_count
FROM sys.dm_db_index_physical_stats(DB_ID(),NULL,NULL,NULL,'SAMPLED') ips
JOIN sys.tables st WITH ( NOLOCK )
ON ips.object_id = st.object_id
JOIN sys.indexes si WITH ( NOLOCK )
ON ips.object_id = si.object_id AND ips.index_id = si.index_id
WHERE st.is_ms_shipped = 0 AND si.name IS NOT NULL
AND avg_fragmentation_in_percent
>= 10 and page_count > 1000
ORDER BY ips.avg_fragmentation_in_percent DESC;

The sample output from that query can look like this:

images/download/attachments/160480526/3.png

More detailed information (for troubleshooting) can be obtained using the following query (which is more resource consuming and should only be used in case more detailed information is needed):

SELECT
OBJECT_NAME(ips.object_id) AS [TableName],
avg_fragmentation_in_percent,
si.name [IndexName],
schema_name(st.schema_id) AS [SchemaName],
page_count,
index_level
FROM sys.dm_db_index_physical_stats(DB_ID(),NULL,NULL,NULL,'detailed') ips
JOIN sys.tables st WITH ( NOLOCK )
ON ips.object_id = st.object_id
JOIN sys.indexes si WITH ( NOLOCK )
ON ips.object_id = si.object_id AND ips.index_id = si.index_id
WHERE st.is_ms_shipped = 0 AND si.name IS NOT NULL
ORDER BY si.name, index_level DESC