Forwarding Issues - Splunk Enterprise Troubleshooting Use Case - 27

14.06.24 09:16 PM - By Murugan

Forwarding Issues:

Issue: 
The TCP output processor has paused the data flow -Heavy Forwarder queues are blockedwhile the indexer is empty.

Errors seen on Indexer:
ERROR TcpInputProc [13891 FwdDataReceiverThread] - Encountered S2S Exception="Failed" to parse observed latency with value="18446744073709524
ERROR TcpInputProc [13891 FwdDataReceiverThread] - Encountered S2S Exception="Failed" to parse observed latency with value="18446744073709524" in json doc, doc="{"green":{"count":207},"red":{"count":1,"splunkd.file_monitor_input.ingestion_latency.ingestion_latency_gap_multiplier":{"ids":["4DD6F0C8-F61C-4719-98CD-4EA9F59C7B3C:XX.XX.237.32:60706"],"values":[18446744073709524]}}}" for data received from src="XX.XX.245.11:52664.
HF puts the indexers under quarantine:
WARN TcpOutputProc [17851 indexerPipe_1] - The TCP output processor has paused the data flow. Forwarding to host_dest="XX.XX.245.10" inside output group idx from host_src="ST"G-EMS-MW-N2 has been blocked for blocked_seconds="8820." This can stall the data flow towards indexing and other network outputs. Review the receiving system's health in the Splunk Monitoring Console. It is probably not accepting data.
WARN AutoLoadBalancedConnectionStrategy [17892 TcpOutEloop] - Applying quarantine to ip="XX.XX.245.10" port="9997" connid="0" _numberOfFailures="2 08-10-2022 13:20:35.805 +0300 INFO AutoLoadBalancedConnectionStrategy [17892 TcpOutEloop] - Removing quarantine from idx="XX.XX.245.10:9997" connid="0

Root cause:
Ingestion latency is part of the heartbeat.When issue hits the HF, it starts to send the overflown integer to the Indexer and the Indexer stops responding to heartbeats due to error.
ERROR TcpInputProc [13891 FwdDataReceiverThread] - Encountered S2S Exception="Failed" to parse observed latency with value="18446744073709524" in json doc, doc="{"green":{"count":207},"red":{"count":1,"splunkd.file_monitor_input.ingestion_latency.ingestion_latency_gap_multiplier":{"ids":["4DD6F0C8-F61C-4719-98CD-4EA9F59C7B3C:XX.XX.237.32:60706"],"values":[18446744073709524]}}}" for data received from src="XX.XX.245.11:52664.

Solution: 
In health.conf on HF set aggregate_ingestion_latency_health = 0 to disable it. Also set ingestion_latency =1 to disabled it  & restart the indexer instance.

File name: $SPLUNK_HOME/etc/system/local/health.conf 
[health_reporter]
aggregate_ingestion_latency_health = 0
[feature:ingestion_latency]
alert.disabled = 1
disabled = 1

Murugan