feat(server): observability improvements and startup/retry fixes by anderslindho · Pull Request #159 · ChannelFinder/recsync

anderslindho · 2026-05-08T10:24:56Z

Addresses several operational pain points found in production at ESS.

Fixes

push_always_retry defaulted to True (a regression), which could freeze all commit processing indefinitely when CF was unreachable. Default is now False.
cleanOnStart/cleanOnStop swept channels synchronously on the reactor thread, blocking IOC commits for the full duration of the sweep. Both now run in a background thread.
Warning messages for IOC connection problems were ambiguous and could fire twice for the same event. A numeric iocName (ephemeral source port) is now flagged explicitly.

Observability

Periodic status lines (configurable via statusInterval, default 60 s) log active/queued connections against maxActive and the CF processor's IOC and channel counts - the key indicators that were invisible during the production incident.
CF push duration is logged per attempt, making CF latency regressions visible in the log stream.
Optional Prometheus metrics endpoint (metricsPort, requires pip install recceiver[metrics]). Exposes connection state, CF processor state, and a push duration histogram. Degrades gracefully to no-ops if prometheus-client is not installed.

…ssion Defaulting to True changed the failure mode from "exhaust retries and release the global lock" to "hold the lock indefinitely", which freezes all commit processing when CF is unreachable. False restores sensible behaviour: the commit drops and the IOC retries on its next reconnect.

…thread clean_service() blocks the reactor for the full duration of the sweep, holding the global lock and preventing all IOC commits. cleanOnStart now schedules the sweep as a background thread after startup, so commits are accepted immediately. cleanOnStop uses deferToThread so the reactor stays live during shutdown while the lock is still held.

A numeric iocName that matches the source port range means the iocid changes on every reconnect, silently accumulating stale channels in CF. Log a warning when this is detected so misconfigured reccasters can be found. Distinguishes disconnect-before-upload from update-without-initial to avoid the same warning firing twice for the same event.

There is currently no runtime visibility into whether maxActive is throttling connections or how many channels the CF processor is tracking. A LoopingCall in RecService logs active/queued connections against the limit every 60 s; CFProcessor logs known_iocs and tracked_channels on the same interval. Both are configurable via statusInterval (0 disables).

Without timing, slow CF commits are invisible until they cause a backlog. Per-attempt duration is now logged so latency regressions show up in the log stream. push_to_cf also checks processor.running on each iteration so a service stop during a retry loop drains immediately rather than waiting up to 60 s per attempt.

Exposes connection state, CF processor state, and CF push performance as Prometheus metrics on a configurable HTTP port (metricsPort, disabled by default). Requires the optional prometheus-client dependency (pip install recceiver[metrics]). Gracefully degrades to no-ops if prometheus-client is not installed so the dependency is truly optional.

sonarqubecloud · 2026-05-08T10:25:28Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

jacomago · 2026-05-08T11:11:25Z

        """Stop the CFProcessor service with lock held.

        If clean_on_stop is enabled, mark all channels as inactive.
+        The sweep runs in a thread so the reactor stays live during shutdown.


We don't want reactor live during any type of clean do we?

jacomago · 2026-05-08T11:14:34Z

        record_info_by_name = CFProcessor.record_info_by_name(record_infos, ioc_info)
        self.update_ioc_infos(transaction, ioc_info, records_to_delete, record_info_by_name)
+        if not transaction.connected and ioc_info.ioc_id not in self.iocs:
+            _log.warning(


Shouldn't we exit here?

jacomago · 2026-05-08T11:18:14Z

 # Time interval for sending recceiver advertisments
 #announceInterval = 15.0

+# Interval in seconds between periodic status log lines (0 to disable)


not 100% sure about this, but ok.

And also update recceiver-full.conf.

jacomago · 2026-05-08T11:25:38Z

    count = 0
    sleep = 1.0
    while processor.cf_config.push_always_retry or count < processor.cf_config.push_max_retries:
+        if not processor.running:


Is there a way to have a test for this?

simon-ess · 2026-05-08T13:26:40Z

            ioc_name = str(port)
-            _log.debug("IOC at %s:%d did not send IOCNAME; using port as iocName", host, port)
+            _log.debug("IOC at %s:%d has no iocName; using source port as iocName", host, port)
+        if ioc_name.isdigit() and 1024 <= int(ioc_name) <= 65535:


This condition seems odd to me. Does this ever occur outside of the conditional right above it?

simon-ess · 2026-05-08T13:31:24Z

    def stopService(self):
        _log.info("Stopping RecService")

+        if hasattr(self, "_statusLoop") and self._statusLoop.running:


Ugh, I don't love using hasattr. Why not just initialise _statusLoop to None?

simon-ess · 2026-05-08T13:32:21Z

            self.lock.release()

+        self._statusLoop = task.LoopingCall(self._logStatus)
+        self._statusLoop.start(60.0, now=False)


This doesn't look very configurable to me...

anderslindho added 6 commits May 8, 2026 11:46

anderslindho requested review from jacomago, shroffk, simon-ess and tynanford May 8, 2026 10:24

anderslindho self-assigned this May 8, 2026

jacomago reviewed May 8, 2026

View reviewed changes

simon-ess reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): observability improvements and startup/retry fixes #159

feat(server): observability improvements and startup/retry fixes #159
anderslindho wants to merge 6 commits intomasterfrom
improve-logging-observability

anderslindho commented May 8, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 8, 2026

Uh oh!

jacomago May 8, 2026

Uh oh!

jacomago May 8, 2026

Uh oh!

jacomago May 8, 2026

Uh oh!

jacomago May 8, 2026

Uh oh!

Uh oh!

Uh oh!

simon-ess May 8, 2026

Uh oh!

simon-ess May 8, 2026

Uh oh!

simon-ess May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anderslindho commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 8, 2026

Quality Gate passed

Uh oh!

jacomago May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jacomago May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jacomago May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jacomago May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

simon-ess May 8, 2026

Choose a reason for hiding this comment

Uh oh!

simon-ess May 8, 2026

Choose a reason for hiding this comment

Uh oh!

simon-ess May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anderslindho commented May 8, 2026 •

edited

Loading