Multiplex filtering by amyjaynethompson · Pull Request #362 · DiamondLightSource/python-dlstbx

amyjaynethompson · 2026-04-28T08:35:21Z

xia2.multiplex has a filtering option built in which can greatly improve data reduction quality. VMXm, in particular, always manually reprocess datasets with xia2.multiplex to turn on these filtering parameters. Therefore, it would be nice to include this as a part of the auto processing infrastructure.

The issue has always been that the filtering can be slow, and this can impede rapid feedback. In xia2, we recently made a new command line program, xia2.multiplex_filtering. This performs the same filtering algorithms on a completed multiplex job. By breaking the algorithm into two separate programs, this would allow for rapid feedback as well as providing a filtered mtz later. This PR attempts to provide trigger/wrappers for such a filtering pipeline.

The cluster number is passed through from multiplex to multiplex_filtering to ensure that it is not triggered on clusters (possible implementation for clusters in the future, but would need slightly different triggering requirements).

As this pipeline relies on a finished multiplex directory (specific files needed that are not user-interesting), checks are done to make sure data is available where expected. This is done using the same delay multipliers as multiplex.

The sample group information is also passed through from multiplex. This is important, as there can be multiple sample groups related to a single DCID. Multiplex also passes through the actual DCID's it used in processing. This is also important, as the stored list of related DCID's can include both rotation/grid scans or other datasets that should not be used. Given all the relevant queries are already done in the multiplex trigger, it seemed easiest to pass these through rather than repeating all these queries.

The filtering itself is set to image_group mode, which means all the images are grouped into batches and a deltacchalf algorithm is used to see if any of these batches do not correlate well with the rest of the data. A group size of 50 is set as default, as this corresponds to 5deg rotation (following standard 0.1 deg fine slicing). However, VMXm have had success using a group size of 10, so they have this specified for their beamline.

General intent here is to test on VMXm first via staging, then roll it out live just for VMXm initially. This will be useful stress testing prior to deployment on other beam lines. Eventually, it is expected that this is triggered on all beam lines after multiplex.

NOTE: will need dials/latest to run -> this includes xia2.multiplex_filtering bug fixes which are not in the latest release.

amyjaynethompson · 2026-05-06T13:04:11Z

Refactored the code so that a separate trigger function was no longer needed. The multiplex recipe has been updated so that a new output channel "filtering" is able to trigger xia2.multiplex_filtering. This saves moving parameters between multiplex and multiplex_filtering.

pblowey · 2026-04-30T15:14:53Z

+        if parameters.cluster_num != "None":
+            is_cluster = True
+        else:
+            is_cluster = False
+
+        if is_cluster:
+            self.log.info(
+                f"Incoming multiplex is cluster {parameters.cluster_num}. Filtering not currently supported for clusters."
+            )
+            return {"success": True}


Suggested change

if parameters.cluster_num != "None":

is_cluster = True

else:

is_cluster = False

if is_cluster:

self.log.info(

f"Incoming multiplex is cluster {parameters.cluster_num}. Filtering not currently supported for clusters."

)

return {"success": True}

if parameters.cluster_num != "None":

self.log.info(

f"Incoming multiplex is cluster {parameters.cluster_num}. Filtering not currently supported for clusters."

)

return {"success": True}

pblowey · 2026-04-30T15:18:37Z

+                f"Incoming multiplex is cluster {parameters.cluster_num}. Filtering not currently supported for clusters."
+            )
+            return {"success": True}
+        else:


else statement isn't needed here because function will have already returned for clusters

pblowey · 2026-04-30T15:25:51Z

+            multiplex_dir = parameters.multiplex_job
+            if not multiplex_dir.is_dir():
+                self.log.error(
+                    f"Given multiplex directory {multiplex_dir} does not exist. Aborting job."
+                )
+                return {"success": True}
+            else:
+                self.log.info(f"Previous multiplex job at {multiplex_dir}")


The multiplex directory not existing is not an expected behaviour. I'd suggest not handling it gracefully and just let the pipeline crash or raise an error so that any issues would be more obvious.

pblowey · 2026-04-30T15:47:21Z

+            query = (
+                (
+                    session.query(AutoProcProgram, ProcessingJob.dataCollectionId).join(
+                        ProcessingJob,
+                        ProcessingJob.processingJobId
+                        == AutoProcProgram.processingJobId,
+                    )
+                )
+                .filter(ProcessingJob.dataCollectionId == parameters.dcid)
+                .filter(ProcessingJob.automatic == True)  # noqa E712
+                .filter(ProcessingJob.recordTimestamp > min_start_time)  # noqa E711
+                .filter(
+                    AutoProcProgram.processingJobId == parameters.multiplex_id
+                )  # check only parent multiplex
+                .filter(
+                    or_(
+                        AutoProcProgram.processingStatus == None,  # noqa E711
+                        AutoProcProgram.processingStartTime == None,  # noqa E711
+                    )
+                )
+            )
+
+            # If there are any running (or yet to start) jobs, then checkpoint with delay
+            waiting_processing_jobs = query.all()


Waiting for a the multiplex job to finish should not be necessary as it only trigger pipelines downstream of it when it has finished (technically it is slightly before the wrapper has finished but xia2.multiplex will have finished and the necessary files will have been created)

pblowey · 2026-04-30T16:28:18Z

+            dc = (
+                session.query(DataCollection)
+                .filter(DataCollection.dataCollectionId == parameters.dcid)
+                .one()
+            )


This query doesn't get used

pblowey · 2026-05-01T08:18:52Z

+            for idx, dcid in enumerate(dcids):
+                job_parameters.append((f"group_dcid_{idx}", str(dcid)))


It would be better to upload these all into a single list. This is possible by uploading multiple job processing parameters under the same key (see how this is handled in the dimple trigger for example). This will save you having to wrestle with this format of having a separate key per dcid elsewhere in the code.

pblowey · 2026-05-15T10:29:52Z

    diffraction_plan_info: Optional[DiffractionPlanInfo] = None
    recipe: Optional[str] = None
    use_clustering: Optional[List[str]] = None
+    use_filtering: Optional[List[str]] = None


Suggested change

use_filtering: Optional[List[str]] = None

use_filtering: List[str] = []

Use of Optional isn't necessary here, the default value is what truly makes a parameter optional in pydantic terms. The Optional type just means that a None value can be supplied, which I don't think would match the use case.

pblowey · 2026-05-15T10:39:49Z


-            self.log.info(f"xia2.multiplex trigger: Processing job {jobid} triggered")
+            self.log.info(
+                f"xia2.multiplex_filtering trigger: Processing job {jobid} triggered"


Suggested change

f"xia2.multiplex_filtering trigger: Processing job {jobid} triggered"

f"xia2.multiplex trigger: Processing job {jobid} triggered"

pblowey · 2026-05-15T10:41:14Z

+            if (
+                parameters.use_filtering
+                and parameters.beamline in parameters.use_filtering
+            ):


Suggested change

if (

parameters.use_filtering

and parameters.beamline in parameters.use_filtering

):

if parameters.beamline in parameters.use_filtering:

If you remove the Optional typing as described above, this check simplifies.

pblowey · 2026-05-15T10:55:32Z

+
+        # Place holder code for future iterations where may run filtering jobs on clusters
+
+        if cluster_num is not None:


I don't think it makes sense to retain cluster logic here (and possibly elsewhere in this wrapper). The flitering job is triggered as a separate job by the multiplex wrapper. With the way the recipe is structured, if you were running filtering on clusters, a separate call of the filtering wrapper would get made for each cluster so you wouldn't need the same logic that loops over and distinguishes between clusters and non-clusters.

amyjaynethompson added 2 commits April 22, 2026 11:52

initial addition for new multiplex_filtering pipeline

14c9e09

documentation

0a29ea6

amyjaynethompson requested a review from pblowey April 28, 2026 08:35

amyjaynethompson added 3 commits May 6, 2026 13:44

refactor to remove need for separate trigger function

9e60c91

Merge branch 'main' into multiplex_filtering

6286cf6

removing testing code

4384938

add mail notification for clusters

9d1a7a8

pblowey requested changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiplex filtering#362

Multiplex filtering#362
amyjaynethompson wants to merge 6 commits into
mainfrom
multiplex_filtering

amyjaynethompson commented Apr 28, 2026 •

edited

Loading

Uh oh!

amyjaynethompson commented May 6, 2026

Uh oh!

pblowey Apr 30, 2026

Uh oh!

pblowey Apr 30, 2026

Uh oh!

pblowey Apr 30, 2026

Uh oh!

pblowey Apr 30, 2026

Uh oh!

pblowey Apr 30, 2026

Uh oh!

pblowey May 1, 2026

Uh oh!

pblowey May 15, 2026

Uh oh!

pblowey May 15, 2026

Uh oh!

pblowey May 15, 2026

Uh oh!

pblowey May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for idx, dcid in enumerate(dcids):
		job_parameters.append((f"group_dcid_{idx}", str(dcid)))

	use_filtering: Optional[List[str]] = None
	use_filtering: List[str] = []

	f"xia2.multiplex_filtering trigger: Processing job {jobid} triggered"
	f"xia2.multiplex trigger: Processing job {jobid} triggered"


		# Place holder code for future iterations where may run filtering jobs on clusters

		if cluster_num is not None:

Conversation

amyjaynethompson commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyjaynethompson commented May 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amyjaynethompson commented Apr 28, 2026 •

edited

Loading