Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributing isoscope scheduler #1126

Open
wants to merge 52 commits into
base: master
Choose a base branch
from

Conversation

vitaly-krugl
Copy link

@vitaly-krugl vitaly-krugl commented Sep 6, 2024

READY FOR REVIEW (tests implemented and passing) ...

Implementation of the Distributed Scope Isolation scheduler isoscope for pytest-xdist.

Background

I developed this scheduler to facilitate a CI use case at my workplace. This scheduler has been in constant - and successful - use for quite some time now.

NOTE: In the following discussion, the meaning of "Scope" is the same as in the loadscope scheduler - it's the set of tests associated with a single test class or the set of module-level test functions associated with a single python test module.

Introduction

The proposed scheduler facilitates distributed execution of unit tests on a one-Scope-at-a-time basis with coordinated Scope-level resource setup and teardown.

We say that the test Scopes are isolated, since each test Scope - along with its Scope-level resource Setup/Teardown - is executed entirely before the subsequent test Scope. Hence we name this scheduler isoscope.

Alternatives: While it's possible to achieve this type of isolation by using the loadscope scheduler with a single xdist worker (--dist=loadscope -n1) which serializes executions of all tests, the isoscope scheduler avails more performant distributed execution of each Scope's tests.

Tests Scopes that meet the following constraints may benefit from the isoscope scheduler:

  1. All of the Scope's tests can share common class-level and/or module-level resource Setup/Teardown
  2. Within a given Scope, tests do not have undesirable side-effects upon each other
  3. Each test within a given Scope has non-trivial execution runtime (e.g., expensive network I/O).

Properties of this scheduler

  1. Executes one test Scope at a time, distributing its tests across the configured XDist Workers
  2. The isoscope scheduler provides a distributed coordination mechanism for Scope-level Resource Setup/Teardown with the following guarantees:
    • Resource Setup is performed exactly once per test Scope.
    • Resource Teardown is performed exactly once per test Scope.
    • Resource Setup of the executing test Scope completes BEFORE execution of the Scope's tests.
    • Resource Teardown phase of the executing test Scope begins after completion of all tests of the Scope.
    • Resource Setup of the next test Scope begins after completion of the previous test Scope's Resource Teardown.

Foundational Design Concepts of this Scheduler

Pytest/xdist Runtest Protocol: Each xdist worker always holds one test item in the queue and won't execute it until it receives either an additional test item from xdist scheduler or the "shutdown" signal in order to satisfy the nextitem requirements of pytest_runtest_protocol which requires a valid "nextitem" during the test session and treats nextitem=None as the end of the test session.

Our design takes advantage of the above-described pytest/xdist Runtest Protocol, including the requirement to always retain the last test item in each worker's input queue, to implement an orderly isolation of all tests within a given test Scope. We call it the "At least two active-Scope tests per worker" rule, described as follows:

When isoscope scheduler distributes tests for the Active Scope - the Scope it intends to execute in the current cycle - it distributes at least two tests from this Scope into each available worker queue. The only exception is when the Active Scope has only one test - in this case only this single test is distributed to an available worker. Here is the detailed workflow:

  1. Before the workers begin executing the Active Scope's tests in their input queues, each of those workers will execute the Active Scope's class or module Setup fixtures. NOTE: isoscope provides the iso_scheduling fixture of type IsoSchedulingFixture that facilitates coordination of Scope-level resource Setup/Teardown such that Setup and Teardown are performed exactly once per Scope versus per worker.
  2. Then each worker will proceed to execute all but the last test in its queue, pausing at the last test (including for the Scope with a single test case scenario). The pausing at the last remaining test in each worker's queue is due to the pytest/xdist Runtest Protocol described above.
  3. At this stage, all the workers that were given tests from the current Active Scope will have executed all of the Scope's Setup fixtures and all but the last test in each of their queues.
  4. To force execution of those last remaining tests from the current Active Scope, isoscope enqueues "Fence" test items from the subsequent Scope(s) - one Fence test per xdist Worker queue containing its last test from the current Active Scope. If there are not enough Fence items available in the remaining test Scopes, then isoscope enqueues the "Shutdown" messages into the remaining workers instead of Fence tests. This causes each such xdist Worker to process its last remaining test from the current Active Scope and execute all of the Scope's class or module Teardown fixtures.

NOTE: The distribution of the Fence tests also takes into account the DSI "At least two active-Scope tests per worker" Rule: DSI limits the number of Fence items it takes from a subsequent Test Scope, making sure that the Rule won't be compromised when the Fence Scope becomes the Active Scope.


Thanks for submitting a PR, your contribution is really appreciated!

Here's a quick checklist that should be present in PRs:

  • Make sure to include reasonable tests for your change if necessary

  • We use towncrier for changelog management, so please add a news file into the changelog folder following these guidelines:

    • Name it $issue_id.$type for example 588.bugfix;

    • If you don't have an issue_id change it to the PR id after creating it

    • Ensure type is one of removal, feature, bugfix, vendor, doc or trivial

    • Make sure to use full sentences with correct case and punctuation, for example:

      Fix issue with non-ascii contents in doctest text files.
      

@nicoddemus
Copy link
Member

nicoddemus commented Sep 18, 2024

Hi @vitaly-krugl thanks for the PR!

Although this PR has very high quality, to be honest I'm hesitant of accepting this PR, because it is somewhat complex and pytest-xdist does not have much maintenance bandwidth and it adds to that.

But this is my opinion only, @RonnyPfannschmidt @bluetech what do you think?

Also consider that this might be published as a separate plugin, it does not need to be included in pytest-xdist: the package would implement pytest_xdist_make_schedule to return the custom scheduler; if there is anything else missing in pytest-xdist to accommodate an external scheduler we would be glad to adapt pytest-xdist for that.

@RonnyPfannschmidt
Copy link
Member

first i want to recognize the effort and creativity that went into this pr,
the implementation sort out a very complex coordination problem in "just" a few thousand lines of code

those few thousand lines are conceptually very dense, and currently come without additional tests,
i have no idea if/when i can dedicate a few days to actually gnaw trough it and properly comprehend it

we have a very bad match up here for the maintenance situation - the new contributed code is without tests and of necessarily high complexity

most volunteer managed opensource projects are unable to take in such a contribution easily, in particular if it comes in without any deep prior engagement/discussion with the maintainer team

if the goal is to make this quickly available, then i strongly recommend publishing this as a "plugin" to xdist,
in which case we are happy to integrate ci in a way that ensures that future xdist changes do not break this plugin

if the goal is to land this in xdist, then its likely to take a long time, as the contribution size and complexity of the solved problem is immense

i agree that the code itself is high quality, but the high complexity of the solved problem, the sheer size of this contribution, the need for tests to be written and our own awareness of our very limited maintenance capabilities
turn this contribution into something that has quite the intimidating factor as both, due diligence in review was well as continued maintenance

its going to be a while before im in a place to give it the due diligence it needs, and i suspect this will be a problem for the initial contribution

i also suspect that multi host aware architecture of the coordinator will look different than initially imagined

my current recommendation would be to begin this as own pypi package, adding a own testsuite and to coordinate with xdist to ensure xdist changes dont break it

in addition if this is not a direct part of xdist, its more easy to declare multi host testing not being supported in the feature and engage as we eventually enable to pull this in

i absolutely want to enable coordinated fixture/resource creation/destruction
i also beleive that the full conclusion of supporting managing resources will need it to be able to be declarative
(and creating that api wont be a walk in the par)

@vitaly-krugl
Copy link
Author

vitaly-krugl commented Oct 21, 2024

Hi @nicoddemus and @RonnyPfannschmidt, thank you for following up with you recommendations. I immensely appreciate the effort and personal sacrifice of the volunteer maintainers, having previously contributed extensively to an open source project myself (Pika).

I developed this scheduler to facilitate a CI use case at my workplace. This scheduler has been in constant - and successful - use for quite some time now.

@RonnyPfannschmidt ...

Just to be clear - I am working on tests. I integrated the isoscope scheduler into existing acceptance tests. I will add isoscope-specific tests next. UPDATE: tests are done.

Regarding complexity: The code of this scheduler could be simplified quite a bit if there is a supported (and efficient) way to shut down and restart all worker nodes between scopes (without re-triggering costly test collection; e.g., we have over 30,000 test cases - and growing rapidly - so cannot afford additional costly collection cycles between scopes. Please advise whether it's possible to shut down and restart all worker nodes efficiently (without triggering test collection), and how (a link to an example would be appreciated).

Regarding code size: The code is very well commented, which contributes significantly to line count, but not to complexity of the code.

Also, I will include design documentation.

I would be happy to engage in any further discussion with the maintainer team. Since I already had the code that solved a useful problem, I expected that the creation of the pull request would naturally facilitate/trigger a discussion.

There is not a huge amount of urgency from my side, as the company is using its own copy of the scheduler. That said, I would like to get it integrated into xdist out of practical considerations, if ever possible.

As a fallback, I will consider open-sourcing it as an independent PyPi project, in which case your offer of CI integration to ensure compatibility with future xdist changes would be immensely helpful. Is there an example of such CI integration in existence today?

Finally, the feedback about multi-host support/testing is noted. Are you referring to the --tx ssh=myhostpopen feature or something else? I would like an opportunity to discuss this further to gain a better understanding.

Vitaly Kruglikov and others added 25 commits October 25, 2024 18:39
…ripts to work around "TypeError: 'type' object is not subscriptable" in py38
…List vs list in type hings, favoring list and tuple.
…TestIsoScope.test_single_scope_subset_of_workers_utilized.
Vitaly Kruglikov and others added 25 commits October 25, 2024 18:39
… TestIsoScope.test_multi_scope_with_insufficient_fence.
….test_multi_scope_with_insufficient_fence.
…st_distributed_setup_teardown_coordination.
… in TestIsoScope.test_distributed_setup_teardown_coordination.
… bytes in position 2-3: truncated \UXXXXXXXX escape in TestIsoScope.test_distributed_setup_teardown_coordination.
…ybe_call_setup to work correctly on Windows.
@vitaly-krugl vitaly-krugl changed the title WORK-IN-PROGRESS - contributing isoscope scheduler Contributing isoscope scheduler Oct 25, 2024
@vitaly-krugl vitaly-krugl changed the title Contributing isoscope scheduler Contributing isoscope scheduler Oct 25, 2024
@RonnyPfannschmidt
Copy link
Member

there is no way to "restart" a worker without collection - when a new worker starts, it has no collection and need to run it to get into "shape"

@RonnyPfannschmidt
Copy link
Member

and yes - multi-host is about --tx=ssh...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants