gh-135552: Make the GC clear weakrefs later. by nascheme · Pull Request #136189 · python/cpython · GitHub | Latest TMZ Celebrity News & Gossip | Watch TMZ Live
Skip to content

gh-135552: Make the GC clear weakrefs later. #136189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Jul 1, 2025

Fix two bugs related to the interaction of weakrefs and the garbage
collector. The weakrefs in the tp_subclasses dictionary are needed in
order to correctly invalidate type caches (for example, by calling
PyType_Modified()). Clearing weakrefs before calling finalizers causes
the caches to not be correctly invalidated. That can cause crashes since the
caches can refer to invalid objects. This is fixed by deferring the
clearing of weakrefs to classes and without callbacks until after finalizers
are executed.

The second bug is caused by weakrefs created while running finalizers. Those
weakrefs can be outside the set of unreachable garbage and therefore survive
the delete_garbage() phase (where tp_clear() is called on objects).
Those weakrefs can expose to Python-level code objects that have had
tp_clear() called on them. See GH-91636 as an example of this kind of
bug. This is fixed be clearing all weakrefs to unreachable objects after
finalizers have been executed.

Clear the weakrefs to unreachable objects after finalizers are called.
@neonene
Copy link
Contributor

neonene commented Jul 1, 2025

I can confirm this PR fixes the gh-132413 issue as well.

@nascheme
Copy link
Member Author

nascheme commented Jul 1, 2025

I think this fixes (or mostly fixes) gh-91636 as well.

@nascheme nascheme requested a review from pablogsal July 1, 2025 23:33
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 1, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 12f0b5c 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136189%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 1, 2025
@nascheme
Copy link
Member Author

nascheme commented Jul 2, 2025

This introduces refleaks, it seems. One of the leaking tests:
test.test_concurrent_futures.test_shutdown.ProcessPoolSpawnProcessPoolShutdownTest.test_shutdown_gh_132969_case_1
My unconfirmed suspicion is that a finalizer is now resurrecting an object via a weakref. Previously that weakref would be cleared before the finalizer is run. The multiprocessing finalizer logic seems very complicated. :-/

@nascheme
Copy link
Member Author

nascheme commented Jul 2, 2025

This is the smallest leaking example I could make so far. Something with ProcessPoolExecutor leaking maybe.

class LeakTest(unittest.TestCase):
    @classmethod
    def _fail_task(cls, n):
        raise ValueError("failing task")

    def test_leak_case(self):
        # this leaks references
        executor = futures.ProcessPoolExecutor(
                max_workers=1,
                max_tasks_per_child=1,
                )
        f2 = executor.submit(LeakTest._fail_task, 0)
        try:
            f2.result()
        except ValueError:
            pass

        # Ensure that the executor cleans up after called
        # shutdown with wait=False
        executor_manager_thread = executor._executor_manager_thread
        executor.shutdown(wait=False)
        time.sleep(0.2)
        executor_manager_thread.join()

    def test_leak_case2(self):
        # this does not leak
        with futures.ProcessPoolExecutor(
                max_workers=1,
                max_tasks_per_child=1,
                ) as executor:
            f2 = executor.submit(LeakTest._fail_task, 0)
            try:
                f2.result()
            except ValueError:
                pass

@neonene
Copy link
Contributor

neonene commented Jul 2, 2025

Other leaking examples (on Windows):

1. test_logging:
import logging
import logging.config
import logging.handlers
from multiprocessing import Queue, Manager

class ConfigDictTest(unittest.TestCase):
    def test_multiprocessing_queues_XXX(self):
        config = {
            'version': 1,
            'handlers' : {
                'spam' : {
                    'class': 'logging.handlers.QueueHandler',
                    'queue': Manager().Queue()  ,         # Leak
                    # 'queue': Manager().JoinableQueue()  # Leak
                    # 'queue': Queue(),                   # No leak

                },
            },
            'root': {'handlers': ['spam']}
        }
        logger = logging.getLogger()
        logging.config.dictConfig(config)
        while logger.handlers:
            h = logger.handlers[0]
            logger.removeHandler(h)
            h.close()
2. test_interpreters.test_api:
import contextlib
import threading
import types
from concurrent import interpreters

def func():
    raise Exception('spam!')

@contextlib.contextmanager
def captured_thread_exception():
    ctx = types.SimpleNamespace(caught=None)
    def excepthook(args):
        ctx.caught = args
    orig_excepthook = threading.excepthook
    threading.excepthook = excepthook
    try:
        yield ctx
    finally:
        threading.excepthook = orig_excepthook

class TestInterpreterCall(unittest.TestCase):
    def test_call_in_thread_XXX(self):
        interp = interpreters.create()
        call = (interp._call, interp.call)[1]   # 0: No leak, 1: Leak
        with captured_thread_exception() as _:
            t = threading.Thread(target=call, args=(interp, func, (), {}))
            t.start()
            t.join()

@nascheme
Copy link
Member Author

nascheme commented Jul 2, 2025

The majority (maybe all) of these leaks are caused by the WeakValueDictionary used as multiprocessing.util._afterfork_registry. That took some digging to find. I'm not yet sure of a good fix for this. Explicitly cleaning the dead weak references from the .data dict works but it not too elegant.

This avoids breaking tests while fixing the issue with tp_subclasses. In
the long term, it would be better to defer the clear of all weakrefs,
not just the ones referring to classes.  However, that is a more
distruptive change and would seem to have a higher chance of breaking
user code.  So, it would not be something to do in a bugfix release.
@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

The majority (maybe all) of these leaks are caused by the WeakValueDictionary used as multiprocessing.util._afterfork_registry. That took some digging to find. I'm not yet sure of a good fix for this. Explicitly cleaning the dead weak references from the .data dict works but it not too elegant.

Nope, that doesn't fix all the leaks. And having to explicitly clean the weakrefs from the WeakValueDictionary really shouldn't be needed, I think. The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

For the purposes of having a fix that we can backport (should probably be backported to all maintained Python versions), a less disruptive fix would be better. To that end, I've changed this PR to only defer clearing weakrefs to class objects. That fixes the tp_subclasses bug but should be less likely to break currently working code.

@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 3, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 2f3daba 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136189%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 3, 2025
We need to clear those before executing the callback.  Since this
ensures they can't run a second time, we don't need
_PyGC_SET_FINALIZED().  Revise comments.
@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

Ah, the KeyedRef callback requires that the weakref is cleared when it is called, otherwise it was not deleting the item from the weak dictionary. So we need to clear the weakrefs with callbacks before executing them. That fixes the refleaks, I believe.

I revised this PR to be something that is potentially suitable for backporting. To minimize the behaviour change, I'm only deferring the clear of weakrefs that refer to types (in order to allow tp_subclasses to work) or with callbacks. I still have an extra clearing pass that gets done after the finalizers are called. That avoids bugs like #91636.

If this gets merged, I think we should consider deferring clearing all weakrefs (without callbacks) until after finalizers are executed. I think that's more correct since it gives the finalizers a better opportunity to do their job.

* *not* be cleared so that caches based on the type version are correctly
* invalidated (see GH-135552 as a bug caused by this).
*/
clear_weakrefs(&final_unreachable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ough! Nice trick with final_unreachable!

@pablogsal pablogsal self-assigned this Jul 3, 2025
nascheme added 3 commits July 3, 2025 06:04
This is a bit trickly because the wrlist can be changing as we
iterate over it.
@nascheme nascheme marked this pull request as ready for review July 3, 2025 13:24
@sergey-miryanov
Copy link
Contributor

@nascheme Do you plan adding tests for gh-135552?

@efimov-mikhail
Copy link
Contributor

FWIW, current approach for clearing weakrefs immediately before delete_garbage looks fine for me.

@pablogsal
Copy link
Member

Thanks a lot for the PR @nascheme!

I have made a pass over the code and couldn't find anything incorrect (apart from the leaks and whatnot) but I would like to do another pass or at least discuss how are are now avoiding resurrection from the callbacks since I assume now is possible that we have callbacks triggering AFTER we handle resurrected objects so I would like to make sure that the bug we fixed in #16687 doesn't come back in some weird form

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

[...] I assume now is possible that we have callbacks triggering AFTER we handle resurrected objects so I would like to make sure that the bug we fixed in #16687 doesn't come back in some weird form

It was always the case that finalizers (both del methods and weakref callbacks) can trigger during the delete_garbage phase. This PR doesn't change that. I had forgotten that when I first made a PR that just stopped clearing the weakrefs. The reason is our tp_traverse methods don't give a complete view of the object graph. The test_trash_weakref_clear() is a unit test that creates this situation.

What needs to be true is that the unreachable set is not accessible by any finalizer that can run during delete_garbage. The "trick" of accounting for all the refcounts ensures that, even if we don't have a complete view of the object graph. We have to clear the weakrefs to the unreachable set since they are not accounted for by the refcount trick.

This kind of change can definitely have some extra scrutiny. The interaction of weakrefs and GC has caused so many bugs already. I don't feel all that confident that this doesn't introduce or re-introduce a bug. Adding some additional unit tests, as Sergey suggests would be a good idea.

@sergey-miryanov
Copy link
Contributor

Adding some additional unit tests, as Sergey suggests would be a good idea.

Do you need some hands on this?

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

Do you need some hands on this?

Sure! Adding tests based on the reproducer scripts in gh-135552 would be a good start.

@sergey-miryanov
Copy link
Contributor

Sure! Adding tests based on the reproducer scripts in gh-135552 would be a good start.

Should I open a new PR? I'm not sure if I can/might push to this branch.

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2025

Sure! Adding tests based on the reproducer scripts in gh-135552 would be a good start.

Should I open a new PR? I'm not sure if I can/might push to this branch.

Yes, I think a PR just for additional unit tests would be the easiest way to manage it.

@nascheme
Copy link
Member Author

nascheme commented Jul 4, 2025

I think this PR, as is, will no longer fix GH-132413. Only weakrefs with callbacks or weakrefs to classes are deferred from clearing. GH-132413 is caused by a weakref to a module object being cleared early. Depending on how serious we think that bug is, we could change this PR back to deferring the clear of all weakrefs without callbacks. Doing that doesn't break many unit tests (just a couple in test_finalization that's explicitly checking this behavior).

@efimov-mikhail
Copy link
Contributor

IMO, it's better to defer clearing of all weakrefs without callbacks. But it's up to release managers to decide which fixes we want to backport.

FYI, @Yhg1s, @hugovk

@neonene
Copy link
Contributor

neonene commented Jul 4, 2025

Only weakrefs with callbacks or weakrefs to classes are deferred from clearing

Weakref can be deferred from clearing if a dummy callback is given, right? If so, the following result on the PR is unexpected to me:

import itertools as mod
callback = lambda wr: None

def gen():
    import weakref
    wr = weakref.ref(mod, callback)
    print(1, wr())
    try:
        yield
    finally:
        print(2, wr())

it = gen()
next(it)
print('//')
1 <module 'itertools' (built-in)>
//
2 None

I guess gh-132413 could have another backporable workaround like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants

TMZ Celebrity News – Breaking Stories, Videos & Gossip

Looking for the latest TMZ celebrity news? You've come to the right place. From shocking Hollywood scandals to exclusive videos, TMZ delivers it all in real time.

Whether it’s a red carpet slip-up, a viral paparazzi moment, or a legal drama involving your favorite stars, TMZ news is always first to break the story. Stay in the loop with daily updates, insider tips, and jaw-dropping photos.

🎥 Watch TMZ Live

TMZ Live brings you daily celebrity news and interviews straight from the TMZ newsroom. Don’t miss a beat—watch now and see what’s trending in Hollywood.