YJIT: Allow parallel scanning for JIT-compiled code by wks · Pull Request #13758 · ruby/ruby · GitHub | Latest TMZ Celebrity News & Gossip | Watch TMZ Live
Skip to content

YJIT: Allow parallel scanning for JIT-compiled code #13758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wks
Copy link
Contributor

@wks wks commented Jul 1, 2025

Some GC modules, notably MMTk, support parallel GC, i.e. multiple GC threads work in parallel during a GC. Currently, when two GC threads scan two iseq objects simultaneously when YJIT is enabled, both threads will attempt to borrow CodeBlock::mem_block, which will result in panic.

We make two changes to YJIT in order to support parallel GC.

Block now holds a list of addresses instead of offsets. This makes it unnecessary to borrow CodeBlock::mem_block and resolve absolute addresses from offsets.

We now set the YJIT code memory to writable in bulk before the reference-updating phase, and reset it to executable in bulk after the reference-updating phase. Previously, YJIT lazily sets memory pages writable while updating object references embedded in JIT-compiled machine code, and sets the memory back to executable by calling mark_all_executable. This approach is inherently unfriendly to parallel GC because (1) it borrows CodeBlock::mem_block, and (2) it sets the whole CodeBlock as executable which races with other GC threads that are updating other iseq objects. It also has performance overhead due to the frequent invocation of system calls. We now set the permission of all the code memory in bulk before and after the reference updating phase. Multiple GC threads can now perform raw memory writes in parallel. We should also see performance improvement during moving GC because of the reduced number of mprotect system calls.

@matzbot matzbot requested a review from a team July 1, 2025 11:54
Some GC modules, notably MMTk, support parallel GC, i.e. multiple GC
threads work in parallel during a GC.  Currently, when two GC threads
scan two iseq objects simultaneously when YJIT is enabled, both threads
will attempt to borrow `CodeBlock::mem_block`, which will result in
panic.

We make two changes to YJIT in order to support parallel GC.

`Block` now holds a list of addresses instead of offsets.  This makes it
unnecessary to borrow `CodeBlock::mem_block` and resolve absolute
addresses from offsets.

We now set the YJIT code memory to writable in bulk before the
reference-updating phase, and reset it to executable in bulk after the
reference-updating phase.  Previously, YJIT lazily sets memory pages
writable while updating object references embedded in JIT-compiled
machine code, and sets the memory back to executable by calling
`mark_all_executable`.  This approach is inherently unfriendly to
parallel GC because (1) it borrows `CodeBlock::mem_block`, and (2) it
sets the whole `CodeBlock` as executable which races with other GC
threads that are updating other iseq objects.  It also has performance
overhead due to the frequent invocation of system calls.  We now set the
permission of all the code memory in bulk before and after the reference
updating phase.  Multiple GC threads can now perform raw memory writes
in parallel.  We should also see performance improvement during moving
GC because of the reduced number of `mprotect` system calls.
@wks wks force-pushed the feature/ruby-yjit-parallel-scan branch from f2d2f8f to f54b3c5 Compare July 1, 2025 12:01
@wks wks marked this pull request as draft July 1, 2025 12:12
@wks
Copy link
Contributor Author

wks commented Jul 1, 2025

There is an alternative solution for Block::gc_obj_offsets. Currently, YJIT embeds object references inside the function. The x86_64 backend uses 64-bit immediates, and the ARM64 backend emits the code sequence LDR, B and .data, where the B instruction jumps over the data. In both cases, we need to record the offsets or the addresses of those embedded references.

An alternative solution is to put all the object reference constants after the function, like this:

jit_compiled_func1:
  ...
  ldr x1, ref1 ; Load an object reference constant
  ldr x2, ref2 ; Load another object reference constant
  ...
  ldr x3, ref3 ; Load another object reference constant
  ...
  ret
  .align 3  
ref1:
  .xword 0x123456789abcdef0  ; The first embedded reference
ref2:
  .xword 0x123456789abcdf40  ; The second embedded reference
ref3:
  .xword 0x123456789abcdf80  ; The third embedded reference

If we use this approach for both x86_64 and ARM64, then we can eliminate Block::gc_obj_offsets, and we only need to record the address of label ref1 and the number of object references used by that function. That will further simplify the object reference scanning operation because all references are stored contiguously. But that requires more changes to the YJIT codegen.

@wks
Copy link
Contributor Author

wks commented Jul 2, 2025

Here are the results of running yjit-bench. I ran on my laptop with ArchLinux. Both the master branch and this PR are built with configure --prefix=$PWD/install --disable-install-doc && make install -j. I ran the benchmarks with and without YJIT.

Some benchmarks are faster, and some are slower. fannkuchredux has a 10% slowdown, and it is reproducible. I don't know what was the cause.

No YJIT:

master: ruby 3.5.0dev (2025-07-01T11:28:47Z master 9d080765cc) +PRISM [x86_64-linux]
pr: ruby 3.5.0dev (2025-07-01T12:00:40Z feature/ruby-yjit-.. f54b3c5a79) +PRISM [x86_64-linux]

-----------------  -----------  ----------  -------  ----------  ----------  ---------
bench              master (ms)  stddev (%)  pr (ms)  stddev (%)  pr 1st itr  master/pr
activerecord       184.5        0.4         179.2    0.8         1.037       1.029    
chunky-png         397.4        0.1         398.1    0.1         1.006       0.998    
erubi-rails        740.7        0.2         738.6    0.5         0.913       1.003    
hexapdf            1348.0       1.5         1342.7   0.8         0.998       1.004    
liquid-c           41.2         7.8         37.9     5.2         1.055       1.087    
liquid-compile     33.4         5.5         31.4     3.9         0.982       1.062    
liquid-render      89.5         1.8         85.7     1.5         1.026       1.044    
lobsters           630.5        0.5         627.7    0.6         1.000       1.004    
mail               82.1         6.2         80.9     5.2         1.015       1.015    
psych-load         1260.4       0.2         1289.3   0.3         0.978       0.978    
railsbench         1609.1       0.6         1598.1   0.6         1.025       1.007    
rubocop            94.2         0.7         88.8     0.6         1.020       1.061    
ruby-lsp           96.5         0.2         94.4     0.3         1.052       1.023    
sequel             36.8         0.4         36.8     1.3         1.015       1.000    
binarytrees        176.2        0.3         171.7    2.4         0.996       1.027    
blurhash           165.9        0.2         160.2    0.2         1.034       1.036    
erubi              94.2         0.5         93.1     1.4         1.075       1.011    
etanni             134.5        0.4         135.0    0.6         1.025       0.996    
fannkuchredux      248.5        0.1         278.6    1.2         0.890       0.892    
fluentd            217.7        0.4         209.5    0.2         1.020       1.039    
graphql            166.7        0.1         166.5    0.1         1.009       1.001    
graphql-native     297.4        0.1         295.5    0.2         1.024       1.007    
lee                544.6        0.4         544.3    0.6         1.000       1.000    
matmul             271.5        0.2         269.1    0.4         1.008       1.009    
nbody              60.6         0.3         60.3     0.3         1.003       1.005    
nqueens            113.9        0.2         112.3    0.3         1.014       1.015    
optcarrot          2988.9       0.7         3002.4   0.7         0.997       0.996    
protoboeuf         93.0         0.6         91.5     0.3         1.036       1.016    
protoboeuf-encode  75.3         0.3         74.4     0.3         1.013       1.012    
rack               23.6         0.8         23.6     0.7         1.008       0.998    
ruby-json          166.4        0.1         165.5    0.8         1.013       1.006    
rubyboy            2737.4       0.2         2771.8   0.2         0.991       0.988    
rubykon            558.7        1.5         555.9    1.4         0.986       1.005    
sudoku             280.9        0.3         291.8    0.3         0.964       0.963    
tinygql            364.3        0.2         368.0    0.6         0.989       0.990    
30k_ifelse         394.0        0.2         384.3    0.7         1.059       1.025    
30k_methods        320.1        1.6         310.5    0.2         1.022       1.031    
cfunc_itself       44.4         1.9         43.1     2.4         1.061       1.031    
fib                100.8        0.4         100.6    0.4         1.002       1.001    
getivar            45.6         0.6         45.9     0.7         0.992       0.994    
keyword_args       124.0        0.3         123.7    1.0         1.004       1.002    
loops-times        505.9        0.1         496.0    0.3         1.019       1.020    
object-new         40.9         0.2         40.6     0.3         1.015       1.007    
respond_to         102.2        0.6         101.8    1.2         1.007       1.005    
ruby-xor           54.5         0.7         54.1     0.7         0.995       1.007    
setivar            42.8         1.3         42.6     0.6         1.004       1.004    
setivar_object     44.7         0.6         44.6     0.5         0.991       1.001    
setivar_young      44.7         0.6         44.8     0.6         0.999       0.998    
str_concat         31.6         2.6         31.4     2.2         1.002       1.004    
throw              11.3         2.2         11.2     0.7         1.005       1.008    
-----------------  -----------  ----------  -------  ----------  ----------  ---------

With --enable=yjit

master: ruby 3.5.0dev (2025-07-01T11:28:47Z master 9d080765cc) +YJIT +PRISM [x86_64-linux]
pr: ruby 3.5.0dev (2025-07-01T12:00:40Z feature/ruby-yjit-.. f54b3c5a79) +YJIT +PRISM [x86_64-linux]

-----------------  -----------  ----------  -------  ----------  ----------  ---------
bench              master (ms)  stddev (%)  pr (ms)  stddev (%)  pr 1st itr  master/pr
activerecord       185.4        0.4         178.7    0.1         1.041       1.037    
chunky-png         395.9        0.2         397.6    0.2         0.991       0.996    
erubi-rails        749.2        0.4         737.4    0.2         0.918       1.016    
hexapdf            1358.2       1.0         1352.8   1.0         1.021       1.004    
liquid-c           41.5         8.2         37.9     4.9         1.053       1.095    
liquid-compile     33.3         5.4         31.0     1.3         0.984       1.073    
liquid-render      90.1         1.9         85.9     1.2         1.022       1.049    
lobsters           645.5        0.7         613.5    0.6         1.022       1.052    
mail               83.8         5.8         81.3     5.2         1.025       1.030    
psych-load         1268.2       0.3         1285.5   0.3         0.987       0.986    
railsbench         1642.6       1.1         1573.5   0.5         1.056       1.044    
rubocop            95.9         1.2         89.5     1.2         1.024       1.072    
ruby-lsp           96.7         0.5         97.7     1.3         1.019       0.990    
sequel             37.2         2.9         37.7     0.6         1.005       0.989    
binarytrees        88.4         0.3         83.7     4.8         1.007       1.056    
blurhash           75.9         0.3         75.9     0.3         1.004       1.000    
erubi              95.8         0.6         92.2     1.0         1.078       1.038    
etanni             119.2        0.7         118.6    0.4         0.987       1.005    
fannkuchredux      111.5        69.8        131.4    73.1        0.908       0.849    
fluentd            218.9        0.6         209.9    0.3         1.031       1.043    
graphql            166.4        0.1         167.0    1.0         1.008       0.996    
graphql-native     298.3        0.2         292.9    0.1         1.032       1.018    
lee                544.5        0.4         545.3    0.5         1.004       0.998    
matmul             120.7        0.6         118.7    0.5         1.015       1.017    
nbody              22.1         0.5         21.9     2.1         0.999       1.006    
nqueens            24.1         0.3         24.0     0.4         1.023       1.004    
optcarrot          719.1        1.4         725.9    1.3         0.993       0.991    
protoboeuf         19.2         2.6         17.4     2.3         1.035       1.105    
protoboeuf-encode  17.2         3.4         17.1     3.1         1.007       1.010    
rack               23.8         0.9         23.4     0.6         0.997       1.017    
ruby-json          144.5        1.3         145.1    0.3         1.034       0.996    
rubyboy            2743.5       0.2         2805.4   0.2         0.980       0.978    
rubykon            261.2        2.1         258.6    1.6         0.980       1.010    
sudoku             77.3         0.1         77.6     0.1         1.007       0.996    
tinygql            365.5        0.1         370.6    0.2         0.983       0.986    
30k_ifelse         50.9         0.7         49.4     0.2         1.001       1.031    
30k_methods        25.6         0.5         25.4     0.4         1.027       1.011    
cfunc_itself       11.3         4.8         11.3     4.8         1.002       1.002    
fib                15.3         0.5         15.1     0.3         1.007       1.012    
getivar            5.9          64.0        5.9      64.0        1.006       1.002    
keyword_args       11.4         5.0         11.4     5.6         1.000       1.002    
loops-times        139.7        0.7         136.4    0.6         1.017       1.024    
object-new         22.3         15.9        22.1     16.0        1.007       1.008    
respond_to         3.8          10.8        3.8      10.9        0.966       1.003    
ruby-xor           10.5         0.8         10.5     0.8         0.977       1.000    
setivar            3.1          87.5        3.1      87.5        0.998       1.005    
setivar_object     20.9         20.6        21.0     20.1        1.008       0.996    
setivar_young      20.7         20.7        20.7     20.5        1.004       1.000    
str_concat         11.7         1.7         12.0     1.7         0.973       0.980    
throw              8.5          1.3         8.6      7.2         1.000       0.992    
-----------------  -----------  ----------  -------  ----------  ----------  ---------

@wks
Copy link
Contributor Author

wks commented Jul 2, 2025

I constructed a microbenchmark that simply calls GC.compact 1000 times, and I recorded the time it takes for each invocation of gc_update_references. Specifically, for this PR, the time includes the invocations of both rb_gc_before_updating_jit_code(); and rb_gc_after_updating_jit_code();. Even with an empty script, the CRuby runtime will have some YJIT-compiled methods as long as the --enable=yjit option is set. I ran the benchmark on both the master branch and this PR, with --enable=yjit. The average time it takes to execute gc_update_references reduced from 473±40 µs to 442±25 µs, a reduction of 6.5%.

The perf tool also shows the time it takes to execute rb_gc_{before,after}_updating_jit_code is small compared to the accumulated execution time of mprotect in rb_yjit_iseq_update_references.

@wks wks marked this pull request as ready for review July 2, 2025 14:01
// Offsets for GC managed objects in the mainline code block
gc_obj_offsets: Box<[u32]>,
// Pointers to references to GC managed objects in the mainline code block
gc_obj_addresses: Box<[*const u8]>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doubles the memory usage, which is a no-go.

To avoid the RefCell borrowing, you only need the starting address of the JIT code region and the ability to mark everything as writable. You can put a copy of the starting address into a global and derive a full pointer off of that everywhere you need it.

An alternative solution is to put all the object reference constants after the function, like this:

That's a no-go for LBBV, which requires as few gaps in the code as possible for good code layout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doubles the memory usage, which is a no-go.

But doesn't it only double the memory used for pointers to objects in the generated machine code, which is probably significantly smaller than the code itself? In that case, this might not increase the memory by a significant amount?

Maybe @wks can measure what the overhead of this is for memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

measure what the overhead of this is for memory usage.

I measure the blocks during an invocation of GC.compact after CRuby starts. There are 1484 live Block instances visited in the compacting GC.

In the following table, num_offsets is block.gc_obj_offsets.len(). count is the number of Block instances that have that many offsets. mean is the average block.code_size(), and columns from min to max are its distribution.

num_offsets count mean min 10% 25% median 75% 90% 95% 99% max
0 912 91.5219 0 4 5 13 33 57 86.45 111 11423
1 287 118.592 13 13 16.5 31 77 97.8 113.5 160.4 10532
2 131 395.519 26 79 160 182 196 200 205 10250.9 10924
3 146 392.973 104 164 169 174 194 210 221.25 10080 11197
4 7 205.429 121 175 211 211 225.5 228.8 230.9 232.58 233
6 1 283 283 283 283 283 283 283 283 283 283
total 1484 153.916 0 5 7 28 87 180 197 241.5 11423

From the table, most blocks have no embedded references at all (num_offsets = 0). Most others have <= 3 embedded references, and one block has 6 embedded references. But on the other hand, the size of the code block can range from a few bytes to 11K bytes. For Block instances with 1-3 embedded references, their mean block size are about 100-400 bytes. The overhead of the offsets array (4 bytes to 12 bytes) is small compared to that. Even if we replace it with pointer arrays (8 bytes to 24 bytes), their overhead is still relatively small.

Copy link
Contributor Author

@wks wks Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again with the erubi-rails benchmark. At the end of the benchmark, I ran a GC.compact and then iterated through all iseq instances in the heap and visited all of their Block instances like in rb_yjit_iseq_update_references. And here is the statistics.

num_offsets count mean min 10% 25% median 75% 90% 95% 99% max
0 12032 88.9778 0 4 5 13 29 50 63 115 13535
1 3279 185.364 13 13 17 46 82 99 139 10056.1 12546
2 1674 358.159 26 80.9 170 185 197 205 210 10675.4 12647
3 1799 327.539 104 165 169 177 193 208 222 9867.02 12633
4 147 349.32 121 194 204 218 223.5 230 233 4876.12 11658
6 1 283 283 283 283 283 283 283 283 283 283
total 18932 154.174 0 5 7 27 79 184 197 253.69 13535

The distribution is roughly the same. The majority have 0 embedded references, most others have 1-3 refs, and no Block instances have more than 6 embedded references. The median code_size is roughly the same as an empty script, but the max code_size gets a little bit bigger.

So I think we shouldn't need to worry about the space overhead of whether each GC reference is pointed by a 4-byte offset or a 8-byte pointer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm doing the rough math correct, there's about one object reference per 90 bytes of machine code. That was previously a 4 byte or 5% overhead and now is a 8 byte or 10% overhead. This change would increase YJIT memory usage by about 5% of machine code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's about one object reference per 90 bytes of machine code

No. There is, on average, one embedded reference per 223.7 bytes of machine code on x86_64 for an empty script, and per 231 bytes for the erubi-rails benchmark. So a 4-byte offset is 1.8% overhead, and a 8-byte offset will be 3.6%. So this change increases the memory usage by less than 1.8% of machine code size. Remember that each Box<[u32]> or Box<[*const u8]> has a malloc overhead and a 64-bit length field (because [T] is a slice type).

Note: To compute the "bytes per embedded references", we divide the total machine code bytes by the total number of reference. The former can be obtained by computing the sum-of-products of the "mean" and "count" columns, and the latter is the sum-of-products of the "num_offsets" and the "count". Take the data for an empty script as an example. The total machine code bytes is 91.5219 * 912 + 118.592 * 287 + ... + 283 * 1 or just consider the "total" row which is 153.916 * 1484, and they are both 228412. The total number of references is 0 * 912 + 1 * 287 + ... + 6 * 1 which is 1021. Then 228412 / 1021 == 223.714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants

TMZ Celebrity News – Breaking Stories, Videos & Gossip

Looking for the latest TMZ celebrity news? You've come to the right place. From shocking Hollywood scandals to exclusive videos, TMZ delivers it all in real time.

Whether it’s a red carpet slip-up, a viral paparazzi moment, or a legal drama involving your favorite stars, TMZ news is always first to break the story. Stay in the loop with daily updates, insider tips, and jaw-dropping photos.

🎥 Watch TMZ Live

TMZ Live brings you daily celebrity news and interviews straight from the TMZ newsroom. Don’t miss a beat—watch now and see what’s trending in Hollywood.