YJIT: Allow parallel scanning for JIT-compiled code #13758

wks · 2025-07-01T11:53:59Z

Some GC modules, notably MMTk, support parallel GC, i.e. multiple GC threads work in parallel during a GC. Currently, when two GC threads scan two iseq objects simultaneously when YJIT is enabled, both threads will attempt to borrow CodeBlock::mem_block, which will result in panic.

We make two changes to YJIT in order to support parallel GC.

Block now holds a list of addresses instead of offsets. This makes it unnecessary to borrow CodeBlock::mem_block and resolve absolute addresses from offsets.

We now set the YJIT code memory to writable in bulk before the reference-updating phase, and reset it to executable in bulk after the reference-updating phase. Previously, YJIT lazily sets memory pages writable while updating object references embedded in JIT-compiled machine code, and sets the memory back to executable by calling mark_all_executable. This approach is inherently unfriendly to parallel GC because (1) it borrows CodeBlock::mem_block, and (2) it sets the whole CodeBlock as executable which races with other GC threads that are updating other iseq objects. It also has performance overhead due to the frequent invocation of system calls. We now set the permission of all the code memory in bulk before and after the reference updating phase. Multiple GC threads can now perform raw memory writes in parallel. We should also see performance improvement during moving GC because of the reduced number of mprotect system calls.

Some GC modules, notably MMTk, support parallel GC, i.e. multiple GC threads work in parallel during a GC. Currently, when two GC threads scan two iseq objects simultaneously when YJIT is enabled, both threads will attempt to borrow `CodeBlock::mem_block`, which will result in panic. We make two changes to YJIT in order to support parallel GC. `Block` now holds a list of addresses instead of offsets. This makes it unnecessary to borrow `CodeBlock::mem_block` and resolve absolute addresses from offsets. We now set the YJIT code memory to writable in bulk before the reference-updating phase, and reset it to executable in bulk after the reference-updating phase. Previously, YJIT lazily sets memory pages writable while updating object references embedded in JIT-compiled machine code, and sets the memory back to executable by calling `mark_all_executable`. This approach is inherently unfriendly to parallel GC because (1) it borrows `CodeBlock::mem_block`, and (2) it sets the whole `CodeBlock` as executable which races with other GC threads that are updating other iseq objects. It also has performance overhead due to the frequent invocation of system calls. We now set the permission of all the code memory in bulk before and after the reference updating phase. Multiple GC threads can now perform raw memory writes in parallel. We should also see performance improvement during moving GC because of the reduced number of `mprotect` system calls.

wks · 2025-07-01T12:59:16Z

There is an alternative solution for Block::gc_obj_offsets. Currently, YJIT embeds object references inside the function. The x86_64 backend uses 64-bit immediates, and the ARM64 backend emits the code sequence LDR, B and .data, where the B instruction jumps over the data. In both cases, we need to record the offsets or the addresses of those embedded references.

An alternative solution is to put all the object reference constants after the function, like this:

jit_compiled_func1:
  ...
  ldr x1, ref1 ; Load an object reference constant
  ldr x2, ref2 ; Load another object reference constant
  ...
  ldr x3, ref3 ; Load another object reference constant
  ...
  ret
  .align 3  
ref1:
  .xword 0x123456789abcdef0  ; The first embedded reference
ref2:
  .xword 0x123456789abcdf40  ; The second embedded reference
ref3:
  .xword 0x123456789abcdf80  ; The third embedded reference

If we use this approach for both x86_64 and ARM64, then we can eliminate Block::gc_obj_offsets, and we only need to record the address of label ref1 and the number of object references used by that function. That will further simplify the object reference scanning operation because all references are stored contiguously. But that requires more changes to the YJIT codegen.

wks · 2025-07-02T10:34:37Z

Here are the results of running yjit-bench. I ran on my laptop with ArchLinux. Both the master branch and this PR are built with configure --prefix=$PWD/install --disable-install-doc && make install -j. I ran the benchmarks with and without YJIT.

Some benchmarks are faster, and some are slower. fannkuchredux has a 10% slowdown, and it is reproducible. I don't know what was the cause.

No YJIT:

master: ruby 3.5.0dev (2025-07-01T11:28:47Z master 9d080765cc) +PRISM [x86_64-linux]
pr: ruby 3.5.0dev (2025-07-01T12:00:40Z feature/ruby-yjit-.. f54b3c5a79) +PRISM [x86_64-linux]

-----------------  -----------  ----------  -------  ----------  ----------  ---------
bench              master (ms)  stddev (%)  pr (ms)  stddev (%)  pr 1st itr  master/pr
activerecord       184.5        0.4         179.2    0.8         1.037       1.029    
chunky-png         397.4        0.1         398.1    0.1         1.006       0.998    
erubi-rails        740.7        0.2         738.6    0.5         0.913       1.003    
hexapdf            1348.0       1.5         1342.7   0.8         0.998       1.004    
liquid-c           41.2         7.8         37.9     5.2         1.055       1.087    
liquid-compile     33.4         5.5         31.4     3.9         0.982       1.062    
liquid-render      89.5         1.8         85.7     1.5         1.026       1.044    
lobsters           630.5        0.5         627.7    0.6         1.000       1.004    
mail               82.1         6.2         80.9     5.2         1.015       1.015    
psych-load         1260.4       0.2         1289.3   0.3         0.978       0.978    
railsbench         1609.1       0.6         1598.1   0.6         1.025       1.007    
rubocop            94.2         0.7         88.8     0.6         1.020       1.061    
ruby-lsp           96.5         0.2         94.4     0.3         1.052       1.023    
sequel             36.8         0.4         36.8     1.3         1.015       1.000    
binarytrees        176.2        0.3         171.7    2.4         0.996       1.027    
blurhash           165.9        0.2         160.2    0.2         1.034       1.036    
erubi              94.2         0.5         93.1     1.4         1.075       1.011    
etanni             134.5        0.4         135.0    0.6         1.025       0.996    
fannkuchredux      248.5        0.1         278.6    1.2         0.890       0.892    
fluentd            217.7        0.4         209.5    0.2         1.020       1.039    
graphql            166.7        0.1         166.5    0.1         1.009       1.001    
graphql-native     297.4        0.1         295.5    0.2         1.024       1.007    
lee                544.6        0.4         544.3    0.6         1.000       1.000    
matmul             271.5        0.2         269.1    0.4         1.008       1.009    
nbody              60.6         0.3         60.3     0.3         1.003       1.005    
nqueens            113.9        0.2         112.3    0.3         1.014       1.015    
optcarrot          2988.9       0.7         3002.4   0.7         0.997       0.996    
protoboeuf         93.0         0.6         91.5     0.3         1.036       1.016    
protoboeuf-encode  75.3         0.3         74.4     0.3         1.013       1.012    
rack               23.6         0.8         23.6     0.7         1.008       0.998    
ruby-json          166.4        0.1         165.5    0.8         1.013       1.006    
rubyboy            2737.4       0.2         2771.8   0.2         0.991       0.988    
rubykon            558.7        1.5         555.9    1.4         0.986       1.005    
sudoku             280.9        0.3         291.8    0.3         0.964       0.963    
tinygql            364.3        0.2         368.0    0.6         0.989       0.990    
30k_ifelse         394.0        0.2         384.3    0.7         1.059       1.025    
30k_methods        320.1        1.6         310.5    0.2         1.022       1.031    
cfunc_itself       44.4         1.9         43.1     2.4         1.061       1.031    
fib                100.8        0.4         100.6    0.4         1.002       1.001    
getivar            45.6         0.6         45.9     0.7         0.992       0.994    
keyword_args       124.0        0.3         123.7    1.0         1.004       1.002    
loops-times        505.9        0.1         496.0    0.3         1.019       1.020    
object-new         40.9         0.2         40.6     0.3         1.015       1.007    
respond_to         102.2        0.6         101.8    1.2         1.007       1.005    
ruby-xor           54.5         0.7         54.1     0.7         0.995       1.007    
setivar            42.8         1.3         42.6     0.6         1.004       1.004    
setivar_object     44.7         0.6         44.6     0.5         0.991       1.001    
setivar_young      44.7         0.6         44.8     0.6         0.999       0.998    
str_concat         31.6         2.6         31.4     2.2         1.002       1.004    
throw              11.3         2.2         11.2     0.7         1.005       1.008    
-----------------  -----------  ----------  -------  ----------  ----------  ---------

With --enable=yjit

master: ruby 3.5.0dev (2025-07-01T11:28:47Z master 9d080765cc) +YJIT +PRISM [x86_64-linux]
pr: ruby 3.5.0dev (2025-07-01T12:00:40Z feature/ruby-yjit-.. f54b3c5a79) +YJIT +PRISM [x86_64-linux]

-----------------  -----------  ----------  -------  ----------  ----------  ---------
bench              master (ms)  stddev (%)  pr (ms)  stddev (%)  pr 1st itr  master/pr
activerecord       185.4        0.4         178.7    0.1         1.041       1.037    
chunky-png         395.9        0.2         397.6    0.2         0.991       0.996    
erubi-rails        749.2        0.4         737.4    0.2         0.918       1.016    
hexapdf            1358.2       1.0         1352.8   1.0         1.021       1.004    
liquid-c           41.5         8.2         37.9     4.9         1.053       1.095    
liquid-compile     33.3         5.4         31.0     1.3         0.984       1.073    
liquid-render      90.1         1.9         85.9     1.2         1.022       1.049    
lobsters           645.5        0.7         613.5    0.6         1.022       1.052    
mail               83.8         5.8         81.3     5.2         1.025       1.030    
psych-load         1268.2       0.3         1285.5   0.3         0.987       0.986    
railsbench         1642.6       1.1         1573.5   0.5         1.056       1.044    
rubocop            95.9         1.2         89.5     1.2         1.024       1.072    
ruby-lsp           96.7         0.5         97.7     1.3         1.019       0.990    
sequel             37.2         2.9         37.7     0.6         1.005       0.989    
binarytrees        88.4         0.3         83.7     4.8         1.007       1.056    
blurhash           75.9         0.3         75.9     0.3         1.004       1.000    
erubi              95.8         0.6         92.2     1.0         1.078       1.038    
etanni             119.2        0.7         118.6    0.4         0.987       1.005    
fannkuchredux      111.5        69.8        131.4    73.1        0.908       0.849    
fluentd            218.9        0.6         209.9    0.3         1.031       1.043    
graphql            166.4        0.1         167.0    1.0         1.008       0.996    
graphql-native     298.3        0.2         292.9    0.1         1.032       1.018    
lee                544.5        0.4         545.3    0.5         1.004       0.998    
matmul             120.7        0.6         118.7    0.5         1.015       1.017    
nbody              22.1         0.5         21.9     2.1         0.999       1.006    
nqueens            24.1         0.3         24.0     0.4         1.023       1.004    
optcarrot          719.1        1.4         725.9    1.3         0.993       0.991    
protoboeuf         19.2         2.6         17.4     2.3         1.035       1.105    
protoboeuf-encode  17.2         3.4         17.1     3.1         1.007       1.010    
rack               23.8         0.9         23.4     0.6         0.997       1.017    
ruby-json          144.5        1.3         145.1    0.3         1.034       0.996    
rubyboy            2743.5       0.2         2805.4   0.2         0.980       0.978    
rubykon            261.2        2.1         258.6    1.6         0.980       1.010    
sudoku             77.3         0.1         77.6     0.1         1.007       0.996    
tinygql            365.5        0.1         370.6    0.2         0.983       0.986    
30k_ifelse         50.9         0.7         49.4     0.2         1.001       1.031    
30k_methods        25.6         0.5         25.4     0.4         1.027       1.011    
cfunc_itself       11.3         4.8         11.3     4.8         1.002       1.002    
fib                15.3         0.5         15.1     0.3         1.007       1.012    
getivar            5.9          64.0        5.9      64.0        1.006       1.002    
keyword_args       11.4         5.0         11.4     5.6         1.000       1.002    
loops-times        139.7        0.7         136.4    0.6         1.017       1.024    
object-new         22.3         15.9        22.1     16.0        1.007       1.008    
respond_to         3.8          10.8        3.8      10.9        0.966       1.003    
ruby-xor           10.5         0.8         10.5     0.8         0.977       1.000    
setivar            3.1          87.5        3.1      87.5        0.998       1.005    
setivar_object     20.9         20.6        21.0     20.1        1.008       0.996    
setivar_young      20.7         20.7        20.7     20.5        1.004       1.000    
str_concat         11.7         1.7         12.0     1.7         0.973       0.980    
throw              8.5          1.3         8.6      7.2         1.000       0.992    
-----------------  -----------  ----------  -------  ----------  ----------  ---------

wks · 2025-07-02T14:01:15Z

I constructed a microbenchmark that simply calls GC.compact 1000 times, and I recorded the time it takes for each invocation of gc_update_references. Specifically, for this PR, the time includes the invocations of both rb_gc_before_updating_jit_code(); and rb_gc_after_updating_jit_code();. Even with an empty script, the CRuby runtime will have some YJIT-compiled methods as long as the --enable=yjit option is set. I ran the benchmark on both the master branch and this PR, with --enable=yjit. The average time it takes to execute gc_update_references reduced from 473±40 µs to 442±25 µs, a reduction of 6.5%.

The perf tool also shows the time it takes to execute rb_gc_{before,after}_updating_jit_code is small compared to the accumulated execution time of mprotect in rb_yjit_iseq_update_references.

XrXr · 2025-07-02T14:16:12Z

yjit/src/core.rs

-    // Offsets for GC managed objects in the mainline code block
-    gc_obj_offsets: Box<[u32]>,
+    // Pointers to references to GC managed objects in the mainline code block
+    gc_obj_addresses: Box<[*const u8]>,


This doubles the memory usage, which is a no-go.

To avoid the RefCell borrowing, you only need the starting address of the JIT code region and the ability to mark everything as writable. You can put a copy of the starting address into a global and derive a full pointer off of that everywhere you need it.

An alternative solution is to put all the object reference constants after the function, like this:

That's a no-go for LBBV, which requires as few gaps in the code as possible for good code layout.

This doubles the memory usage, which is a no-go.

But doesn't it only double the memory used for pointers to objects in the generated machine code, which is probably significantly smaller than the code itself? In that case, this might not increase the memory by a significant amount?

Maybe @wks can measure what the overhead of this is for memory usage.

measure what the overhead of this is for memory usage.

I measure the blocks during an invocation of GC.compact after CRuby starts. There are 1484 live Block instances visited in the compacting GC.

In the following table, num_offsets is block.gc_obj_offsets.len(). count is the number of Block instances that have that many offsets. mean is the average block.code_size(), and columns from min to max are its distribution.

num_offsets count mean min 10% 25% median 75% 90% 95% 99% max

0 912 91.5219 0 4 5 13 33 57 86.45 111 11423

1 287 118.592 13 13 16.5 31 77 97.8 113.5 160.4 10532

2 131 395.519 26 79 160 182 196 200 205 10250.9 10924

3 146 392.973 104 164 169 174 194 210 221.25 10080 11197

4 7 205.429 121 175 211 211 225.5 228.8 230.9 232.58 233

6 1 283 283 283 283 283 283 283 283 283 283

total 1484 153.916 0 5 7 28 87 180 197 241.5 11423

From the table, most blocks have no embedded references at all (num_offsets = 0). Most others have <= 3 embedded references, and one block has 6 embedded references. But on the other hand, the size of the code block can range from a few bytes to 11K bytes. For Block instances with 1-3 embedded references, their mean block size are about 100-400 bytes. The overhead of the offsets array (4 bytes to 12 bytes) is small compared to that. Even if we replace it with pointer arrays (8 bytes to 24 bytes), their overhead is still relatively small.

I tried again with the erubi-rails benchmark. At the end of the benchmark, I ran a GC.compact and then iterated through all iseq instances in the heap and visited all of their Block instances like in rb_yjit_iseq_update_references. And here is the statistics.

num_offsets count mean min 10% 25% median 75% 90% 95% 99% max

0 12032 88.9778 0 4 5 13 29 50 63 115 13535

1 3279 185.364 13 13 17 46 82 99 139 10056.1 12546

2 1674 358.159 26 80.9 170 185 197 205 210 10675.4 12647

3 1799 327.539 104 165 169 177 193 208 222 9867.02 12633

4 147 349.32 121 194 204 218 223.5 230 233 4876.12 11658

6 1 283 283 283 283 283 283 283 283 283 283

total 18932 154.174 0 5 7 27 79 184 197 253.69 13535

The distribution is roughly the same. The majority have 0 embedded references, most others have 1-3 refs, and no Block instances have more than 6 embedded references. The median code_size is roughly the same as an empty script, but the max code_size gets a little bit bigger.

So I think we shouldn't need to worry about the space overhead of whether each GC reference is pointed by a 4-byte offset or a 8-byte pointer.

If I'm doing the rough math correct, there's about one object reference per 90 bytes of machine code. That was previously a 4 byte or 5% overhead and now is a 8 byte or 10% overhead. This change would increase YJIT memory usage by about 5% of machine code.

there's about one object reference per 90 bytes of machine code

No. There is, on average, one embedded reference per 223.7 bytes of machine code on x86_64 for an empty script, and per 231 bytes for the erubi-rails benchmark. So a 4-byte offset is 1.8% overhead, and a 8-byte offset will be 3.6%. So this change increases the memory usage by less than 1.8% of machine code size. Remember that each Box<[u32]> or Box<[*const u8]> has a malloc overhead and a 64-bit length field (because [T] is a slice type).

Note: To compute the "bytes per embedded references", we divide the total machine code bytes by the total number of reference. The former can be obtained by computing the sum-of-products of the "mean" and "count" columns, and the latter is the sum-of-products of the "num_offsets" and the "count". Take the data for an empty script as an example. The total machine code bytes is 91.5219 * 912 + 118.592 * 287 + ... + 283 * 1 or just consider the "total" row which is 153.916 * 1484, and they are both 228412. The total number of references is 0 * 912 + 1 * 287 + ... + 6 * 1 which is 1021. Then 228412 / 1021 == 223.714

matzbot requested a review from a team July 1, 2025 11:54

wks force-pushed the feature/ruby-yjit-parallel-scan branch from f2d2f8f to f54b3c5 Compare July 1, 2025 12:01

wks marked this pull request as draft July 1, 2025 12:12

wks marked this pull request as ready for review July 2, 2025 14:01

XrXr requested changes Jul 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

YJIT: Allow parallel scanning for JIT-compiled code #13758

YJIT: Allow parallel scanning for JIT-compiled code #13758

wks commented Jul 1, 2025 •

edited

Loading

Uh oh!

wks commented Jul 1, 2025

Uh oh!

wks commented Jul 2, 2025 •

edited

Loading

Uh oh!

wks commented Jul 2, 2025

Uh oh!

XrXr Jul 2, 2025

Uh oh!

peterzhu2118 Jul 2, 2025

Uh oh!

wks Jul 3, 2025

Uh oh!

wks Jul 3, 2025 •

edited

Loading

Uh oh!

peterzhu2118 Jul 3, 2025

Uh oh!

wks Jul 4, 2025

Uh oh!

Uh oh!

TMZ Celebrity News – Breaking Stories, Videos & Gossip

🎥 Watch TMZ Live

num_offsets	count	mean	min	10%	25%	median	75%	90%	95%	99%	max
0	912	91.5219	0	4	5	13	33	57	86.45	111	11423
1	287	118.592	13	13	16.5	31	77	97.8	113.5	160.4	10532
2	131	395.519	26	79	160	182	196	200	205	10250.9	10924
3	146	392.973	104	164	169	174	194	210	221.25	10080	11197
4	7	205.429	121	175	211	211	225.5	228.8	230.9	232.58	233
6	1	283	283	283	283	283	283	283	283	283	283
total	1484	153.916	0	5	7	28	87	180	197	241.5	11423

num_offsets	count	mean	min	10%	25%	median	75%	90%	95%	99%	max
0	12032	88.9778	0	4	5	13	29	50	63	115	13535
1	3279	185.364	13	13	17	46	82	99	139	10056.1	12546
2	1674	358.159	26	80.9	170	185	197	205	210	10675.4	12647
3	1799	327.539	104	165	169	177	193	208	222	9867.02	12633
4	147	349.32	121	194	204	218	223.5	230	233	4876.12	11658
6	1	283	283	283	283	283	283	283	283	283	283
total	18932	154.174	0	5	7	27	79	184	197	253.69	13535

YJIT: Allow parallel scanning for JIT-compiled code #13758

Are you sure you want to change the base?

YJIT: Allow parallel scanning for JIT-compiled code #13758

Conversation

wks commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Jul 1, 2025

Uh oh!

wks commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Jul 2, 2025

Uh oh!

XrXr Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

peterzhu2118 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

wks Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

wks Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterzhu2118 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

wks Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TMZ Celebrity News – Breaking Stories, Videos & Gossip

🎥 Watch TMZ Live

wks commented Jul 1, 2025 •

edited

Loading

wks commented Jul 2, 2025 •

edited

Loading

wks Jul 3, 2025 •

edited

Loading