-
Notifications
You must be signed in to change notification settings - Fork 5.4k
YJIT: Allow parallel scanning for JIT-compiled code #13758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Some GC modules, notably MMTk, support parallel GC, i.e. multiple GC threads work in parallel during a GC. Currently, when two GC threads scan two iseq objects simultaneously when YJIT is enabled, both threads will attempt to borrow `CodeBlock::mem_block`, which will result in panic. We make two changes to YJIT in order to support parallel GC. `Block` now holds a list of addresses instead of offsets. This makes it unnecessary to borrow `CodeBlock::mem_block` and resolve absolute addresses from offsets. We now set the YJIT code memory to writable in bulk before the reference-updating phase, and reset it to executable in bulk after the reference-updating phase. Previously, YJIT lazily sets memory pages writable while updating object references embedded in JIT-compiled machine code, and sets the memory back to executable by calling `mark_all_executable`. This approach is inherently unfriendly to parallel GC because (1) it borrows `CodeBlock::mem_block`, and (2) it sets the whole `CodeBlock` as executable which races with other GC threads that are updating other iseq objects. It also has performance overhead due to the frequent invocation of system calls. We now set the permission of all the code memory in bulk before and after the reference updating phase. Multiple GC threads can now perform raw memory writes in parallel. We should also see performance improvement during moving GC because of the reduced number of `mprotect` system calls.
f2d2f8f
to
f54b3c5
Compare
There is an alternative solution for An alternative solution is to put all the object reference constants after the function, like this: jit_compiled_func1:
...
ldr x1, ref1 ; Load an object reference constant
ldr x2, ref2 ; Load another object reference constant
...
ldr x3, ref3 ; Load another object reference constant
...
ret
.align 3
ref1:
.xword 0x123456789abcdef0 ; The first embedded reference
ref2:
.xword 0x123456789abcdf40 ; The second embedded reference
ref3:
.xword 0x123456789abcdf80 ; The third embedded reference If we use this approach for both x86_64 and ARM64, then we can eliminate |
Here are the results of running yjit-bench. I ran on my laptop with ArchLinux. Both the master branch and this PR are built with Some benchmarks are faster, and some are slower. No YJIT:
With
|
I constructed a microbenchmark that simply calls The |
// Offsets for GC managed objects in the mainline code block | ||
gc_obj_offsets: Box<[u32]>, | ||
// Pointers to references to GC managed objects in the mainline code block | ||
gc_obj_addresses: Box<[*const u8]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doubles the memory usage, which is a no-go.
To avoid the RefCell borrowing, you only need the starting address of the JIT code region and the ability to mark everything as writable. You can put a copy of the starting address into a global and derive a full pointer off of that everywhere you need it.
An alternative solution is to put all the object reference constants after the function, like this:
That's a no-go for LBBV, which requires as few gaps in the code as possible for good code layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doubles the memory usage, which is a no-go.
But doesn't it only double the memory used for pointers to objects in the generated machine code, which is probably significantly smaller than the code itself? In that case, this might not increase the memory by a significant amount?
Maybe @wks can measure what the overhead of this is for memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
measure what the overhead of this is for memory usage.
I measure the blocks during an invocation of GC.compact
after CRuby starts. There are 1484 live Block
instances visited in the compacting GC.
In the following table, num_offsets
is block.gc_obj_offsets.len()
. count
is the number of Block
instances that have that many offsets. mean
is the average block.code_size()
, and columns from min
to max
are its distribution.
num_offsets | count | mean | min | 10% | 25% | median | 75% | 90% | 95% | 99% | max |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 912 | 91.5219 | 0 | 4 | 5 | 13 | 33 | 57 | 86.45 | 111 | 11423 |
1 | 287 | 118.592 | 13 | 13 | 16.5 | 31 | 77 | 97.8 | 113.5 | 160.4 | 10532 |
2 | 131 | 395.519 | 26 | 79 | 160 | 182 | 196 | 200 | 205 | 10250.9 | 10924 |
3 | 146 | 392.973 | 104 | 164 | 169 | 174 | 194 | 210 | 221.25 | 10080 | 11197 |
4 | 7 | 205.429 | 121 | 175 | 211 | 211 | 225.5 | 228.8 | 230.9 | 232.58 | 233 |
6 | 1 | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 |
total | 1484 | 153.916 | 0 | 5 | 7 | 28 | 87 | 180 | 197 | 241.5 | 11423 |
From the table, most blocks have no embedded references at all (num_offsets = 0). Most others have <= 3 embedded references, and one block has 6 embedded references. But on the other hand, the size of the code block can range from a few bytes to 11K bytes. For Block
instances with 1-3 embedded references, their mean block size are about 100-400 bytes. The overhead of the offsets array (4 bytes to 12 bytes) is small compared to that. Even if we replace it with pointer arrays (8 bytes to 24 bytes), their overhead is still relatively small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried again with the erubi-rails benchmark. At the end of the benchmark, I ran a GC.compact
and then iterated through all iseq
instances in the heap and visited all of their Block
instances like in rb_yjit_iseq_update_references
. And here is the statistics.
num_offsets | count | mean | min | 10% | 25% | median | 75% | 90% | 95% | 99% | max |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12032 | 88.9778 | 0 | 4 | 5 | 13 | 29 | 50 | 63 | 115 | 13535 |
1 | 3279 | 185.364 | 13 | 13 | 17 | 46 | 82 | 99 | 139 | 10056.1 | 12546 |
2 | 1674 | 358.159 | 26 | 80.9 | 170 | 185 | 197 | 205 | 210 | 10675.4 | 12647 |
3 | 1799 | 327.539 | 104 | 165 | 169 | 177 | 193 | 208 | 222 | 9867.02 | 12633 |
4 | 147 | 349.32 | 121 | 194 | 204 | 218 | 223.5 | 230 | 233 | 4876.12 | 11658 |
6 | 1 | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 |
total | 18932 | 154.174 | 0 | 5 | 7 | 27 | 79 | 184 | 197 | 253.69 | 13535 |
The distribution is roughly the same. The majority have 0 embedded references, most others have 1-3 refs, and no Block
instances have more than 6 embedded references. The median code_size
is roughly the same as an empty script, but the max code_size
gets a little bit bigger.
So I think we shouldn't need to worry about the space overhead of whether each GC reference is pointed by a 4-byte offset or a 8-byte pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm doing the rough math correct, there's about one object reference per 90 bytes of machine code. That was previously a 4 byte or 5% overhead and now is a 8 byte or 10% overhead. This change would increase YJIT memory usage by about 5% of machine code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's about one object reference per 90 bytes of machine code
No. There is, on average, one embedded reference per 223.7 bytes of machine code on x86_64 for an empty script, and per 231 bytes for the erubi-rails benchmark. So a 4-byte offset is 1.8% overhead, and a 8-byte offset will be 3.6%. So this change increases the memory usage by less than 1.8% of machine code size. Remember that each Box<[u32]>
or Box<[*const u8]>
has a malloc overhead and a 64-bit length field (because [T]
is a slice type).
Note: To compute the "bytes per embedded references", we divide the total machine code bytes by the total number of reference. The former can be obtained by computing the sum-of-products of the "mean" and "count" columns, and the latter is the sum-of-products of the "num_offsets" and the "count". Take the data for an empty script as an example. The total machine code bytes is 91.5219 * 912 + 118.592 * 287 + ... + 283 * 1
or just consider the "total" row which is 153.916 * 1484
, and they are both 228412
. The total number of references is 0 * 912 + 1 * 287 + ... + 6 * 1
which is 1021
. Then 228412 / 1021 == 223.714
Some GC modules, notably MMTk, support parallel GC, i.e. multiple GC threads work in parallel during a GC. Currently, when two GC threads scan two iseq objects simultaneously when YJIT is enabled, both threads will attempt to borrow
CodeBlock::mem_block
, which will result in panic.We make two changes to YJIT in order to support parallel GC.
Block
now holds a list of addresses instead of offsets. This makes it unnecessary to borrowCodeBlock::mem_block
and resolve absolute addresses from offsets.We now set the YJIT code memory to writable in bulk before the reference-updating phase, and reset it to executable in bulk after the reference-updating phase. Previously, YJIT lazily sets memory pages writable while updating object references embedded in JIT-compiled machine code, and sets the memory back to executable by calling
mark_all_executable
. This approach is inherently unfriendly to parallel GC because (1) it borrowsCodeBlock::mem_block
, and (2) it sets the wholeCodeBlock
as executable which races with other GC threads that are updating other iseq objects. It also has performance overhead due to the frequent invocation of system calls. We now set the permission of all the code memory in bulk before and after the reference updating phase. Multiple GC threads can now perform raw memory writes in parallel. We should also see performance improvement during moving GC because of the reduced number ofmprotect
system calls.