Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special JIT support for FFI #14491

Draft
wants to merge 68 commits into
base: master
Choose a base branch
from
Draft

Special JIT support for FFI #14491

wants to merge 68 commits into from

Conversation

dstogov
Copy link
Member

@dstogov dstogov commented Jun 6, 2024

No description provided.

@dstogov
Copy link
Member Author

dstogov commented Jun 6, 2024

@iluuu1994 @nielsdos I would appreciate, if you could take a quick look over this, when you have time. If this is interesting for you - please, share your ideas.

This is a very initial PoC yet. It aims to generate optimized native code instead of generic calls to FFI callbacks. There are a lot of not solved questions:

  • JIT should generate FFI type guards (check that the FFI\CData type is the same as during trace recording and compilation)
  • most FFI types are not persistent. I didn't find an efficient way to implement FFI\CData type guards yet.
  • FFI array bounds checks are not implemented
  • It would be great to store pointers to FFI\CData in CPU registers and "unbox" the temporary FFI\CData objects
  • Guards, bounds checks and CData pointers loads should be moved out of the loops
  • Access to FFI structures and unions fields is not implemented
  • Access to FFI variables
  • Native code to call FFI functions
  • Native wrappers for FFI callbacks

Even in the current state this makes access to FFI Arrays more than 20 times faster. See the following example:

<?php
function ary3($n) {
  for ($i=0; $i<$n; $i++) {
    $X[$i] = $i + 1;
    $Y[$i] = 0;
  }
  for ($k=0; $k<1000; $k++) {
    for ($i=$n-1; $i>=0; $i--) {
      $Y[$i] += $X[$i];
    }
  }
  $last = $n-1;
  print "$Y[0] $Y[$last]\n";
}

function ary3_ffi($n) {
  $X = FFI::new("int[$n]");
  $Y = FFI::new("int[$n]");
  for ($i=0; $i<$n; $i++) {
    $X[$i] = $i + 1;
    $Y[$i] = 0;
  }
  for ($k=0; $k<1000; $k++) {
    for ($i=$n-1; $i>=0; $i--) {
      $Y[$i] += $X[$i];
    }
  }
  $last = $n-1;
  print "$Y[0] $Y[$last]\n";
}

/*****/

function gethrtime()
{
  $hrtime = hrtime();
  return (($hrtime[0]*1000000000 + $hrtime[1]) / 1000000000);
}

function start_test()
{
  ob_start();
  return gethrtime();
}

function end_test($start, $name)
{
  global $total;
  $end = gethrtime();
  ob_end_clean();
  $total += $end-$start;
  $num = number_format($end-$start,3);
  $pad = str_repeat(" ", 24-strlen($name)-strlen($num));

  echo $name.$pad.$num."\n";
  ob_start();
  return gethrtime();
}

function total()
{
  global $total;
  $pad = str_repeat("-", 24);
  echo $pad."\n";
  $num = number_format($total,3);
  $pad = str_repeat(" ", 24-strlen("Total")-strlen($num));
  echo "Total".$pad.$num."\n";
}

$t0 = $t = start_test();
ary3(200000);
$t = end_test($t, "ary3(200000)");
ary3_ffi(200000);
$t = end_test($t, "ary3_ffi(200000)");

@bwoebi
Copy link
Member

bwoebi commented Jun 7, 2024

I absolutely love the idea of JITting specific functions (like FFI here). It will also allow JITing some function calls completely away in future I hope.

I just think that the JIT should expose an API to JIT specific functions rather than the other way round, that extensions expose their internals to the JIT and it needs to be hardcoded in JIT then. That should likely scale better when more extensions find something JIT worthy.
I.e. the code doing the JITting of the FFI functions and operator overloads should live in ext/ffi.
I'm okay with not doing that right away, but I feel like JIT should become separate from opcache and have a proper public API eventually...

@nielsdos
Copy link
Member

nielsdos commented Jun 7, 2024

I like the idea. Extensions in PHP are often wrappers around C libraries, and by adding support for JIT specializations for FFI, it opens the door for creating extension-like functionality within PHP with reasonable overhead.

I think that LuaJIT does something similar with their FFI, but it's been a long time since I looked at that. Perhaps there are ideas there that we could use here too. I'm not sure.

I agree with Bob's comment, but it also seems like a lot more effort and difficulty (as he already pointed out).

most FFI types are not persistent. I didn't find an efficient way to implement FFI\CData type guards yet.

If I understand right, the problem is the following: In normal cases you'd compare the FFI type pointer in the guard, but because they are not persistent the pointers aren't a unique way of identifying the type (e.g. a type allocated later may reuse the same memory address). Furthermore, the type pointer may not always be dereferenced because it could have been freed.
Maybe this could be solved by giving each FFI type a unique ID that is never reused, and then compare against that ID in the guard. The ID could be created by a simple counter. I'm not sure.

Guards, bounds checks and CData pointers loads should be moved out of the loops

For guards and bounds checks, I suppose this could be solved in a general way if IR itself had range analysis or value set analysis (e.g. as part of SCCP). That would not only benefit FFI but also PHP itself. I see an open PR for SCCP so maybe this "issue" goes away in the future anyway.

@dstogov
Copy link
Member Author

dstogov commented Jun 10, 2024

Maybe this could be solved by giving each FFI type a unique ID that is never reused, and then compare against that ID in the guard. The ID could be created by a simple counter. I'm not sure.

LuaJIT uses this approach, but we will have to serialize IDs across several workers and probably keep the types forever

@arnaud-lb
Copy link
Member

I also like the idea. This would reduce the amount of C code in use-cases such as Niels mentioned, which is a good thing.

Maybe this could be solved by giving each FFI type a unique ID that is never reused, and then compare against that ID in the guard. The ID could be created by a simple counter. I'm not sure.

LuaJIT uses this approach, but we will have to serialize IDs across several workers and probably keep the types forever

At a minimum this requires a mapping from type structures to IDs, so that IDs are stable across workers and subsequent requests?

The size of the associated storage may be manageable if IDs are only used by JIT and are only allocated when a type is JITed, because then the mapping has the same lifetime as the JIT buffer, and also grows at the same time as the JIT buffer.

Guards, bounds checks and CData pointers loads should be moved out of the loops

For guards and bounds checks, I suppose this could be solved in a general way if IR itself had range analysis or value set analysis (e.g. as part of SCCP). That would not only benefit FFI but also PHP itself. I see an open PR for SCCP so maybe this "issue" goes away in the future anyway.

Agreed. I was looking at range analysis earlier this year, and will continue working on this topic (range analysis) soon (unless someone else does it first - I don't want to block progress), so I will check if this can have an impact here.

@dstogov
Copy link
Member Author

dstogov commented Jun 10, 2024

At a minimum this requires a mapping from type structures to IDs, so that IDs are stable across workers and subsequent requests?

yes.

The size of the associated storage may be manageable if IDs are only used by JIT and are only allocated when a type is JITed, because then the mapping has the same lifetime as the JIT buffer, and also grows at the same time as the JIT buffer.

I'm not sure if we can "persist" some CType during JIT-ing, because we will need to update all CData objects of this type.

Guards, bounds checks and CData pointers loads should be moved out of the loops

For guards and bounds checks, I suppose this could be solved in a general way if IR itself had range analysis or value set analysis (e.g. as part of SCCP). That would not only benefit FFI but also PHP itself. I see an open PR for SCCP so maybe this "issue" goes away in the future anyway.

Agreed. I was looking at range analysis earlier this year, and will continue working on this topic (range analysis) soon (unless someone else does it first - I don't want to block progress), so I will check if this can have an impact here.

Luajit achieves good code through loop-peeling. It repeats loop body two times and removes all redundant code in the second copy using folding rules (common subexpression elimination, load forwarding, guard elimination, etc)

@arnaud-lb
Copy link
Member

I'm not sure if we can "persist" some CType during JIT-ing, because we will need to update all CData objects of this type.

Indeed. I was thinking about something like this:

get_id(ctype):
    if ctype.id:
        return ctype.id
    if mapping[ctype]:
        return ctype.id := mapping[ctype]
    return ctype.id := mapping[ctype] := next_id()

This handles future instances, but this doesn't account for other existing instances in the same request, or existing instances of other workers that will get to execute the JITed code.

Maybe we can have a special exit that fetches the id? This is starting to get complicated though.

Comment on lines +2680 to +2730
str = accel_find_interned_string(str);
if (str && (str->gc.u.type_info & IS_STR_FFI_TYPE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to include the cdef in the lookup key in some way, as str may depend on it.

E.g.:

$cdef = FFI::cdef("typedef char test;");
$cdata = $cdef->new("test");

$cdef = FFI::cdef("typedef int test;");
$cdata = $cdef->new("test");

An other possible issue is that this could lead to a high number of cached types due to types like $cdef->new('char[' . $len . ']').

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to include the cdef in the lookup key in some way, as str may depend on it.

yes. One of the PHPT tests already catches this problem. The work is in progress...

@chopins
Copy link
Contributor

chopins commented Jun 24, 2024

At present, the low performance of FFI\CData calculations and other operations is caused by conversion to PHP types and magic calls. If the value is simply assigned to CData, its performance is not inferior. So these problems can be avoided with good coding. The other thing is to avoid frequent type conversions by manipulating symbol overloads. I don't think it's a good idea to get a little bit of acceleration through JIT, and it would make the FFI API ugly

@dstogov
Copy link
Member Author

dstogov commented Jun 24, 2024

At present, the low performance of FFI\CData calculations and other operations is caused by conversion to PHP types and magic calls. If the value is simply assigned to CData, its performance is not inferior.

Right. This is what JIT is doing to do.

I don't think it's a good idea to get a little bit of acceleration through JIT, and it would make the FFI API ugly

  • The current PoC shows 20 times speedup (see the example at the top).
  • This PR doesn't change PHP ext/ffi API at all.

@chopins
Copy link
Contributor

chopins commented Jun 24, 2024

  1. Isn't it better to use class handles of do_operation. similar to GMP .
  2. As discussed above, the FFI type needs to be clarified, so it is necessary to require access to the FFI type through an instance. I don't recommend accessing the FFI API through an instance.
  3. FFI is not enable by default, so JIT may not be available

@dstogov
Copy link
Member Author

dstogov commented Jun 24, 2024

  1. Isn't it better to use class handles of do_operation. similar to GMP .

I don't understand what do you propose.
See the following PHP code:

$x = FFI::new("int[42]");
$y = $x[$i];

This PR translates the last line into 3 machine instructions

movq 0x60(%r14), %rcx       ; load Z_OBJ_P() from $x zval
movq 0x40(%rcx), %rcx       ; load CData->ptr (start of the array)
movl (%rcx, %rax, 4), %ecx  ; load element of the array (%rax contains value of $i)

How can you make this better with do_operation?

@chopins
Copy link
Contributor

chopins commented Jun 25, 2024

I want to optimize the zend_ffi_cdata_do_operation() function, but there is no good way to match the CData array.

I'm not against JIT improving performance, but I'm against changing the API to be less good because of the need to optimize performance. It is still necessary to make sure that PHP can write elegant and concise code.

@dstogov
Copy link
Member Author

dstogov commented Jun 25, 2024

I'm not against JIT improving performance, but I'm against changing the API to be less good because of the need to optimize performance. It is still necessary to make sure that PHP can write elegant and concise code.

What kind of API changes do you mean? This PR doesn't change anything visible to PHP programmers.

@chopins
Copy link
Contributor

chopins commented Jun 25, 2024

The following PR is related to FFI JIT ?
4acf008#commitcomment-143451098

@dstogov
Copy link
Member Author

dstogov commented Jun 25, 2024

The following PR is related to FFI JIT ?
4acf008#commitcomment-143451098

Not at all. I don't like it, and I think your last RFC may be a good solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants