Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: renumber Arm targets + Apple feature detection #2076

Closed
jan-wassenberg opened this issue Apr 10, 2024 · 9 comments
Closed

RFC: renumber Arm targets + Apple feature detection #2076

jan-wassenberg opened this issue Apr 10, 2024 · 9 comments

Comments

@jan-wassenberg
Copy link
Member

jan-wassenberg commented Apr 10, 2024

FYI we are working on supporting dynamic dispatch with Clang on Arm. As part of this, we may insert another NEON target using some of the optional features (fp16, bf16, dot, perhaps fp16fml - please let us know if you'd like to use/target others).

We'd want this target to be used if it's available, but it should not take precedence over any SVE targets. To enable that, we'd have to renumber the Arm targets. This could cause breakage for a project that uses the combination of:

  • GCC on aarch64
  • dynamic dispatch via foreach_target.h
  • precompiled shared libraries or objects, which are not compiled fresh during a build.

This seems sufficiently unlikely, but please let us know within say a week if you have any concerns.

For concreteness, the plan is to insert 2 targets below HWY_NEON, 3 below HWY_SVE2, and that leaves 4 below HWY_SVE2_128.

@johnplatts
Copy link
Contributor

Here is a function that can detect if an optional CPU feature is present on MacOS/iOS/iPad:

static HWY_INLINE bool HasCpuFeature(const char* feature_name) {
  int result = 0;
  size_t len = sizeof(int);
  return sysctlbyname(feature_name, &result, &len, 0, 0) == 0 && result != 0;
}

Need to include the <sys/sysctl.h> header to use the sysctlbyname function on MacOS/iOS/iPad.

A list of optional AArch64 SIMD ISA extensions that can be queried on MacOS/iOS/iPad can be found at https://developer.apple.com/documentation/kernel/1387446-sysctlbyname/determining_instruction_set_characteristics.

@jan-wassenberg
Copy link
Member Author

Thanks @johnplatts - good point, seems like a good occasion to also add support for runtime dispatch on Apple.
I think the ones we'd look at are:

  • NEON: AdvSIMD_HPFPCvt, FEAT_AES+FEAT_PMULL;
  • NEON2 or NEON_8_6 (or any better ideas for the name?): FEAT_BF16, FEAT_DotProd, FEAT_FHM/FEAT_FP16.

@jan-wassenberg jan-wassenberg changed the title RFC: renumber Arm targets? RFC: renumber Arm targets + Apple feature detection Apr 11, 2024
@johnplatts
Copy link
Contributor

AVX3/AVX3_DL target detection also should be updated for x86_64 on MacOS as
(a) XGETBV might fail to report support for ZMM vectors and AVX3 mask registers on MacOS (even in the case where both the CPU and OS support AVX512F) until AVX512 instructions are invoked, and
(b) there are bugs with AVX3/AVX3_DL context saving on MacOS releases earlier than 12.2.

Here are some functions that can be used to check that Highway is running on MacOS 12.2 or later (the below code requires that <sys/utsname.h> be included on MacOS):

static HWY_INLINE bool ParseU32(const char*& ptr, uint32_t& parsed_val) {
  uint64_t parsed_u64 = 0;

  const char* start_ptr = ptr;
  for (char ch; (ch = (*ptr)) != '\0'; ++ptr) {
    unsigned digit = static_cast<unsigned char>(ch) -
                     static_cast<unsigned>(static_cast<unsigned char>('0'));
    if (digit > 9) {
      break;
    }

    parsed_u64 = (parsed_u64 * 10u) + digit;
    if (parsed_u64 > 0xFFFFFFFFu) {
      return false;
    }
  }

  parsed_val = static_cast<uint32_t>(parsed_u64);
  return (ptr != start_ptr);
}

static HWY_INLINE bool IsMacOS_12_2_Or_Later() {
  struct utsname uname_buf;
  ZeroBytes(uname_buf);

  if ((uname(&uname_buf)) != 0) {
    return false;
  }

  const char* ptr = uname_buf.release;
  if (!ptr) {
    return false;
  }

  uint32_t major;
  uint32_t minor;
  if (!ParseU32(ptr, major)) {
    return false;
  }

  if (*ptr != '.') {
    return false;
  }

  ++ptr;
  if (!ParseU32(ptr, minor)) {
    return false;
  }

  // We are running on MacOS 12.2 or later if the Darwin kernel version is 21.3 or later
  return (major > 21 || (major == 21 && minor >= 3));
}

Here is an updated snippet that correctly checks for AVX3 support on MacOS:

  if (has_xsave && has_osxsave) {
#ifdef __APPLE__
    // On MacOS, check for AVX3 support by checking that we are running on
    // MacOS 12.2 or later and HasCpuFeature("hw.optional.avx512f") returns true
    const bool have_avx3_xsave_support =
        IsMacOS_12_2_Or_Later() && HasCpuFeature("hw.optional.avx512f");
#endif

    const uint32_t xcr0 = ReadXCR0();
    constexpr int64_t min_avx3 = HWY_AVX3 | HWY_AVX3_DL | HWY_AVX3_SPR;
    // XMM/YMM
    if (!IsBitSet(xcr0, 1) || !IsBitSet(xcr0, 2)) {
      // Clear the AVX2/AVX3 bits if XMM/YMM XSAVE is not enabled
      bits &= ~min_avx2;
    }

#ifndef __APPLE__
    // On OS's other than MacOS, check for AVX3 support by checking that bits 5,
    // 6, and 7 of XCR0 are set
    const bool have_avx3_xsave_support =
        IsBitSet(xcr0, 5) && IsBitSet(xcr0, 6) && IsBitSet(xcr0, 7);
#endif

    // opmask, ZMM lo/hi
    if (!have_avx3_xsave_support) {
      bits &= ~min_avx3;
    }
  } else {  // !has_xsave || !has_osxsave
    // Clear the AVX2/AVX3 bits if the CPU or OS does not support XSAVE
    bits &= ~min_avx2;
  }

The MacOS AVX3 context saving bug was mentioned at https://community.intel.com/t5/Software-Tuning-Performance/MacOS-Darwin-kernel-bug-clobbers-AVX-512-opmask-register-state/m-p/1327259, golang/go#49233, and simdutf/simdutf#236.

@jan-wassenberg
Copy link
Member Author

Nice find, thank you @johnplatts ! Would you like to send this code as a pull request, with a comment mentioning the intel.com forum discussion link?

@johnplatts
Copy link
Contributor

Nice find, thank you @johnplatts ! Would you like to send this code as a pull request, with a comment mentioning the intel.com forum discussion link?

I have made the changes to x86 DetectTargets() that fix the issues with AVX3 detection on macOS in pull request #2083.

Also added HasCpuFeature in hwy/targets.cc that is available if Highway is being compiled for macOS/iOS/iPadOS in pull request #2083. HasCpuFeature is used in the updated implementation of DetectTargets() on macOS on x86 in pull request #2083 to check that the OS supports AVX3, and HasCpuFeature can also be used to detect support for some of the AArch64 SIMD extension set extensions on Apple Silicon CPU's.

@johnplatts
Copy link
Contributor

Windows on AArch64 also has the IsProcessorFeaturePresent function that can check for the presence of some of the AArch64 instruction set extensions (including the SDOT/UDOT instructions), and the IsProcessorFeaturePresent function is described at https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-isprocessorfeaturepresent.

copybara-service bot pushed a commit that referenced this issue Apr 23, 2024
PiperOrigin-RevId: 627388792
copybara-service bot pushed a commit that referenced this issue Apr 23, 2024
PiperOrigin-RevId: 627388792
copybara-service bot pushed a commit that referenced this issue Apr 24, 2024
PiperOrigin-RevId: 627388792
copybara-service bot pushed a commit that referenced this issue Apr 24, 2024
PiperOrigin-RevId: 627388792
copybara-service bot pushed a commit that referenced this issue Apr 24, 2024
PiperOrigin-RevId: 627710484
@kleisauke
Copy link
Contributor

Windows on AArch64 also has the IsProcessorFeaturePresent function that can check for the presence of some of the AArch64 instruction set extensions

Unfortunately, that doesn't cover SVE. Any code with SVE intrinsics cannot be used on Windows targets, see:
llvm/llvm-project#64278 (comment)

@johnplatts
Copy link
Contributor

Windows on AArch64 also has the IsProcessorFeaturePresent function that can check for the presence of some of the AArch64 instruction set extensions

Unfortunately, that doesn't cover SVE. Any code with SVE intrinsics cannot be used on Windows targets, see: llvm/llvm-project#64278 (comment)

Microsoft is likely planning on adding support for SVE in a future Windows release as Microsoft has recently added detection for SVE on Windows on AArch64 in the .NET Runtime according to a pull request that can be found at dotnet/runtime#100937.

There is a new constant PF_ARM_SVE_INSTRUCTIONS_AVAILABLE that was recently added to https://github.com/dotnet/runtime/blob/main/src/native/minipal/cpufeatures.c for the AArch64 SVE feature that hasn't yet made its way into Windows headers or the IsProcessorFeaturePresent API documentation.

The Visual C++ 2022 compiler also does not currently have support for SVE, and compiling the SVE target for Windows on AArch64 requires Clang.

@jan-wassenberg
Copy link
Member Author

The renumbering is done, and thanks @johnplatts for adding the Apple detection :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants