Nvrtc error compilation

Toggle table of contents sidebar

Toggle table of contents sidebar

Error Handling#

NVRTC defines the following enumeration type and function for API call error handling.

class cuda.nvrtc.nvrtcResult(value)#

The enumerated type nvrtcResult defines API call result codes.
NVRTC API functions return nvrtcResult to indicate the call result.

NVRTC_SUCCESS = 0#
NVRTC_ERROR_OUT_OF_MEMORY = 1#
NVRTC_ERROR_PROGRAM_CREATION_FAILURE = 2#
NVRTC_ERROR_INVALID_INPUT = 3#
NVRTC_ERROR_INVALID_PROGRAM = 4#
NVRTC_ERROR_INVALID_OPTION = 5#
NVRTC_ERROR_COMPILATION = 6#
NVRTC_ERROR_BUILTIN_OPERATION_FAILURE = 7#
NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION = 8#
NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 9#
NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID = 10#
NVRTC_ERROR_INTERNAL_ERROR = 11#
cuda.nvrtc.nvrtcGetErrorString(result: nvrtcResult)#

nvrtcGetErrorString is a helper function that returns a string describing the given nvrtcResult code, e.g., NVRTC_SUCCESS to “NVRTC_SUCCESS”. For unrecognized enumeration values, it returns “NVRTC_ERROR unknown”.

Parameters:

result (nvrtcResult) – CUDA Runtime Compilation API result code.

Returns:
  • nvrtcResult.NVRTC_SUCCESS – nvrtcResult.NVRTC_SUCCESS

  • bytes – Message string for the given nvrtcResult code.

General Information Query#

NVRTC defines the following function for general information query.

cuda.nvrtc.nvrtcVersion()#

nvrtcVersion sets the output parameters major and minor with the CUDA Runtime Compilation version number.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

  • major (int) – CUDA Runtime Compilation major version number.

  • minor (int) – CUDA Runtime Compilation minor version number.

cuda.nvrtc.nvrtcGetNumSupportedArchs()#

nvrtcGetNumSupportedArchs sets the output parameter numArchs with the number of architectures supported by NVRTC. This can then be used to pass an array to nvrtcGetSupportedArchs to get the supported architectures.

see nvrtcGetSupportedArchs

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

  • numArchs (int) – number of supported architectures.

cuda.nvrtc.nvrtcGetSupportedArchs()#

nvrtcGetSupportedArchs populates the array passed via the output parameter supportedArchs with the architectures supported by NVRTC. The array is sorted in the ascending order. The size of the array to be passed can be determined using nvrtcGetNumSupportedArchs.

see nvrtcGetNumSupportedArchs

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

  • supportedArchs (List[int]) – sorted array of supported architectures.

Compilation#

NVRTC defines the following type and functions for actual compilation.

class cuda.nvrtc.nvrtcProgram(*args, **kwargs)#

nvrtcProgram is the unit of compilation, and an opaque handle for a program.

To compile a CUDA program string, an instance of nvrtcProgram must be created first with nvrtcCreateProgram, then compiled with nvrtcCompileProgram.

getPtr()#

Get memory address of class instance

cuda.nvrtc.nvrtcCreateProgram(char *src, char *name, int numHeaders, list headers, list includeNames)#

nvrtcCreateProgram creates an instance of nvrtcProgram with the given input parameters, and sets the output parameter prog with it.

Parameters:
  • src (bytes) – CUDA program source.

  • name (bytes) – CUDA program name. name can be NULL; “default_program” is
    used when name is NULL or “”.

  • numHeaders (int) – Number of headers used. numHeaders must be greater than or equal
    to 0.

  • headers (List[bytes]) – Sources of the headers. headers can be NULL when numHeaders
    is 0.

  • includeNames (List[bytes]) – Name of each header by which they can be included in the CUDA
    program source. includeNames can be NULL when numHeaders is
    0. These headers must be included with the exact names specified
    here.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_OUT_OF_MEMORY

    • NVRTC_ERROR_PROGRAM_CREATION_FAILURE

    • NVRTC_ERROR_INVALID_INPUT

    • NVRTC_ERROR_INVALID_PROGRAM

  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

cuda.nvrtc.nvrtcDestroyProgram(prog)#

nvrtcDestroyProgram destroys the given program.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcCreateProgram

cuda.nvrtc.nvrtcCompileProgram(prog, int numOptions, list options)#

nvrtcCompileProgram compiles the given program.

It supports compile options listed in Supported Compile
Options
.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • numOptions (int) – Number of compiler options passed.

  • options (List[bytes]) – Compiler options in the form of C string array. options can be
    NULL when numOptions is 0.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_OUT_OF_MEMORY

  • NVRTC_ERROR_INVALID_INPUT

  • NVRTC_ERROR_INVALID_PROGRAM

  • NVRTC_ERROR_INVALID_OPTION

  • NVRTC_ERROR_COMPILATION

  • NVRTC_ERROR_BUILTIN_OPERATION_FAILURE

Return type:

nvrtcResult

cuda.nvrtc.nvrtcGetPTXSize(prog)#

nvrtcGetPTXSize sets the value of ptxSizeRet with the size of the PTX generated by the previous compilation of prog (including the trailing NULL).

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

    • NVRTC_ERROR_INVALID_PROGRAM

  • ptxSizeRet (int) – Size of the generated PTX (including the trailing NULL).

cuda.nvrtc.nvrtcGetPTX(prog, char *ptx)#

nvrtcGetPTX stores the PTX generated by the previous compilation of prog in the memory pointed by ptx.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • ptx (bytes) – Compiled result.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_INVALID_INPUT

  • NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

cuda.nvrtc.nvrtcGetCUBINSize(prog)#

nvrtcGetCUBINSize sets the value of cubinSizeRet with the size of the cubin generated by the previous compilation of prog. The value of cubinSizeRet is set to 0 if the value specified to -arch is a virtual architecture instead of an actual architecture.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

    • NVRTC_ERROR_INVALID_PROGRAM

  • cubinSizeRet (int) – Size of the generated cubin.

cuda.nvrtc.nvrtcGetCUBIN(prog, char *cubin)#

nvrtcGetCUBIN stores the cubin generated by the previous compilation of prog in the memory pointed by cubin. No cubin is available if the value specified to -arch is a virtual architecture instead of an actual architecture.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • cubin (bytes) – Compiled and assembled result.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_INVALID_INPUT

  • NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetCUBINSize

cuda.nvrtc.nvrtcGetNVVMSize(prog)#

DEPRECATION NOTICE: This function will be removed in a future release. Please use nvrtcGetLTOIRSize (and nvrtcGetLTOIR) instead.

Parameters:

prog (nvrtcProgram) – None

Returns:
  • nvrtcResult

  • nvvmSizeRet (int) – None

cuda.nvrtc.nvrtcGetNVVM(prog, char *nvvm)#

DEPRECATION NOTICE: This function will be removed in a future release. Please use nvrtcGetLTOIR (and nvrtcGetLTOIRSize) instead.

Parameters:
  • prog (nvrtcProgram) – None

  • nvvm (bytes) – None

Return type:

nvrtcResult

cuda.nvrtc.nvrtcGetLTOIRSize(prog)#

nvrtcGetLTOIRSize sets the value of LTOIRSizeRet with the size of the LTO IR generated by the previous compilation of prog. The value of LTOIRSizeRet is set to 0 if the program was not compiled with -dlto.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

    • NVRTC_ERROR_INVALID_PROGRAM

  • LTOIRSizeRet (int) – Size of the generated LTO IR.

cuda.nvrtc.nvrtcGetLTOIR(prog, char *LTOIR)#

nvrtcGetLTOIR stores the LTO IR generated by the previous compilation of prog in the memory pointed by LTOIR. No LTO IR is available if the program was compiled without -dlto.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • LTOIR (bytes) – Compiled result.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_INVALID_INPUT

  • NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetLTOIRSize

cuda.nvrtc.nvrtcGetOptiXIRSize(prog)#

nvrtcGetOptiXIRSize sets the value of optixirSizeRet with the size of the OptiX IR generated by the previous compilation of prog. The value of nvrtcGetOptiXIRSize is set to 0 if the program was compiled with options incompatible with OptiX IR generation.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

    • NVRTC_ERROR_INVALID_PROGRAM

  • optixirSizeRet (int) – Size of the generated LTO IR.

cuda.nvrtc.nvrtcGetOptiXIR(prog, char *optixir)#

nvrtcGetOptiXIR stores the OptiX IR generated by the previous compilation of prog in the memory pointed by optixir. No OptiX IR is available if the program was compiled with options incompatible with OptiX IR generation.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • Optix (bytes) – IR Compiled result.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_INVALID_INPUT

  • NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

cuda.nvrtc.nvrtcGetProgramLogSize(prog)#

nvrtcGetProgramLogSize sets logSizeRet with the size of the log generated by the previous compilation of prog (including the trailing NULL).

Note that compilation log may be generated with warnings and
informative messages, even when the compilation of prog succeeds.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:
  • nvrtcResult

    • NVRTC_SUCCESS

    • NVRTC_ERROR_INVALID_INPUT

    • NVRTC_ERROR_INVALID_PROGRAM

  • logSizeRet (int) – Size of the compilation log (including the trailing NULL).

See also

nvrtcGetProgramLog

cuda.nvrtc.nvrtcGetProgramLog(prog, char *log)#

nvrtcGetProgramLog stores the log generated by the previous compilation of prog in the memory pointed by log.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • log (bytes) – Compilation log.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_INVALID_INPUT

  • NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

cuda.nvrtc.nvrtcAddNameExpression(prog, char *name_expression)#

nvrtcAddNameExpression notes the given name expression denoting the address of a global function or device/__constant__ variable.

The identical name expression string must be provided on a subsequent
call to nvrtcGetLoweredName to extract the lowered name.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • name_expression (bytes) – constant expression denoting the address of a global function or
    device/__constant__ variable.

Returns:
  • NVRTC_SUCCESS

  • NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION

Return type:

nvrtcResult

cuda.nvrtc.nvrtcGetLoweredName(prog, char *name_expression)#

nvrtcGetLoweredName extracts the lowered (mangled) name for a global function or device/__constant__ variable, and updates lowered_name to point to it. The memory containing the name is released when the NVRTC program is destroyed by nvrtcDestroyProgram. The identical name expression must have been previously provided to nvrtcAddNameExpression.

Parameters:
  • prog (nvrtcProgram) – CUDA Runtime Compilation program.

  • name_expression (bytes) – constant expression denoting the address of a global function or
    device/__constant__ variable.

Returns:
  • nvrtcResult – NVRTC_SUCCESS
    NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION
    NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID

  • lowered_name (bytes) – initialized by the function to point to a C string containing the
    lowered (mangled) name corresponding to the provided name
    expression.

Supported Compile Options#

NVRTC supports the compile options below. Option names with two preceding dashs (None) are long option names and option names with one preceding dash () are short option names. Short option names can be used instead of long option names. When a compile option takes an argument, an assignment operator (=) is used to separate the compile option argument from the compile option name, e.g., “–gpu-architecture=compute_60”. Alternatively, the compile option name and the argument can be specified in separate strings without an assignment operator, .e.g, “–gpu-architecture” “compute_60”. Single-character short option names, such as -D, -U, and -I, do not require an assignment operator, and the compile option name and the argument can be present in the same string with or without spaces between them. For instance, “-D=<def>”, “-D<def>”, and “-D <def>” are all supported.
The valid compiler options are:

  • Compilation targets

    • --gpu-architecture=<arch> (-arch)

      Specify the name of the class of GPU architectures for which the input must be compiled.

      • Valid <arch>s:

        • compute_50

        • compute_52

        • compute_53

        • compute_60

        • compute_61

        • compute_62

        • compute_70

        • compute_72

        • compute_75

        • compute_80

        • compute_87

        • compute_89

        • compute_90

        • sm_50

        • sm_52

        • sm_53

        • sm_60

        • sm_61

        • sm_62

        • sm_70

        • sm_72

        • sm_75

        • sm_80

        • sm_87

        • sm_89

        • sm_90

      • Default: compute_52

  • Separate compilation / whole-program compilation

    • --device-c (-dc)

      Generate relocatable code that can be linked with other relocatable device code. It is equivalent to –relocatable-device-code=true.

    • --device-w (-dw)

      Generate non-relocatable code. It is equivalent to --relocatable-device-code=false.

    • --relocatable-device-code={true|false} (-rdc)

      Enable (disable) the generation of relocatable device code.

      • Default: false

    • --extensible-whole-program (-ewp)

      Do extensible whole program compilation of device code.

      • Default: false

  • Debugging support

    • --device-debug (-G)

      Generate debug information. If –dopt is not specified, then turns off all optimizations.

    • --generate-line-info (-lineinfo)

      Generate line-number information.

  • Code generation

    • --dopt on (-dopt)

    • --dopt=on

      Enable device code optimization. When specified along with ‘-G’, enables limited debug information generation for optimized device code (currently, only line number information). When ‘-G’ is not specified, ‘-dopt=on’ is implicit.

    • --ptxas-options <options> (-Xptxas)

    • --ptxas-options=<options>

      Specify options directly to ptxas, the PTX optimizing assembler.

    • --maxrregcount=<N> (-maxrregcount)

      Specify the maximum amount of registers that GPU functions can use. Until a function-specific limit, a higher value will generally increase the performance of individual GPU threads that execute this function. However, because thread registers are allocated from a global register pool on each GPU, a higher value of this option will also reduce the maximum thread block size, thereby reducing the amount of thread parallelism. Hence, a good maxrregcount value is the result of a trade-off. If this option is not specified, then no maximum is assumed. Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.

    • --ftz={true|false} (-ftz)

      When performing single-precision floating-point operations, flush denormal values to zero or preserve denormal values. --use_fast_math implies --ftz=true.

      • Default: false

    • --prec-sqrt={true|false} (-prec-sqrt)

      For single-precision floating-point square root, use IEEE round-to-nearest mode or use a faster approximation. --use_fast_math implies --prec-sqrt=false.

      • Default: true

    • --prec-div={true|false} (-prec-div)

      For single-precision floating-point division and reciprocals, use IEEE round-to-nearest mode or use a faster approximation. --use_fast_math implies --prec-div=false.

      • Default: true

    • --fmad={true|false} (-fmad)

      Enables (disables) the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA). --use_fast_math implies --fmad=true.

      • Default: true

    • --use_fast_math (-use_fast_math)

      Make use of fast math operations. --use_fast_math implies --ftz=true --prec-div=false --prec-sqrt=false --fmad=true.

    • --extra-device-vectorization (-extra-device-vectorization)

      Enables more aggressive device code vectorization in the NVVM optimizer.

    • --modify-stack-limit={true|false} (-modify-stack-limit)

      On Linux, during compilation, use setrlimit() to increase stack size to maximum allowed. The limit is reset to the previous value at the end of compilation. Note: setrlimit() changes the value for the entire process.

      • Default: true

    • --dlink-time-opt (-dlto)

      Generate intermediate code for later link-time optimization. It implies -rdc=true. Note: when this option is used the nvrtcGetLTOIR API should be used, as PTX or Cubin will not be generated.

    • --gen-opt-lto (-gen-opt-lto)

      Run the optimizer passes before generating the LTO IR.

    • --optix-ir (-optix-ir)

      Generate OptiX IR. The Optix IR is only intended for consumption by OptiX through appropriate APIs. This feature is not supported with link-time-optimization (-dlto)

. Note: when this option is used the nvrtcGetOptiX API should be used, as PTX or Cubin will not be generated.

  • Preprocessing

    • --define-macro=<def> (-D)

      <def> can be either <name> or <name=definitions>.

      • <name>

        Predefine <name> as a macro with definition 1.

      • <name>=<definition>

        The contents of <definition> are tokenized and preprocessed as if they appeared during translation phase three in a #define directive. In particular, the definition will be truncated by embedded new line characters.

    • --undefine-macro=<def> (-U)

      Cancel any previous definition of <def>.

    • --include-path=<dir> (-I)

      Add the directory <dir> to the list of directories to be searched for headers. These paths are searched after the list of headers given to nvrtcCreateProgram.

    • --pre-include=<header> (-include)

      Preinclude <header> during preprocessing.

    • --no-source-include (-no-source-include) The preprocessor by default adds the directory of each input sources to the include path. This option disables this feature and only considers the path specified explicitly.

  • Language Dialect

    • --std={c++03|c++11|c++14|c++17|c++20} (-std={c++11|c++14|c++17|c++20})

      Set language dialect to C++03, C++11, C++14, C++17 or C++20

    • --builtin-move-forward={true|false} (-builtin-move-forward)

      Provide builtin definitions of std::move and std::forward, when C++11 language dialect is selected.

      • Default: true

    • --builtin-initializer-list={true|false} (-builtin-initializer-list)

      Provide builtin definitions of std::initializer_list class and member functions when C++11 language dialect is selected.

      • Default: true

  • Misc.

    • --disable-warnings (-w)

      Inhibit all warning messages.

    • --restrict (-restrict)

      Programmer assertion that all kernel pointer parameters are restrict pointers.

    • --device-as-default-execution-space (-default-device)

      Treat entities with no execution space annotation as device entities.

    • --device-int128 (-device-int128)

      Allow the __int128 type in device code. Also causes the macro CUDACC_RTC_INT128 to be defined.

    • --optimization-info=<kind> (-opt-info)

      Provide optimization reports for the specified kind of optimization. The following kind tags are supported:

      • inline : emit a remark when a function is inlined.

    • --version-ident={true|false} (-dQ)

      Embed used compiler’s version info into generated PTX/CUBIN

      • Default: false

    • --display-error-number (-err-no)

      Display diagnostic number for warning messages. (Default)

    • --no-display-error-number (-no-err-no)

      Disables the display of a diagnostic number for warning messages.

    • --diag-error=<error-number>,… (-diag-error)

      Emit error for specified diagnostic message number(s). Message numbers can be separated by comma.

    • --diag-suppress=<error-number>,… (-diag-suppress)

      Suppress specified diagnostic message number(s). Message numbers can be separated by comma.

    • --diag-warn=<error-number>,… (-diag-warn)

      Emit warning for specified diagnostic message number(s). Message numbers can be separated by comma.

Содержание

  1. NVRTC_ERROR_BUILTIN_OPERATION_FAILURE, HIVEOS about kawpowminer HOT 11 CLOSED
  2. Comments (11)
  3. Related Issues (20)
  4. Recommend Projects
  5. React
  6. Vue.js
  7. Typescript
  8. TensorFlow
  9. Django
  10. Laravel
  11. Recommend Topics
  12. javascript
  13. server
  14. Machine learning
  15. Visualization
  16. Recommend Org
  17. Facebook
  18. Microsoft
  19. NVRTC_ERROR_BUILTIN_OPERATION_FAILURE, HIVEOS #8
  20. Comments
  21. Nvrtc error compilation hive os
  22. 1. Introduction
  23. 2. Getting Started
  24. 2.1. System Requirements
  25. 2.2. Installation
  26. 3. User Interface
  27. 3.1. Error Handling
  28. Enumerations
  29. Functions
  30. Enumerations
  31. Functions
  32. 3.2. General Information Query
  33. Functions
  34. Functions
  35. 3.3. Compilation
  36. Typedefs
  37. Functions
  38. Typedefs
  39. Functions
  40. 3.4. Supported Compile Options
  41. 3.5. Host Helper
  42. Functions
  43. Functions
  44. 4. Language
  45. 4.1. Execution Space
  46. 4.2. Separate Compilation
  47. 4.3. Dynamic Parallelism
  48. 4.4. Integer Size
  49. 4.5. Include Syntax
  50. 4.6. Predefined Macros
  51. 4.7. Predefined Types
  52. 4.8. Builtin Functions
  53. 4.9. Default C++ Dialect
  54. 5. Basic Usage
  55. 6. Accessing Lowered Names
  56. 6.1. Introduction
  57. 6.2. Example
  58. 6.3. Notes
  59. 7. Interfacing With Template Host Code
  60. 7.1. Introduction
  61. 7.2. Example
  62. 8. Versioning Scheme
  63. 8.1. NVRTC Shared Library Versioning
  64. 8.2. NVRTC-builtins Library
  65. 9. Miscellaneous Notes
  66. 9.1. Thread safety
  67. 9.2. Stack Size
  68. 9.3. NVRTC Static Library
  69. A. Example: SAXPY
  70. A.1. Code (saxpy.cpp)
  71. A.2. Build Instructions
  72. B. Example: Using Lowered Name
  73. B.1. Code (lowered-name.cpp)
  74. B.2. Build Instructions
  75. C. Example: Using nvrtcGetTypeName
  76. C.1. Code (host-type-name.cpp)
  77. C.2. Build Instructions
  78. D. Example: Dynamic Parallelism
  79. D.1. Code (dynamic-parallelism.cpp)
  80. D.2. Build Instructions
  81. E. Example: Device LTO (link time optimization)
  82. E.1. Code (offline.cu)
  83. E.2. Code (online.cpp)
  84. E.3. Build Instructions
  85. Notices
  86. Notice

NVRTC_ERROR_BUILTIN_OPERATION_FAILURE, HIVEOS about kawpowminer HOT 11 CLOSED

@castillojim24 Is this is an issue. We have made lots of improvments and I know some users are using HIVE OS with no issues.

Please try it with version 1.2.1 and see if you are successful 🙂

castillojim24 commented on January 15, 2023

I try with hiveos customized version 1.2.0 and with 1.2.1 compiled by myself, but the error persist

castillojim24 commented on January 15, 2023

nyakze commented on January 15, 2023

What was causing the issue? How did you solve it?
Please elaborate so others having this issue can solve it as well

commented on January 15, 2023

I have also that problem. will be nice he will share his knowledge.

castillojim24 commented on January 15, 2023

skydivematy, are you using hiveos?

If the case copy the files: libnvrtc-builtins.so and libnvrtc.so.10.2 to the folder /hive/miners/ethminer/kawpowminer/1.2.2

commented on January 15, 2023

Thanks for your respond, i did it. Copy both files from /hive/lib folder to hive/miners/ethminer/kawpowminer/1.2.2 but still same error.
terminate called after throwing an instance of ‘cuda_runtime_error’ what(): CUDA NVRTC error in func compileKernel at line 411 calling compileResult failed with error NVRTC_ERROR_BUILTIN_OPER TION_FAILURE

nyakze commented on January 15, 2023

It seems that it’s the issue of HiveOS incorrecrly linking libraries. If you want to fix it yourself, try the steps below. Note that t-rex is using 10.0 which is also incorrectly linked, so do the same for that version as well.
Do the following:
locate libnvrtc-builtins.so | grep 10.2
check that the link is wrong (pointing to 9.2)
ls -lah /hive/lib/libnvrtc-builtins.so
rm /hive/lib/libnvrtc-builtins.so
ln -s /hive/lib/libnvrtc-builtins.so.10.2 /hive/lib/libnvrtc-builtins.so

nyakze commented on January 15, 2023

If above doesn’t help, do nvidia-driver-update

commented on January 15, 2023

nvidia driver 440.82, hiveos all latest updates

nyakze commented on January 15, 2023

nvidia driver 440.82, hiveos all latest updates

Which linux kernel is your hive running on? Is it 5.0.21?
You can use hive-replace —list from hive shell and update to the latest distro. There are some things hive doesn’t update.

  • Issue with cuda using RTX3060
  • Low hashrate 3060Ti LHR under WSL2
  • Documentation to narrow down hash search
  • Cuda error rtx3090, 3080 LHR HOT 5
  • kawpow has a default opencl of 1.2, but can’t I use version 2.0?
  • Very low hash rate with ADM Radeon 5700xt compiled in Windows 10 HOT 2
  • Build error on Jetson HOT 1
  • CUDA_cuda_LIBRARY and CUDA_nvrtc_LIBRARY are used but they are set to NOTFOUND.
  • Mining suddently stops HOT 1
  • Unable to Compile on Ubuntu: Hunter error HOT 7
  • Can’t Build on Windows, it fails at Configuring the OpenSSL
  • Stop after generated DAG HOT 2
  • what is the .lib file for the function hash in eval in EthashAux.cpp
  • raven alone mining performance is 10% HOT 1
  • Does kawpowminer perform CPU mining when built to do so? HOT 4
  • Not enough memory HOT 2
  • Building from sources archive fails
  • OverClocking HOT 1
  • False result, verify overclock
  • power consumption controll needed HOT 1

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Источник

NVRTC_ERROR_BUILTIN_OPERATION_FAILURE, HIVEOS #8

kawpowminer 1.1.3+commit.adb6e361
Build: linux/release/gnu

пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mConfigured pool rvnt.minermore.com:4505
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mApi server listening on port 3334.
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mSelected pool rvnt.minermore.com:4505
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mStratum mode : Stratum
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mEstablished connection to rvnt.minermore.com [149.28.243.216:4505]
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mSpinning up miners.
пїЅ[33mcu пїЅ[35m21:22:03 пїЅ[94mcuda-0 пїЅ[0mUsing Pci Id : 01:00.0 GeForce GTX 1070 (Compute 6.1) Memory : 7.93 GB
пїЅ[33mcu пїЅ[35m21:22:03 пїЅ[94mcuda-1 пїЅ[0mUsing Pci Id : 04:00.0 GeForce GTX 1070 (Compute 6.1) Memory : 7.93 GB
пїЅ[33mcu пїЅ[35m21:22:03 пїЅ[94mcuda-2 пїЅ[0mUsing Pci Id : 06:00.0 GeForce GTX 1070 (Compute 6.1) Memory : 7.93 GB
пїЅ[33mcu пїЅ[35m21:22:03 пїЅ[94mcuda-3 пїЅ[0mUsing Pci Id : 09:00.0 GeForce GTX 1070 (Compute 6.1) Memory : 7.93 GB
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mNew target set to: 00000009f6000000000000000000000000000000000000000000000000000000
пїЅ[94m i пїЅ[35m21:22:03 пїЅ[94mkawpow пїЅ[0mEpoch : пїЅ[97m36пїЅ[0m Difficulty : пїЅ[97m431.17 MhпїЅ[0m
�[94m i �[35m21:22:03 �[94mkawpow �[0mJob: �[97maedaec7d… block 275179�[0m rvnt.minermore.com [149.28.243.216:4505]
пїЅ[33mcu пїЅ[35m21:22:03 пїЅ[94mcuda-4 пїЅ[0mUsing Pci Id : 0a:00.0 GeForce RTX 2070 (Compute 7.5) Memory : 7.93 GB
пїЅ[33mcu пїЅ[35m21:22:05 пїЅ[94mcuda-1 пїЅ[0mGenerating DAG + Light : 1.30 GB
пїЅ[33mcu пїЅ[35m21:22:05 пїЅ[94mcuda-0 пїЅ[0mGenerating DAG + Light : 1.30 GB
пїЅ[33mcu пїЅ[35m21:22:05 пїЅ[94mcuda-2 пїЅ[0mGenerating DAG + Light : 1.30 GB
пїЅ[94m i пїЅ[35m21:22:05 пїЅ[94mkawpow пїЅ[0mAuthorized worker jimnu.1070s
пїЅ[33mcu пїЅ[35m21:22:05 пїЅ[94mcuda-4 пїЅ[0mGenerating DAG + Light : 1.30 GB
пїЅ[33mcu пїЅ[35m21:22:05 пїЅ[94mcuda-3 пїЅ[0mGenerating DAG + Light : 1.30 GB
�[94m i �[35m21:22:06 �[94mkawpow �[0mJob: �[97mf51faeff… block 275180�[0m rvnt.minermore.com [149.28.243.216:4505]
пїЅ[94m i пїЅ[35m21:22:08 пїЅ[94mkawpow пїЅ[0mNew API session from 127.0.0.1:59540
пїЅ[94m i пїЅ[35m21:22:08 пїЅ[94mkawpow пїЅ[0mAPI : Method miner_getstat1 requested
пїЅ[32m m пїЅ[35m21:22:08 пїЅ[94mkawpow пїЅ[0mпїЅ[32m0:00пїЅ[0mпїЅ[1;97m A0пїЅ[0m пїЅ[1;36m0.00 hпїЅ[0m — cu0 пїЅ[36m0.00пїЅ[0m пїЅ[36m45C 74%пїЅ[0m, cu1 пїЅ[36m0.00пїЅ[0m пїЅ[36m45C 75%пїЅ[0m, cu2 пїЅ[36m0.00пїЅ[0m пїЅ[36m47C 74%пїЅ[0m, cu3 пїЅ[36m0.00пїЅ[0m пїЅ[36m53C 75%пїЅ[0m, cu4 пїЅ[36m0.00пїЅ[0m пїЅ[36m45C 60%пїЅ[0m
пїЅ[33mcu пїЅ[35m21:22:08 пїЅ[94mcuda-4 пїЅ[0mGenerated DAG + Light in 3,820 ms. 6.63 GB left.
terminate called after throwing an instance of ‘cuda_runtime_error’
what(): CUDA NVRTC error in func compileKernel at line 411 calling compileResult failed with error NVRTC_ERROR_BUILTIN_OPERATION_FAILURE

The text was updated successfully, but these errors were encountered:

@castillojim24 Is this is an issue. We have made lots of improvments and I know some users are using HIVE OS with no issues.

Please try it with version 1.2.1 and see if you are successful 🙂

I try with hiveos customized version 1.2.0 and with 1.2.1 compiled by myself, but the error persist

Источник

Nvrtc error compilation hive os

The User guide to NVRTC.

1. Introduction

NVRTC is a runtime compilation library for CUDA C++. It accepts CUDA C++ source code in character string form and creates handles that can be used to obtain the PTX. The PTX string generated by NVRTC can be loaded by cuModuleLoadData and cuModuleLoadDataEx, and linked with other modules by using the nvJitLink library or using cuLinkAddData of the CUDA Driver API. This facility can often provide optimizations and performance not possible in a purely offline static compilation.

NVRTC addresses these issues by providing a library interface that eliminates overhead associated with spawning separate processes, disk I/O, etc., while keeping application deployment simple.

2. Getting Started

2.1. System Requirements

2.2. Installation

3. User Interface

This chapter presents the API of NVRTC. Basic usage of the API is explained in Basic Usage.

3.1. Error Handling

NVRTC defines the following enumeration type and function for API call error handling.

Enumerations

Functions

Enumerations

Values

Functions

Parameters
Returns

Message string for the given nvrtcResult code.

Description

3.2. General Information Query

NVRTC defines the following function for general information query.

Functions

Functions

Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description

3.3. Compilation

NVRTC defines the following type and functions for actual compilation.

Typedefs

Functions

Typedefs

nvrtcProgram is the unit of compilation, and an opaque handle for a program. To compile a CUDA program string, an instance of nvrtcProgram must be created first with nvrtcCreateProgram, then compiled with nvrtcCompileProgram.

Functions

Parameters
Returns
Description

The identical name expression string must be provided on a subsequent call to nvrtcGetLoweredName to extract the lowered name.

Parameters
Returns
Description

It supports compile options listed in Supported Compile Options.

Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Description
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description
Parameters
Returns
Description

Note that compilation log may be generated with warnings and informative messages, even when the compilation of prog succeeds.

3.4. Supported Compile Options

NVRTC supports the compile options below. Option names with two preceding dashs ( —) are long option names and option names with one preceding dash ( —) are short option names. Short option names can be used instead of long option names. When a compile option takes an argument, an assignment operator ( =) is used to separate the compile option argument from the compile option name, e.g., «—gpu-architecture=compute_60». Alternatively, the compile option name and the argument can be specified in separate strings without an assignment operator, .e.g, «—gpu-architecture» «compute_60». Single-character short option names, such as -D, -U, and -I, do not require an assignment operator, and the compile option name and the argument can be present in the same string with or without spaces between them. For instance, «-D= «, «-D «, and «-D « are all supported.

The valid compiler options are:

Specify the name of the class of GPU architectures for which the input must be compiled.

Generate relocatable code that can be linked with other relocatable device code. It is equivalent to —relocatable-device-code=true.

Generate non-relocatable code. It is equivalent to —relocatable-device-code=false.

Generate debug information. If —dopt is not specified, then turns off all optimizations.

Generate line-number information.

Enable device code optimization. When specified along with ‘-G’, enables limited debug information generation for optimized device code (currently, only line number information). When ‘-G’ is not specified, ‘-dopt=on’ is implicit.

Specify options directly to ptxas, the PTX optimizing assembler.

Specify the maximum amount of registers that GPU functions can use. Until a function-specific limit, a higher value will generally increase the performance of individual GPU threads that execute this function. However, because thread registers are allocated from a global register pool on each GPU, a higher value of this option will also reduce the maximum thread block size, thereby reducing the amount of thread parallelism. Hence, a good maxrregcount value is the result of a trade-off. If this option is not specified, then no maximum is assumed. Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.

Make use of fast math operations. —use_fast_math implies —ftz=true —prec-div=false —prec-sqrt=false —fmad=true.

Enables more aggressive device code vectorization in the NVVM optimizer.

Generate intermediate code for later link-time optimization. It implies -rdc=true. Note: when this option is used the nvrtcGetLTOIR API should be used, as PTX or Cubin will not be generated.

Run the optimizer passes before generating the LTO IR.

Generate OptiX IR. The Optix IR is only intended for consumption by OptiX through appropriate APIs. This feature is not supported with link-time-optimization ( -dlto)

. Note: when this option is used the nvrtcGetOptiX API should be used, as PTX or Cubin will not be generated.

Predefine as a macro with definition 1.

The contents of are tokenized and preprocessed as if they appeared during translation phase three in a #define directive. In particular, the definition will be truncated by embedded new line characters.

Cancel any previous definition of .

Add the directory to the list of directories to be searched for headers. These paths are searched after the list of headers given to nvrtcCreateProgram.

Preinclude during preprocessing.

—no-source-include ( -no-source-include) The preprocessor by default adds the directory of each input sources to the include path. This option disables this feature and only considers the path specified explicitly.

Set language dialect to C++03, C++11, C++14, C++17 or C++20

Inhibit all warning messages.

Programmer assertion that all kernel pointer parameters are restrict pointers.

Treat entities with no execution space annotation as __device__ entities.

Allow the __int128 type in device code. Also causes the macro __CUDACC_RTC_INT128__ to be defined.

inline : emit a remark when a function is inlined.

Display diagnostic number for warning messages. (Default)

Disables the display of a diagnostic number for warning messages.

Emit error for specified diagnostic message number(s). Message numbers can be separated by comma.

Suppress specified diagnostic message number(s). Message numbers can be separated by comma.

Emit warning for specified diagnostic message number(s). Message numbers can be separated by comma.

3.5. Host Helper

NVRTC defines the following functions for easier interaction with host code.

Functions

Functions

nvrtcResult nvrtcGetTypeName ( std::​string* result ) [inline]

Parameters
Returns
Description

This function is only provided when the macro NVRTC_GET_TYPE_NAME is defined with a non-zero value. It uses abi::__cxa_demangle or UnDecorateSymbolName function calls to extract the type name, when using gcc/clang or cl.exe compilers, respectively. If the name extraction fails, it will return NVRTC_INTERNAL_ERROR, otherwise *result is initialized with the extracted name.

nvrtcGetTypeName() is not multi-thread safe because it calls UnDecorateSymbolName(), which is not multi-thread safe.

The returned string may contain Microsoft-specific keywords such as __ptr64 and __cdecl.

Parameters
Returns
Description

This function is only provided when the macro NVRTC_GET_TYPE_NAME is defined with a non-zero value. It uses abi::__cxa_demangle or UnDecorateSymbolName function calls to extract the type name, when using gcc/clang or cl.exe compilers, respectively. If the name extraction fails, it will return NVRTC_INTERNAL_ERROR, otherwise *result is initialized with the extracted name.

nvrtcGetTypeName() is not multi-thread safe because it calls UnDecorateSymbolName(), which is not multi-thread safe.

The returned string may contain Microsoft-specific keywords such as __ptr64 and __cdecl.

4. Language

Unlike the offline nvcc compiler, NVRTC is meant for compiling only device CUDA C++ code. It does not accept host code or host compiler extensions in the input code, unless otherwise noted.

4.1. Execution Space

NVRTC uses __host__ as the default execution space, and it generates an error if it encounters any host code in the input. That is, if the input contains entities with explicit __host__ annotations or no execution space annotation, NVRTC will emit an error. __host__ __device__ functions are treated as device functions.

NVRTC provides a compile option, —device-as-default-execution-space , that enables an alternative compilation mode, in which entities with no execution space annotations are treated as __device__ entities.

4.2. Separate Compilation

NVRTC itself does not provide any linker. Users can, however, use the nvJitLink library or cuLinkAddData in the CUDA Driver API to link the generated relocatable PTX code with other relocatable code. To generate relocatable PTX code, the compile option —relocatable-device-code=true or —device-c is required.

4.3. Dynamic Parallelism

4.4. Integer Size

Different operating systems define integer type sizes differently. Linux x86_64 implements LP64, and Windows x86_64 implements LLP64.

Table 1. Integer sizes in bits for LLP64 and LP64

short int long long long pointers and size_t
LLP64 16 32 32 64 64
LP64 16 32 64 64 64

NVRTC implements LP64 on Linux and LLP64 on Windows.

NVRTC supports 128-bit integer types through the ‘__int128’ type. This can be enabled with the —device-int128 flag. 128-bit integer support is not available on Windows.

4.5. Include Syntax

When nvrtcCompileProgram() is called, the current working directory is added to the header search path used for locating files included with the quoted syntax (e.g., #include «foo.h» ), before the code is compiled.

4.6. Predefined Macros

  • __CUDACC_RTC__ : useful for distinguishing between runtime and offline nvcc compilation in user code.
  • __CUDACC__ : defined with same semantics as with offline nvcc compilation.
  • __CUDACC_RDC__ : defined with same semantics as with offline nvcc compilation.
  • __CUDACC_EWP__ : defined with same semantics as with offline nvcc compilation.
  • __CUDACC_DEBUG__ : defined with same semantics as with offline nvcc compilation.
  • __CUDA_ARCH__ : defined with same semantics as with offline nvcc compilation.
  • __CUDA_ARCH_LIST__ : defined with same semantics as with offline nvcc compilation.
  • __CUDACC_VER_MAJOR__ : defined with the major version number as returned by nvrtcVersion .
  • __CUDACC_VER_MINOR__ : defined with the minor version number as returned by nvrtcVersion .
  • __CUDACC_VER_BUILD__ : defined with the build version number.
  • __NVCC_DIAG_PRAGMA_SUPPORT__ : defined with same semantics as with offline nvcc compilation.
  • __CUDACC_RTC_INT128__ : defined when -device-int128 flag is specified during compilation, and indicates that __int128 type is supported.
  • NULL : null pointer constant.
  • va_start
  • va_end
  • va_arg
  • va_copy : defined when language dialect C++11 or later is selected.
  • __cplusplus
  • _WIN64 : defined on Windows platforms.
  • __LP64__ : defined on non-Windows platforms where long int and pointer types are 64-bits.
  • __cdecl : defined to empty on all platforms.
  • __ptr64 : defined to empty on Windows platforms.

4.7. Predefined Types

  • clock_t
  • size_t
  • ptrdiff_t
  • va_list : Note that the definition of this type may be different than the one selected by nvcc when compiling CUDA code.
  • Predefined types such as dim3 , char4 , etc., that are available in the CUDA Runtime headers when compiling offline with nvcc are also available, unless otherwise noted.

4.8. Builtin Functions

Builtin functions provided by the CUDA Runtime headers when compiling offline with nvcc are available, unless otherwise noted.

4.9. Default C++ Dialect

The default C++ dialect is C++17. Other dialects can be selected using the -std flag.

5. Basic Usage

This section of the document uses a simple example, Single-Precision О±в‹…X Plus Y (SAXPY), shown in Figure 1 to explain what is involved in runtime compilation with NVRTC. For brevity and readability, error checks on the API return values are not shown. The complete code listing is available in Example: SAXPY.

First, an instance of nvrtcProgram needs to be created. Figure 2 shows creation of nvrtcProgram for SAXPY. As SAXPY does not require any header, 0 is passed as numHeaders , and NULL as headers and includeNames .

If SAXPY had any #include directives, the contents of the files that are #include ‘d can be passed as elements of headers, and their names as elements of includeNames . For example, #include and #include would require 2 as numHeaders , < » «, » » >as headers , and < «foo.h», «bar.h» >as includeNames ( and must be replaced by the actual contents of foo.h and bar.h ). Alternatively, the compile option -I can be used if the header is guaranteed to exist in the file system at runtime.

Once the instance of nvrtcProgram for compilation is created, it can be compiled by nvrtcCompileProgram as shown in Figure 3. Two compile options are used in this example, —gpu-architecture=compute_80 and —fmad=false , to generate code for the compute_80 architecture and to disable the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations. Other combinations of compile options can be used as needed and Supported Compile Options lists valid compile options.

After the compilation completes, users can obtain the program compilation log and the generated PTX as Figure 4 shows. NVRTC does not generate valid PTX when the compilation fails, and it may generate program compilation log even when the compilation succeeds if needed.

A nvrtcProgram can be compiled by nvrtcCompileProgram multiple times with different compile options, and users can only retrieve the PTX and the log generated by the last compilation.

When the instance of nvrtcProgram is no longer needed, it can be destroyed by nvrtcDestroyProgram as shown in Figure 5.

The generated PTX can be further manipulated by the CUDA Driver API for execution or linking. Figure 6 shows an example code sequence for execution of the generated PTX.

6. Accessing Lowered Names

6.1. Introduction

NVRTC will mangle __global__ function names and names of __device__ and __constant__ variables as specified by the IA64 ABI. If the generated PTX is being loaded using the CUDA Driver API, the kernel function or __device__ / __constant__ variable must be looked up by name, but this is hard to do when the name has been mangled. To address this problem, NVRTC provides API functions that map source level __global__ function or __device__ / __constant__ variable names to the mangled names present in the generated PTX.

The two API functions nvrtcAddNameExpression and nvrtcGetLoweredName work together to provide this functionality. First, a ‘name expression’ string denoting the address for the __global__ function or __device__ / __constant__ variable is provided to nvrtcAddNameExpression . Then, the program is compiled with nvrtcCompileProgram . During compilation, NVRTC will parse the name expression string as a C++ constant expression at the end of the user program. The constant expression must provide the address of the __global__ function or __device__ / __constant__ variable. Finally, the function nvrtcGetLoweredName is called with the original name expression and it returns a pointer to the lowered name. The lowered name can be used to refer to the kernel or variable in the CUDA Driver API.

NVRTC guarantees that any __global__ function or __device__ / __constant__ variable referenced in a call to nvrtcAddNameExpression will be present in the generated PTX (if the definition is available in the input source code).

6.2. Example

Example: Using Lowered Name lists a complete runnable example. Some relevant snippets:

6.3. Notes

  1. Sequence of calls: All name expressions must be added using nvrtcAddNameExpression before the NVRTC program is compiled with nvrtcCompileProgram . This is required because the name expressions are parsed at the end of the user program, and may trigger template instantiations. The lowered names must be looked up by calling nvrtcGetLoweredName only after the NVRTC program has been compiled, and before it has been destroyed. The pointer returned by nvrtcGetLoweredName points to memory owned by NVRTC, and this memory is freed when the NVRTC program has been destroyed ( nvrtcDestroyProgram ). Thus the correct sequence of calls is : nvrtcAddNameExpression , nvrtcCompileProgram , nvrtcGetLoweredName , nvrtcDestroyProgram .
  2. Identical Name Expressions: The name expression string passed to nvrtcAddNameExpression and nvrtcGetLoweredName must have identical characters. For example, «foo» and «foo » are not identical strings, even though semantically they refer to the same entity (foo), because the second string has a extra whitespace character.
  3. Constant Expressions: The characters in the name expression string are parsed as a C++ constant expression at the end of the user program. Any errors during parsing will cause compilation failure and compiler diagnostics will be generated in the compilation log. The constant expression must refer to the address of a __global__ function or __device__ / __constant__ variable.
  4. Address of overloaded function: If the NVRTC source code has multiple overloaded __global__ functions, then the name expression must use a cast operation to disambiguate. However, casts are not allowed in constant expression for C++ dialects before C++11. If using such name expressions, please compile the code in C++11 or later dialect using the ‘-std’ command line flag. Example: Consider that the GPU code string contains: The name expression ‘(void(*)(int))foo’ correctly disambiguates ‘foo(int)’ , but the program must be compiled in C++11 or later dialect (e.g. ‘-std=c++11’ ) because casts are not allowed in pre-C++11 constant expressions.

7. Interfacing With Template Host Code

7.1. Introduction

In some scenarios, it is useful to instantiate __global__ function templates in device code based on template arguments in host code. The NVRTC helper function nvrtcGetTypeName can be used to extract the source level name of a type in host code, and this string can be used to instantiate a __global__ function template and get the mangled name of the instantiation using the nvrtcAddNameExpression and nvrtcGetLoweredName functions.

nvrtcGetTypeName is defined inline in the NVRTC header file, and is available when the macro NVRTC_GET_TYPE_NAME is defined with a non-zero value. It uses the abi::__cxa_demangle and UnDecorateSymbolName host code functions when using gcc/clang and cl.exe compilers, respectively. Users may need to specify additional header paths and libraries to find the host functions used ( abi::__cxa_demangle / UnDecorateSymbolName ). See the build instructions for the example below for reference (Build Instructions).

7.2. Example

Example: Using nvrtcGetTypeName lists a complete runnable example. Some relevant snippets:

8. Versioning Scheme

8.1. NVRTC Shared Library Versioning

In the following, MAJOR and MINOR denote the major and minor versions of the CUDA Toolkit. e.g. for CUDA 11.2, MAJOR is «11» and MINOR is «2».

  • Linux:
    • In CUDA toolkits prior to CUDA 11.3, the soname was set to » MAJOR.MINOR «.
    • In CUDA 11.3 and later 11.x toolkits, the soname field is set to » 11.2 «.
    • In CUDA toolkits with major version > 11 (e.g. CUDA 12.x), the soname field is set to » MAJOR «.
  • Windows:
    • In CUDA toolkits prior to cuda 11.3, the DLL name was of the form » nvrtc64_XY_0.dll «, where X = MAJOR , Y = MINOR .
    • In CUDA 11.3 and later 11.x toolkits, the DLL name is » nvrtc64_112_0.dll «.
    • In CUDA toolkits with major version > 11 (e.g. CUDA 12.x), the DLL name is of the form » nvrtc64_X0_0.dll » where X = MAJOR .

Consider a CUDA toolkit with major version > 11. The NVRTC shared library in this CUDA toolkit will have the same soname (Linux) or DLL name (Windows) as an NVRTC shared library in a previous minor version of the same CUDA toolkit. Similarly, the NVRTC shared library in CUDA 11.3 and later 11.x releases will have the same soname (Linux) or DLL name (Windows) as the NVRTC shared library in CUDA 11.2.

As a consequence of the versioning scheme described above, an NVRTC client that links against a particular NVRTC shared library will continue to work with a future NVRTC shared library with a matching soname (Linux) or DLL name (Windows). This allows the NVRTC client to take advantage of bug fixes and enhancements available in the more recent NVRTC shared library 1 . However, the more recent NVRTC shared library may generate PTX with a version that is not accepted by the CUDA Driver API functions of an older CUDA driver, as explained in the CUDA Compatibility document.

Alternately, an NVRTC client can either link against the static NVRTC library or redistribute a specific version of the NVRTC shared library and use dlopen (Linux) or LoadLibrary (Windows) functions to use that library at run time. Either approach allows the NVRTC client to maintain control over the version of NVRTC being used during deployment, to ensure predictable functionality and performance.

8.2. NVRTC-builtins Library

The NVRTC-builtins library contains helper code that is part of the NVRTC package. It is only used by the NVRTC library internally. Each NVRTC library is only compatible with the NVRTC-builtins library from the same CUDA toolkit.

9. Miscellaneous Notes

9.1. Thread safety

Multiple threads can invoke NVRTC API functions concurrently, as long as there is no race condition. In this context, a race condition is defined to occur if multiple threads concurrently invoke NVRTC API functions with the same nvrtcProgram argument, where at least one thread is invoking either nvrtcCompileProgram or nvrtcAddNameExpression 2 .

9.2. Stack Size

On Linux, NVRTC will increase the stack size to the maximum allowed using the setrlimit() function during compilation. This reduces the chance that the compiler will run out of stack when processing complex input sources. The stack size is reset to the previous value when compilation is completed.

Because setrlimit() changes the stack size for the entire process, it will also affect other application threads that may be executing concurrently. The command line flag -modify-stack-limit=false will prevent NVRTC from modifying the stack limit.

9.3. NVRTC Static Library

The NVRTC static library references functions defined in the NVRTC-builtins static library and the PTX compiler static library. Please see Build Instructions for an example.

A. Example: SAXPY

A.1. Code (saxpy.cpp)

A.2. Build Instructions

B. Example: Using Lowered Name

B.1. Code (lowered-name.cpp)

B.2. Build Instructions

C. Example: Using nvrtcGetTypeName

C.1. Code (host-type-name.cpp)

C.2. Build Instructions

D. Example: Dynamic Parallelism

D.1. Code (dynamic-parallelism.cpp)

D.2. Build Instructions

E. Example: Device LTO (link time optimization)

This section demonstrates device link time optimization (LTO). There are two units of LTO IR. The first unit is generated offline using nvcc , by specifying the architecture as ‘ -arch lto_XX ‘ (see offline.cu). The generated LTO IR is packaged in a fatbinary.

The second unit is generated online using NVRTC, by specifying the flag ‘ -dlto ‘ (see online.cpp).

These two units are then passed to libnvJitLink* API functions, which link together the LTO IR, run the optimizer on the linked IR and generate a cubin (see online.cpp). The cubin is then loaded on the GPU and executed.

E.1. Code (offline.cu)

E.2. Code (online.cpp)

E.3. Build Instructions

Notices

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Источник

NVRTCError: NVRTC_ERROR_COMPILATION with `cupy.cuda.compile_with_cache`

tommy19970714 opened this issue 2 years ago · comments

🐛 Bug

I got the following error when I compile the model with cupy.cuda.compile_with_cache to jit.

NVRTCError: NVRTC_ERROR_COMPILATION (6)

During handling of the above exception, another exception occurred:

CompileException                          Traceback (most recent call last)

cupy/util.pyx in cupy.util.memoize.decorator.ret()

/usr/local/lib/python3.7/dist-packages/cupy/cuda/compiler.py in compile(self, options)
    440         except nvrtc.NVRTCError:
    441             log = nvrtc.getProgramLog(self.ptr)
--> 442             raise CompileException(log, self.src, self.name, options, 'nvrtc')
    443 
    444 

CompileException: /tmp/tmpan1ut480/3b7c153ce98d06488f1cbac8793f6dff_2.cubin.cu(16): error: identifier "tensor" is undefined

1 error detected in the compilation of "/tmp/tmpan1ut480/3b7c153ce98d06488f1cbac8793f6dff_2.cubin.cu".

To Reproduce

This is a colab to reproduce the error.
https://colab.research.google.com/drive/1WDRCN6wPIAsl5tBFKfne0ABN49estM9P?usp=sharing

This is a minimum code.

@cupy.util.memoize(for_each_device=True)
def cupy_launch(strFunction, strKernel):
	return cupy.cuda.compile_with_cache(strKernel).get_function(strFunction)

kernel_Correlation_rearrange = " .... "

import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

    def forward(self, x_warp_after, x_cond):
        cupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', {
          'intStride': 1,
          'input': x_warp_after,
          'output': x_cond
        }))(
        )
        return x_warp_after, x_cond

net = Net().cuda()
input1 = torch.randn([1, 256, 8, 6]).cuda()
input2 = torch.randn([1, 256, 8, 6]).cuda()
trace_model = torch.jit.trace(net, [input1, input2])

Expected behavior

I think the above error occurs when I use cupy.cuda.compile_with_cache.

Environment

  • CuPy version: cupy-cuda101==7.4.0
  • CUDA/cuDNN version: 11.0.221
  • PyTorch Version (e.g., 1.0): 1.8.1+cu101
  • OS (e.g., Linux): Ubuntu 18.04.5 LTS (x86_64)
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): no
  • Python version: 3.7 (64-bit runtime)
  • GPU models and configuration: GPU 0: Tesla T4
  • Any other relevant information:

Additional context

I opened an issue in the pytorch repository before, but I realized that the problem is not a pytorch issue, but a cupy issue.

This is neither PyTorch’s nor CuPy’s bug, but rather an issue in the way you did string processing to generate your kernel. Notice this error:

CompileException: /tmp/tmpan1ut480/3b7c153ce98d06488f1cbac8793f6dff_2.cubin.cu(16): error: identifier "tensor" is undefined

It is a common C/C++ error telling your the definition for an identifier tensor is missing. You should check how that identifier entered the code string. CuPy provides some env variables, and the one you need to help you debug the code generation is either CUPY_CACHE_SAVE_CUDA_SOURCE or CUPY_DUMP_CUDA_SOURCE_ON_ERROR.

By the way, it is best to not use cupy.cuda.compile_with_cache() because it is subject to change without notification (it’s considered internal API AFAIK). There is a public API cupy.RawModule for exactly this need (see tutorial).

@leofang @kmaehashi Thank you for your response!

I tried using environment variables for debugging.
I debugged the next colab using the environment variable.
https://colab.research.google.com/drive/1WDRCN6wPIAsl5tBFKfne0ABN49estM9P?usp=sharing
But the error was the same as before and I could not judge which part of the code was bad.

Also, as I got feedback, I tried replacing cupy.cuda.compile_with_cache() with cupy.RawModule and running it, but the ERROR was the same. You can check the replaced code from the colab above.

As @leofang told me, I think it is a c/c++ error, but since c/c++ does not use tensor, it is difficult to find the cause.
Can you see what the problem is in the c/c++ code.

P.S.
I checked the issue that @kmaehashi shared with me.
The pytorch forum is the one I posted.
Issue #3987 is very similar to this issue. The issue is closed, but there is no solusion.

As shown in the output, this is the code you are passing to cupy.RawModule:

Name: /tmp/tmpfl9rgemg/2580de419f706496d2e0e83122e0c456_2.cubin.cu
Options: -I/usr/local/lib/python3.7/dist-packages/cupy/core/include -I /usr/local/lib/python3.7/dist-packages/cupy/core/include/cupy/_cuda/cuda-10.1 -I /usr/local/cuda/include -ftz=true -arch=compute_37
CUDA source:
01 
02 	extern "C" __global__ void kernel_Correlation_rearrange(
03 		const int n,
04 		const float* input,
05 		float* output
06 	) {
07 	  int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
08 
09 	  if (intIndex >= n) {
10 	    return;
11 	  }
12 
13 	  int intSample = blockIdx.z;
14 	  int intChannel = blockIdx.y;
15 
16 	  float fltValue = input[(((intSample * tensor(256)) + intChannel) * tensor(8) * tensor(6)) + intIndex];
17 
18 	  __syncthreads();
19 
20 	  int intPaddedY = (intIndex / tensor(6)) + 3*1;
21 	  int intPaddedX = (intIndex % tensor(6)) + 3*1;
22 	  int intRearrange = ((tensor(6) + 6*1) * intPaddedY) + intPaddedX;
23 
24 	  output[(((intSample * tensor(256) * tensor(8)) + intRearrange) * tensor(256)) + intChannel] = 100;
25 	}
26 

Line 16 contains an undefined variable tensor. That is the cause of error: identifier "tensor" is undefined.

@kmaehashi Thank you for letting me know.
The tensor is not used in the original c++ code, but it seems to be converted by cupy.
The original code:

kernel_Correlation_rearrange = '''
	extern "C" __global__ void kernel_Correlation_rearrange(
		const int n,
		const float* input,
		float* output
	) {
	  int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x;

	  if (intIndex >= n) {
	    return;
	  }

	  int intSample = blockIdx.z;
	  int intChannel = blockIdx.y;

	  float fltValue = input[(((intSample * SIZE_1(input)) + intChannel) * SIZE_2(input) * SIZE_3(input)) + intIndex];

	  __syncthreads();

	  int intPaddedY = (intIndex / SIZE_3(input)) + 3*{{intStride}};
	  int intPaddedX = (intIndex % SIZE_3(input)) + 3*{{intStride}};
	  int intRearrange = ((SIZE_3(input) + 6*{{intStride}}) * intPaddedY) + intPaddedX;

	  output[(((intSample * SIZE_1(output) * SIZE_2(output)) + intRearrange) * SIZE_1(input)) + intChannel] = 100;
	}
'''

Do you have any idea how I can prevent it from being converted by cupy?

but it seems to be converted by cupy.

No, it is done by your code.

def cupy_kernel(strFunction, objVariables):
	strKernel = globals()[strFunction].replace('{{intStride}}', str(objVariables['intStride']))

	while True:
		objMatch = re.search('(SIZE_)([0-4])(()([^)]*)())', strKernel)
...

Please report this issue to the repository that provides the code.

Thanks to @kmaehashi ‘s advice, this problem has been solved.

The problem is that when converting to jit, the int type becomes a tensor type.
I solved the problem by rewriting the following code.

def cupy_kernel(strFunction, objVariables):
	strKernel = globals()[strFunction].replace('{{intStride}}', str(objVariables['intStride']))

	while True:
		objMatch = re.search('(SIZE_)([0-4])(()([^)]*)())', strKernel)

		if objMatch is None:
			break
		# end

		intArg = int(objMatch.group(2))

		strTensor = objMatch.group(4)
		intSizes = objVariables[strTensor].size()
                
                #####
                # HERE: I was changed following lines.
		replaceStr = str(intSizes[intArg]).replace("tensor", "int")
		strKernel = strKernel.replace(objMatch.group(), replaceStr)
                # HERE
                #####
	# end

	while True:
		objMatch = re.search('(VALUE_)([0-4])(()([^)]+)())', strKernel)

		if objMatch is None:
			break
		# end

		intArgs = int(objMatch.group(2))
		strArgs = objMatch.group(4).split(',')

		strTensor = strArgs[0]
		intStrides = objVariables[strTensor].stride()
		strIndex = [ '((' + strArgs[intArg + 1].replace('{', '(').replace('}', ')').strip() + ')*' + str(intStrides[intArg]) + ')' for intArg in range(intArgs) ]

		strKernel = strKernel.replace(objMatch.group(0), strTensor + '[' + str.join('+', strIndex) + ']')
	# end

	return strKernel

07-10-2020, 02:10 AM
(This post was last modified: 07-10-2020, 02:13 AM by jen42.)

I’m having trouble with hashcat 6 working with any of the NVidia 2000 series GPUs I’ve tried.  I’ve tried on a laptop RTX 2060 (Linux Mint) and a hashing rigs with a mix of GTX 1080 Founders and GTX 2080 Supers (Ubuntu 18.04) with the exact same results on the 2080 GPUs only.  The 1080’s work just fine…

I have easier access to the laptop so I will use that for the examples below —

First — without nvidia-cuda-toolkit installed:

Code:

$ sudo hashcat-6.0.0/hashcat.bin -m 1000 -b
hashcat (v6.0.0) starting in benchmark mode...

Benchmarking uses hand-optimized kernel code by default.
You can use it in your cracking session by setting the -O option.
Note: Using optimized kernel code limits the maximum supported password length.
To disable the optimized kernel code in benchmark mode, use the -w option.

* Device #1: CUDA SDK Toolkit installation NOT detected.
            CUDA SDK Toolkit installation required for proper device support and utilization
            Falling back to OpenCL Runtime

* Device #1: WARNING! Kernel exec timeout is not disabled.
            This may cause "CL_OUT_OF_RESOURCES" or related errors.
            To disable the timeout, see: https://hashcat.net/q/timeoutpatch
nvmlDeviceGetFanSpeed(): Not Supported

OpenCL API (OpenCL 1.2 CUDA 10.2.185) - Platform #1 [NVIDIA Corporation]
========================================================================
* Device #1: GeForce RTX 2060, 5632/5934 MB (1483 MB allocatable), 30MCU

Benchmark relevant options:
===========================
* --optimized-kernel-enable

Hashmode: 1000 - NTLM

Speed.#1.........: 37744.5 MH/s (53.21ms) @ Accel:64 Loops:1024 Thr:1024 Vec:8

Started: Thu Jul  9 17:51:19 2020
Stopped: Thu Jul  9 17:51:26 2020

And the nvidia packages that are installed when it works:

Code:

$ sudo apt list --installed | grep -i nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-common-435/bionic,bionic,bionic-updates,bionic-updates,now 435.21-0ubuntu0.18.04.2 all [installed]
libnvidia-common-440/bionic,bionic,bionic-updates,bionic-updates,bionic-security,bionic-security,now 440.100-0ubuntu0.18.04.1 all [installed,automatic]
libnvidia-compute-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-decode-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-encode-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-extra-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-fbc1-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-gl-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-ifr1-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-compute-utils-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-dkms-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-driver-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed]
nvidia-kernel-common-440/bionic,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-kernel-source-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-prime/bionic-updates,bionic-updates,now 0.8.8.2 all [installed]
nvidia-settings/bionic,now 440.64-0ubuntu0~0.18.04.1 amd64 [installed]
nvidia-utils-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]

As you can see, it works, but complains about falling back to OpenCL.  In addition, hashcat 6 pegs the CPU at 100% when running in OpenCL fallback.  This is not ideal.  hashcat 5 can run for weeks (probably forever) without using more than 20% of the CPU, so we are currently reverted back to hashcat 5 because of this problem.

With Cuda Toolkit installed:

Code:

$ sudo hashcat-6.0.0/hashcat.bin -m 1000 -b
hashcat (v6.0.0) starting in benchmark mode...

Benchmarking uses hand-optimized kernel code by default.
You can use it in your cracking session by setting the -O option.
Note: Using optimized kernel code limits the maximum supported password length.
To disable the optimized kernel code in benchmark mode, use the -w option.

* Device #1: WARNING! Kernel exec timeout is not disabled.
            This may cause "CL_OUT_OF_RESOURCES" or related errors.
            To disable the timeout, see: https://hashcat.net/q/timeoutpatch
* Device #2: WARNING! Kernel exec timeout is not disabled.
            This may cause "CL_OUT_OF_RESOURCES" or related errors.
            To disable the timeout, see: https://hashcat.net/q/timeoutpatch
nvmlDeviceGetFanSpeed(): Not Supported

CUDA API (CUDA 10.2)
====================
* Device #1: GeForce RTX 2060, 5644/5934 MB, 30MCU

OpenCL API (OpenCL 1.2 CUDA 10.2.185) - Platform #1 [NVIDIA Corporation]
========================================================================
* Device #2: GeForce RTX 2060, skipped

Benchmark relevant options:
===========================
* --optimized-kernel-enable

Hashmode: 1000 - NTLM

nvrtcCompileProgram(): NVRTC_ERROR_INVALID_OPTION

nvrtc: error: invalid value for --gpu-architecture (-arch)

* Device #1: Kernel /home/jen42/hashcat-6.0.0/OpenCL/shared.cl build failed.

* Device #1: Kernel /home/jen42/hashcat-6.0.0/OpenCL/shared.cl build failed.

Started: Thu Jul  9 18:00:43 2020
Stopped: Thu Jul  9 18:00:44 2020

Here is the package list for all nvidia when the cuda toolkit is installed:

Code:

$ sudo apt list --installed | grep -i nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-common-435/bionic,bionic,bionic-updates,bionic-updates,now 435.21-0ubuntu0.18.04.2 all [installed]
libnvidia-common-440/bionic,bionic,bionic-updates,bionic-updates,bionic-security,bionic-security,now 440.100-0ubuntu0.18.04.1 all [installed,automatic]
libnvidia-compute-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-decode-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-encode-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-extra-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-fbc1-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-gl-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
libnvidia-ifr1-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-compute-utils-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-cuda-dev/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-cuda-toolkit/bionic,now 9.1.85-3ubuntu1 amd64 [installed]
nvidia-dkms-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-driver-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed]
nvidia-kernel-common-440/bionic,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-kernel-source-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
nvidia-opencl-dev/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-prime/bionic-updates,bionic-updates,now 0.8.8.2 all [installed]
nvidia-profiler/bionic,now 9.1.85-3ubuntu1 amd64 [installed,automatic]
nvidia-settings/bionic,now 440.64-0ubuntu0~0.18.04.1 amd64 [installed]
nvidia-utils-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-440/bionic,bionic-updates,bionic-security,now 440.100-0ubuntu0.18.04.1 amd64 [installed,automatic]

As you can see, the GTX 2080’s just don’t work with nvidia-cuda-toolkit and dependencies installed.  I’ve tried using the Ubuntu default repositories and installing the NVidia drivers with apt, and also downloading the drivers directly from NVidia and installing them that way.  No change.

I have used the precompiled hashcat binaries and built from source. I’ve also tried the same with the latest beta of hashcat.  No change.

Additionally, hashcat 6.0 with only the 1080 GPU AND cuda toolkit still pegs the CPU at 100%, so we still can’t use hashcat 6 with just those GPU without CPU issues.

Any ideas why this doesn’t work?

Posts: 302
Threads: 2
Joined: Dec 2015

Your driver versions and component versions, as shown in your package lists, are not consistent with the versions provided with the current CUDA Toolkit installation.

Grab this file https://developer.nvidia.com/cuda-downlo…nfilelocal

It’s the runfile for installing on Ubuntu 18.04. You should see versions consistent with the following list after installation: https://docs.nvidia.com/cuda/cuda-toolki…t-versions

Current Driver version for linux is >= 450.XX

Now, that said, I’m not certain a driver upgrade will purely solve the issue. It sounds more like you’ve got some other environmental issue at play here. Especially as I can not reproduce with my linux test bench, It’s also mixed generation including 20 series cards, running ubuntu 18.04 with latest tookit, hashcat 5.1.0 and 6.0.0 both running perfectly.

Posts: 2
Threads: 1
Joined: Jul 2020

Thank you!  The 450 driver fixed the hashcat 6.0 cuda errors. Smile

I’m still seeing CPU usage at 100% with hashcat 6 but not 5 for the same tests though:

Hashcat 5.1: is using 6 — 10% CPU:

Code:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM    TIME+ COMMAND                                       

2448 root      20  0 9209036 1.580g 130340 S  6.6  5.1  0:12.09 hashcat64.bin             

Hashcat 6.0 is using 96-100:

Code:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM    TIME+ COMMAND                                       
2557 root      20  0 11.330g 990188 105096 S  96.0  3.0  0:11.82 hashcat.bin           

How does your CPU look with hashcat 6 vs 5?

Posts: 302
Threads: 2
Joined: Dec 2015

6.0.0 is definitely using a bit more CPU for me, though not nearly as significant as you are seeing. How many cards are in that system? This may be an issue of scale, where more cards uses more CPU like we had with the Nvidia core spinning issue. The suggestion still, due mostly to that issue, is to have at minimum 1 fast core per GPU in the system, with 1-2 cores left over for the rest of the system.

aaronbgr…@gmail.com

unread,

Nov 11, 2017, 6:41:18 PM11/11/17

to theano-dev

Hello,

I wrote some GPU Ops for Theano a few months ago. In particular, «git show» on the version of Theano that I checked out returns:

commit d49b53682449a39175daf72724860024fd6b7b1f
Merge: 98c3625 bcf999f
Author: Frédéric Bastien <frederic…@gmail.com>
Date:   Fri Apr 7 09:52:30 2017 -0400

    Merge pull request #5820 from mike1808/patch-1

        fix theano.sparse.mul docs

At the time, I was using libgpuarray 0.6.3.

When I update to Theano 1.0.0rc1 and libgpuarray 0.7.4 and then go and try to compile my GPU Ops (i.e. as part of a Theano graph, by calling theano.function), I get the Nvidia compiler error, «nvrtcCompileProgram: NVRTC_ERROR_COMPILATION, » and no further information.

Now, I realize that libgpuarray 0.7.x is API and ABI incompatible with prior versions, so I expect that I will need to make some changes to my GPU Op code. However, typically this manifests itself as a compile error, where Theano points me to a C file that it’s generated and is trying to compile. The file also usually includes some helpful compiler error message.

Has anyone seen this NVRTC compilation error before and / or have any advice for troubleshooting? I will continue looking through Theano documentation / code to see if I can find a solution. So far, Google searches have not been very helpful. There is one thread on this at https://github.com/Theano/Theano/issues/5943, which was eventually caused by old GPU hardware using an unsupported version of cuDNN.

For further background: I’m running this on Ubuntu 16.04, Python 2.7, Numpy 1.13, CUDA 8, and cuDNN 5.1. Updating to CUDA 9 / CUDNN 7 produces the same error. Reverting back to the Theano checkout cited above allows me to compile and run the GPU Ops again, though I have only tried this on CUDA 8 and not with the newer release. The GPU in question is a GTX 1060, though I can reproduce this problem on another machine that has a Titan X Pascal.

Any thoughts on this would be greatly appreciated. Thanks for your help.

Sincerely,
Aaron

Pascal Lamblin

unread,

Nov 15, 2017, 11:19:27 PM11/15/17

to thean…@googlegroups.com

If you have the option to check out libgpuarray from github and

recompile it with -DCMAKE_BUILD_TYPE=Debug, then you will get the code

for the cuda kernel as well as the error message from NVRTC.

Hth,

> —

>

> —

> You received this message because you are subscribed to the Google

> Groups «theano-dev» group.

> To unsubscribe from this group and stop receiving emails from it, send

> an email to theano-dev+…@googlegroups.com

> <mailto:theano-dev+…@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.



Pascal Lamblin

aaronbgr…@gmail.com

unread,

Nov 22, 2017, 10:24:15 PM11/22/17

to theano-dev

Hi,

I figured it out: I had nested if statements in my GPU kernel, which for some reason caused the error. The first if statement is there so I can have a total execution grid size that’s (slightly) larger than the number of elements I want to compute, and the nested if statement was a «mode» switch that makes the function do different things depending on the mode. (It’s a pooling function — think max pooling vs sum / mean pooling)

I now have a GPU kernel for each mode, and it runs fine.

Best,
Aaron

Понравилась статья? Поделить с друзьями:
  • Nvidia script error
  • Nss error 12286
  • Npm install error homestead
  • Npm error cannot find module semver
  • Notreadableerror could not start video source как исправить