Introduction
The Coprocessor Offload Infrastructure (COI) library is designed for communication between the processes on the host and the Intel® Xeon Phi™ Coprocessors. As per the COI terminology, one process is a source, and the other process is a sink. The communication channel between them is a pipeline initiated from the source to the sink. The source and sink are two binary executables compiled and built for their respective architectures. The source process is responsible for launching the coprocessor process through the COI API calls.
This blog is written to help developers analyze and debug COI errors encountered while executing applications using COI API for offloading to Intel® Xeon Phi™ Coprocessors. It explains different methods/tools which a developer can use to trace the COI error and get meaningful information about the error. However, this blog does not explain any steps for building and running COI applications. For detailed COI API documentation and steps for building and running COI applications, refer to the COI API Reference Manual and coi_getting_started guide respectively, which is included as a part of the Intel® MPSS installation package.
By default, COI is installed in locations as shown:
/usr/share/doc/intel-coi-<version> | COI API Reference Manual, COI getting started guide, and the release notes |
/usr/include | Include files required to build COI applications |
/usr/share/do/intel-coi-<version>/tutorials | Code samples that can be helpful for learning how to write COI applications |
/usr/bin | COI tools to assist in development |
/usr/lib64 | COI shared libraries needed to build COI applications |
The rest of the blog is organized into the following sections, where each section explains a COI debugging/tracing method in detail:
Getting Error Information Using COIRESULT
Investigating COI Application Log File
Tracing Loaded Libraries using SINK_LD_TRACE_LOADED_OBJECTS
Using coitrace to assist with debugging
Conclusion
Getting Error Information Using COIRESULT
COI uses COIResult for its error reporting. The form of the error message varies depending on the function which received and checked the COIResult value. However, the message usually takes the form of:
{function that checked for COIResult} with {COIResult mnemonic}
There are a couple of ways in which this could be done is as follows1:
Example 1:
#include <intel-coi/source/COIProcess_source.h> #include <intel-coi/source/COIEngine_source.h> COIRESULT result = COI_ERROR; COIENGINE engine; result = COIEngineGetHandle(COI_ISA_MIC, 0, &engine); if (result != COI_SUCCESS) { printf("COIEngineGetHandle result %s\n", COIResultGetName(result)); return -1; }
Example 2:
#include <intel-coi/source/COIProcess_source.h> #include <intel-coi/source/COIEngine_source.h> #define CHECK_RESULT(_COIFUNC) \ { \ COIRESULT result = _COIFUNC; \ if (result != COI_SUCCESS) \ { \ printf("%s returned %s\n", #_COIFUNC, COIResultGetName(result));\ return -1; \ } \ } COIENGINE engine; //Now every call to COI API function can be wrapped by CHECK_RESULT CHECK_RESULT( COIEngineGetHandle(COI_ISA_MIC, 0, &engine));
The associated names and basic meanings for each of the possible values of COIResult are given in the header file COIResult_common.h (default location: /usr/include/intel-coi/common) and are also listed here with possible reasons as to why the error might occur.
Error code | Offload Error | Remark |
0 | COI_SUCCESS | The function succeeded without error |
1 | COI_ERROR | Unspecified error |
2 | COI_NOT_INITIALIZED | The function was called before the system was initialized |
3 | COI_ALREADY_INITIALIZED | The function was called after the system was initialized |
4 | COI_ALREADY_EXISTS | Cannot complete the request due to the existence of a similar object |
5 | COI_DOES_NOT_EXIST | The specified object was not found |
6 | COI_INVALID_POINTER | One of the addresses provided was not valid |
7 | COI_OUT_OF_RANGE | One of the arguments contains a value that is invalid |
8 | COI_NOT_SUPPORTED | This function is not currently supported as used |
9 | COI_TIME_OUT_REACHED | The specified time out caused the function to abort |
10 | COI_MEMORY_OVERLAP | The source and destination range specified overlaps for the same buffer |
11 | COI_ARGUMENT_MISMATCH | The specified arguments are not compatible |
12 | COI_SIZE_MISMATCH | The specified size does not match the expected size |
13 | COI_OUT_OF_MEMORY | The function was unable to allocate the required memory |
14 | COI_INVALID_HANDLE | One of the handles provided was not valid |
15 | COI_RETRY | This function currently can't complete, but might be able to later |
16 | COI_RESOURCE_EXHAUSTED | The resource was not large enough |
17 | COI_ALREADY_LOCKED | The object was expected to be unlocked, but was locked |
18 | COI_NOT_LOCKED | The object was expected to be locked, but was unlocked |
19 | COI_MISSING_DEPENDENCY | One or more dependent components could not be found |
20 | COI_UNDEFINED_SYMBOL | One or more symbols the component required was not defined in any library |
21 | COI_PENDING | Operation is not finished |
22 | COI_BINARY_AND_HARDWARE_MISMATCH | A specified binary will not run on the specified hardware |
23 | COI_PROCESS_DIED | One of the COI processes died |
24 | COI_INVALID_FILE | The file is invalid for its intended usage in the function |
25 | COI_EVENT_CANCELED | Event wait on a user event that was unregistered or is being unregistered returns this error |
26 | COI_VERSION_MISMATCH | The version of Intel® Coprocessor Offload Infrastructure on the host is not compatible with the version on the device |
27 | COI_BAD_PORT | The port that the host is set to connect to is invalid |
28 | COI_AUTHENTICATION_FAILURE | The daemon was unable to authenticate the user that requested an engine. Only reported if daemon is set up for authorization |
29 | COI_NUM_RESULTS | Reserved, do not use |
Inspecting the Automatically Produced COI Log File
Sometimes, having an accurate error code doesn’t necessarily make a problem clear. For example, if COIProcessCreateFromFile returns COI_MISSING_DEPENDENCY, this indicates that a dynamic library needed by the executable could not be found in the source or sink file systems. If the debug version of the COI library is used, however, there is a possibility that more information can be learned by looking at the automatically-produced log file. This file is named <executable>.coilog, where <executable> is the name of the source executable. It is located in the current directory in effect when the application was launched.
In order to use the debug version of the COI library, you will have to extract and compile the COI library from the source provided with your version of Intel® MPSS.
Steps to compile debug version of COI library can be given as follows:
- If you have not already done so, download and extract mpss-src-<MPSS-version>.tar file from the Intel® MPSS webpage
tar –x mpss-src-<MPSS-version>.tar
- Extract the MPSS COI source
cd mpss-<MPSS-version>/src tar –xj mpss-coi-<MPSS-version>.tar.bz2
- Compiling a debug version of the COI library requires that some of the metadata files are present in /usr/include directory. If not already present, you should extract the source mpss-metadata-<MPSS-version>.tar.bz2 file provided and copy the required files
tar –xj mpss-metadata-<MPSS-version>.tar.bz2 cp mpss-metadata-<MPSS-version>/mpss-metadata.c /usr/include/. cp mpss-metadata-<MPSS-version>/mpss-metadata.mk /usr/include/.
- From the extracted mpss-coi-<MPSS-version> directory you can compile and install the COI library as follows:
make debug //Builds the debug COI library in build directory make debug-install-host //Installs the debug version of COI library on Host make debug-install-sdk //Installs the required SDK files
- To install these new binaries and libraries on the coprocessor you will need to overwrite the card’s COI library (done manually for each coprocessor card)
scp build/device-linux-debug/libcoi_device.so mic0:/usr/lib64/libcoi_device.so.0 ssh mic0 “/etc/init.d/coi stop” scp build/device-linux-debug/coi_daemon mic0:/usr/bin/coi_daemon ssh mic0 “/etc/init.d/coi start”
Once the debug version of the COI library is installed, a <executable>.coilog will be created whenever the application is launched. In the event of error <executable>.coilog will be populated with an entry like the following:
[SOURCE][0xfffffffe][3484974483003000][..\..\mechanism\proxy\uproxy_host.cpp:185][COILOG_LEVEL_ERROR][COIProxy::WorkerThread]: Error: scif_recv failed: 108
where:
[SOURCE] | refers to whether the error occurred on source (Host) or sink (Coprocessor) |
[0xfffffffe] | is hex corresponding to the actual Pthread id |
[3484974483003000] | refers to the timestamp of the event (tickcount) |
[..\..\mechanism\proxy\uproxy_host.cpp:185] | Source file and line number |
Error: scif_recv failed: 108 | error number corresponding to its entry in usr/include/asm-generic/errno*.h header files. For e.g. 108 corresponds to ESHUTDOWN (Cannot send after transport endpoint shutdown) |
Trace Libraries Loaded Using SINK_LD_TRACE_LOADED_OBJECTS Environment Variable
If the environment variable SINK_LD_TRACE_LOADED_OBJECTS is set to a non-empty value, it changes the behavior of the COIProcessCreate* APIs. Instead of creating the process the coi_daemon will print to standard out (stdout), the information about which libraries it is loading. If all the dynamic dependencies are found, the API returns COI_NOT_INITIALIZED; the COIProcess will not actually be created when this environment variable is set; it is meant solely as a debugging aid.
One scenario where this can be useful is, if the user built their binary on one system that had all the needed libraries, but then wanted to run their binary on a completely different system with different environment settings. In this case, the variable SINK_LD_TRACE_LOADED_OBJECTS can be useful to verify that your environment is configured correctly before you attempt to launch your application.
Steps for using the environment variable SINK_LD_TRACE_LOADED_OBJECTS can be given as follows:
- Since the information about which libraries are loaded originates from coi_daemon it is important that the prints are redirected to the console rather than to /dev/null(the default). In order to do this, restart the coi_daemon on the coprocessor as follows:
[user@host ~] ssh mic0 //ssh directly in to the coprocessor [user@host-mic0 ~] /etc/init.d/coi stop //coi_daemon if it already running [user@host-mic0 ~] coi_daemon --= & //restart coi_daemon with prints redirected to stdout (console)
Now using a different shell, on the host execute your COI application with the environment variable SINK_LD_TRACE_LOADED_OBJECTS set to a non-empty value. For example, as shown below, we can set the environment variable to 1 and run our sample COI application on host. Here in this case, if we have no missing dependency then we would get the following output:
[user@host release] SINK_LD_TRACE_LOADED_OBJECTS=1 ./coi_simple_source_host //output 2 engines available Got engine handle COIProcessCreateFromFile( engine, SINK_NAME, 0, NULL, false, NULL, false, NULL, 0, NULL, &proc ) returned COI_NOT_INITIALIZED
- Now, if you check the console on mic0, you will see the information about the loaded libraries. One such sample output originating from coi_daemon can be given as below. Here in this case, if the coi_device library is missing on the device then the coi_daemon will report the dynamic dependency check failure as given below:
[user@host-mic0 ~] COI_DAEMON is trying to create a process 'coi_simple_sink_mic' using the following files:<SOURCE>: /home/slgogar/COI_TEST/release/coi_simple_sink_mic<SINK>: libstdc++.so.6<SINK>: libm.so.6<SINK>: libgcc_s.so.1<SINK>: libc.so.6<FAIL>: libcoi_device.so.0 dynamic dependency check failed on 1 libraries. COIRESULT= COI_MISSING_DEPENDENCY libcoi_device.so.0 process create ending abnormally
- Once the environment settings are all verified, restart the coi_daemon on the coprocessor in its default settings as follows:
[user@host-mic0 ~] /etc/init.d/coi stop [user@host-mic0 ~] /etc/init.d/coi start
Using coitrace to assist with debugging
Included in the installation package is a tool called coitrace. This trace utility operates similar to Unix*-style tools like strace* and shows all of the COI API invocations and input parameters. This can be helpful to trace what COI commands are being executed for tracing and debugging. To see a complete list of options run
coitrace -h
To use coitrace simply execute your program through coitrace. For example, without coitrace the hello_world sample executes as follows:
[user@hostname release]# ./hello_world_source_host 2 engines available Got engine handle Sink process created, press enter to destroy it. Hello from the sink! Sink process returned 0 Sink exit reason SHUTDOWN OK
This is how the hello_world sample would execute through the tool coitrace printing out additional information like function arguments, thread_id, and return values of each function call:
[user@hostname release]$ coitrace ./hello_world_source_host COIEngineGetCount [ThID:0x7fbc167d5740] in_ISA = COI_ISA_MIC out_pNumEngines = 0x7fff963b8698 0x00000002 (hex) : 2 (dec) 2 engines available COIEngineGetHandle [ThID:0x7fbc167d5740] in_ISA = COI_ISA_MIC in_EngineIndex = 0x00000000 (hex) : 0 (dec) out_pEngineHandle = 0x7fff963b8680 0x7fbc16a73d60 Got engine handle COIProcessCreateFromMemory [ThID:0x7fbc167d5740] in_Engine = 0x7fbc16a73d60 in_pBinaryName = hello_world_sink_mic in_pBinaryBuffer = 0x7fbc167ec000 in_BinaryBufferLength = 0x000000000000288f (hex) : 10383 (dec) in_Argc = 0 in_ppArgv = 0 (bool) in_DupEnv = false in_ppAdditionalEnv = 0 (bool) in_ProxyActive = true in_Reserved = (nil) in_BufferSpace = 0x0000000000000000 (hex) : 0 (dec) in_LibrarySearchPath = (nil) in_FileOfOrigin = hello_world_sink_mic in_FileOfOriginOffset = 0x0000000000000000 (hex) : 0 (dec) out_pProcess = 0x7fff963b8688 0x1802a60 COIProcessCreateFromFile [ThID:0x7fbc167d5740] in_Engine = 0x7fbc16a73d60 in_pBinaryName = hello_world_sink_mic in_Argc = 0 in_ppArgv = 0 (bool) in_DupEnv = false in_ppAdditionalEnv = 0 (bool) in_ProxyActive = true in_Reserved = (nil) in_BufferSpace = 0x0000000000000000 (hex) : 0 (dec) in_LibrarySearchPath = (nil) out_pProcess = 0x7fff963b8688 0x1802a60 Sink process created, press enter to destroy it. Hello from the sink! COIProcessDestroy [ThID:0x7fbc167d5740] in_Process = 0x1802a60 in_WaitForMainTimeout = -1 (bool) in_ForceDestroy = false out_pProcessReturn = 0x7fff963b869f Sink process returned 0 Sink exit reason SHUTDOWN OK
Conclusion
At this point you should have a slightly better understanding of how to analyze and debug COI API errors. Depending on the complication of your application you might have to use several methods/tools in combination to track down the COI API error. Moreover, by correctly linking the COI application with debug version of the COI library, debuggers (like GDB) can be utilized to read debug symbols and provide useful information relevant to the error.
1 The code snippets are extracted from COI tutorials (sample examples) provided with Intel® MPSS installation. By default, after Intel® MPSS installation, the sample programs are copied in /usr/docs/intel-coi-<MPSS-version>/tutorials directory.
Other Related References
https://software.intel.com/en-us/articles/debugging-intel-xeon-phi-applications-on-linux-host
https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
Immagine icona:
