I’m working on a arch-Linux based image for apalis-TK1.
The system boots, modules load, firmware appears to load - all seems good, however after boot wheen running any of the cuda samples the first run takes very long time, meanwhile no CPU activity. I’ve identified the culprit “cuInit” takes about 4 minutes to return, but after the first run it performs the same as on the ubuntu-jetpack image. In trying to figure what might be missing on the Arch system to initialize the GPU, I have run out of ideas.
using strace, shows the time is spent in ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18) = 0 where in ubuntu the the call looks a bit different ioctl(4, AGPIOC_RELEASE or APM_IOC_SUSPEND or SNDRV_PCM_IOCTL_TSTAMP, 0xbec70ef0) = 0
running as root makes no difference, so its not a permissions issue
ls -l /dev/nv* shows the same device handles on both systems (arch/ubuntu)
to make an arch package for the kernel and headers. And also created a collection of pacakges for cuda-toolkit 6.5 from the deb pacakges.
Once having a completely functional system, it seems easier to control which software-packages should be in the system. And much easier than the bitbake route.
easier access to more recent software versions (except those bound by hardware)
maintaining access to old package versions if needed, through Arch ABS system and ARM (Arch roll back machine)
hardware is in the post, have booth apalis-tk1 v1.1 and v1.2. available
not sure what you mean by software version? firmware (lib/firmware) is identical to the ubuntu system
Linux alarm 3.10.40-2.8.3-g877a32308600-dirty #1 SMP PREEMPT Wed Nov 21 10:04:22 UTC 2018 armv7l GNU/Linux
the "dirty" is because i modified the Makefile, adding -fno-pic to the module build
(BUILD_CFLAGS_MODULE := DMOULE -fno-pic), without this modules will not load.
to make make USB working, in the config file I have:
CONFIG_EXTRA_FIRMWARE="tegra_xusb_firmware"
CONFIG_EXTRA_FIRMWARE_DIR="firmware"
apart from this kernels are identical, i.e build from the same source using the same kernel configuration
one note, I might have an issue with the K20 firmware, but unless the K20 is involved in initializing the GPU that shouldn’t be related.
Do you have anything custom in the ubuntu image for initialization or is all “inherited” from nvidia ?
[root@alarm alarm]# lsmod
Module Size Used by
nvhost_vi 2869 0
joydev 8119 0
apalis_tk1_k20_ts 2081 0
apalis_tk1_k20_can 6835 0
apalis_tk1_k20_adc 1532 0
gpio_apalis_tk1_k20 1956 0
apalis_tk1_k20 9401 4
apalis_tk1_k20_adc,apalis_tk1_k20_can,apalis_tk1_k20_ts,gpio_apalis_tk1_k20
[alarm@alarm ~]$ time ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GK20A"
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.2
Total amount of global memory: 1924 MBytes (2017759232 bytes)
( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores
GPU Clock rate: 852 MHz (0.85 GHz)
Memory Clock rate: 924 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GK20A
Result = PASS
real 4m0.761s
user 0m0.024s
sys 0m0.110s
second and further runs give ~
real 0m0.070s
user 0m0.013s
sys 0m0.033s
Do you have anything custom in the ubuntu image for initialization or is all “inherited” from nvidia ?
Yeah, there are some custom scripts/daemon running at the startup of linux provided by nvidia. Before we discuss about this, it will be interesting to know, if you application runs on the regular Ubuntu Jetpack fine or not. Could you try on the latest release? Thanks.
On a Yocto-based image (Morty branch; linux-toradex 3.10.40-2.7.5+g6c533d3) I’m getting the exact same delay on the exact same ioctl call ( ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18) ). Here it also takes about 4 minutes before it continues. This too is with an Apalis TK1 on an Ixora 1.1A carrier board.
Ubuntu: application performs the
same every time ( it is the latest
release)
Arch: cuInit taks 4 minutes
the first time, meanwhile the cpu
apparently does nothing (htop), tegrastats stalls until the call to cuInit completes, The process is in D state
do you mean the nv and nvfb scripts “/etc/systemd/nv.sh” ? or is there more ?
I have that too, though slightly different to make it independent of dpkg.
I think it might be related to glibc being newer then what the libraries were build for, - at least it seems from vallgrind/callgrind there is a lot more symbol lookups then on the ubuntu system. But not sure what to do about it or if anything can be done, if this is indeed the issue shouldn’t the CPU be busy?
This means in the Ubuntu image, Nvidia is running some proprietary binaries/daemon finwhich is not included in the toradex meta layer. So we don’t know more about this issue with GPU Initialization. You should check for the differences between your Arch and the Nvidia Ubuntu system.
I had to leave the issue for some time, but hopefully I will get back to it - I spent some time trying to figure what is different in the Ubuntu image, running bootchartd, comparing scripts etc. never found any difference.
However the ubuntu system is running an x-server, the arch system does not could it be the xserver that does some extra init step that mitigates the 4min wait?. The drivers for the TK1 is compiled for xorg-xserver 15 - this gives some dependency challenges, therefore I didn’t go down that path yet.
@haffmans do you have an x-server in your yocto-morty image ? Or did you solve it another way?
I had to leave the issue for some time, but hopefully I will get back to it - I spent some time trying to figure what is different in the Ubuntu image, running bootchartd, comparing scripts etc. never found any difference.
However the ubuntu system is running an x-server, the arch system does not could it be the xserver that does some extra init step that mitigates the 4min wait?. The drivers for the TK1 is compiled for xorg-xserver 15 - this gives some dependency challenges, therefore I didn’t go down that path yet.
@haffmans do you have an x-server in your yocto-morty image ? Or did you solve it another way?
@haffmans
I have just tried to disable the Xserver and client on the ubuntu system - this does not not introduce any performance penalty on the first CUDA init.