Arch linux on apalis-TK1, GPU initialization slow

Hi

I’m working on a arch-Linux based image for apalis-TK1.
The system boots, modules load, firmware appears to load - all seems good, however after boot wheen running any of the cuda samples the first run takes very long time, meanwhile no CPU activity. I’ve identified the culprit “cuInit” takes about 4 minutes to return, but after the first run it performs the same as on the ubuntu-jetpack image. In trying to figure what might be missing on the Arch system to initialize the GPU, I have run out of ideas.

using strace, shows the time is spent in ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18) = 0 where in ubuntu the the call looks a bit different ioctl(4, AGPIOC_RELEASE or APM_IOC_SUSPEND or SNDRV_PCM_IOCTL_TSTAMP, 0xbec70ef0) = 0

  • running as root makes no difference, so its not a permissions issue
  • ls -l /dev/nv* shows the same device handles on both systems (arch/ubuntu)
  • tegra init script is running doing boot
  • dmesg shows similar output

Hi @Arhcie

Could you provide the version of the Hardware and Software of your module? How did you create your Arch-Linux based Image?

Why do you want to use Arch Linux and not Ubuntu?

Best regards, Jaski

I used information from this to create the basic image:
Jetson/Porting Arch - eLinux.org
and
https://developer.toradex.com/knowledge-base/apalis-tk1-mainline-with-nouveau

I used this:

to make an arch package for the kernel and headers. And also created a collection of pacakges for cuda-toolkit 6.5 from the deb pacakges.

Once having a completely functional system, it seems easier to control which software-packages should be in the system. And much easier than the bitbake route.

  • easier access to more recent software versions (except those bound by hardware)
  • maintaining access to old package versions if needed, through Arch ABS system and ARM (Arch roll back machine)

hardware is in the post, have booth apalis-tk1 v1.1 and v1.2. available
not sure what you mean by software version? firmware (lib/firmware) is identical to the ubuntu system

hi, Thanks for this Information.

not sure what you mean by software version? firmware (lib/firmware) is identical to the ubuntu system

I meant the software version installed on the module ( uname -a ).

There is a similar issue with cuda on a different module.
It would be the best, if you write to the NVIDIA Support.

Linux alarm 3.10.40-2.8.3-g877a32308600-dirty #1 SMP PREEMPT Wed Nov 21 10:04:22 UTC 2018 armv7l GNU/Linux

the "dirty" is because i modified the Makefile, adding -fno-pic to the module build 

(BUILD_CFLAGS_MODULE := DMOULE -fno-pic), without this modules will not load.

to make make  USB working,  in the config file I have:
CONFIG_EXTRA_FIRMWARE="tegra_xusb_firmware"
CONFIG_EXTRA_FIRMWARE_DIR="firmware"

apart from this kernels are identical, i.e build from the same source using the same kernel configuration

  • one note, I might have an issue with the K20 firmware, but unless the K20 is involved in initializing the GPU that shouldn’t be related.

Do you have anything custom in the ubuntu image for initialization or is all “inherited” from nvidia ?

[root@alarm alarm]# lsmod
Module                  Size  Used by
nvhost_vi               2869  0
joydev                  8119  0
apalis_tk1_k20_ts       2081  0
apalis_tk1_k20_can      6835  0
apalis_tk1_k20_adc      1532  0
gpio_apalis_tk1_k20     1956  0
apalis_tk1_k20          9401  4
apalis_tk1_k20_adc,apalis_tk1_k20_can,apalis_tk1_k20_ts,gpio_apalis_tk1_k20

[alarm@alarm ~]$ time ./deviceQuery 
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)
Device 0: "GK20A"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.2
  Total amount of global memory:                 1924 MBytes (2017759232 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Clock rate:                                852 MHz (0.85 GHz)
  Memory Clock rate:                             924 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GK20A
Result = PASS

real    4m0.761s
user    0m0.024s
sys     0m0.110s

second and further runs give ~
real 0m0.070s
user 0m0.013s
sys 0m0.033s

Do you have anything custom in the ubuntu image for initialization or is all “inherited” from nvidia ?

Yeah, there are some custom scripts/daemon running at the startup of linux provided by nvidia. Before we discuss about this, it will be interesting to know, if you application runs on the regular Ubuntu Jetpack fine or not. Could you try on the latest release? Thanks.

On a Yocto-based image (Morty branch; linux-toradex 3.10.40-2.7.5+g6c533d3) I’m getting the exact same delay on the exact same ioctl call ( ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18) ). Here it also takes about 4 minutes before it continues. This too is with an Apalis TK1 on an Ixora 1.1A carrier board.

Any solution would be greatly appreciated.

Yes - thats the whole point

  • Ubuntu: application performs the
    same every time ( it is the latest
    release)
  • Arch: cuInit taks 4 minutes
    the first time, meanwhile the cpu
    apparently does nothing (htop), tegrastats stalls until the call to cuInit completes, The process is in D state

do you mean the nv and nvfb scripts “/etc/systemd/nv.sh” ? or is there more ?

I have that too, though slightly different to make it independent of dpkg.

I think it might be related to glibc being newer then what the libraries were build for, - at least it seems from vallgrind/callgrind there is a lot more symbol lookups then on the ubuntu system. But not sure what to do about it or if anything can be done, if this is indeed the issue shouldn’t the CPU be busy?

This means in the Ubuntu image, Nvidia is running some proprietary binaries/daemon finwhich is not included in the toradex meta layer. So we don’t know more about this issue with GPU Initialization. You should check for the differences between your Arch and the Nvidia Ubuntu system.

I had to leave the issue for some time, but hopefully I will get back to it - I spent some time trying to figure what is different in the Ubuntu image, running bootchartd, comparing scripts etc. never found any difference.

However the ubuntu system is running an x-server, the arch system does not could it be the xserver that does some extra init step that mitigates the 4min wait?. The drivers for the TK1 is compiled for xorg-xserver 15 - this gives some dependency challenges, therefore I didn’t go down that path yet.

@haffmans do you have an x-server in your yocto-morty image ? Or did you solve it another way?

I had to leave the issue for some time, but hopefully I will get back to it - I spent some time trying to figure what is different in the Ubuntu image, running bootchartd, comparing scripts etc. never found any difference.

However the ubuntu system is running an x-server, the arch system does not could it be the xserver that does some extra init step that mitigates the 4min wait?. The drivers for the TK1 is compiled for xorg-xserver 15 - this gives some dependency challenges, therefore I didn’t go down that path yet.

@haffmans do you have an x-server in your yocto-morty image ? Or did you solve it another way?

@Arhcie I have not yet solved the problem. The Yocto image also does not run any X server, but we might be able to give that a try soon-ish.

@haffmans
I have just tried to disable the Xserver and client on the ubuntu system - this does not not introduce any performance penalty on the first CUDA init.