[cfarm-users] cfarm109 GPU access problem

Thu Mar 12 00:44:25 CET 2026

Hi Zach!

On 2026-03-11T16:14:56-0500, Zach van Rijn <me at zv.io> wrote:
> On Wed, 2026-03-11 at 16:36 +0100, Thomas Schwinge wrote:
>> Is there some (permissions?) problem on cfarm109?  With
>> 'nvidia-smi', there are some errors, but the GPU shows up:
>> 
>>     $ nvidia-smi
>>     NvRmMemInitNvmap failed: error Permission denied
>>     NvRmMemMgrInit failed: Memory Manager Not supported, line
>> 333
>>     NvRmMemMgrInit failed: error type 196626
>>     libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196625
>>     NvRmMemInitNvmap failed: error Permission denied
>>     NvRmMemMgrInit failed: Memory Manager Not supported, line
>> 333
>>     NvRmMemMgrInit failed: error type 196626
>>     libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196625
>>     Wed Mar 11 07:19:17 2026    

Specifically, per 'strace', I see:

    [...]
    openat(AT_FDCWD, "/dev/nvmap", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
    write(2, "NvRmMemInitNvmap failed: error P"..., 49) = 49
    [...]

So it fails to open '/dev/nvmap':

    $ ls -alF /dev/nvmap
    cr--r----- 1 root video 10, 123 Jan  1  1970 /dev/nvmap

I'm not in group "video" -- but also not sure if that's really the
expected permissions setup for this file.

> Indeed it is a permissions error (I only tested as root

Uh...  ;-P

Before we analyze/experiment any further, please try the following.
Reboot the system.  Does '/dev/nvmap' already exist (probably not?); if
yes, what are its permissions?  Now run, for example, 'nvidia-smi' as
non-root (!) user.  Does '/dev/nvmap' now exist (probably yes); if yes,
what are its permissions?

If I remember correctly, long ago (with "classic" GPU card/drivers), I
once had a similar issue , where 'nvidia-modprobe' (?) wouldn't set up
the '/dev/nvidia*' permissions correctly, if the first access (triggering
the 'nvidia' etc. modules load) was as root vs. non-root user.

Grüße
 Thomas

> and then
> assumed that it worked the same as with cfarm107/cfarm108, but I
> can confirm non-root cfarm109 users will have CUDA failures):
>
>
> spark (cfarm107/cfarm108)
> -----
>
> zv at cfarm107:~/311$ ./llama.cpp/build-cuda/bin/llama-cli -s 0 -c
> 0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf -p "print the date
> then exit"                                                     
> ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122566 MiB): 
>   Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes,
> VRAM: 122566 MiB (73237 MiB free)                              
>
> Loading model...
>
> [ Prompt: 105.2 t/s | Generation: 37.7 t/s ]
>
>
> thor (cfarm109)
> ----
>
> zv at cfarm109:~/311$ ./llama.cpp/build-cuda/bin/llama-cli -s 0 -c
> 0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf                      
> NvRmMemInitNvmap failed: error Permission denied               
> NvRmMemMgrInit failed: Memory Manager Not supported, line 333  
> NvRmMemMgrInit failed: error type 196626                       
> ggml_cuda_init: failed to initialize CUDA: no CUDA-capable
> device is detected                                             
>                                                                     
> Loading model...
>
> [ Prompt: 15.4 t/s | Generation: 7.1 t/s ]
>
> root at cfarm109:/home/zv/311# ./llama.cpp/build-cuda/bin/llama-cli
> -s 0 -c 0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf
> ggml_cuda_init: found 1 CUDA devices (Total VRAM: 125771 MiB):
>   Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes,
> VRAM: 125771 MiB (84135 MiB free)
>
> Loading model...
>
> [ Prompt: 17.5 t/s | Generation: 25.3 t/s ]
>
>
> analysis
> --------
>
> The 7.1 t/s is anomalous (GPU not utilized). The 25-38 t/s is
> typical for Thor and Spark respectively. Potential fixes?
>
> Upon first glance this may be a known issue on Jetson boards
> (ignore any Docker-specific remarks; the issue is unrelated):
>
> https://forums.developer.nvidia.com/t/gpu-driver-access-failure-on-isaac-ros-4-0-on-jetson-thor/359506/13
>
> https://forums.developer.nvidia.com/t/nvrmmeminitnvmap-failed-with-permission-denied/313501
>
> https://forums.developer.nvidia.com/t/nvrmmeminitnvmap-failed-with-permission-denied-error-when-running-nvidia-docker-in-rootless-mode-on-jetson-orin-nano/319532
>
> The /etc/udev/rules.d/99-tegra-devices.rules file on-machine is:
>
>     https://termbin.com/row5
>
> I am happy to share any other info/test results I can.
>
>
> memory
> ------
>
> (Not a permissions problem but worth saying for completeness).
>
> Having conducted the same test on both Spark and Thor, the Spark
> is using a few GB memory while the Thor is using 40GB+ even
> after the test completed. This is re. my email from 2026-03-07:
>
>     "[cfarm-users] cfarm109 memory monitoring"
>
> To free the memory (temporary workaround):
>
>   $ cfarm-drop-caches
>
> Now the system is back to normal memory usage.
>
>
> groups
> ------
>
> The comments in the above forum posts suggest adding users to
> the 'video' group. That's easy enough to test. Fewer warnings:
>
> zv at cfarm109:~/311$ ./llama.cpp/build-cuda/bin/llama-cli -s 0 -c
> 0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf                      
> ggml_cuda_init: failed to initialize CUDA: operation not
> supported                                                      
>                                                                     
> Loading model...
>
> however there is still no GPU utilization. No CUDA devices.
>
> To test your example of 'nvidia-smi' as non-root with video,
> there are no warnings, so that is a plus, but not a solution:
>
> zv at cfarm109:~$ nvidia-smi
> Wed Mar 11 16:07:56 2026       
> ...
>
> Building llama.cpp as normal user (even as member of 'video'):
>
> ...
> -- Detecting CUDA compile features - done                      
> -- Using CMAKE_CUDA_ARCHITECTURES=native
> CMAKE_CUDA_ARCHITECTURES_NATIVE=No CUDA devices found.-real    
> -- CUDA host compiler is GNU 13.3.0
> ...
>
> Building llama.cpp as root:
>
> ...
> -- Detecting CUDA compile features - done
> -- Using CMAKE_CUDA_ARCHITECTURES=110-real
> CMAKE_CUDA_ARCHITECTURES_NATIVE=110-real
> -- CUDA host compiler is GNU 13.3.0
> ...
>
>
> jetpack
> -------
>
> When I installed JetPack on cfarm109 I had to do some manual
> steps (known issue); not likely related but worth mentioning:
>
> root at cfarm109:~# apt-get install nvidia-jetpack
> Reading package lists... Done
> Building dependency tree... Done
> Reading state information... Done
> Some packages could not be installed. This may mean that you
> have
> requested an impossible situation or if you are using the
> unstable
> distribution that some required packages have not yet been
> created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
>
> The following packages have unmet dependencies:
>  nvidia-container : Depends: nvidia-container-toolkit-base (=
> 1.18.0-1) but 1.18.1-1 is to be installed
>                     Depends: libnvidia-container-tools (=
> 1.18.0-1) but 1.18.1-1 is to be installed
>                     Depends: nvidia-container-toolkit (= 1.18.0-
> 1) but 1.18.1-1 is to be installed
>                     Depends: libnvidia-container1 (= 1.18.0-1)
> but 1.18.1-1 is to be installed
> E: Unable to correct problems, you have held broken packages.
>
> https://forums.developer.nvidia.com/t/jetpack-7-1-apt-install-issue-nvidia-container/357136
>
> v=1.18.0-1
> apt-get install -y \
>     nvidia-container-toolkit=${v} \
>     nvidia-container-toolkit-base=${v} \
>     libnvidia-container-tools=${v} \
>     libnvidia-container1=${v} \
>     --allow-downgrades \
>     ;
> apt-get install nvidia-jetpack
>
>
> This is the only deviation from the stock 38.4.0 setup, and all
> post-installation steps mirror that of cfarm107/cfarm108.
>
> The JetPack issue should eventually be fixed upstream.
>
>
> conclusion
> ----------
>
> Immediately, I don't know of a trivial workaround. Obviously it
> is not possible to give all users root access.
>
> I'm glad that cfarm107/cfarm108 work as expected and can serve
> as a reference for comparison to cfarm109. Maybe auditing all
> the relevant files, devices, and configurations can clue us in.
>
> For now, please use those machines for GPU testing. Note that
> cfarm108 is going to be temporarily unstable while I investigate
> rootless Docker with GPU support and may be rebooted any time.
>
> I hope a solution to the cfarm109 permissions issue is trivial
> to implement but one must be found before it can be implemented.
>
> I'm happy to look into it but I'd greatly appreciate some extra
> eyes since I am extremely busy with non-computer commitments and
> it may take me a few weeks to get back to normal availability.
>
>
> Zach