[cfarm-users] cfarm109 GPU access problem (was: cfarm109 additional details)

Wed Mar 11 22:14:56 CET 2026

On Wed, 2026-03-11 at 16:36 +0100, Thomas Schwinge wrote:
> Hi!
> 
> ...
> 
> 
> Is there some (permissions?) problem on cfarm109?  With
> 'nvidia-smi', there are some errors, but the GPU shows up:
> 
>     $ nvidia-smi
>     NvRmMemInitNvmap failed: error Permission denied
>     NvRmMemMgrInit failed: Memory Manager Not supported, line
> 333
>     NvRmMemMgrInit failed: error type 196626
>     libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196625
>     NvRmMemInitNvmap failed: error Permission denied
>     NvRmMemMgrInit failed: Memory Manager Not supported, line
> 333
>     NvRmMemMgrInit failed: error type 196626
>     libnvrm_gpu.so: NvRmGpuLibOpen failed, error=196625
>     Wed Mar 11 07:19:17 2026    

Indeed it is a permissions error (I only tested as root and then
assumed that it worked the same as with cfarm107/cfarm108, but I
can confirm non-root cfarm109 users will have CUDA failures):

spark (cfarm107/cfarm108)
-----

zv at cfarm107:~/311$ ./llama.cpp/build-cuda/bin/llama-cli -s 0 -c
0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf -p "print the date
then exit"                                                     
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122566 MiB): 
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes,
VRAM: 122566 MiB (73237 MiB free)                              

Loading model...

[ Prompt: 105.2 t/s | Generation: 37.7 t/s ]

thor (cfarm109)
----

zv at cfarm109:~/311$ ./llama.cpp/build-cuda/bin/llama-cli -s 0 -c
0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf                      
NvRmMemInitNvmap failed: error Permission denied               
NvRmMemMgrInit failed: Memory Manager Not supported, line 333  
NvRmMemMgrInit failed: error type 196626                       
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable
device is detected                                             

Loading model...

[ Prompt: 15.4 t/s | Generation: 7.1 t/s ]

root at cfarm109:/home/zv/311# ./llama.cpp/build-cuda/bin/llama-cli
-s 0 -c 0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 125771 MiB):
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes,
VRAM: 125771 MiB (84135 MiB free)

Loading model...

[ Prompt: 17.5 t/s | Generation: 25.3 t/s ]

analysis
--------

The 7.1 t/s is anomalous (GPU not utilized). The 25-38 t/s is
typical for Thor and Spark respectively. Potential fixes?

Upon first glance this may be a known issue on Jetson boards
(ignore any Docker-specific remarks; the issue is unrelated):

https://forums.developer.nvidia.com/t/gpu-driver-access-failure-on-isaac-ros-4-0-on-jetson-thor/359506/13

https://forums.developer.nvidia.com/t/nvrmmeminitnvmap-failed-with-permission-denied/313501

https://forums.developer.nvidia.com/t/nvrmmeminitnvmap-failed-with-permission-denied-error-when-running-nvidia-docker-in-rootless-mode-on-jetson-orin-nano/319532

The /etc/udev/rules.d/99-tegra-devices.rules file on-machine is:

    https://termbin.com/row5

I am happy to share any other info/test results I can.

memory
------

(Not a permissions problem but worth saying for completeness).

Having conducted the same test on both Spark and Thor, the Spark
is using a few GB memory while the Thor is using 40GB+ even
after the test completed. This is re. my email from 2026-03-07:

    "[cfarm-users] cfarm109 memory monitoring"

To free the memory (temporary workaround):

  $ cfarm-drop-caches

Now the system is back to normal memory usage.

groups
------

The comments in the above forum posts suggest adding users to
the 'video' group. That's easy enough to test. Fewer warnings:

zv at cfarm109:~/311$ ./llama.cpp/build-cuda/bin/llama-cli -s 0 -c
0 -m models/GLM-4.7-Flash-UD-Q8_K_XL.gguf                      
ggml_cuda_init: failed to initialize CUDA: operation not
supported                                                      

Loading model...

however there is still no GPU utilization. No CUDA devices.

To test your example of 'nvidia-smi' as non-root with video,
there are no warnings, so that is a plus, but not a solution:

zv at cfarm109:~$ nvidia-smi
Wed Mar 11 16:07:56 2026       
...

Building llama.cpp as normal user (even as member of 'video'):

...
-- Detecting CUDA compile features - done                      
-- Using CMAKE_CUDA_ARCHITECTURES=native
CMAKE_CUDA_ARCHITECTURES_NATIVE=No CUDA devices found.-real    
-- CUDA host compiler is GNU 13.3.0
...

Building llama.cpp as root:

...
-- Detecting CUDA compile features - done
-- Using CMAKE_CUDA_ARCHITECTURES=110-real
CMAKE_CUDA_ARCHITECTURES_NATIVE=110-real
-- CUDA host compiler is GNU 13.3.0
...

jetpack
-------

When I installed JetPack on cfarm109 I had to do some manual
steps (known issue); not likely related but worth mentioning:

root at cfarm109:~# apt-get install nvidia-jetpack
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you
have
requested an impossible situation or if you are using the
unstable
distribution that some required packages have not yet been
created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-container : Depends: nvidia-container-toolkit-base (=
1.18.0-1) but 1.18.1-1 is to be installed
                    Depends: libnvidia-container-tools (=
1.18.0-1) but 1.18.1-1 is to be installed
                    Depends: nvidia-container-toolkit (= 1.18.0-
1) but 1.18.1-1 is to be installed
                    Depends: libnvidia-container1 (= 1.18.0-1)
but 1.18.1-1 is to be installed
E: Unable to correct problems, you have held broken packages.

https://forums.developer.nvidia.com/t/jetpack-7-1-apt-install-issue-nvidia-container/357136

v=1.18.0-1
apt-get install -y \
    nvidia-container-toolkit=${v} \
    nvidia-container-toolkit-base=${v} \
    libnvidia-container-tools=${v} \
    libnvidia-container1=${v} \
    --allow-downgrades \
    ;
apt-get install nvidia-jetpack

This is the only deviation from the stock 38.4.0 setup, and all
post-installation steps mirror that of cfarm107/cfarm108.

The JetPack issue should eventually be fixed upstream.

conclusion
----------

Immediately, I don't know of a trivial workaround. Obviously it
is not possible to give all users root access.

I'm glad that cfarm107/cfarm108 work as expected and can serve
as a reference for comparison to cfarm109. Maybe auditing all
the relevant files, devices, and configurations can clue us in.

For now, please use those machines for GPU testing. Note that
cfarm108 is going to be temporarily unstable while I investigate
rootless Docker with GPU support and may be rebooted any time.

I hope a solution to the cfarm109 permissions issue is trivial
to implement but one must be found before it can be implemented.

I'm happy to look into it but I'd greatly appreciate some extra
eyes since I am extremely busy with non-computer commitments and
it may take me a few weeks to get back to normal availability.

Zach