Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get Jetson GPU Information #476

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 60 additions & 2 deletions beszel/internal/agent/gpu.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"encoding/json"
"fmt"
"os/exec"
"regexp"
"strconv"
"strings"
"sync"
Expand All @@ -18,6 +19,7 @@ import (
type GPUManager struct {
nvidiaSmi bool
rocmSmi bool
tegrastats bool
GpuDataMap map[string]*system.GPUData
mutex sync.Mutex
}
Expand Down Expand Up @@ -89,6 +91,47 @@ func (c *gpuCollector) collect() error {
return c.cmd.Wait()
}

// parseJetsonData parses the output of rtegrastats and updates the GPUData map
func (gm *GPUManager) parseJetsonData(output []byte) bool {
data := string(output)
ramPattern := regexp.MustCompile(`RAM (\d+)/(\d+)MB`)
gr3dPattern := regexp.MustCompile(`GR3D_FREQ (\d+)%`)
tempPattern := regexp.MustCompile(`([a-z0-9_]+)@(\d+\.?\d*)C`)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use gpu@(\d+\.?\d*)C here to target the gpu temp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm using an AGX Orin.
This is the tegrastats output:

01-24-2025 09:56:48 RAM 4300/30698MB (lfb 197x4MB) SWAP 28/15349MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 [email protected] iwlwifi_1@43C [email protected] [email protected] [email protected] [email protected] VDD_GPU_SOC 2171mW/2171mW VDD_CPU_CV 241mW/241mW VIN_SYS_5V0 4375mW/4375mW

I will check the code. And try to test it with more versions of Jetson

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks like for power you have VDD_GPU_SOC 2171mW, VDD_CPU_CV 241mW, and VIN_SYS_5V0 4375mW.

On Orin Nano they have VDD_IN 12479mW, VDD_CPU_GPU_CV 4667mW, and VDD_SOC 2817mW.

Annoying that they are different, but I guess we have to pick the best option for each model.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this. It looks like Orin Nano and Orin NX do not have separate power monitors for CPU and GPU.

If you don't want to work on this anymore I think what you have is fine. Those models can't get GPU temperature. Later I can add a fallback for total system power and use VDD_IN for them.

https://forums.developer.nvidia.com/t/any-way-to-monitor-gpu-specific-power-usage-on-jetson-orin-nano/318764

https://docs.nvidia.com/jetson/archives/r36.4/DeveloperGuide/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html#jetson-orin-nx-series-and-jetson-orin-nano-series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hhhh,jetson does have a lot of versioning issues and it's killing us too, we have a lot of Jetson but it's the new year,so I can test it later in the year!

powerPattern := regexp.MustCompile(`VDD_GPU_SOC (\d+)mW`)
gm.mutex.Lock()
defer gm.mutex.Unlock()
gpuData := gm.GpuDataMap["0"]
// Parse RAM usage
ramMatches := ramPattern.FindStringSubmatch(data)
if ramMatches != nil {
gpuData.MemoryUsed, _ = strconv.ParseFloat(ramMatches[1], 64)
gpuData.MemoryTotal, _ = strconv.ParseFloat(ramMatches[2], 64)
}
// Parse GR3D (GPU) usage
gr3dMatches := gr3dPattern.FindStringSubmatch(data)
if gr3dMatches != nil {
usage, _ := strconv.ParseFloat(gr3dMatches[1], 64)
gpuData.Usage = usage / 100
}

tempMatches := tempPattern.FindAllStringSubmatch(data, -1)
for _, match := range tempMatches {
if match[1] == "cpu" {
gpuData.Temperature, _ = strconv.ParseFloat(match[2], 64)
break
}
}

// Parse power usage
powerMatches := powerPattern.FindStringSubmatch(data)
if powerMatches != nil {
power, _ := strconv.ParseFloat(powerMatches[1], 64)
gpuData.Power = power / 1000
}
gpuData.Count++
return true
}

// parseNvidiaData parses the output of nvidia-smi and updates the GPUData map
func (gm *GPUManager) parseNvidiaData(output []byte) bool {
fields := strings.Split(string(output), ", ")
Expand Down Expand Up @@ -200,10 +243,14 @@ func (gm *GPUManager) detectGPUs() error {
if err := exec.Command("rocm-smi").Run(); err == nil {
gm.rocmSmi = true
}
if gm.nvidiaSmi || gm.rocmSmi {
_, err := exec.LookPath("tegrastats")
if err == nil {
gm.tegrastats = true
}
if gm.nvidiaSmi || gm.rocmSmi || gm.tegrastats {
return nil
}
return fmt.Errorf("no GPU found - install nvidia-smi or rocm-smi")
return fmt.Errorf("no GPU found - install nvidia-smi or rocm-smi or tegrastats")
}

// startCollector starts the appropriate GPU data collector based on the command
Expand All @@ -226,7 +273,15 @@ func (gm *GPUManager) startCollector(command string) {
parse: gm.parseAmdData,
}
go amdCollector.start()
case "tegrastats":
jetsonCollector := gpuCollector{
name: "tegrastats",
cmd: exec.Command("tegrastats"),
parse: gm.parseJetsonData,
}
go jetsonCollector.start()
}

}

// NewGPUManager creates and initializes a new GPUManager
Expand All @@ -243,6 +298,9 @@ func NewGPUManager() (*GPUManager, error) {
if gm.rocmSmi {
gm.startCollector("rocm-smi")
}
if gm.tegrastats {
gm.startCollector("tegrastats")
}

return &gm, nil
}