Monitoring

Rancher에 모니터링 도구 설치 및 GPU 모니터링

References

rancher에서 monitoring 도구 설치

  • Apps -> Charts 이동후 monitoring 검색

    charts

  • Monitoring 설치 설치를 진행하면 모니터링 앱은 Rancher 의 cattle-monitoring-system namespace 에 배포됨
    (설치후 랜처 로그아웃후 다시 로그인)

  • 네비게이션 영역을 보면 Monitoring 메뉴가 추가되어 있음

    navigationbar

  • 대쉬보드 확인

    dashboard

  • grafana 확인

    grafana

그라파나에 로그인합니다. Grafana 인스턴스의 기본 관리자 사용자 이름과 비밀번호는 입니다 admin/prom-operator. (비밀번호가 있는 사람에 관계없이 Rancher의 클러스터 관리자 권한은 여전히 ​​Grafana 인스턴스에 액세스해야 합니다.) 차트를 배포하거나 업그레이드할 때 대체 자격 증명을 제공할 수도 있습니다.

GPU 노드 모니터링을 위한 dcgm-exporter 설치

  • Helm v3 설치
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 && \
    chmod 700 get_helm.sh && \
    ./get_helm.sh
  • Helm 저장소 설정
helm repo add gpu-helm-charts \
    https://nvidia.github.io/gpu-monitoring-tools/helm-charts
  • 레파지터리 업데이트
helm repo update
  • DCGM-Exporter 공식 차트 설치
#helm install --generate-name gpu-helm-charts/dcgm-exporter

# orca mlops node를 모니터링하기 위해 tolerations, nodeSelector 수정한 values파일로 설치한다. 
helm install -f dcgm-values.yaml --generate-name gpu-helm-charts/dcgm-exporter
  • 차트 확인
$ helm search repo gpu-helm-charts
NAME                                    CHART VERSION   APP VERSION     DESCRIPTION
gpu-helm-charts/dcgm-exporter           2.4.0           2.4.0           A Helm chart for DCGM exporter
gpu-helm-charts/kube-prometheus         0.0.43                          Manifests, dashboards, and alerting rules for e...
gpu-helm-charts/prometheus-operator     0.0.15                          Provides easy monitoring definitions for Kubern...

$ helm inspect chart gpu-helm-charts/dcgm-exporter 
apiVersion: v2
appVersion: 2.4.0
description: A Helm chart for DCGM exporter
home: https://github.com/nvidia/gpu-monitoring-tools/
icon: https://assets.nvidiagrid.net/ngc/logos/DCGM.png
keywords:
- gpu
- cuda
- compute
- monitoring
- telemetry
- tesla
kubeVersion: '>= 1.13.0-0'
name: dcgm-exporter
sources:
- https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools
version: 2.4.0
  • dcgm-values.yaml로 저장한다.
$ helm inspect values gpu-helm-charts/dcgm-exporter 
# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

image:
  repository: nvcr.io/nvidia/k8s/dcgm-exporter
  pullPolicy: IfNotPresent
  # Image tag defaults to AppVersion, but you can use the tag key
  # for the image tag, e.g:
  tag: 2.2.9-2.4.0-ubuntu18.04

# Comment the following line to stop profiling metrics from DCGM
arguments: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
# NOTE: in general, add any command line arguments to arguments above
# and they will be passed through.
# Use "-r", "<HOST>:<PORT>" to connect to an already running hostengine
# Example arguments: ["-r", "host123:5555"]
# Use "-n" to remove the hostname tag from the output.
# Example arguments: ["-n"]
# Use "-d" to specify the devices to monitor. -d must be followed by a string
# in the following format: [f] or [g[:numeric_range][+]][i[:numeric_range]]
# Where a numeric range is something like 0-4 or 0,2,4, etc.
# Example arguments: ["-d", "g+i"] to monitor all GPUs and GPU instances or
# ["-d", "g:0-3"] to monitor GPUs 0-3.

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name:

podSecurityContext: {}
  # fsGroup: 2000

securityContext:
  runAsNonRoot: false
  runAsUser: 0
  capabilities:
     add: ["SYS_ADMIN"]
  # readOnlyRootFilesystem: true

service:
  type: ClusterIP
  port: 9400
  address: ":9400"
  # Annotations to add to the service
  annotations: {}

resources: {}
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi
serviceMonitor:
  enabled: true
  interval: 15s
  additionalLabels: {}
    #monitoring: prometheus

mapPodsMetrics: false

nodeSelector: {}
  #node: gpu

tolerations: []
#- operator: Exists

affinity: {}
  #nodeAffinity:
  #  requiredDuringSchedulingIgnoredDuringExecution:
  #    nodeSelectorTerms:
  #    - matchExpressions:
  #      - key: nvidia-gpu
  #        operator: Exists

extraHostVolumes: []
#- name: host-binaries
#  hostPath: /opt/bin

extraVolumeMounts: []
#- name: host-binaries
#  mountPath: /opt/bin
#  readOnly: true

extraEnv: []
#- name: EXTRA_VAR
#  value: "TheStringValue"
  • helm 삭제
$ helm ls
NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                          APP VERSION  
dcgm-exporter-1658202339        default         1               2022-07-19 12:45:40.1694572 +0900 KST   deployed        dcgm-exporter-2.4.0            2.4.0


$ helm delete dcgm-exporter-1658202339

exporter 추가후 동작 확인

$ kubectl get pods -A | grep exporter 
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-6q7pp            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-8bmpz            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-8xrk6            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-blvhr            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-kc4ql            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-l56nm            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-qkk82            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-s654d            1/1     Running            0          56m
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-x6hjz            1/1     Running            0          56m
default                     dcgm-exporter-1658202339-6wwlg                               0/1     CrashLoopBackOff   6          5m22s
default                     dcgm-exporter-1658202339-ffch6                               0/1     CrashLoopBackOff   6          5m22s
default                     dcgm-exporter-1658202339-kgldt                               0/1     CrashLoopBackOff   7          5m22s
default                     dcgm-exporter-1658202339-nprlc                               0/1     CrashLoopBackOff   6          5m22s

CrashLoopBackoff 상태 확인

  • 로그는 정상
$ kubectl logs dcgm-exporter-1658205662-qwhs9
time="2022-07-19T03:50:16Z" level=info msg="Starting dcgm-exporter"
time="2022-07-19T03:50:16Z" level=info msg="DCGM successfully initialized!"
time="2022-07-19T03:50:16Z" level=info msg="Collecting DCP Metrics"
time="2022-07-19T03:50:16Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-07-19T03:50:16Z" level=info msg="Pipeline starting"
time="2022-07-19T03:50:16Z" level=info msg="Starting webserver"
  • 상태 확인
$ kubectl describe pod dcgm-exporter-1658205662-qwhs9
......
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  29m                default-scheduler  Successfully assigned default/dcgm-exporter-1658205662-qwhs9 to hcidc-sv-paz-orca-worker-09
  Normal   Pulled     29m                kubelet            Container image "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04" already present on machine
  Normal   Created    29m                kubelet            Created container exporter
  Normal   Started    29m                kubelet            Started container exporter
  Warning  Unhealthy  28m (x3 over 29m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

CrashLoopBackOff 해결

$ kubectl edit daemonset.apps/dcgm-exporter-1658205662

# initialDelaySeconds를 60으로 변경
.......
livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /health
    port: 9400
    scheme: HTTP
    initialDelaySeconds: 60
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1

grafana로 이동후 gpu dashboard 설치

DCGM Exporter Dashboard 설치
https://grafana.com/grafana/dashboards/12239
  • Import 선택

    add-dashboard1

  • dashboard id(12239) 입력

    add-dashboard2

  • Prometheus 선택

    add-dashboard3

  • Dashboard 확인

    gpu-dashboard