Linux 服务器监控体系搭建：Prometheus + Grafana 实战原创

admin 2026-06-10 服务器浏览 (5)

温馨提示：

本文最后更新于 2026-06-10，已超过 0 天没有更新。若文章内的图片失效（无法正常加载），请留言反馈或直接联系我。

一、引言

在运维工作中，”不知道服务器发生了什么” 是最危险的状态。一个完善的监控体系能让你在用户发现问题之前就察觉异常，从被动救火转向主动预防。本文将基于 Prometheus + Grafana 这一黄金组合，手把手搭建一套完整的 Linux 服务器监控体系。

二、架构概览

整套监控体系由以下组件构成：

组件	角色	端口
Prometheus Server	时序数据采集与存储	9090
Node Exporter	主机指标采集（每台服务器部署）	9100
Grafana	数据可视化与告警	3000
Alertmanager	告警路由与通知	9093

数据流：

服务器 A (Node Exporter:9100) ─┐
服务器 B (Node Exporter:9100) ─┼─→ Prometheus (9090) ──→ Grafana (3000)
服务器 C (Node Exporter:9100) ─┘       │
                                         └─→ Alertmanager (9093) ──→ 通知渠道

三、环境准备

3.1 系统要求

操作系统：Ubuntu 22.04+ / CentOS 7+ / Debian 11+
最低配置：1 核 CPU、1GB 内存、20GB 磁盘
推荐配置：2 核 CPU、4GB 内存、50GB 磁盘（SSD 更佳）

3.2 创建专用用户

出于安全考虑，所有监控组件使用非 root 用户运行：

# 创建 prometheus 用户
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus

# 创建数据目录
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

四、安装 Prometheus

4.1 下载与安装

# 下载最新版本（2026 年版本为 v3.2.0）
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v3.2.0/prometheus-3.2.0.linux-amd64.tar.gz
tar xzf prometheus-3.2.0.linux-amd64.tar.gz

# 安装二进制文件
sudo cp prometheus-3.2.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

# 复制配置文件
sudo cp -r prometheus-3.2.0.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/{consoles,console_libraries}

4.2 配置 Prometheus

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s      # 采集间隔
  evaluation_interval: 15s  # 规则评估间隔
  scrape_timeout: 10s       # 采集超时

# 告警规则文件
rule_files:
  - "alerts/*.yml"

# 采集目标
scrape_configs:
  # Prometheus 自身指标
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 服务器节点
  - job_name: 'node'
    static_configs:
      - targets:
        - '192.168.1.10:9100'   # 服务器 A
        - '192.168.1.11:9100'   # 服务器 B
        - '192.168.1.12:9100'   # 服务器 C

  # 如果使用 Docker，可以添加自动发现
  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']

4.3 创建 systemd 服务

# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.external-url=

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

# 启动服务
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

五、部署 Node Exporter

Node Exporter 需要部署在 每一台 需要监控的服务器上。

5.1 安装

# 下载 Node Exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.0/node_exporter-1.9.0.linux-amd64.tar.gz
tar xzf node_exporter-1.9.0.linux-amd64.tar.gz

# 安装
sudo cp node_exporter-1.9.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter

5.2 systemd 服务

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://prometheus.io/docs/guides/node-exporter/
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --web.listen-address=:9100 \
    --path.rootfs=/ \
    --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

# 启动
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# 验证
curl http://localhost:9100/metrics | head -20

六、安装 Grafana

6.1 安装

# Ubuntu/Debian
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install -y grafana

# 或使用官方二进制
wget https://grafana.com/grafana/download/11.5.0?platform=linux
sudo dpkg -i grafana_11.5.0_amd64.deb

6.2 配置

# /etc/grafana/grafana.ini
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = monitor.example.com
root_url = https://monitor.example.com

[security]
admin_user = admin
admin_password = 请修改为强密码

[auth.anonymous]
enabled = false

[auth.basic]
enabled = true

[alerting]
enabled = true
execute_alerts = true

# 启动 Grafana
sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

6.3 配置数据源

进入 Configuration → Data Sources → Add data source
选择 Prometheus
URL 填写 http://localhost:9090
点击 Save & Test，确认连接成功

6.4 导入仪表盘

Grafana 社区提供了大量现成的仪表盘模板：

Node Exporter Full (ID: 1860) — 最全面的主机监控仪表盘
1 Node Dashboard (ID: 11074) — 单节点精简版
Linux Hosts Dashboard (ID: 9276) — 多主机概览

导入方式：Dashboard → Import → 输入 Dashboard ID → Load → 选择 Prometheus 数据源 → Import

七、告警配置

7.1 告警规则

# /etc/prometheus/alerts/node_alerts.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:
      # 服务器宕机
      - alert: InstanceDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} 已宕机"
          description: "服务器 {{ $labels.instance }} 已离线超过 1 分钟"

      # CPU 使用率过高
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高 ({{ $value }}%)"
          description: "{{ $labels.instance }} CPU 使用率已超过 80% 持续 5 分钟"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间不足 (剩余 {{ $value | humanizePercentage }})"
          description: "{{ $labels.instance }} 根分区可用空间不足 10%"

      # 内存不足
      - alert: MemoryLow
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "可用内存不足 ({{ $value | humanizePercentage }})"
          description: "{{ $labels.instance }} 可用内存不足 10%"

      # 磁盘 I/O 过高
      - alert: HighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘 I/O 过高"
          description: "{{ $labels.instance }} 磁盘 I/O 等待时间超过 50% 持续 10 分钟"

7.2 安装 Alertmanager

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.0/alertmanager-0.28.0.linux-amd64.tar.gz
tar xzf alertmanager-0.28.0.linux-amd64.tar.gz
sudo cp alertmanager-0.28.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager

# /etc/alertmanager/alertmanager.yml
route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'

  - name: 'critical'
    webhook_configs:
      - url: 'https://hooks.example.com/alert'
    email_configs:
      - to: 'admin@example.com'

八、进阶配置

8.1 数据持久化与保留策略

# Prometheus 启动参数中配置
--storage.tsdb.retention.time=30d      # 数据保留 30 天
--storage.tsdb.retention.size=50GB     # 最大存储 50GB
--storage.tsdb.wal-compression          # 启用 WAL 压缩

8.2 安全加固

启用认证：Grafana 默认有登录认证，Prometheus 和 Alertmanager 建议使用反向代理（如 Nginx）添加 Basic Auth
防火墙规则：只允许监控服务器访问 Node Exporter 的 9100 端口
HTTPS：使用 Let’s Encrypt 为 Grafana 配置 HTTPS

# Nginx 反向代理配置示例
server {
    listen 443 ssl;
    server_name monitor.example.com;

    ssl_certificate /etc/letsencrypt/live/monitor.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/monitor.example.com/privkey.pem;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

8.3 Docker Compose 一键部署

如果你更倾向于容器化部署，可以使用 Docker Compose：

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v3.2.0
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts:/etc/prometheus/alerts
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:v1.9.0
    container_name: node_exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    ports:
      - "9100:9100"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.5.0
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.28.0
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

九、常见问题排查

问题	可能原因	解决方法
Prometheus 无法采集数据	防火墙阻止了 9100 端口	`sudo ufw allow 9100`
Grafana 数据源连接失败	Prometheus 地址配置错误	检查 URL 是否包含 `http://`
告警未触发	规则语法错误	`promtool check rules alerts/*.yml`
磁盘空间快速增长	保留时间过长	调整 `retention.time` 或 `retention.size`
Node Exporter 无数据	服务未启动	`systemctl status node_exporter`

十、总结

通过本文的实践，你已经搭建了一套完整的 Linux 服务器监控体系：

Prometheus 负责采集和存储时序数据
Node Exporter 暴露主机级别的系统指标
Grafana 提供可视化仪表盘和告警管理
Alertmanager 处理告警路由和通知分发

这套体系的核心优势在于：

开源免费：所有组件均为开源软件，无授权费用
生态丰富：社区提供了数百个 Exporter 和仪表盘模板
可扩展：从几台服务器到数千台集群都能胜任
标准化：Prometheus 已成为云原生监控的事实标准

监控体系的搭建只是第一步，更重要的是建立运维流程——定期审查告警、优化规则、更新仪表盘。监控的价值不在于工具本身，而在于它帮你发现的问题和避免的事故。

admin

个人网站