Prometheus + Grafana 监控 NVIDIA GPU

1.首先安装 NVIDIA Data Center GPU Manager (DCGM),从 https://developer.nvidia.com/dcgm 下载安装

nv-hostengine -t
yum erase -y datacenter-gpu-manager
rpm -ivh datacenter-gpu-manager*
systemctl enable --now dcgm.service

2. 安装 NVIDIA DCGM exporter for Prometheus,从 https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm 下载手工安装

wget -q -O /usr/local/bin/dcgm-exporter https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter
chmod +x /usr/local/bin/dcgm-exporter
mkdir /run/prometheus 
wget -q -O /etc/systemd/system/prometheus-dcgm.service https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/exporters/prometheus-dcgm/bare-metal/prometheus-dcgm.service
systemctl daemon-reload
systemctl enable --now prometheus-dcgm.service

3. 从 https://prometheus.io/download/#node_exporter 下载 node_exporter,手工安装为服务并添加 dcgm-exporter 资料

tar xf node_exporter*.tar.gz
mv node_exporter-*/node_exporter /usr/local/bin/
chown root:root /usr/local/bin/node_exporter
chmod +x /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sed -i '/ExecStart=\/usr\/local\/bin\/node_exporter/c\ExecStart=\/usr\/local\/bin\/node_exporter --collector.textfile.directory=\/run\/prometheus' /etc/systemd/system/node_exporter.service

systemctl daemon-reload
systemctl enable --now node_exporter.service

4. Grafana 添加这个Dashboard
https://grafana.com/grafana/dashboards/11752

批量更改Transmission的tracker

某些站会改tracker,所以需要批量更改已有种子的tracker

#!/bin/bash

username="yaoge123"
password="yaoge123"
host="localhost:9091"

for i in $(transmission-remote $host -n $username:$password -l|awk '{print $1}'|grep -o -P '\d+')
do
	out=$(transmission-remote $host -n $username:$password -t $i -it|grep $1)
	if [ -n "$out" ];then
		id=$(echo "$out"|awk -F: '{print $1}'|grep -o -P '\d+')
		echo $i
		transmission-remote $host -n $username:$password -t $i -tr $id
		transmission-remote $host -n $username:$password -t $i -td $2
	fi
done

用法:./chtracker.sh https://old.tracker.com https://new.tracker.com/announce.php?passkey=xxxxxxxxxx

Docker 自动更新 Let’s Encrypt

在nginx的 docker run 中添加webroot和配置文件挂载

-v $PWD/nginx/letsencrypt/:/var/www/letsencrypt:ro \
-v $PWD/letsencrypt/etc/:/etc/nginx/letsencrypt/:ro \

在nginx中将wwwroot发布出去

location ^~ /.well-known/ {
    root /var/www/letsencrypt/;
}

在nginx中配置证书文件

ssl_certificate letsencrypt/live/www.yaoge123.com/fullchain.pem;
ssl_certificate_key letsencrypt/live/www.yaoge123.com/privkey.pem;

创建 certbot 的docker run脚本,以后只要周期性运行这个脚本就可以自动更新证书了

#!/bin/sh
cd $(dirname $0)
pwd

docker run -it --rm \
	-v $PWD/letsencrypt/etc:/etc/letsencrypt \
	-v $PWD/letsencrypt/lib:/var/lib/letsencrypt \
	-v $PWD/letsencrypt/log:/var/log/letsencrypt \
	-v $PWD/nginx/letsencrypt:/var/www \
	certbot/certbot \
	certonly --webroot \
	--email yaoge123@example.com --agree-tos --no-eff-email \
	--webroot-path=/var/www/ \
	-n \
	--domains www.yaoge123.com
docker kill --signal=HUP nginx

CentOS 7 YUM 安装 Cacti

先添加EPEL再用yum安装cacti和中文字体

yum install cacti cacti-spine mariadb-server google-noto-sans-simplified-chinese-fonts

编辑 /etc/httpd/conf.d/cacti.conf ,在 Directory /usr/share/cacti/ 中添加可访问的浏览器客户端

编辑 /etc/cron.d/cacti ,去掉注释

编辑 /etc/spine.conf ,注释RDB_*

创建数据库

[root@yaoge123]# mysqladmin --user=root create cacti

创建数据库用户

[root@yaoge123]# mysql --user=root mysql
MariaDB [mysql]> GRANT ALL ON cacti.* TO cactiuser@localhost IDENTIFIED BY 'cactiuser';
MariaDB [mysql]> flush privileges;

数据库用户增加 timezone 权限

[root@yaoge123]# mysql -u root
MariaDB [(none)]> GRANT SELECT ON mysql.time_zone_name TO cactiuser@localhost IDENTIFIED BY 'cactiuser';
MariaDB [(none)]> flush privileges;

数据库增加 timezone

[root@yaoge123]# mysql_tzinfo_to_sql /usr/share/zoneinfo/ | mysql -u root mysql

新建一个文件 /etc/my.cnf.d/cacti.cnf ,内容供参考根据实际情况修改

[mysqld]
character-set-client = utf8mb4
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
innodb_additional_mem_pool_size = 80M
innodb_buffer_pool_size = 1024M
innodb_doublewrite = ON
innodb_file_format = Barracuda
innodb_file_per_table = ON
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_large_prefix = ON
join_buffer_size = 748M
max_allowed_packet = 16777216
max_heap_table_size = 374M
tmp_table_size = 374M

重启相关服务,设置开机自动启动

systemctl restart mariadb
systemctl enable mariadb
systemctl restart httpd
systemctl enable httpd

导入数据库

[root@yaoge123]# mysql cacti < /usr/share/doc/cacti-*/cacti.sql

浏览器打开 http://<server>/cacti/ ,默认用户名密码为 admin/admin

UniverStor P20000 (ActiveScale P100) NFS性能测试

首先这是一个不严谨的测试,Client和Server都是在有生产压力的情况下做额外测试,仅使用了一个Client和Server(仅使用三个Server中的一个),结果仅供参考。

NFS Client:mirrors.nju.edu.cn 的生产虚拟机,负载不轻,VMXNET 3网卡,主机万兆网卡,通过下面一个系统节点的IP挂载NFS。
mount -t nfs -o nfsvers=3,wsize=1048576,rsize=1048576,proto=tcp,async,lookupcache=none,timeo=600 x.x.x.x:/yaoge123 /mnt

NFS Server: UniverStor P20000 (ActiveScale P100) 5.5.0.40,三分之一配置,三个系统节点六个存储节点,box.nju.edu.cn的后端存储。
access_type=RW,clients=x.x.x.x,sec=sys

Continue reading

HPE ProLiant DL380 Gen10 不同BIOS设置内存性能测试

硬件环境

2*Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz
12*HPE SmartMemory DDR4-2666 RDIMM 16GiB

iLO 5 1.37 Oct 25 2018
System ROM U30 v1.46 (10/02/2018)
Intelligent Platform Abstraction Data 7.2.0 Build 30
System Programmable Logic Device 0x2A
Power Management Controller Firmware 1.0.4
NVMe Backplane Firmware 1.20
Power Supply Firmware 1.00
Power Supply Firmware 1.00
Innovation Engine (IE) Firmware 0.1.6.1
Server Platform Services (SPS) Firmware 4.0.4.288
Redundant System ROM U30 v1.42 (06/20/2018)
Intelligent Provisioning 3.20.154
Power Management Controller FW Bootloader 1.1
HPE Smart Storage Battery 1 Firmware 0.60
HPE Eth 10/25Gb 2p 631FLR-SFP28 Adptr 212.0.103001
HPE Ethernet 1Gb 4-port 331i Adapter – NIC 20.12.41
HPE Smart Array P816i-a SR Gen10 1.65
HPE 100Gb 1p OP101 QSFP28 x16 OPA Adptr 1.5.2.0.0
HPE InfiniBand EDR/Ethernet 100Gb 2-port 840QSF 12.22.40.30
Embedded Video Controller 2.5

软件环境

CentOS Linux release 7.6.1810 (Core)
Linux yaoge123 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Memory Latency Checker – v3.6

Continue reading

安装 GPFS 管理GUI

  1. GUI节点安装 gpfs.gss.pmcollector-.rpm gpfs.gss.pmsensors-.rpm gpfs.gui-.noarch.rpm gpfs.java-.x86_64.rpm
  2. 所有节点安装 gpfs.gss.pmsensors-*.rpm
  3. 初始化收集器节点 mmperfmon config generate –collectors [node list],GUI节点必须是收集器节点
  4. 启用传感器节点 mmchnode –perfmon -N [SENSOR_NODE_LIST]
  5. 设置容量监控节点和间隔 mmperfmon config update GPFSDiskCap.restrict=[node] GPFSDiskCap.period=86400
  6. 设置fileset容量监控节点和间隔 mmperfmon config update GPFSFilesetQuota.restrict=[node] GPFSFilesetQuota.period=3600
  7. GUI节点自动启动systemctl enable gpfsgui

删除

  1. GUI节点:systemctl stop gpfsgui; systemctl disable gpfsgui;
  2. mmlscluster |grep perfmon 查询一下哪些节点,mmchnode –noperfmon -N [SENSOR_NODE_LIST]
  3. mmperfmon config delete –all
  4. 清空数据库 psql postgres postgres -c “drop schema fscc cascade”
  5. 删除相关的rpm包 yum erase gpfs.gss.pmcollector gpfs.gss.pmsensors gpfs.gui gpfs.java
  6. mmlsnodeclass 查询有哪些节点,分别用mmchnodeclass GUI_SERVERS delete -N <……> 和 mmchnodeclass GUI_MGMT_SERVERS delete -N <……> 删除

OpenLDAP 升级报错pwdMaxRecordedFailure不存在

升级到OpenLDAP 2.4.44,出现以下错误

User Schema load failed for attribute "pwdMaxRecordedFailure". Error code 17: attribute type undefined
config error processing olcOverlay={1}ppolicy,olcDatabase={2}hdb,cn=config: User Schema load failed for attribute "pwdMaxRecordedFailure". Erro...ype undefined
slapd stopped.

解决办法

cd /etc/openldap/slapd.d/cn=config/cn=schema
mv cn\=\{3\}ppolicy.ldif cn\=\{3\}ppolicy.ldif.bak
mv /etc/openldap/schema/ppolicy.ldif cn\=\{3\}ppolicy.ldif

 

OpenLDAP 密码策略

OpenLDAP默认是没有密码检查策略的,123456这也得密码也能接受,这显然是管理员不希望看到的。

  1. 导入密码策略schema
    ldapadd -Y EXTERNAL -H ldapi:/// -D "cn=config" -f /etc/openldap/schema/ppolicy.ldif
  2. 加载模块,因为已经添加过syncprov模块了,所以只要追加ppolicy模块就可以了
    dn: cn=module{0},cn=config
    changetype: modify
    add: olcModuleLoad
    olcModuleLoad: ppolicy.la
    
    ldapmodify -Y EXTERNAL -H ldapi:/// -f mod_ppolicy.ldif
  3. 指定默认策略dn名
    dn: olcOverlay=ppolicy,olcDatabase={2}hdb,cn=config
    changeType: add
    objectClass: olcOverlayConfig
    objectClass: olcPPolicyConfig
    olcOverlay: ppolicy
    olcPPolicyDefault: cn=default,ou=ppolicy,dc=yaoge123,dc=com
    olcPPolicyHashCleartext: TRUE
    ldapmodify -Y EXTERNAL -H ldapi:/// -f ppolicy.ldif
  4. 创建默认策略
    dn: ou=ppolicy,dc=yaoge123,dc=com
    objectClass: organizationalUnit
    objectClass: top
    ou: ppolicy
    
    dn: cn=default,ou=ppolicy,dc=yaoge123,dc=com
    cn: default
    objectClass: top
    objectClass: device
    objectClass: pwdPolicy
    objectClass: pwdPolicyChecker
    pwdAllowUserChange: TRUE
    pwdAttribute: userPassword
    pwdCheckQuality: 2
    pwdExpireWarning: 604800
    pwdFailureCountInterval: 0
    pwdGraceAuthnLimit: 5
    pwdInHistory: 5
    pwdLockout: TRUE
    pwdLockoutDuration: 600
    pwdMaxAge: 0
    pwdMaxFailure: 5
    pwdMinAge: 0
    pwdMinLength: 8
    pwdMustChange: FALSE
    pwdSafeModify: FALSE
    pwdCheckModule: check_password.so
    ldapadd -Y EXTERNAL -H ldapi:/// -f defaultppolicy.ldif
  5. 修改/etc/openldap/check_password.conf,定义check_password.so规则
  6. MirrorMode的两台LDAP均需进行上述同样的配置

Seafile集成卡巴斯基

防病毒脚本 /opt/kaspersky/kav4fs_scan.sh

#!/bin/bash

VIRUS_FOUND=1
CLEAN=0
UNDEFINED=2
KAV4FS='/opt/kaspersky/kav4fs/bin/kav4fs-control'
if [ ! -x $KAV4FS ]
then
    echo "Binary not executable"
    exit $UNDEFINED
fi

SCAN_OUTPUT=`$KAV4FS --scan-file "$1"`
if [ "$?" -ne 0 ]
then
    echo "Error due to check file '$1'"
    exit 3
fi

while read line
do
	OUT1=`echo $line|cut -d':' -f 1`
	OUT2=`echo $line|cut -d':' -f 2|sed 's/ //g'`
	case "$OUT1" in
        "Threats found" )
                THREATS_C=$OUT2
                ;;
        "Riskware found" )
                RISKWARE_C=$OUT2
                ;;
        "Infected" )
                INFECTED=$OUT2
                ;;
        "Suspicious" )
                SUSPICIOUS=$OUT2
                ;;
        "Scan errors" )
                SCAN_ERRORS_C=$OUT2
                ;;
        "Password protected" )
                PASSWORD_PROTECTED=$OUT2
                ;;
        "Corrupted" )
                CORRUPTED=$OUT2
                ;;
	esac
done <<< "$SCAN_OUTPUT"

if [ $INFECTED -gt 0 ]
then
    exit $VIRUS_FOUND
elif [ $THREATS_C -gt 0 -o $RISKWARE_C -gt 0 -o $SUSPICIOUS -gt 0 -o $SCAN_ERRORS_C -gt 0 -o $CORRUPTED -gt 0 ]
then
    exit $UNDEFINED
else
    exit $CLEAN
fi

/opt/seafile/conf/seafile.conf 添加防病毒配置

[virus_scan]
scan_command = /opt/kaspersky/kav4fs_scan.sh
virus_code = 1
nonvirus_code = 0
scan_interval = 60

每天crontab清除kav4fs的日志/etc/cron.d/kav

30 0 * * * root find /var/log/kaspersky/kav4fs/supervisor_trace.log* -exec rm {} \;
40 0 * * * root /opt/kaspersky/kav4fs/bin/kav4fs-control -S --clean-stat