Service vm 100 in error state must be disabled and fixed first

Hi, the HA state for all our VMs on our proxmox test environment is "error" since this morning and the backup has been failed. The odd part, als VMs are running perfectly. The state for our containers are fine. How can I remove the error state without restarting the VMs? And more importantly...

  • #1

Hi,

the HA state for all our VMs on our proxmox test environment is «error» since this morning and the backup has been failed. The odd part, als VMs are running perfectly. The state for our containers are fine.

How can I remove the error state without restarting the VMs? And more importantly how can I figure out why the error state has been raised, as I’m unable to find any clue in the logs, yet?

Thanks in advance.

  • #2

I’ve just got the same «problem» this morning for a single VM. It was in «error» state but still up and running fine …

My problem was I’ve added a backup schedule that was conflicting with the one already in place (same time, different location) and one backup task failed to lock the VM. I guess it’s where/why the error state came from … nothing useful in logs except this entry (journalctl -u pve-ha-lrm -l):

Dec 06 00:30:34 pve2 pve-ha-lrm[24176]: service vm:110 is in an error state and needs manual intervention. Look up ‘ERROR REC

But I’ve not been able to clean the error state without changing the State to «disabled» in HA Resources (which stopped the VM …) and to «started» again …

state: <disabled | enabled | ignored | started | stopped> (default = started)
[…]
disabled

The CRM tries to put the resource in stopped state, but does not try to relocate the resources on node failures. The main purpose of this state is error recovery, because it is the only way to move a resource out of the error state.

Maybe you could just «remove» the VM from HA Resources list and «add» it again but I’ve not tried it before using the previous solution.

  • #3

Hi Belokan,

the funny thing is, the backup error was for the group testing, the errors are reported for the group hosting.

Anyway, your proposed solution works fine, but is a pain in the ass when you have more than 10 VMs. Is there a command line solution? pvecm only seems to have the option to remove a node.

  • #4

Hi,

we do have the same problem in a 4 node cluster with one node already running kernel 4.13.
virtual machines sporadic turn in ha error mode, but are still running without any problems.

@deniswitt yes you can use command line for setting ha states by:
# ha-manager set <service> —state <started|stopped|disabled>

  • #5

Hi baerm,

we are using 4.13 as well.

ha-manager doesn’t seem to help, as the only working option is «disabled», which will stop the machines which is what we try to avoid. Thanks anyway.

  • #6

Hi deniswitt,

what seems to be working apart from disable -> start, is setting the state to ignore (not sure if this is possible via commandline, but i would guess so), then migrate vm and then set the ha state back to started. not sure if the migration is necessary at all, but this was our workflow at least.

We do have another issue with debian stretch vms and migrations, so this didnt work with all our error state vms, but the jessie vms worked well ;)

  • #7

Hi baerm,

unfortunately «ignore» doesn’t work: service ‘vm:102’ in error state, must be disabled and fixed first (500)

Same for migration:

Requesting HA migration for VM 111 to node hosting2
service ‘vm:111’ in error state, must be disabled and fixed first
TASK ERROR: command ‘ha-manager migrate vm:111 hosting2’ failed: exit code 255

  • #8

Got another VM in «error» this morning and not due to backup lock this time.
Is there a location where to find a clear log about what happened to end with an error state ? Same as last time, VM was up and running without any issue …

Thanks.

  • #9

Up …

Same «problem» this morning, a VM in error state, I’ve removed it from HA resources and then added it again in order to avoid a useless restart (changing state from error to stopped to started).

Where can I find any clue regarding why it has been stated as «error» ?

Thanks in advance.

  • #10

It appears to be happening here as well, only if a VM is set to some kind of HA State and we attempt to use the Backup feature.

edit: Kind of a hack-y «fix» is to remove the affected VM’s from HA, then add them back in. Not really an optimal option.

Last edited: Feb 14, 2018

  • #11

Wondering if there are any others that are seeing this issue:

The situation for us is that we have a cluster that is set up with HA, and VMs on the nodes are in an HA group (nothing fancy, default settings). Backups are set up for various VMs (no compression, snapshot) to a storage location on a host outside of the cluster via NFS.

The backups end up running, but afterwards the VMs that were backed up have an HA state that reads «error». Backups that run for VMs that are not in an HA group do not have any issues (no HA state to be broken).

The «fix» for affected VMs is to remove them from their respective HA group and re-adding them, which is not a «fix» but a bad band-aid.

Any kind of assistance or further information would be wonderful. Thanks!

  • #12

Is there any logfile where we can see wat HA is doing and why VM’s go in an error state?
Some sort of error log?

AlexLup

Member

Mar 19, 2018

215

12

23

41


  • #13

Got the same issue..

journalctl -u pve-ha-lrm -l was the closes I could find, but its shallow logging, nothing but exit code 255

  • #15

I ran into this thread when searching Google. I’ve been trying my best to break our test-bed. The error I was getting was
service 'vm:100' in error state, must be disabled and fixed first.

The fix for me was to remove VM 100’s HA from Datacenter > HA. After this it was stuck in a migrate state — which I assume it what killed it in the first place. Obviously at this point the migrate state was no longer appliciable so unlocked it from the PVE command line with qm unlock 100.

Once this was done I was able to add the VM back into HA and it booted successfully.

infrastructure:proxmox-tests-cluster:proxmox_-_tests_cluster_et_high_availability

Table des matières

Proxmox — tests cluster et High Availability

Nœuds

  1. plmasrv06

    • eno1 → vmbr0 192.168.207.61/24 (admin)

    • eno2 10.10.207.60/24 (data)

  2. plmasrv07

    • eno1 → vmbr0 192.168.207.71/24 (admin)

    • eno2 10.10.207.60/24 (data)

  3. plmasrv08

    • eno1 → vmbr0 192.168.207.81/24 (admin)

    • eno2 10.10.207.60/24 (data)

  1. bellebaie

    • plmasrv06

    • plmasrv07

    • plmasrv08

Serveur NFS externe (Debian)

vi /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
allow-hotplug enp13s0
iface enp13s0 inet static
        address 192.168.207.100
        netmask 255.255.255.0
        gateway 192.168.207.254

allow-hotplug enp15s0
iface enp15s0 inet static
        address 10.10.207.100
        netmask 255.255.255.0
apt install nfs-kernel-server
mkdir /usr/share/nfs-shared
chmod 757 /usr/share/nfs-shared
vi /etc/exports
/usr/share/nfs-shared/ 10.10.207.60(rw,root_squash)
/usr/share/nfs-shared/ 10.10.207.70(rw,root_squash)
/usr/share/nfs-shared/ 10.10.207.80(rw,root_squash)

# reboot

Création d’une sauvegarde régulière

  1. Datacenter → Backup → Add

    • Storage : local

    • Selection mode : All

    • Mode : Snapshot

Création de VMs

  1. Créer VM Debian Stretch sur plmasrv06

    • ID : 100

    • Name : vm1

    • Hard Disk (scsi0) : NFS:100/vm-100-disk-1.qcow2,size=32G

    • Processors : 2 (1 sockets, 2 cores)

    • Memory : 2.00 GiB

    • → Options → Start at boot → Yes

  2. Full clone de VM 100

    • ID : 101

    • Name : vm2

  3. Attendre que le clone vm2 soit créé — OK

  4. Full clone de VM 100

    • ID : 102

    • Name : vm3

  5. Attendre que le clone vm3 soit créé — OK

Proxmox

  1. → Datacenter

  2. → Storage

  3. → Add

  4. → “NFS”

    • ID : nfs

    • Server : 10.10.207.100

    • Export : /usr/share/nfs-shared

    • Content : tout

    • Nodes : All (No restrictions)

    • Enable : coché

    • Max Backups : 1

Migration à chaud

  • vm1

    • plmasrv06 → plmasrv07 — OK

    • plmasrv07 → plmasrv08 — OK

    • plmasrv08 → plmasrv06 — OK

  • vm2

    • plmasrv06 → plmasrv07 — OK

    • plmasrv07 → plmasrv08 — OK

    • plmasrv08 → plmasrv06 — OK

  • vm3

    • plmasrv06 → plmasrv07 — OK

    • plmasrv07 → plmasrv08 — OK

    • plmasrv08 → plmasrv06 — OK

Toutes les VMs sont de retour sur plmasrv06.

High Availability

  1. Datacenter → HA → Groups → Create

    • ID : g123

    • Tout les nœuds

    • pas d’options cochées

  2. Datacenter → HA → Resources → Add

    • Ajouter chaque VM et les associer dans le groupe “g123”

Status

root@plmasrv06:~# ha-manager status
quorum OK
master plmasrv06 (active, Tue Sep 18 11:31:50 2018)
lrm plmasrv06 (active, Tue Sep 18 11:31:51 2018)
lrm plmasrv07 (active, Tue Sep 18 11:31:54 2018)
lrm plmasrv08 (idle, Tue Sep 18 11:31:56 2018)
service vm:100 (plmasrv06, started)
service vm:101 (plmasrv06, started)
service vm:102 (plmasrv06, started)

Tests

  1. Déconnexion de l’interface d’admin du nœud plmasrv06

    • L’ciône du nœud passe du vert au rouge

    • vm1 et vm3 sont redémarrées sur plmasrv06 et vm2 sur plmasrv07

    • plmasrv06 redémarre au bout de 60s

  2. Reconnexion de l’interface d’admin du nœud plmasrv06

    • plmasrv06 est de retour dans le cluster

    • les VMs restent sur plmasrv06 et plmasrv07

  3. Migration VMs vers plmasrv06 — OK

  4. Retrait des VMs du groupe HA “g123”

    1. → VM

    2. → More

    3. → Manage HA

    4. → Group : effacer “g123”

  5. Suppression des ressources HA et du groupe “g123”

    1. → Datacenter

    2. → HA

    3. → Resources

    4. → sélectionner ressources

    5. → Remove

root@plmasrv06:~# ha-manager status
quorum OK
master plmasrv07 (active, Tue Sep 18 12:00:05 2018)
lrm plmasrv06 (idle, Tue Sep 18 12:00:12 2018)
lrm plmasrv07 (active, Tue Sep 18 12:00:04 2018)
lrm plmasrv08 (active, Tue Sep 18 12:00:05 2018)
  1. Création de groupes HA

    1. g1

      • plmasrv06

      • “restricted” coché

      • “no failback” non coché

    2. g2

      • plmasrv07

      • “restricted” non coché

      • “no failback” coché

    3. g3

      • plmasrv08

      • “restricted” coché

      • “no failback” coché

  2. Création de ressources HA

    • vm1 → g1

    • vm2 → g2

    • vm3 → g3

vm3 se retrouve migrée sur plmasrv08

  1. Déconnexion de l’interface d’admin du nœud plmasrv06

    • L’icône du nœud passe du vert au rouge

    • vm2 est redémarrée sur plmasrv07

    • vm1 reste sur plmasrv06

    • plmasrv06 reboot après 60 secondes

    • vm1 n’est PAS redémarrée sur plmasrv06

  2. Reconnexion de l’interface d’admin du nœud plmasrv0i6

    • plmasrv06 est de retour dans le cluster

    • vm1 est DOWN sur plmasrv06

    • vm2 est UP sur plmasrv07

    • vm3 est UP sur plmasrv08

  3. Démarrage manuel de vm1 sur plmasrv06

Requesting HA start for VM 100
service 'vm:100' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager set vm:100 --state started' failed: exit code 255
  1. → vm1

    1. → More

    2. → Manage HA

    3. Request State : disabled

    4. Start

L’option “restricted” restreint la VM dans un groupe. S’il n’y a pas assez de nœuds dans le groupe, le service reste stoppé. On peut utiliser ça pour, par exemple, si l’on aun dongle USB comme clé pour un logiciel (pas besoin de redémarrer la VM sur un autre n&ud sans le dongle).

  1. Déconnexion de l’interface d’admin du nœud plmasrv07

    • L’icône du nœud passe du vert au rouge

    • plmasrv07 reboot après 60 secondes

    • vm2 est redémarrée sur plmasrv06

  2. Reconnexion de l’interface d’admin du nœud plmasrv07

    • plmasrv07 est de retour dans le cluster

    • vm1 est UP sur plmasrv06

    • vm2 est UP sur plmasrv06

    • vm3 est UP sur plmasrv08

L’option “no failback” permet à une VM de redémarrer sur un nœud qui ne fait pas partie de son groupe. Mais dans ce cas, et après qu’un nœud du groupe soit bon, la VM n’est pas migrée de nouveau dans le groupe.

Utile si, par exemple, on veux migrer les VMs à la main si l’on a peur d’une seconde saturation de l’infrastructure.

  1. Migration de la vm2 sur plmasrv07 — OK

    • vm1 est UP sur plmasrv06

    • vm2 est UP sur plmasrv07

    • vm3 est UP sur plmasrv08

  1. Déconnexion de l’interface d’admin du nœud plmasrv08

    • L’icône du nœud passe du vert au rouge

    • plmasrv08 reboot après 60 secondes

    • vm3 reste sur plmasrv08

  2. Reconnexion de l’interface d’admin du nœud pmasrv08

    • plmasrv08 est de retour dans le cluster

    • vm3 est DOWN sur plmasrv08

    • vm1 est UP sur plmasrv06

    • vm2 est UP sur plmasrv07

  3. Démarrage manuel de vm3 sur plmasrv08

Requesting HA start for VM 103
service 'vm:103' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager set vm:103 --state started' failed: exit code 255
  1. → vm3

    1. → More

    2. → Manage HA

    3. Request State : disabled

    4. Start

Je ne peux pas imaginer un cas intéressant où “restricted” et “no failback” seraient tout deux cochés.

  1. Suppression de toutes les ressources et groupes HA

  2. Création d’un groupe “g123” contenant les trois nœuds

  3. Ajouter chaque VM dans le groupe “g123”

  4. Migrer vm2 et vm3 sur plmasrv06 — OK

  5. Éteindre plmasrv06

    • L’icône de plmasrv06 passe du vert au rouge

    • vm1 et vm3 sont redémarrées sur plmasrv07

    • vm2 est redémarrée sur plmasrv08

infrastructure/proxmox-tests-cluster/proxmox_-_tests_cluster_et_high_availability.txt

· Dernière modification: 2019/01/10 16:28 par

rguyader

При запуски появляется ошибка:

kvm: -drive file=/var/lib/vz/images/100/vm-100-disk-1.qcow2,if=none,id=drive-ide0,format=qcow2,aio=native,cache=none,detect-zeroes=on: file system may not support O_DIRECT

kvm: -drive file=/var/lib/vz/images/100/vm-100-disk-1.qcow2,if=none,id=drive-ide0,format=qcow2,aio=native,cache=none,detect-zeroes=on: could not open disk image /var/lib/vz/images/100/vm-100-disk-1.qcow2: Could not open ‘/var/lib/vz/images/100/vm-100-disk-1.qcow2’: Invalid argument

TASK ERROR: start failed: command ‘/usr/bin/kvm -id 100 -chardev ‘socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait’ -mon ‘chardev=qmp,mode=control’ -vnc unix:/var/run/qemu-server/100.vnc,x509,password -pidfile /var/run/qemu-server/100.pid -daemonize -smbios ‘type=1,uuid=23f1d264-91c8-408e-bb84-9ac94cf529fc’ -name TServer -smp ‘4,sockets=1,cores=4,maxcpus=4’ -nodefaults -boot ‘menu=on,strict=on,reboot-timeout=1000’ -vga cirrus -cpu kvm64,+lahf_lm,+x2apic,+sep -m 10240 -k en-us -cpuunits 1000 -device ‘piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2’ -device ‘usb-tablet,id=tablet,bus=uhci.0,port=1’ -device ‘virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3’ -iscsi ‘initiator-name=iqn.1993-08.org.debian:01:5ed3e1eb441e’ -drive ‘file=/var/lib/vz/template/iso/debian-8.0.0-amd64-DVD-1.iso,if=none,id=drive-ide2,media=cdrom,aio=native’ -device ‘ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200’ -drive ‘file=/var/lib/vz/images/100/vm-100-disk-1.qcow2,if=none,id=drive-ide0,format=qcow2,aio=native,cache=none,detect-zeroes=on’ -device ‘ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100’ -netdev ‘type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown’ -device ‘e1000,mac=E6:03:E9:F0:74:81,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300» failed: exit code 1

В консоле показывает:

root@pve:~# kvm

Could not initialize SDL(No available video device) — exiting

root@pve:~# lspci | grep VGA

07:0b.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 10)

Сервер DELL PowerEdge 1100, никто с такой проблемой не сталкивался?

Понравилась статья? Поделить с друзьями:
  • Service unavailable http error 503 the service is unavailable что это значит
  • Service unavailable http error 503 the service is unavailable перевести
  • Service unavailable http error 503 the service is unavailable как исправить
  • Service unavailable error code 3 что это
  • Service unavailable dns failure как исправить