Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux deployment from template by Ansible not working. Network in disconnected state. #1991

Open
Vibhanshuj49 opened this issue Feb 1, 2024 · 30 comments

Comments

@Vibhanshuj49
Copy link

Hi,

I have been trying to deploy the linux VM from the below playbook via template and it's deploying fine however it's not connecting to the network, and it stays in the disconnected state.
On the VM I see this: - "A start job is running for wait for Network to be configured" and after some time it timed out.

As part of troubleshooting, we tried below but nothing worked. Any suggestion to fix this?

  1. Tried with Ubuntu and Centos image.
  2. Upgraded ansible version.
  3. Tried with perl and curl package

  • name: Clone a virtual machine from Linux template and customize
    community.vmware.vmware_guest:
    hostname: "{{ vcenter_ip }}"
    username: "{{ vcenter_username }}"
    password: "{{ vcenter_password }}"
    datacenter: "{{ vcenter_datacenter }}"
    validate_certs: false
    state: poweredon
    folder:
    template: "{{ vm_templet }}"
    name: "{{ vm_name }}"
    cluster: "{{ vcenter_cluster }}"
    networks:
    - name:
    start_connected: True
    connected: True
    device_type: vmxnet3
    ip:
    netmask:
    gateway:
    type: static
    dns_servers:
    wait_for_ip_address: true
    customization:
    hostname:
    dns_servers:
    -
    -
    dns_suffix:
    -
    timezone: 165
    script_text: |
    #!/bin/bash
    touch /tmp/touch-from-playbook
    delegate_to: localhost
@aydinguven
Copy link

Can you try installing vmtools or open-vm-tools on the template.

@MaximilianClemens
Copy link
Contributor

Do you use distributed vSwitches and normal vSwitches parallel? I noticed problems when there are portgroups with the same name on different switches.

@FireHelmet
Copy link

FireHelmet commented Feb 27, 2024

Hello,

I encounter the same problem. My VM is a Rocky Linux and I already have the VMtools pre-installed in my template.

I tried with and without the options,
connected: true
start_connected: true

I don't have the issue with Windows template. All of my VMs are using the same vSwitch which is a 'distributed vSwitch'

I use the version 4.1.0 of the collection and ansible-core 2.16.3

Thank you for your support.

@FireHelmet
Copy link

Hi @mariolenz ,

May I request your help please ?

Thank you!

@ihumster
Copy link
Collaborator

@FireHelmet Try some simple experiments to determine the scope of the problem:

  1. Create a VMware Standard Switch on same host (without uplinks), create in this VSS portgroup with simple name (for example "test_pg")
  2. Run your playbook with task based on community.vmware.vmware_guest module with a specific host, switch and portgroup selected.

If the VM was created from a template and connected to the network, the problem is probably not in the template and, PROBABLY, not in module. To exclude (or confirmation of the problem) the module, it may make sense to create a separate portgroup in your VDS for further testing.

@FireHelmet
Copy link

Hello @ihumster ,

Thank you for your quick answer.

I tested what you proposed by doing,

  1. A new port group named 'test' in a standard vSwitch on one of my host in the same cluster
  2. Deployed a VM from the same template with the same playbook excepted by adding the key for using the ESXi host where my new port group was created and by adding the name of this new port group (of course)
  3. The VM has been deployed correctly and assigned to the new port group but the network card is still disconnected,

Please find evidences below,
image

image

Please find my playbook I used for the deployment and the result below,

- hosts: all
  gather_facts: yes
  connection: local
  environment:
    VMWARE_VALIDATE_CERTS: false
      
  tasks:
    - name: Deploy new virtual machine from template '${option.vm-tpl}'
      community.vmware.vmware_guest:
        hostname: MY_VCENTER
        username: MY_USER
        password: MY_PASSWORD
        datacenter: MY_DC
        esxi_hostname: MY_ESX_HOST
        folder: /
        state: poweredon
        name: FROT998
        template: TPL-Linux_Rocky_9.3
        datastore: DS003
        hardware:
          memory_mb: 4096
          num_cpus: 2
          num_cpu_cores_per_socket: 1
          version: 20
        networks:
          - name: test
            connected: true
            start_connected: true
      delegate_to: localhost

Thank you for your support,

@ihumster
Copy link
Collaborator

ihumster commented Mar 1, 2024

@FireHelmet Hmm. Why Checkbox Connected At Powered On is disabled? Does the template have the same setting? If so, then this is the cause of the error.

@FireHelmet
Copy link

I don't know why this checkbox is disabled on the Linux VM, also I don't have the issue with Windows deployment.

No, the VM used as template has the checkbox enabled (I reconverted the template as a VM to show you this setting because this kind of setting can't be changed or looked when the VM is converted as a template),

image

@ihumster
Copy link
Collaborator

ihumster commented Mar 1, 2024

@FireHelmet
Copy link

Thanks @ihumster ,

I tried to add ethernet0.startConnected = "TRUE" but the option disappear after reconverting the VM to template even if when I reconvert again to VM the checkbox is still enabled.

Also I'm not using any customization template.

I did a last test by doing a deployment of my template from the vCenter, so by hand...and the checkbox is well enabled so I'm thinking the problem comes from the ansible collection OR pyvmomi

See capture of the deployment by hand of FROT997 below,

image

image

What's your opinion ?

Many thanks for your support

@ihumster
Copy link
Collaborator

ihumster commented Mar 1, 2024

@FireHelmet Need to check something else: you can try add no network section of your playbook allow_guest_control property and set it to false?

    networks:
          - name: test
            connected: true
            start_connected: true
            allow_guest_control: false

@FireHelmet
Copy link

@ihumster ,

Still disconnected,

image

The playbook I used,

- hosts: all
  gather_facts: yes
  connection: local
  environment:
    VMWARE_VALIDATE_CERTS: false
      
  tasks:
    - name: Deploy new virtual machine from template '${option.vm-tpl}'
      community.vmware.vmware_guest:
        hostname: MY_VCENTER
        username: MY_USER
        password: MY_PASSWORD
        datacenter: MY_DC
        esxi_hostname: MY_ESX_HOST
        folder: /
        state: poweredon
        name: FROT998
        template: TPL-Linux_Rocky_9.3
        datastore: DS008
        hardware:
          memory_mb: 4096
          num_cpus: 2
          num_cpu_cores_per_socket: 1
          version: 20
        networks:
          - name: test
            connected: true
            start_connected: true
            allow_guest_control: false
      delegate_to: localhost

@MaximilianClemens
Copy link
Contributor

Hello @FireHelmet,

can you check if there are events like "customization started" at the deployed vm. (In the UI > Monitor > events).
Can you login to the console of the deployed vm and check in /var/log for cloud-init logs that are generated after the cloning.

And as last test can just test this:

networks:
          - name: SRV-LAN

(remove connected, start_connected, allow_guest_control, when those settings are set it seems like the module does this:

if nic_change_detected:
                # Change to fix the issue found while configuring opaque network
                # VMs cloned from a template with opaque network will get disconnected
                # Replacing deprecated config parameter with relocation Spec
                if isinstance(net_obj, vim.OpaqueNetwork):
                    self.relospec.deviceChange.append(nic)
                else:
                    self.configspec.deviceChange.append(nic)
                self.change_detected = True

Maybe that ist related to this issue?

Regards
Maximilian

@FireHelmet
Copy link

FireHelmet commented Mar 1, 2024

Hello @MaximilianClemens ,

Yes, please see below on a newly created VM from same template and same setting except the name,

image

Also, no cloud-init log because I don't use customization feature from vSphere,

image

About the test of - name: SRV-LAN, I already test and same result, unfortunately.

What's an "opaque network" ?

Thank you for your support too.

@MaximilianClemens
Copy link
Contributor

MaximilianClemens commented Mar 1, 2024

Is there anything under Events in the vsphere ui?

I don't know what a opaque network is, but my theory was, that something triggers a customization, even when not wanted and this customization fails. that failure would result in a disconnected adapter.

@FireHelmet
Copy link

FireHelmet commented Mar 1, 2024

@MaximilianClemens ,

No, nothing related to "customization" and as I wrote, the issue doesn't appear when I deploy the VM from the same template but manually from the vCenter UI. Also, no issues with Windows templates.

Currently I used a workaround by using a powershell script with PowerCLI running on a Windows host and this ansible playbook

- hosts: all
  gather_facts: no
  tasks:
  - name: Connect the Network Interface of '${option.vm-name}'
    ansible.windows.win_powershell:
      script: |
        Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -ParticipateInCEIP $false -Confirm:$false | out-null
        Connect-VIServer -Server ${option.vcenter-hostname} -User ${option.vcenter-username} -Password ${option.vcenter-password}
        Get-VM ${option.vm-name} | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Connected:$true -Confirm:$false
        Disconnect-VIServer -Server ${option.vcenter-hostname} -Confirm:$false

I use Rundeck on top of Ansible it's the reason why the variables have this format ${option.xxx}

@ihumster
Copy link
Collaborator

ihumster commented Mar 2, 2024

@MaximilianClemens FYI
"Opaque network" is term from NSX-T. Used for portgroups, which creates NSX-T manager on-top N-VDS (on current version NSX-T not used, and exists for compatibility).

@ihumster
Copy link
Collaborator

ihumster commented Mar 2, 2024

@FireHelmet I guess we'll have to dive into some deep debugging. Please deploy a new VM from this template and send here the machine log - vmware.log from the directory on the datastore (Just post it please on paste.bin for example).

@FilipFabicevic
Copy link

Thanks @ihumster ,

I tried to add ethernet0.startConnected = "TRUE" but the option disappear after reconverting the VM to template even if when I reconvert again to VM the checkbox is still enabled.

Also I'm not using any customization template.

I did a last test by doing a deployment of my template from the vCenter, so by hand...and the checkbox is well enabled so I'm thinking the problem comes from the ansible collection OR pyvmomi

See capture of the deployment by hand of FROT997 below,

image

image

What's your opinion ?

Many thanks for your support

Where did you add this? So I can try it.
BTW I managed to have this working by installing cloud-init on the template but our company does not use cloud-init in production so I have to find workaround.

@FireHelmet
Copy link

Hello @ihumster ,

Please find the log here https://pastebin.com/w13XP186 . The retention is 1 month.
I sent the password for accessing this link by email to [email protected]

Thank you very much for your support

@FireHelmet
Copy link

Hello @FilipFabicevic ,

I added the key/value in the .vmx of the VM. But as I said this fix doesn't work and not only for me.

@ihumster
Copy link
Collaborator

ihumster commented Mar 8, 2024

@FireHelmet I looked at the log and didn't see anything interesting about the problem. Perhaps you need to do some more research and look at a piece of hostd.log from the esxi server during VM startup.

For convenience, you can “kick out” all the VMs from one of the hosts (switch DRS to Manual mode) and try to launch the playbook (indicating the deployment of the VM not to the cluster, but to this host) and look at hostd.log at the same time.
Perhaps the reason will be visible there. Judging by the VM startup logs, the reason is not in it, but in the vSphere infrastructure

@FireHelmet
Copy link

@ihumster ,

Please find the hostd.log here https://pastebin.com/6qPL78hE . The retention is 1 month.
I sent the password for accessing this link by email to [email protected]

I just extracted all logs around the VM ID and/or the name of the VM. I hope this log will help you.

Thank you very much for your support

@ihumster
Copy link
Collaborator

ihumster commented Mar 8, 2024

@FireHelmet And add vmkernel.log for same time from host.

@FireHelmet
Copy link

@ihumster

Please find the vmkernel.log here https://pastebin.com/WDYYsBuQ. The retention is 1 month.
I sent the password for accessing this link by email to [email protected]

I just extracted all logs around the VM ID and/or the name of the VM. I hope this log will help you.

Thank you very much for your support

@ihumster
Copy link
Collaborator

ihumster commented Mar 8, 2024

@FireHelmet Either the logs are not complete, or there is nothing in them about the VM’s network adapter. You extract logs around VM ID/Name, but need more logs about dvportgroup-66 of your dswitch

@djvujke
Copy link

djvujke commented Mar 14, 2024

I have same network names on standard switch and dswitch. So when I want to change adapter

- name:  Changing network adapter  
  vmware_guest:
    <<: *vmware_connection
    name: "{{ my_vm.vm_name }}"
    networks:
    - name: "{{ my_vm.vm_network }}"
      ip: "{{ my_vm.vm_ip_address }}"
      netmask: "{{ my_vm.vm_netmask }}"
      state: present
      start_connected: true
      connected: true
      dvswitch_name: "DSwitch" 

Task fails.
TASK [Changing network adapter] **************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to connect virtual device ethernet0. ", "op": "reconfig"}

I do put dvswitch_name to my switches name, but I'm not sure if he really selects network from dvswitch, but it takes one from standard switch
When manually select network , I select one from dvswitch and it changes it.

vsphere 7.0.3 U3

@FireHelmet
Copy link

Hey @djvujke ,

Please open a new issue because it's not the same topic.

Thanks

@jbertozzi
Copy link

Hello,

Just came here to say we encounter the same issue since we migrated to vSphere 8.0.2 (build 23319993) from vSphere 7.

I found similar issue on vmware forums. Look take on this commentary https://communities.vmware.com/t5/vCenter-Server-Discussions/deployed-VM-from-template-but-NIC-is-disconnected/m-p/2977937/highlight/true#M94606

We are currently testing to integrate to the template the following conf:

cat /etc/vmware-tools/tools.conf
[deployPkg]
enable-custom-scripts = true

I will keep you updated.

Regards,

@runejuhl
Copy link

FYI I had some apparently similar issues.

I made a hacky workaround by changing the network interfaces to start disconnected when provisioning, and connecting them after the VM was created. This seemed to work nicely, and might work for you as well.

In the end it turned out that my issue was caused by the VM template not having Netplan installed, and VMware expecting it to be available and failing customization because of this. The symptoms were similar enough that I thought I had the same issue as y'all 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants