Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible goes into hung state on windows servers #564

Open
GDMTT opened this issue Oct 20, 2023 · 13 comments
Open

Ansible goes into hung state on windows servers #564

GDMTT opened this issue Oct 20, 2023 · 13 comments

Comments

@GDMTT
Copy link

GDMTT commented Oct 20, 2023

SUMMARY

When running a Job against between 1 and 20 Windows servers will hang on one or two servers. This causes the job to get stuck on that task waiting for the hung server.
I have found that the process is not entirely hung on the Windows server. using TCP Dump I can see the communication between the hung process and the Ansible server, slowly going back an forth, but no real activity to say something is happening. Killing all the process running under the Ansible service account or rebooting the server fails the task for that windows server and the job continues as per normal.

I originally thought it was related to the recourses available on the Windows server at the time when Ansible connected but when I last investigated, and I logged onto the servers there was nothing I could see that would cause this hung state.

I'm struggling to figure out where this issue as I can't recreate it on demand.

There is nothing useful in the event log that I can see.

Another observation that may or may not be related is that there were Ansible processes running on a couple of servers many days after the pervious round of patching. I resume these where hung jobs that weren't resolved when the jobs were canceled by the engineer. At that time I had not fund the TCP Dump method of identifying which server was the hung server.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ansible.windows

ANSIBLE VERSION
bash-5.1$ ansible --version
ansible [core 2.15.5]
  config file = None
  configured module search path = ['/runner/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.9/site-packages/ansible
  ansible collection location = /runner/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.9.17 (main, Aug  9 2023, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True
bash-5.1$

COLLECTION VERSION
bash-5.1$ ansible-galaxy collection list

# /usr/share/ansible/collections/ansible_collections
Collection              Version
----------------------- -------
amazon.aws              6.5.0
ansible.posix           1.5.4
ansible.windows         2.1.0
awx.awx                 23.2.0
azure.azcollection      1.18.1
community.vmware        3.10.0
google.cloud            1.2.0
kubernetes.core         2.4.0
openstack.cloud         2.1.0
ovirt.ovirt             3.2.0
redhatinsights.insights 1.0.8
theforeman.foreman      3.14.0
bash-5.1$

CONFIGURATION
bash-5.1$ ansible-config dump --only-changed
CONFIG_FILE() = None
bash-5.1$

OS / ENVIRONMENT

Windows Server 2016, 2019, 2022, 2012R2

STEPS TO REPRODUCE

The play book below is the one where we see this issue the most, but it's the one we use the most.
On the last round of patching job was stuck on the task "Ping host to wake up" and for another Job template it was also on the first. But I dont remember this always being the cases. In the past jobs have been found stuck further along.

Variables set on the inventory
---
ansible_connection: winrm
ansible_winrm_server_cert_validation: ignore
ansible_winrm_transport: kerberos
ansible_port: 5986
ansible_winrm_scheme: https

Playbook 
---
 - hosts: all
   tasks:
   - name: Ping host to wake up
     ansible.windows.win_ping:

   - name: Reboot - 3600s timeout
     ansible.windows.win_reboot:
       reboot_timeout: 3600

   - name: Check for missing updates
     win_updates:
       state: searched
     register: update_results

   - name: report update results
     debug:
       msg: |
         {% for k in update_results.updates %}
         {{ update_results.updates[k].title }}
         {% endfor %}         

   - name: Apply Security, Critical updates, Update Rollups log to C:\ansible_wu.txt
     ansible.windows.win_updates:
       category_names:
         - SecurityUpdates
         - CriticalUpdates
         - UpdateRollups
         - Updates
         - DefinitionUpdates		
         - MicrosoftDefenderAntivirus
       reject_list:
         - KB2267602
       reboot: yes
       log_path: C:\ansible_wu.txt

   - name: Check for missing updates
     win_updates:
       state: searched
     register: update_results

   - name: report update results
     debug:
       msg: |
         {% for k in update_results.updates %}
         {{ update_results.updates[k].title }}
         {% endfor %}         

   - name: Reboot - 3600s timeout
     ansible.windows.win_reboot:
       reboot_timeout: 3600

EXPECTED RESULTS

The job doesn't get blocked and hung server is failed

ACTUAL RESULTS

No verbose logs at this time.


@GDMTT
Copy link
Author

GDMTT commented Oct 20, 2023

I managed to get a server to do it and found the following in the PowerShell operational event log :

9:40:10pm
Error Message = System error.

Context:
Severity = Warning
Host Name = Default Host
Host Version = 5.1.14409.2001
Host ID = bb89f310-f4c7-48f6-a2cf-51f450d1c837
Host Application = C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -EncodedCommand JgBjAGgAYwBwAC4AYwBvAG0AIAA2ADUAMAAwADEAIAA+ACAAJABuAHUAbABsAAoAaQBmACAAKAAkAFAAUwBWAGUAcgBzAGkAbwBuAFQAYQBiAGwAZQAuAFAAUwBWAGUAcgBzAGkAbwBuACAALQBsAHQAIABbAFYAZQByAHMAaQBvAG4AXQAiADMALgAwACIAKQAgAHsACgAnAHsAIgBmAGEAaQBsAGUAZAAiADoAdAByAHUAZQAsACIAbQBzAGcAIgA6ACIAQQBuAHMAaQBiAGwAZQAgAHIAZQBxAHUAaQByAGUAcwAgAFAAbwB3AGUAcgBTAGgAZQBsAGwAIAB2ADMALgAwACAAbwByACAAbgBlAHcAZQByACIAfQAnAAoAZQB4AGkAdAAgADEACgB9AAoAJABlAHgAZQBjAF8AdwByAGEAcABwAGUAcgBfAHMAdAByACAAPQAgACQAaQBuAHAAdQB0ACAAfAAgAE8AdQB0AC0AUwB0AHIAaQBuAGcACgAkAHMAcABsAGkAdABfAHAAYQByAHQAcwAgAD0AIAAkAGUAeABlAGMAXwB3AHIAYQBwAHAAZQByAF8AcwB0AHIALgBTAHAAbABpAHQAKABAACgAIgBgADAAYAAwAGAAMABgADAAIgApACwAIAAyACwAIABbAFMAdAByAGkAbgBnAFMAcABsAGkAdABPAHAAdABpAG8AbgBzAF0AOgA6AFIAZQBtAG8AdgBlAEUAbQBwAHQAeQBFAG4AdAByAGkAZQBzACkACgBJAGYAIAAoAC0AbgBvAHQAIAAkAHMAcABsAGkAdABfAHAAYQByAHQAcwAuAEwAZQBuAGcAdABoACAALQBlAHEAIAAyACkAIAB7ACAAdABoAHIAbwB3ACAAIgBpAG4AdgBhAGwAaQBkACAAcABhAHkAbABvAGEAZAAiACAAfQAKAFMAZQB0AC0AVgBhAHIAaQBhAGIAbABlACAALQBOAGEAbQBlACAAagBzAG8AbgBfAHIAYQB3ACAALQBWAGEAbAB1AGUAIAAkAHMAcABsAGkAdABfAHAAYQByAHQAcwBbADEAXQAKACQAZQB4AGUAYwBfAHcAcgBhAHAAcABlAHIAIAA9ACAAWwBTAGMAcgBpAHAAdABCAGwAbwBjAGsAXQA6ADoAQwByAGUAYQB0AGUAKAAkAHMAcABsAGkAdABfAHAAYQByAHQAcwBbADAAXQApAAoAJgAkAGUAeABlAGMAXwB3AHIAYQBwAHAAZQByAA==
Engine Version = 5.1.14409.2001
Runspace ID = e65587ed-574d-4ebb-b127-e0bd27b9533c
Pipeline ID = 1
Command Name =
Command Type =
Script Name =
Command Path =
Sequence Number = 1000
User = GALLAGHER1\ServiceAnsible_SRV
Connected User =
Shell ID = Microsoft.PowerShell

User Data:

9:40:11pm
Error Message = System error.

Context:
Severity = Warning
Host Name = Default Host
Host Version = 5.1.14409.2001
Host ID = 66d671f6-bad9-4f9c-a952-5ec8ceda5256
Host Application = C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -EncodedCommand JgBjAGgAYwBwAC4AYwBvAG0AIAA2ADUAMAAwADEAIAA+ACAAJABuAHUAbABsAAoAaQBmACAAKAAkAFAAUwBWAGUAcgBzAGkAbwBuAFQAYQBiAGwAZQAuAFAAUwBWAGUAcgBzAGkAbwBuACAALQBsAHQAIABbAFYAZQByAHMAaQBvAG4AXQAiADMALgAwACIAKQAgAHsACgAnAHsAIgBmAGEAaQBsAGUAZAAiADoAdAByAHUAZQAsACIAbQBzAGcAIgA6ACIAQQBuAHMAaQBiAGwAZQAgAHIAZQBxAHUAaQByAGUAcwAgAFAAbwB3AGUAcgBTAGgAZQBsAGwAIAB2ADMALgAwACAAbwByACAAbgBlAHcAZQByACIAfQAnAAoAZQB4AGkAdAAgADEACgB9AAoAJABlAHgAZQBjAF8AdwByAGEAcABwAGUAcgBfAHMAdAByACAAPQAgACQAaQBuAHAAdQB0ACAAfAAgAE8AdQB0AC0AUwB0AHIAaQBuAGcACgAkAHMAcABsAGkAdABfAHAAYQByAHQAcwAgAD0AIAAkAGUAeABlAGMAXwB3AHIAYQBwAHAAZQByAF8AcwB0AHIALgBTAHAAbABpAHQAKABAACgAIgBgADAAYAAwAGAAMABgADAAIgApACwAIAAyACwAIABbAFMAdAByAGkAbgBnAFMAcABsAGkAdABPAHAAdABpAG8AbgBzAF0AOgA6AFIAZQBtAG8AdgBlAEUAbQBwAHQAeQBFAG4AdAByAGkAZQBzACkACgBJAGYAIAAoAC0AbgBvAHQAIAAkAHMAcABsAGkAdABfAHAAYQByAHQAcwAuAEwAZQBuAGcAdABoACAALQBlAHEAIAAyACkAIAB7ACAAdABoAHIAbwB3ACAAIgBpAG4AdgBhAGwAaQBkACAAcABhAHkAbABvAGEAZAAiACAAfQAKAFMAZQB0AC0AVgBhAHIAaQBhAGIAbABlACAALQBOAGEAbQBlACAAagBzAG8AbgBfAHIAYQB3ACAALQBWAGEAbAB1AGUAIAAkAHMAcABsAGkAdABfAHAAYQByAHQAcwBbADEAXQAKACQAZQB4AGUAYwBfAHcAcgBhAHAAcABlAHIAIAA9ACAAWwBTAGMAcgBpAHAAdABCAGwAbwBjAGsAXQA6ADoAQwByAGUAYQB0AGUAKAAkAHMAcABsAGkAdABfAHAAYQByAHQAcwBbADAAXQApAAoAJgAkAGUAeABlAGMAXwB3AHIAYQBwAHAAZQByAA==
Engine Version = 5.1.14409.2001
Runspace ID = 7f03471e-5fea-4d23-920a-f86276f1dee3
Pipeline ID = 1
Command Name =
Command Type =
Script Name =
Command Path =
Sequence Number = 1157
User = DOMAIN\Ansible
Connected User =
Shell ID = Microsoft.PowerShell

User Data:

look like it gets stuck in a loop of starting up powershell then erroring and staring again.

@jborean93 jborean93 transferred this issue from ansible-collections/community.windows Oct 23, 2023
@jborean93
Copy link
Collaborator

I've transferred this to the ansible.windows collection as that is where the win_updates module lives and is potentially the cause behind this. The win_updates module is quite complicated and doesn't follow the traditional way that modules are executed for various reasons. Essentially what happens is that the action plugin kicks off the module which spawns a new PowerShell process in the background. The action plugin will then repeatedly run the module again to get the intermediate output which is why it looks like it's in a loop of starting powershell again and again.

What you can do is set the log_path option to a location on the remote host. This will be able to give you a lot more information on the progress of the task and to see whether it is stuck or not. Keep in mind the Windows Updates agent can take a really long time depending on the updates but if you can share the log/debug output from a task that seems stuck that will very much help try to narrow down what might be the problem.

@GDMTT
Copy link
Author

GDMTT commented Oct 23, 2023

@jborean93 I'm not 100% convinced this is purely a win_update as I have seen the same thing happen gathering facts.

@jborean93
Copy link
Collaborator

I would certainly try and narrow it down if you can, knowing whether it's something with the complex win_updates module or just something else like win_ping will certainly help try to identify where the issue might be. We can also further narrow it down to see if it fails with - raw: echo "hi" as a test. Right now we are not aware of any outstanding bugs that cause a hang on the connection so will be reliant on yourself to provide a reproducer or the information needed sorry.

@ste990
Copy link

ste990 commented Jan 2, 2024

I am getting this with 2 servers out of around 20.
When it happens, the server hangs and there are no event logs from around when the scheduled job runs.
I'll try the log_path parameter you have recommended and see if it gives us more info the next time it happens.
With myself, once you restart the problematic machine it's fine, but comes back again after maybe 1-2 weeks (1 run per day).

All I'm doing on the remote machine is below:

  • name: Search-only, ruturn a list
    ansible.windows.win_updates:
    state: searched
    register: updates

I'm running the below:
ansible [core 2.13.10]
python version = 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
jinja version = 3.1.2

Collection Version


ansible.eda 1.4.2
ansible.windows 2.0.0

@ste990
Copy link

ste990 commented Jan 7, 2024

Still getting this issue. Added in the log_path parameter, when a machine hangs it doesn't give out any logs, or create a log file.

It seems like Ansible establishes a winrm connection, and then initiates it's task to run the update search. I haven't been able to see where it hangs, I'm guessing it's earlier in the process given it's not created a log file yet.

I'm using this via gitlab, but have seen the same when running it directly on the command line.

I will setup another job for win_ping and run it on a schedule to see if the issue comes back. Then will move on to getting the return of a command.

For reference, I've built a new server and get the same results:

Ubuntu 22.04
Python 3.10
ansible core 2.16.2
ansible.windows collection 2.2.0

@ste990
Copy link

ste990 commented Jan 8, 2024

I can confirm that this issue still occurs when running win_ping, just happened to 3 servers across 20.

@ste990
Copy link

ste990 commented Jan 8, 2024

On my test server I noticed the server was running with around 400-500MB free in RAM. I am running zabbix proxy on this and using it in a centralized monitoring system with Ansible/Gitlab being used for orchestration. Zabbix uses memory for caching, and multiple pollers which can utilize alot of memory.

I checked on the server where the issue originated, and I can see it only has 160 MB of RAM free.

I upped the test server to 12GB RAM and the win pings seem to run fine for 11 consecutive runs. I have done a few rounds of searching updates which have went successfully. I've created a schedule to run this every 30 minutes to see if the issue happens again.

@ste990
Copy link

ste990 commented Jan 8, 2024

I have run the search updates 10 consecutive times without any issues and a noticeable performance increase.

This issue seems to occur when ansible is running, and the local server it's running on has swapping enabled.

Although memory free was low, memory available was sufficient, but it seemed that it was going to swap rather than taking back from the buffer/cache.

I'm at a loss as to why this would cause a remote windows server to hang.

@jborean93 - Sorry to ping directly, wonder if you could shed some light on this? I think it must be related to one of the ps scripts it passes in initially which I think acts as a pipe for ansible to send more scripts in?

@ste990
Copy link

ste990 commented Jan 9, 2024

Issue came back this morning with no memory constraints.

@ste990
Copy link

ste990 commented Jan 9, 2024

processes are still in place on the ansible box. I done an strace on these yesterday and it looked like they were just awaiting for input from the remote server.

@ste990
Copy link

ste990 commented Jan 9, 2024

ansible-search-updates-2024-01-09-04-03-10.log

I have managed to get the logs from one of the servers it failed on, which didn't completely hang but we can't RDP to it and some applications (like event viewer) crash when you try to run them.

This happens after a specific period of time/attempts to the same selection of servers. Still trying to find out what the commonality is between them that's causing this.

@jborean93
Copy link
Collaborator

Unfortunately I can't really help too much as this is something specific to your environment. We are reliant on the WinRM service working properly with things like the operational timeout to avoid problems like this. If they aren't then there's not much we can do about it sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants