Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the way to go if you expect the connection to devices in the testbed file to fail? #233

Open
JanisFalkenhagenWork opened this issue Apr 3, 2024 · 24 comments
Assignees

Comments

@JanisFalkenhagenWork
Copy link

JanisFalkenhagenWork commented Apr 3, 2024

Hey there,

i am currently developing some tests for my company's infrastructure and because it's a pretty much constantly changing network, failed connects are expected and do not indicate wrong behaviour.

If i then use a testbed which contains all devices, the typical testbed.connect() and further usage of the devices inside of that testbed fails.
See the example below for a quick demo.

[...]
class CommonSetup(aetest.CommonSetup):
    @aetest.subsection
    def connect_to_devs(self, testbed):
        testbed.connect() 
# Here a log message is written to the cli containing the traceback of an internal exception that happens when connection to the device. It 
#is NOT propagated to this context. Otherwise, i could remove the device dynamically from the testbed. It looks like the remaining 
#connections still don't get established. See also Issue #226.

class Testcase(aetest.Testcase):
    @aetest.setup
    def setup(self, testbed):
        self.results = testbed.parse("<sh something>")

# Now when i run this, i get an error:
#3299:  Caught an exception while executing section setup:
#3300:  Traceback (most recent call last):
#3301:    File "TESTSDIR/TEST_SCRIPT_NAME", line LineNo, in setup
#3302:      self.results = testbed.parse(f"<sh something>")
#3303:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#3304:    File "src/pyats/topology/testbed.py", line 816, in pyats.topology.testbed.Testbed.parse
#3305:   TypeError: 'NoneType' object is not iterable

    @aetest.test
    def test(self):
        [...]

Is there a way for me to have the testbed.parse command only operate on devices, that have been connected?
Is there a way to force connect to all devices, even though there are some devices that can't be connected to atm?
Is there a way for me to see if a device inside the testbed isn't connected successfully, so i can remove it dynamically?

Is this just bad test design on my part?

I'm looking forward to your answers and thank you for your time. :D

Have a nice day
Janis

@JanisFalkenhagenWork
Copy link
Author

Ahh and im on the current version:

You are currently running pyATS version: 24.2
Python: 3.11.5 [64bit]

Package Version


genie 24.2
genie.libs.clean 24.2
genie.libs.conf 24.2
genie.libs.filetransferutils 24.2
genie.libs.health 24.2
genie.libs.ops 24.2
genie.libs.parser 24.2
genie.libs.robot 24.2
genie.libs.sdk 24.2
genie.telemetry 24.2
genie.trafficgen 24.2
pyats 24.2
pyats.aereport 24.2
pyats.aetest 24.2
pyats.async 24.2
pyats.connections 24.2
pyats.contrib 24.2
pyats.datastructures 24.2
pyats.easypy 24.2
pyats.kleenex 24.2
pyats.log 24.2
pyats.reporter 24.2
pyats.results 24.2
pyats.robot 24.2
pyats.tcl 24.2
pyats.topology 24.2
pyats.utils 24.2
rest.connector 24.2
unicon 24.2
unicon.plugins 24.2
yang.connector 24.2

@Neeraja1bala
Copy link

Hello @JanisFalkenhagenWork ,

Kindly give some time. will check and update you.

@JanisFalkenhagenWork
Copy link
Author

Hey,

for now i used this workaround:

class CommonSetup(aetest.CommonSetup)
    @aetest.subsection
    def connect_to_devices(self, testbed):
        dev_to_remove = set()
        futures = []
        with ThreadPoolExecutor() as executor:
            for dev_name in testbed.devices.keys():
                dev = testbed.devices[dev_name]
                futures.append(
                    executor.submit(self._connect_device, dev, dev_name)
                )
        
        for future in futures:
            success, dev_name = future.result()
            if not success:
                dev_to_remove.add(dev_name)
                testbed.remove_device(dev_name)
        
        logging.warning("The following devices couldnt be connected to:")
        logging.warning(dev_to_remove)
        
    def _connect_device(self, dev_obj, dev_name):
        success = True
        try:
            dev_obj.connect()
        except Exception as e:
            logging.warning(f"Device {dev_name} failed to connect. Removing it from the testbed.")
            success = False
        return success, dev_name

It makes the code less concise but seems to work. Also we modifiy the testbed in place, what might have sideeffects for other scripts, as i dont think that they are cloned for each testscript.
It also leaves the user vulnerable to concurrency problems. It would be nice, if functionality like this could be made available in the release version. (If it isn't already and we just missed it xD.)

I am still curious about your solution or recommended/endorsed way of solving this problem.

Have a nice day. :D

@sowmyadn010501
Copy link

Hi, thanks for the info. Kindly give me some time to check on this with my team.

@sowmyadn010501
Copy link

Hi, the testbed.connect() will connect where the devices able to and fails when the devices are unable to connect. As for the Issue #226 its already fixed. Kindly check

@JanisFalkenhagenWork
Copy link
Author

I disagree. The connect does not fail on my machine at least. As i've written above, it prints a log message but does not throw an error. This means that the setup step would not fail, the subsequent tests would thus be executed. However, you are right that the connect stops any further execution. So even if i somehow catch the device that is not connectable, any devices afterwards are not connected, meaning, those tests would unneccessarily fail.

Also i am sorry but this does not really answer my question because i asked on how i would handle this issue, if i expect devices not to be connectable. From what i am understanding from your answer, i would guess that i cannot use the testbed.connect() shortcut. But what would be the standard way of doing the connection then if i wanted to implement it by the book? Or is this case just not covered because it usually points to a critically failing network anyways?

And yeah issue #226 is closed and tbh i don't recall why i mentioned it. I think because it complains about Exceptions being thrown which are not anymore. Maybe the fix mentioned in that issue by @SohanTirpude was to remove the exceptions being propagated to the user, i.e. the testscript, which now makes my life more difficult.

Thanks for the response.

@Neeraja1bala
Copy link

Neeraja1bala commented Apr 23, 2024

Hi,

Kindly try with updated pyats as issue #226 is resolved and in this, @SohanTirpude did not removes the exceptions but only showing exceptions for devices which are unable to connect and for other devices are able to connect. So, kindly try.
And we have a method device.is_connected() which we can use to go through all the devices available in testbed and check whether it is connected or not. If it is not, then please remove it from the testbed and continue the job execution.

@JanisFalkenhagenWork
Copy link
Author

JanisFalkenhagenWork commented Apr 25, 2024

Hi,

thanks for your answer. I'll try as soon as i get to it. My timetable is pretty full though atm. So please give me some time to test this out.

The device.is_connected() method is a good hint though. Thank you.

Have a nice rest of your week
Janis

@JanisFalkenhagenWork
Copy link
Author

But just out of curiosity?

Why is the device.is_connected() method not listed here?

https://pubhub.devnetcloud.com/media/pyats/docs/topology/concept.html

@Neeraja1bala
Copy link

Hello Janis,

Thanks for your update, take your time to try this. Once you did, kindly update the status.

@Neeraja1bala
Copy link

Hello Janis,

Kindly let us know the update on this issue.

@JanisFalkenhagenWork
Copy link
Author

Hi,

i am currently on vacation. Please give me some more time to test this and let me come back to you then.

Thank you in advance.
Janis

@Neeraja1bala
Copy link

Hello ,

Kindly let us know, when you will come back from vacation.

@Neeraja1bala
Copy link

Hello @JanisFalkenhagenWork ,

Is any update on this issue, kindly let us know.

@JanisFalkenhagenWork
Copy link
Author

JanisFalkenhagenWork commented May 22, 2024

Hey,

i just tried it on the newest version available to me. This is the output of the pyats version check command:

You are currently running pyATS version: 24.4
Python: 3.11.5 [64bit]

  Package                      Version
  ---------------------------- -------
  genie                        24.4.3
  genie.libs.clean             24.4
  genie.libs.conf              24.4
  genie.libs.filetransferutils 24.4
  genie.libs.health            24.4
  genie.libs.ops               24.4
  genie.libs.parser            24.4
  genie.libs.robot             24.4
  genie.libs.sdk               24.4
  genie.telemetry              24.4
  genie.trafficgen             24.4
  pyats                        24.4
  pyats.aereport               24.4
  pyats.aetest                 24.4
  pyats.async                  24.4
  pyats.connections            24.4
  pyats.contrib                24.4
  pyats.datastructures         24.4
  pyats.easypy                 24.4
  pyats.kleenex                24.4
  pyats.log                    24.4
  pyats.reporter               24.4
  pyats.results                24.4
  pyats.robot                  24.4
  pyats.tcl                    24.4
  pyats.topology               24.4
  pyats.utils                  24.4
  rest.connector               24.4
  unicon                       24.4
  unicon.plugins               24.4
  yang.connector               24.4

The same behaviour as i described above is still happening:

I created a testbed containing three devices:

  1. device is able to connect when handled manually.
  2. device is running into a timeout
  3. device is able to connect when handled manually.

Then i used the following code to check my hypothethis:

from genie.testbed import load

path_to_testbed = "<my_path>"
tb = load(path_to_testbed)

tb.connect()

This then prints the following to the console:

<uninteresting Logstuff from the first device>

failed to connect to 2nd_device
Failed while bringing device to "any" state
Traceback (most recent call last):
  File "src/unicon/statemachine/statemachine.py", line 737, in unicon.statemachine.statemachine.StateMachine.go_to
  File "src/unicon/statemachine/statetransition.py", line 478, in unicon.statemachine.statetransition.AnyStateTransition.do_transitions
  File "src/unicon/eal/dialogs.py", line 479, in unicon.eal.dialogs.Dialog.process
  File "src/unicon/eal/dialog_processor.py", line 339, in unicon.eal.dialog_processor.SimpleDialogProcessor.process
  File "src/unicon/eal/dialog_processor.py", line 219, in unicon.eal.dialog_processor.SimpleDialogProcessor.expect_eval_statements
  File "src/unicon/eal/backend/spawn.py", line 111, in unicon.eal.backend.spawn.RawSpawn.read_update_buffer
  File "src/unicon/eal/backend/spawn.py", line 96, in unicon.eal.backend.spawn.RawSpawn.read
  File "src/unicon/eal/backend/pty_backend.py", line 174, in unicon.eal.backend.pty_backend.Spawn._read
unicon.core.errors.EOF: Unable to read. Connection closed or not available

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/unicon/bases/connection.py", line 794, in unicon.bases.connection.Connection.connect
  File "src/unicon/bases/routers/connection_provider.py", line 227, in unicon.bases.routers.connection_provider.BaseSingleRpConnectionProvider.connect
  File "src/unicon/bases/routers/connection_provider.py", line 257, in unicon.bases.routers.connection_provider.BaseSingleRpConnectionProvider.establish_connection
  File "src/unicon/statemachine/statemachine.py", line 740, in unicon.statemachine.statemachine.StateMachine.go_to
unicon.core.errors.StateMachineError: Failed while bringing device to "any" state

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/pyats/topology/testbed.py", line 537, in pyats.topology.testbed.Testbed.connect
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/pyats/connections/manager.py", line 512, in pyats.connections.manager.ConnectionManager.connect
  File "src/unicon/bases/connection.py", line 801, in unicon.bases.connection.Connection.connect
unicon.core.errors.ConnectionError: failed to connect to 2nd_device
Failed while bringing device to "any" state

This looks like an exception but it is not an exception that is propagated to the user context, i.e., my context. This is only printed to an output stream.
To test this hypothesis i used the following modified code:

from genie.testbed import load

path_to_testbed = "<my_path>"
tb = load(path_to_testbed)

tb.connect()
# In the original example, the script ends here ^.
print("If this is printed, the previous function could not have thrown an exception as it wasn't caught and thus would be propagated further up the stack, i.e., crash the program leading to this line not being executed.")

As expected this scripts prints the stuff from the script above plus the last string. We can therefore conclude that the tb.connect() does NOT throw an exception. This means, that we cannot recognize from the user code if specific devices have failed to connect without iterating over them once again. This effectively makes the testbed.parse() function not usable without further testbed checking.

Additionally, the execution of the connect function just halts, as soon, as one device is encountered that is not connectable, increasing the negative impact of this behaviour. This means that simply removing not connected devices from the testbed via the device.is_connected() method, for example, is not feasible as this would remove not only the devices that aren't connectable but also the devices that weren't tried to connect to because of the internal exception beforehand. I.e., the testbed would not cover the complete reachable network. In my example from above this means that the second and third device would be removed from the testbed even though, only the second device is not connectable.

That is why i asked, if this is by design and if there is a workaround that does not lead to reimplementing the tb.connect() function. The latter obviously works, as i demonstrated here:

Hey,

for now i used this workaround:

class CommonSetup(aetest.CommonSetup)
    @aetest.subsection
    def connect_to_devices(self, testbed):
        dev_to_remove = set()
        futures = []
        with ThreadPoolExecutor() as executor:
            for dev_name in testbed.devices.keys():
                dev = testbed.devices[dev_name]
                futures.append(
                    executor.submit(self._connect_device, dev, dev_name)
                )
        
        for future in futures:
            success, dev_name = future.result()
            if not success:
                dev_to_remove.add(dev_name)
                testbed.remove_device(dev_name)
        
        logging.warning("The following devices couldnt be connected to:")
        logging.warning(dev_to_remove)
        
    def _connect_device(self, dev_obj, dev_name):
        success = True
        try:
            dev_obj.connect()
        except Exception as e:
            logging.warning(f"Device {dev_name} failed to connect. Removing it from the testbed.")
            success = False
        return success, dev_name

It makes the code less concise but seems to work. Also we modifiy the testbed in place, what might have sideeffects for other scripts, as i dont think that they are cloned for each testscript. It also leaves the user vulnerable to concurrency problems. It would be nice, if functionality like this could be made available in the release version. (If it isn't already and we just missed it xD.)

I am still curious about your solution or recommended/endorsed way of solving this problem.

Have a nice day. :D

I hope this clarifies my problem and shows that i tested it thoroughly enough as you requested. :)

Have a nice day
Janis

@Neeraja1bala
Copy link

Neeraja1bala commented May 23, 2024

Hello @JanisFalkenhagenWork ,

Can you let me know, how many devices are in yaml and which one is not connectable second device of third device. Is that same thing happened in internal aswell.
If devices are not reachable by default testbed object will not remove. And we are not raising the exception we are changing the logging.exception.

@JanisFalkenhagenWork
Copy link
Author

Hi @Neeraja1bala

as i wrote above the testbed contains three devices. The first and third (when read from the top) are reachable and connectable via ssh The second device is not present in the network, the connection thus times out:

I created a testbed containing three devices:

1. device is able to connect when handled manually.

2. device is running into a timeout

3. device is able to connect when handled manually.

All devices are cisco Catalyst 2960L.

Do i understand you correctly that the testbed.connect() function should remove non reachable devices?
I am pretty certain that this is not happening, as that would prevent my issue in the first place. But i'll test that right away again. Give me an hour at most until i get back to you.

I am not sure what you mean by

And we are not raising the exception we are changing the logging.exception.

Also the meaning of this sentence is not clear to me. Could you please elaborate on those two?

Is that same thing happened in internal aswell.

@JanisFalkenhagenWork
Copy link
Author

Ok, i tested again. I used the same testbed as above with the following code:

tb = load("path_to_testbed")
print(tb.devices.keys())  # dict_keys(['device_a', 'DUMMY_NOT_PRESENT', 'device_b'])

tb.connect()

print(tb.devices.keys())  # dict_keys(['device_a', 'DUMMY_NOT_PRESENT', 'device_b'])

for key in tb.devices.keys():
    dev = tb.devices[key]
    print(dev.is_connected())
#  True
#  False
#  True

As you can see, the device which is not present is not removed.

Also contrary to what i saw in my last tests, this time the third device is connected. This, i did not expect, so i tested with two of our large production testbeds again. It did also not show the behaviour where the connection establishment is aborted once one device fails. I am not sure what changed since then but this means, that the testbed.connect() function in conjunction with device.is_connected() solves that problem.

This means that stable test operation would require this code for each connect, as long as one cannot gurantee that all devices are reachabel, right?

[...]
tb.connect()
dev_to_remove = []
for key in tb.devices.keys():
    dev = tb.devices[key]
    if not dev.is_connected():
        dev_to_remove.append(key)

for key in dev_to_remove:
    tb.remove_device(key)

But if this is practically required for each testcase, why not implement it in the testbed.connect() function by default?
Maybe add a method parameter that allows turning this behaviour on in order to keep the old interface functional.

Also it would be very much bad practice if the testbed object passed to me in my testcase's setup method, was modified, i.e., if i removed not connected devices from it because all subsequent testcases would be handed the modified testbed. So the better solution would be to have the testbed.parse() method only work on the connected devices inside the testbed or am i mistaken here?

And my main question remains:

Is this just bad test design on my part, i.e., should testbeds only contain devices that are guranteed to connect successfully?

@Neeraja1bala
Copy link

Hello @JanisFalkenhagenWork ,

If devices are not connectable, tb.connect() will not remove those devices . You should have to do it in your side.

OR

You can set empty list, and add devices in that list if devices are connectable

[...]
tb.connect()
dev_to_remove = []
for key in tb.devices.keys():
    dev = tb.devices[key]
    if not dev.is_connected():
        dev_to_remove.append(key)

for key in dev_to_remove:
    tb.remove_device(key)

@Neeraja1bala
Copy link

Hello @JanisFalkenhagenWork ,
Any update on this, kindly let us know the status .

@JanisFalkenhagenWork
Copy link
Author

JanisFalkenhagenWork commented May 28, 2024

Hello,

your code example is simply copied from my answer. It does not do what your caption says it does.

You still didn't answer the question i have:

Are testbeds containing devices that are not connectable bad style/bad practice?

Also i find it a bit rude to change your previous answers after the fact without marking changes.

@Neeraja1bala
Copy link

Hello ,

Yes, it is a bad practice but , if devices are not connectable you have to remove those devices in testbed or make sure those devices are reachable if not you have to do above code block.

Are testbeds containing devices that are not connectable bad style/bad practice?

I had forgot to mentioned you that previous change i made.

Also i find it a bit rude to change your previous answers after the fact without marking changes.

@JanisFalkenhagenWork
Copy link
Author

Ok. Thank you for the answer,

I have no further questions.

@Neeraja1bala
Copy link

Thanks for confirmation. Will be closing this issue.

Neeraja

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants