Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating Point Exception #22

Open
ZaidQureshi opened this issue May 24, 2019 · 16 comments
Open

Floating Point Exception #22

ZaidQureshi opened this issue May 24, 2019 · 16 comments
Assignees
Labels

Comments

@ZaidQureshi
Copy link

I am trying to run the example nvm-identify but I get the following output:
Resetting controller and setting up admin queues...
Floating point exception

The dmesg output is this:
[May24 16:40] traps: nvm-identify[3179] trap divide error ip:7f6d2f98a434 sp:7ffd9a74e3b0 error:0 in libnvm.so[7f6d2f985000+9000]
I am not sure what is going on. Any help would be appreciated.

@enfiskutensykkel
Copy link
Owner

Hi Zaid,

This probably means that there is a division by zero somewhere. Not really sure where that would be. Would it be possible for you to add some print statements in the examples/identify/module.c and examples/identify/common.c files to see how far it gets before the bug occurs, and ideally which library call that fails? I'm wondering if the error occurs in the reset_ctrl function (in common.c) or in the identify_ctrl function (also in common.c).

Thanks

@enfiskutensykkel enfiskutensykkel self-assigned this May 25, 2019
@ZaidQureshi
Copy link
Author

ZaidQureshi commented May 28, 2019

I did some debugging and it seems the problem is on line 51 in include/nvm_queue.h which has the following text:
if ((uint16_t) ((sq->tail - sq->head) % sq->max_entries) == sq->max_entries - 1)
So for some reason the tail, head, and max_entries of the admin queue are all 0.

@enfiskutensykkel
Copy link
Owner

Thank you for your effort.

I suspect that this may be because it is not able to read controller registers properly. I assume you are using the kernel module version (and not the SmartIO version)?

In examples/identify/module.c, could you print out the members of ctrl directly after the call to nvm_ctrl_init, particularly the values of ctrl->page_size and ctrl->max_entries ?

@ZaidQureshi
Copy link
Author

These are the values for each of the members of ctrl after the nvm_ctrl_init call:

CTRL->DSTRD: 0
CTRL->timeout: 30000
CTRL->MAX_ENTRIES: 0
CTRL->MM_SIZE: 8192
CTRL->mm_ptr: 0x7f511d31f000

@enfiskutensykkel
Copy link
Owner

Thank you again.

This is a bit strange, it appears that it managed to read the time-out value from the registers, but he maximum number of entries has become 0. I may be doing something wrong in src/ctrl.c or maybe I’ve interpreted the spec wrong. There are some fields that are defined as 0 meaning 1, so I need to check if this is the case here.

Out of curiosity, what kind of nvme controller/disk is this?

@ZaidQureshi
Copy link
Author

This is a Corsair NX500 SSD.

@ZaidQureshi
Copy link
Author

ZaidQureshi commented May 28, 2019

So the value returned from CAP.MQES reg value in initialize_handle in src/ctrl.c for me is 65535. When we add 1, it overflows and turns 0. Now the question is why does it return 65535.
When I remove the plus 1, I get valid values for max entries for the admin queues and the controller information is returned correctly..

@enfiskutensykkel
Copy link
Owner

The reason I add 1 is that MQES is, according to the spec,

This is a 0's based value. The minimum value is 1h, indicating two entries
Meaning that a value of 0 is illegal, and a value of 1 means it should be 2 etc.

If it returns 65535, I suspect a different type of issue, namely that all MMIO reads are returning 0xFFs (this is a common symptom when reading does not work). If you attempt to read any other values directly from the memory mapped pointer, does it all return 0xFFs? Could you for example try this:
printf("%lx\n", *((uint64_t*) mm_ptr));

@ZaidQureshi
Copy link
Author

ZaidQureshi commented May 28, 2019

This is the value printed due to printf("%lx\n", *((uint64_t*) ctrl->mm_ptr)); after calling nvm_ctrl_init:
203c03ffff

@enfiskutensykkel
Copy link
Owner

enfiskutensykkel commented May 28, 2019

That's a good thing!

The controller probably supports 65536 entries, and the bug is that I'm using 16-bit variable to read out the value and add one. I'm trying to find a datasheet or documentation in order to verify that it supports that many queue entries, but I suspect that this is the case.

I'll add a quick fix for it and add your name in the commit message (if you want), or you can send me a patch or alternatively a pull request if you want. In the mean time, you can probably comment out the +1 for now.

@enfiskutensykkel
Copy link
Owner

Thank you for your time and effort, by the way.

@ZaidQureshi
Copy link
Author

Wait but the specification states that this register is only 16 bits.

@enfiskutensykkel
Copy link
Owner

enfiskutensykkel commented May 28, 2019

Yes, the register is only 16 bits, but I belive the value allowed to be 65536. I'll look around, and check the spec errata if it is mentioned, and I'll also have a look at what the Linux kernel NVMe driver does. Does the indentify example program work for you if you don't add one?

@ZaidQureshi
Copy link
Author

Yes that example works but I think typically what people do is they always use the MQES+1 value. So I think you will need to change the type of the max_entries field in all of your code. I believe this is leading to another issue at the end of the latency benchmark when you are unmapping memory as I get the following error at the end of the latency benchmark:
[unmap_memory] Page unmapping kernel request failed: Invalid argument

I am trying to debug it and see what is happening.

@enfiskutensykkel
Copy link
Owner

enfiskutensykkel commented May 28, 2019

The spec says in the introduction chapter:

Support for 65,535 I/O queues, with each I/O queue supporting up to 64K outstanding commands.

Since it's 2^16 - 1 number of I/O queues, does that mean 64K in this case perhaps means 2^16?

Regardless, the unmapping is a known bug (see #18), I haven't had time to rewrite the kernel module yet. The problem is how I'm currently storing address mappings, as I'm storing everything into one global list when I should instead create a list per open descriptor. It happens with the nvm-latency-benchmark because I check against the PID, and for some reason CUDA programs fork into child processes so the "wrong" PID is attempting to clean up the stored mapping.

It will clean up automatically once you unload the module, so do that every now and then if you run the nvm-latency-bench program multiple times. The warnings and errors look bad, but it is expected.

@enfiskutensykkel
Copy link
Owner

The permanent fix would be to change the type of max_entries to uint32_t as you suggested, or alternatively always use + 1 when I reference it. I'll have to look into the impact of both solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants