Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spi implementation relies on fragile timing #657

Open
Felix-El opened this issue Jan 5, 2023 · 6 comments
Open

Spi implementation relies on fragile timing #657

Felix-El opened this issue Jan 5, 2023 · 6 comments

Comments

@Felix-El
Copy link

Felix-El commented Jan 5, 2023

I'm working on a project where a Feather M4 board connects to a Wiznet W5500 Ethernet chip via SPI for reading and processing raw Ethernet frames. The ATSAMD is running at 120 MHz and the SPI Master via SERCOM1 is operated at 20 MHz.

While generally all parts reached a working state by themselves (SPI communication, USB ACM, GPIOs and a custom SysTick based timer) I soon started seeing issues which were hard to diagnose. The SPI SERCOM started reporting overflow errors after some time (after 10-60s into operation), and that only when building with the release profile, unoptimized seemed to work.
After days of digging I've realized that my SysTick exception (at only 1ms rate, doing nothing but an increment of a SW counter) could influence the timing enough for SPI write / transfer to break! I expect any other interrupt could do the same.

The issue seems to be the implementation of Duplex write and transfer within src\sercom\spi\impl_ehal_thumbv7em.rs.
As they are really similar, let's take transfer as the base for discussion:

#[inline]
fn transfer<'w>(&mut self, words: &'w mut [Word<$Length>]) -> Result<&'w [Word<$Length>], Error> {
    let cells = core::cell::Cell::from_mut(words).as_slice_of_cells();
    let mut to_send = cells.iter();
    let mut to_recv = cells.iter();
    while to_recv.len() > 0 {
        let flags = self.read_flags_errors()?;
        if to_send.len() > 0 && flags.contains(Flags::DRE) {
            let word = match to_send.next() {
                Some(cell) => cell.get(),
                None => unreachable!(),
            };
            self.config.as_mut().regs.write_data(word as u32);
        }
        if to_recv.len() > to_send.len() && flags.contains(Flags::RXC) {
            let word = self.config.as_mut().regs.read_data() as Word<$Length>;
            match to_recv.next() {
                Some(cell) => cell.set(word),
                None => unreachable!(),
            }
        }
    }
    Ok(words)
}

The while loop is there to ensure that we read as many words as we write. This alone however is not sufficient. As the SERCOM only offers a buffer for one word, reading may not limp behind more than one word. Let's call this situation (-1).
To understand the issue I recorded the contents of INTFLAGS of the SERCOM in response to sending just one single word (by writing the DATA register). It turns out to be: [0, ..., 1, ..., 5, ..., 7, ...] where DRE=1, TXC=2, RXC=4.

As you can see, the DRE bit is set first, so depending on execution times (optimization, interrupts, ...) it is unpredictable what happens. Assume we start in the first iteration of the while loop:

  • Iteration 1: First if hits because DRE is set => transmit first word. Second if does not hit as RXC was clear at time of sampling. We're one read behind (-1).
  • Iteration N>1: now it depends at what time we take the next snapshot of the INTFLAGS
    1. Possibility: Both, DRE and RXC are observed, and the first if hits and we transmit (-2) but shortly after (hopefully) the second if hits and we manage to read a word to get back to (-1) avoiding overflow.
    2. Possibility: Only DRE is observed -> same situation as Iteration 1. We're now (-2) and in the following iterations we'll have to catch up reading with even closer timing to avoid the overflow.

It appears the existing code is quite sensitive to timing variations - one could say it has real-time requirements of its own.
I was able to write a workaround for my project using a thin newtype as wrapper:

impl<C, A> Transfer<u8> for SafeSpi<C, A>
where
    C: ValidConfig,
    A: Capability,
{
    type Error = Error;

    fn transfer<'w>(&mut self, words: &'w mut [u8]) -> Result<&'w [u8], Self::Error> {
        let cells = Cell::from_mut(words).as_slice_of_cells();
        let mut to_send = cells.iter();
        let mut to_recv = cells.iter();
        while to_recv.len() > 0 {
            if to_send.len() > 0 {
                while !self.0.read_flags_errors()?.contains(Flags::DRE) {}
                let word = match to_send.next() {
                    Some(cell) => cell.get(),
                    None => unreachable!(),
                };
                unsafe { self.0.write_data(word as u32) };
            }
            if to_recv.len() > to_send.len() {
                while !self.0.read_flags_errors()?.contains(Flags::RXC) {}
                let word = (unsafe { self.0.read_data() }) as u8;
                match to_recv.next() {
                    Some(cell) => cell.set(word),
                    None => unreachable!(),
                }
            }
        }
        Ok(words)
    }
}

This solved the issue in my case so I'm sharing these findings. I believe a conservative implementation like this should be shipped with the crate, as a safer default, even though I see a few subtle issues:

  • The existing implementation avoids dead times by "streaming" new data, while my implementation waits for RXC before pushing another word, introducing a delay between words.
  • If a constant stream of words is not upheld, will the SPI SERCOM assume "end of transfer" prematurely and unassert CS when using Hardware Slave Select? If so it would probably render HWSS useless since some SPI clients rely on CS active during the entire transaction. Only software-driven CS would be possible. Maybe not an issue but needs documentation.
  • Perhaps this can't even be made reliable enough without involving DMA?

I'm willing to help improve the implementation but I feel like I neither have enough experience nor the tools to analyze the outcome at the waveform level.

(Find attached the Rust program I used for recording the INTFLAGS register after sending a word. This is all assembly but SERCOM1 must be configured (in Rust) prior to running this.)

Crates used:

embedded-hal = "0.2.7"
embedded-nal = "0.6.0"
embedded-time = "0.12.1"
cortex-m = { version = "0.7.6", features = ["critical-section-single-core"] }
cortex-m-rt = "0.7.2"
feather_m4 = { version = "0.10.1", features = ["usb"] }
atsamd-hal = { version = "0.15.1" }
usb-device = "0.2.9"
usbd-serial = "0.1.1"
w5500 = { path = "priv/w5500" } # unpublished master w/ raw Ethernet frame reading
critical-section = "1.1.1"
heapless = "0.7.16"
@bradleyharden
Copy link
Contributor

@Felix-El, I skimmed your post, but I will need some additional time to go through it in detail. I suspect the root cause is a problem I noticed before but haven't fixed yet. Maybe you can confirm?

I realized that I've been enabling the receiver (RXEN bit), which is required for Capability = Duplex. But I then have to code around the receiver when the Capability = Tx. What I should have done instead is simply disable the receiver for Capability = Tx. That would stop any RXC interrupts and prevent any BUFOVF errors.

Right now, I'm working to get #450 ported to D11 and D21 chips. Then I want to go through and incorporate it into the sercom::spi module. At the same time, I will update the code to disable the receiver when it isn't enabled.

@Felix-El
Copy link
Author

@bradleyharden, I have only used the feather_m4::spi_master() function which yields a type specialized to Capability = Duplex. That is when the reported issues manifest. I believe, from reading the ATSAMx manual BUFOVF will not occur with the receiver disabled.

@bradleyharden
Copy link
Contributor

@Felix-El, I finally took the time to read your first post. I'm confused though.

You talk about being 0, 1 or 2 reads behind (0, -1, -2). But unless I'm mistaken, that has nothing to do with overflow, at least not directly. There's a separate TX buffer, and you can fill that with words, so that it always has the next byte ready to transmit. And I believe the TX buffer is at least 2 deep, so you should be able to reach -2 pretty much immediately. If overflow is happening, it's not because you're filling the TX buffer too quickly, it's because the RX buffer isn't being read fast enough. Your solution does indeed prevent overflow, but it does it in a brute force manner, by simply never transmitting the next word until you receive the current word. As you say, that has a number of downsides.

Let's take a step back for a second. I believe the implementation you showed is for a blocking embedded-hal trait, right? You also said your SPI clock is 20 MHz. What is your word length? Is it one byte? If so, a single SPI transaction takes 400 ns, which is only 48 clocks at 120 MHz. That means you need to service both a read and a write every 48 clock cycles, on top of servicing the exception. To me, that's a recipe for timing problems.

My first thought is that you should re-evaluate your architecture. At the very least, it seems like you should be using high priority interrupts for this. But with only 48 clock cycles, interrupt latency is likely problematic. A better approach is probably to move to DMA. If my assumptions above are correct, I'm honestly surprised it works as well as it does without DMA. I had a similar problem, where I needed to service a SPI bus quickly, and I ran into problems immediately. I switched to DMA and haven't had a problem since.

One last question. Are you familiar with RTIC? It's really useful when you need hard real-time guarantees. I would highly recommend it.

@Felix-El
Copy link
Author

Felix-El commented Jan 20, 2023

@bradleyharden, thanks for looking into this.

You talk about being 0, 1 or 2 reads behind (0, -1, -2). But unless I'm mistaken, that has nothing to do with overflow, at least not directly. There's a separate TX buffer, and you can fill that with words, so that it always has the next byte ready to transmit. And I believe the TX buffer is at least 2 deep, so you should be able to reach -2 pretty much immediately.

TX buffer is strictly speaking just one 1 deep but the shift register could be considered as the second storage element.
IIUC the manual, when the shift register is drained, the TX buffer'ed word will be moved into it and we're informed via TXC.

If overflow is happening, it's not because you're filling the TX buffer too quickly, it's because the RX buffer isn't being read fast enough.

Correct, that's exactly what I meant. If we send off 2 TX words, we're at risk we can't pick up the first RX word soon enough before the second is received - this is happening asynchronously at the SPI controller's pace.

Your solution does indeed prevent overflow, but it does it in a brute force manner, by simply never transmitting the next word until you receive the current word. As you say, that has a number of downsides.

Absolutely.

Let's take a step back for a second. I believe the implementation you showed is for a blocking embedded-hal trait, right? You also said your SPI clock is 20 MHz. What is your word length? Is it one byte?

Yes that is the case.

If so, a single SPI transaction takes 400 ns, which is only 48 clocks at 120 MHz.

Yup.

That means you need to service both a read and a write every 48 clock cycles, on top of servicing the exception. To me, that's a recipe for timing problems.

Well, a 20 MHz clock is just the toggle rate of the SCK line. If at times a master does not manage to provide/read data at "real-time" speeds, there should only be a gap between words on the bus (idle time, SCK held). Or at least that is my expectation.

My first thought is that you should re-evaluate your architecture. At the very least, it seems like you should be using high priority interrupts for this. But with only 48 clock cycles, interrupt latency is likely problematic. A better approach is probably to move to DMA.

Yes exactly, without DMA we are at risk to trigger the reported behaivor, i.e. if interrupt processing takes too long inbetween TX write and RX read.

If my assumptions above are correct, I'm honestly surprised it works as well as it does without DMA. I had a similar problem, where I needed to service a SPI bus quickly, and I ran into problems immediately. I switched to DMA and haven't had a problem since.

Is there an example of using DMA with this crate? I'm generally open to also use it. Anyway, this does not resolve this issue at hand.

One last question. Are you familiar with RTIC? It's really useful when you need hard real-time guarantees. I would highly recommend it.

I have not used RTIC so far and the solution I found is good enough for my use case. However, I'm worried about the implied real-time requirements of using SPI here. Being the master and having a HW SPI controller (not bit-bang) I don't expect to be forced to observe those many details. Let's call this hidden complexity.
Have you ever seen this mentioned in the docs of this or any other peripheral crate?

I understand the solution I chose is conservative but at least it works reliably. Should this behavior not be default?
People who shoot for best performance will anyway use DMA but they will likely also start off prototyping something with this synchronous SPI and might get frustrated because it simply fails at some unfortunate "clock ratio" or undeterministically due to interrupts.

sakian added a commit to VitalBio/atsamd that referenced this issue Apr 11, 2024
@sakian
Copy link
Contributor

sakian commented Apr 11, 2024

For what it's worth, I ran into this issue on one of my projects as well, talking to a stepper driver (TMC5130A). Your alterations are working for me.

@flelchuk
Copy link

flelchuk commented Apr 12, 2024

For what it's worth, I ran into this issue on one of my projects as well, talking to a stepper driver (TMC5130A). Your alterations are working for me.

I'm glad it was of use to someone else too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants