Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT bytecode optimizer #162

Open
jolivepetrus opened this issue May 9, 2018 · 14 comments
Open

JIT bytecode optimizer #162

jolivepetrus opened this issue May 9, 2018 · 14 comments

Comments

@jolivepetrus
Copy link
Contributor

As Lua RTOS makes intensive use of read-only tables, a series of optimizations can be performed on the program's bytecode to speed-up the program execution. This can help that programs written for Lua RTOS to have a similar performance than the writtens in C, and takes a special importance when the programmer use the Lua RTOS hardware-access modules.

We have start the work on an initial version of the Lua RTOS JIT bytecode optimizer, with excellent results.

@jolivepetrus
Copy link
Contributor Author

I have some initial experimental results.

Objective: know how many time is needed by Lua to respond to a GPIO interrupt.

Test code:

pio.pin.setdir(pio.OUTPUT, pio.GPIO21)
pio.pin.interrupt(pio.GPIO14, function(value)
    pio.pin.sethigh(pio.GPIO21)
    tmr.delayus(1)
    pio.pin.setlow(pio.GPIO21)
end, pio.pin.IntrRaiseEdge)

device = pwm.attach(pio.GPIO26, 1000, 0.1)
device:start()

Wire: GPIO26 to GPIO14
Test points: GPIO14 and GPIO21

CASE A: Lua RTOS with Lua locks & without JIT bytecode optimizer

casea

CASE B: Lua RTOS with Lua locks & JIT bytecode optimizer

caseb

CASE C: Lua RTOS without Lua locks & JIT bytecode optimizer

casec

Conclussions:

CASE A: use C if you need a response time < 60 usecs
CASE C: use C if you need a response time < 35 usecs
CASE D: use C if you need a response time < 16 usecs

It is clear that the JIT has a good impact in the performance, and it is mandatory to determine in which situations it is safe to disable Lua locks.

@jolivepetrus
Copy link
Contributor Author

Hi guys,

A first version of the JIT byte-code optimizer is available in 6763f11.

@yawor
Copy link
Contributor

yawor commented May 11, 2018

@jolivepetrus what exactly are Lua locks? When it's safe to disable them? Is it needed only when multiple threads are accessing the same hardware? If I use only a dedicated thread to access a specific hardware or I use explicit mutexes in my code to disallow simultaneous access to it, can I safely disable the Lua locks?

@jolivepetrus
Copy link
Contributor Author

jolivepetrus commented May 11, 2018

@yawor,

Lua locks are recursive mutexes that are used in the lua API to protect concurrent access to the Lua state. For example, when calling the lua_newuserdata, or the lua_pushinteger function, a lock is adquired before modifying the lua state (the structure that holds, for example the global variables) and released when the lua state is modified.

The thing is that Lua RTOS is programmed in a way that lua locks can be omitted in certain circumstances:

  • The callbacks use a new lua state (in fact it is a light version of the lua_State structure). The code executed in the callback don't require to use locks when accessing the lua api.

  • Lua RTOS threads also use a new lua state, so the above rule can be applied to it.

  • Within a callback, or a Lua RTOS thread, only the lua api functions that access to global variables must be protected by a lua lock.

With the actual Lua RTOS version, it should be safe to disable locks if:

  • To run programs, put the source code into a lua file, and use dofile to execute the program. Avoid paste code in the console.

  • Always protect access to global variables from Lua RTOS threads (use the mutual exclusion functions from the thread module).

Disabling lua locks, or minimize the use of lua locks is feasible, but requires some internal work to be transparent to the programmer. For now, please follow the above rules, and just program in the usual way.

@yawor
Copy link
Contributor

yawor commented May 11, 2018

@jolivepetrus thanks for the explanation
Unfortunately enabling JIT with disabled Lua Locks at the same time has a negative impact on SPI.

I've been testing this by executing 10-12 consecutive spi:readwrite operations, each sending 2 bytes. Nothing more in the program right now (wifi and network services disabled). I recorded the time between the NSS going down and back again and the time to NSS going down again.

  • LL on, JIT off: 260 us, 46 us
  • LL off, JIT off: 140 us, 22 us
  • LL on, JIT on: 180 us, 24 us
  • LL off, JIT on: 180 us, 25 us

So I got the best performance with both LL and JIT off. It's strange that with JIT on, there's almost no difference on whether the LL is on or off.

BTW is disabling hardware locks supported? I wanted to try disabling them in Kconfig, but then I got some error with i2c unlock function missing during compilation.

@jolivepetrus
Copy link
Contributor Author

@yawor,

Please, attach your test code, to check the JIT optimizations.

@jolivepetrus
Copy link
Contributor Author

@yawor,

In theory the JIT's commits disables lua locks and read-only table cache if JIT is enabled:

Cache:

When accessing to readonly tables, Lua RTOS can get the key/value pair from a cache. This can
speedud the execution of Lua RTOS scripts. This option is disabled when the JIT bytecode
optimizer is enabled.

Lua locks:

Use locks when the program enters the Lua core. This option is disabled when the JIT bytecode
optimizer is enabled.

For this reason you have no differences with JIT=on & (LL=on | LL = off). If JIT=on LL are always disabled.

Now LL setting is present in Kconfig for compatibility, and will be removed from Kconfig soon.

@yawor
Copy link
Contributor

yawor commented May 11, 2018

Ok, here's a little twist. I have two programs: one that I started implementing communication with RFM12B module and a stripped down which only enables SPI and sends some bytes.
I've actually tested this on the bigger program and now I've checked on the stripped down one. What's strange, the stripped down version is slower than the full one (on compilation where JIT is enabled, I get 200-210 us instead of 180 us like I've posted earlier). The SPI code is almost identical (besides function name) and the difference is that I start a thread before running init commands over SPI in the full program.

Here's stripped down code:

local spiData = {
   0x0123,
   0x4567,
   0x89ab,
   0xcdef,
   0x0000,
   0x1111,
   0x2222,
   0x3333,
   0x4444,
   0x5555,
   0x6666,
}

local dev

local function readwrite(data)
    dev:select()
    local ret = dev:readwrite((data & 0xff00) >> 8, data & 0xff)
    dev:deselect()
    return ret[0] << 8 | ret[1]
end

function runtest()
    dev = spi.attach(spi.SPI3, spi.MASTER, pio.GPIO5, 2000000, 8, 0)
    for i, cmd in ipairs(spiData) do
        readwrite(cmd)
    end
end

And here's the full one:

-- RFM12B

local FREQUENCY = 1664 -- math.floor((868320000 - 860000000) / 5000) -- 868.320 MHz
local FSK_SHIFT = 48 -- (math.floor(60000 / 15000) - 1) << 4 -- 60000
local DATARATE = 35 -- math.floor((10000000 / 29 / 9600) - 0.5)

local CMD_CFG = 0x8000
local CFG_EL = 0x80
local CFG_EF = 0x40
--local CFG_BAND_315 = 0x00
--local CFG_BAND_433 = 0x10
local CFG_BAND_868 = 0x20
--local CFG_BAND_915 = 0x30
--local CFG_XTAL_8_5PF  = 0x00
--local CFG_XTAL_9_0PF  = 0x01
--local CFG_XTAL_9_5PF  = 0x02
--local CFG_XTAL_10_0PF = 0x03
--local CFG_XTAL_10_5PF = 0x04
--local CFG_XTAL_11_0PF = 0x05
--local CFG_XTAL_11_5PF = 0x06
--local CFG_XTAL_12_0PF = 0x07
local CFG_XTAL_12_5PF = 0x08
--local CFG_XTAL_13_0PF = 0x09
--local CFG_XTAL_13_5PF = 0x0A
--local CFG_XTAL_14_0PF = 0x0B
--local CFG_XTAL_14_5PF = 0x0C
--local CFG_XTAL_15_0PF = 0x0D
--local CFG_XTAL_15_5PF = 0x0E
--local CFG_XTAL_16_0PF = 0x0F

local CMD_PWRMGT = 0x8200
local PWRMGT_ER = 0x80
--local PWRMGT_EBB = 0x40
local PWRMGT_ET = 0x20
--local PWRMGT_ES = 0x10
--local PWRMGT_EX = 0x08
--local PWRMGT_EB = 0x04
--local PWRMGT_EW = 0x02
local PWRMGT_DC = 0x01

local CMD_FREQUENCY = 0xA000

local CMD_DATARATE = 0xC600
--local DATARATE_CS = 0x80

local CMD_RXCTRL = 0x9000
local RXCTRL_P16_VDI = 0x400
local RXCTRL_VDI_FAST = 0x000
--local RXCTRL_VDI_MEDIUM = 0x100
--local RXCTRL_VDI_SLOW = 0x200
--local RXCTRL_VDI_ALWAYS_ON = 0x300
--local RXCTRL_BW_400 = 0x20
--local RXCTRL_BW_340 = 0x40
--local RXCTRL_BW_270 = 0x60
local RXCTRL_BW_200 = 0x80
--local RXCTRL_BW_134 = 0xA0
--local RXCTRL_BW_67 = 0xC0
local RXCTRL_LNA_0 = 0x00
--local RXCTRL_LNA_6 = 0x08
--local RXCTRL_LNA_14 = 0x10
--local RXCTRL_LNA_20 = 0x18
local RXCTRL_RSSI_103 = 0x00
--local RXCTRL_RSSI_97 = 0x01
--local RXCTRL_RSSI_91 = 0x02
--local RXCTRL_RSSI_85 = 0x03
--local RXCTRL_RSSI_79 = 0x04
--local RXCTRL_RSSI_73 = 0x05
--local RXCTRL_RSSI_67 = 0x06
--local RXCTRL_RSSI_61 = 0x07

local CMD_DATAFILTER = 0xC228
local DATAFILTER_AL = 0x80
--local DATAFILTER_ML = 0x40
--local DATAFILTER_S = 0x10

local CMD_FIFORESET = 0xCA00
--local FIFORESET_SP = 0x08
--local FIFORESET_AL = 0x04
local FIFORESET_FF = 0x02
local FIFORESET_DR = 0x01

--local CMD_SYNCPATTERN = 0xCE00

local CMD_READ = 0xB000

local CMD_AFC = 0xC400
--local AFC_AUTO_OFF = 0x00
--local AFC_AUTO_ONCE = 0x40
local AFC_AUTO_VDI = 0x80
--local AFC_AUTO_KEEP = 0xC0
local AFC_LIMIT_OFF = 0x00
--local AFC_LIMIT_16 = 0x10
--local AFC_LIMIT_8 = 0x20
--local AFC_LIMIT_4 = 0x30
--local AFC_ST = 0x08
--local AFC_FI = 0x04
local AFC_OE = 0x02
local AFC_EN = 0x01

local CMD_TXCONF = 0x9800
--local TXCONF_MP = 0x100
local TXCONF_POWER_0 = 0x00
--local TXCONF_POWER_3 = 0x01
--local TXCONF_POWER_6 = 0x02
--local TXCONF_POWER_9 = 0x03
--local TXCONF_POWER_12 = 0x04
--local TXCONF_POWER_15 = 0x05
--local TXCONF_POWER_18 = 0x06
--local TXCONF_POWER_21 = 0x07

local CMD_PLL = 0xCC02
--local PLL_DDY = 0x08
--local PLL_DDIT = 0x04
local PLL_BW0 = 0x01

local CMD_TX = 0xB800

--local CMD_WAKEUP = 0xE000

--local CMD_DUTYCYCLE = 0xC800
--local DUTYCYCLE_ENABLE = 0x01

local CMD_STATUS = 0x0000
--local STATUS_RGIT = 0x8000
local STATUS_FFIT = 0x8000
--local STATUS_POR = 0x4000
--local STATUS_RGUR = 0x2000
--local STATUS_FFOV = 0x2000
--local STATUS_WKUP = 0x1000
--local STATUS_EXT = 0x0800
--local STATUS_LBD = 0x0400
--local STATUS_FFEM = 0x0200
--local STATUS_ATS = 0x0100
--local STATUS_RSSI = 0x0100
--local STATUS_DQD = 0x0080
--local STATUS_CRL = 0x0040
--local STATUS_ATGL = 0x0020

local CMD_RESET = 0xffff

local CMD_PWRMGT_DEFAULT = CMD_PWRMGT | PWRMGT_DC
--local CMD_PWRMGT_TRANSMIT = CMD_PWRMGT_DEFAULT | PWRMGT_ET
local CMD_PWRMGT_RECEIVE = CMD_PWRMGT_DEFAULT | PWRMGT_ER
local CMD_CLEAR_FIFO = CMD_FIFORESET | FIFORESET_DR | (8 << 4)
local CMD_ACCEPT_DATA = CMD_CLEAR_FIFO | FIFORESET_FF

local INIT_COMMANDS = {
    CMD_CFG | CFG_EL | CFG_EF | CFG_BAND_868 | CFG_XTAL_12_5PF,
    CMD_PWRMGT_DEFAULT,
    CMD_FREQUENCY | FREQUENCY,
    CMD_DATARATE | DATARATE,
    CMD_RXCTRL | RXCTRL_P16_VDI | RXCTRL_VDI_FAST | RXCTRL_BW_200 | RXCTRL_LNA_0 | RXCTRL_RSSI_103,
    CMD_DATAFILTER | DATAFILTER_AL | 4,
    CMD_CLEAR_FIFO,
    CMD_AFC | AFC_AUTO_VDI | AFC_LIMIT_OFF | AFC_OE | AFC_EN,
    CMD_TXCONF | TXCONF_POWER_0 | FSK_SHIFT,
    CMD_PLL | PLL_BW0,
    CMD_PWRMGT_RECEIVE,
}

local dev
--local buffer = {}
buffer = {}
local packets = {}
--local recvd = 0
recvd = 0
local recheck = false
--local status
local packet_rcvd = event.create()

function rfm12_readwrite(data)
    dev:select()
    local ret = dev:readwrite((data & 0xff00) >> 8, data & 0xff)
    dev:deselect()
    return ret[0] << 8 | ret[1]
end

function rfm12_init()
    dev = spi.attach(spi.SPI3, spi.MASTER, pio.GPIO5, 2000000, 8, 0)
    thread.start(extafree_packet, 8 * 1024)

    pio.pin.setdir(pio.OUTPUT, pio.GPIO16)
    pio.pin.setdir(pio.OUTPUT, pio.GPIO12)
    pio.pin.setlow(pio.GPIO12)
    thread.sleepms(1)
    pio.pin.sethigh(pio.GPIO12)
    for i, cmd in ipairs(INIT_COMMANDS) do
        rfm12_readwrite(cmd)
    end
    rfm12_readwrite(CMD_STATUS)
    --pio.pin.interrupt(pio.GPIO21, rfm12_callback, pio.pin.IntrNegEdge, 100, 8 * 1024, 22)
    pio.pin.interrupt(pio.GPIO21, rfm12_callback, pio.pin.IntrLowLevel, 100, 8 * 1024, 22)
    rfm12_readwrite(CMD_CLEAR_FIFO)
    rfm12_readwrite(CMD_ACCEPT_DATA)
end

function extafree_packet()
    while true do
        packet_rcvd:wait()
        while #packets > 0 do
            local packet = table.remove(packets, 1)
            local cs = 0
            for i = 1, 10 do
                cs = cs + packet[i]
            end
            cs = cs & 0xff

            print(table.concat(packet, ' '), cs == packet[11])
        end
        packet_rcvd:done()
    end
end

function rfm12_callback()
    while true do
        status = rfm12_readwrite(CMD_STATUS)
        recheck = false
        if status & STATUS_FFIT ~= 0 then
            recheck = true
            recvd = recvd + 1
            buffer[recvd] = rfm12_readwrite(CMD_READ) & 0xff
            if recvd == 11 then
                rfm12_readwrite(CMD_CLEAR_FIFO)
                rfm12_readwrite(CMD_ACCEPT_DATA)
                table.insert(packets, buffer)
                buffer = {}
                recvd = 0
                packet_rcvd:broadcast()
            end
        end
        if not recheck then break end
    end
end

Don't worry about the rest of the program in the full code. I'm testing it right now without anything connected to the SPI except the logic analyser. When you run rfm12_init function, it should send about 13 transactions, each containing 2 bytes.

@jolivepetrus
Copy link
Contributor Author

@yawor,

New optimizations have been added in the JIT (see f88aa17).

Also there are changes in the Lua spi module (readwrite function). Now the function returns an array t[k] with k >= 1, that ensures that the result table is an array and no table resizes are done when the table is build. In the previous version the array was t[k], with k >= 0, but this was not an array (in Lua arrays are indexed from 1 to array size) and caused continuous table resizes.

You should see a 150 usecs time between NSS with the JIT enabled. It is not feasible to get lower values without changes in the SPI driver, but I have some ideas to implement in the driver to reduce the transaction duration.

@jolivepetrus
Copy link
Contributor Author

@yawor,

Please try with commit 05bd970. Now the SPI Lua module uses an internal buffer to make transfers to the SPI bus.

@yawor
Copy link
Contributor

yawor commented May 31, 2018

@jolivepetrus
I can confirm that the SPI operation now looks a lot better. I don't have the RF module connected right now (just logic analyser), so I don't know if it's now enough to properly handle the transmission. I'll check it during the weekend.

@yawor
Copy link
Contributor

yawor commented Jun 3, 2018

@jolivepetrus
First of all, thank you for your great support regarding this issue.

I don't know if this is a good issue to post this, maybe I should open a new issue for the SPI?
Unfortunately by making recent you've introduced an error in the spi:readwrite function. It now returns the same data as given in the arguments, instead of the real, received data. I've confirmed that it's not the slave device causing this with the Logic Analyser and the same thing happens when I disconnect the MISO pin (it should return zeros when MISO is not connected).

I've tested the SPI performance and my conclusion is that the moment the NSS pin is activated/deactivated only helps partially. I've tried with and without the new SPI_FLAG_CS_AUTO to time both approaches and the total time to send 10 SPI requests is almost the same. When I test the time between last falling clock edge of the previous transfer to the first rising clock edge of the next transfer, it is about 190 us with or without the flag. I don't know if this time can be shortened further - probably not much. It can certainly go up: just enabling wifi in the config.lua adds ~20 us to this time.

BTW I've noticed that the NSS now rises up at the exact same time as the last clock edge falls down on the transfer. Is it possible to introduce a little delay? The time between NSS going down and the first clock edge is 1/2 of clock cycle, which should also be ok for the NSS going up after the last clock edge. Or add ability to add a number of dummy bits at the end. Some devices don't like NSS going up on the last clock edge. For example, the RFM12B datasheet calls this time "Select hold time" and requires it to be at least 25 ns.

@jolivepetrus
Copy link
Contributor Author

jolivepetrus commented Jun 4, 2018

We can introduce a small delay before NSS goes high, but not when NSS goes low due to an ESP32 errata (this only works in half-duplex). Maybe it's better to make SPI_FLAG_CS_AUTO = 0 by default, this introduces a small delay.

@jolivepetrus
Copy link
Contributor Author

@yawor,

Reading issue is solved in ba7d71b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants