Concept for Realtime-Units (PRUs)

  • SPI to adc/dac ⇾ use dedicated hardware and not bit-banging

    • bit-banging transfer takes 8 (DAC) - 12 (ADC) ticks per bit ⇾ so 192 ticks for 24 bit at best (DAC) and 384 ticks for 32 bit (ADC)

    • no PRU-Peripheral is utilizable for it

    • Host-SPI is accessible by PRU (~40 ticks read delay for <= 32-bit, writing should take 1 tick) ⇾ Buffered, FIFO, allows 4-32 bit words, max 48 MHz

    • CS-Pins need precise timing, or at least repeatable (same delay, equidistant), so at beginning of IRQ-Loop

      • no IRQ to synchronize transaction!

      • start of transaction on host-spi can be half a SPI-Clock-Cycle accurate (Start Request ⇾ CS Low)

      • CS Pin could be PRU-Controlled/Monitored-Pin

    • alternative: DAC needs no MISO, so one extra PIN would allow to talk to both ICs at the same time

      • clk can be shared and would be reduced to the lower one (ADC is slower and has longer registers, so improvement would be 1/3 less time)

  • GPIO to target

    • PRU-GPIO pin direction is controlled by host, in linux via device tree or alternatively via “cape-universal”

    • PRU can’t access host GPIO (i found no memory-mapping)

    • a trick would be to reduce PRU to an observer with his separate pins ⇾ linux can then program, control and listen on it’s own pins

    • one PRU could sample 16 bits with close to 100 MHz …

    • there is no interrupt for PRU-pin-changes, but for transitions on host-gpio-modules

    • NOTE: some pins allow hardware-timestamping via int-routine, data-sheet even implies this is possible for all pins

  • UART to target ⇾ seems to be only host-controlled

    • could also be handled by PRU ⇾ has dedicated UART (TL16C550) no autobaud, 192 Mbps, 16-bit FIFO

    • HOST-UART is accessible by PRU (~40 ticks read delay) ⇾ (16C750), 300 bps to 3.7 Mbps, autobaud up to 115.2 Kbps

    • could just be monitored by PRU (GPIO)

  • SPI to target

    • control by userspace seems to be the way to go, so user can decide whether to use SPI or just the GPIO

    • one additional SPI available

    • monitor and timestamp by PRU

  • Programmer, currently for SWD, but also allow JTAG

    • should stay in user-space, SWD needs flexible pin-direction

  • unify Pins to Target ⇾ Programming, SPI, UART and GPIO

    • would be possible if Pins are controlled and used in Linux, PRU is just recording Pins

  • I2C to dac (static voltage) handled by PRU, to minimize errors

    • PRU could access host-I2C

    • PRU can utilize MDIO-Interface for that

  • vCap ⇾ see sub-chapter below

  • Scheduling

    • one PRU should do time critical things, i.e. sampling into ringbuffer in shared PRU-Memory → other PRU-Core should handle transfer to cpu-memory

    • work timer/interrupt-based with short transactions, could do pin-reading the rest of the time

  • try to add unit tests for critical code sections (vCap) ⇾ make it more modular

  • benchmark the loop

    • timer, counter with min, max, mean with copy to host

    • or just use debug-pins to mark active parts and analyze like pwm

    • maybe more useful than disassembling via godbolt

vCap - Converter & Capacitor Emulation

  • main future goals from the thesis regarding vCap

    • dynamic capacitance ⇾ allow capacitor sweeps, find optimum cost by benchmarking the target firmware

    • energy aware debugging ⇾ keep energy in capacitor constant during commands

    • support more targets, mainly msp430

  • find a better name

BeagleBone Features, Comparison

PRU ICSS High-level-Overview

  • 200 MHz cores

  • one 32bit-op per cycle

  • no division

  • 8 kB RAM per Core

  • 12 kB shared RAM

  • 3 banks of Scratch-Pad (=3x30x 32-bit registers) directly between the cores, 1Cycle Access

  • access to host memory, L3 interconnect (expensive wait, see benchmark below)

  • INTC, 64 events, 10 channels

  • access to host periphery (QSPI, GPIO, even USB), ~ 40 cycles read latency

  • CPU has mailbox system to send 4x 32-bit IRQ-messages to individual cores (also PRUs)

TI-Wiki contains datasheets for various sub-topics

  • PRU C/C++ optimization guide, presentations,

  • Subprocessor documentation

  • PRU-Projects_, notably

    • PyPRUSS (programming PRUs on beaglebone black),

    • libpruio (high speed data handling),

    • BeagleLogic (100 MHz, 14CH Logic Analyzer),

    • High Speed data acquisition

Latency Benchmarks

source: “sprace8a.pdf”

  • writes take normally 1 cycle, reads 2 to 14 (UART) cycles local, reads 30 - 40 cycles global (periphery)

  • transfer shared RAM to DDR 5 cycles / 4 byte, to 65 cycles / 128 byte

  • transfer DDR to shared RAM 47 cycles / 4 byte, 107 cycles / 128 byte ⇾ prefer large chunks

PRUs (ICSS, ICSSG) Supported techniques

source: link for feature-comparison

  • mostly called (enhanced) EGPIO:

  • 16 bit parallel capture input for GPIO, r31[15:0] are DataIn, r31[16] is ClockIn

  • 28 bit shift input ⇾ pru_DATAIN, r31_status[27:0], with counter stats, internal clock source ⇾ which pin?

    • WARNING: this seems to leave only ONE input

  • 3 Ch peripheral interface (on ICSS device dependent) - not found on BBB ??

  • Shift output

  • dedicated UART (with 16-bit FIFO, 192 Mbps) based on TL16C550, no speedsense, but autoflow (cts, rts)

  • eCAP (enhanced Capture)

  • IEP (industrial Ethernet)

  • 2x MII_RT (media independent interface), MDIO (management Data IO)

    • each MII has 32 byte RX FIFO, 64 byte TX FIFO, even TX_EN (as Chip-select) but has clk input ⇾ NO SPI

Beagle Black ⇾ AM3358

  • 1 PRU = 2 Cores, 200 MHz, 8 KB IRAM (instructions) per Core, 8 KB DRAM per core, 12 KB shared DRAM, 17/17 GP-Inp, 16/16 GP-Out, 3 Banks Scratch Pad

  • eGPIO on register x30000 / pins pr1_pru0_pru_r31[16:0] (INP) and pr1_pru0_pru_r30[15:0] (OUT) for PRU0, same for PRU1 with changed register name

  • UART on register x28000 / pins pr1_uart0_rxd/txd/cts_n/rts_n

  • eCap on pr1_ecap0_ecap_capin_apwm_o ⇾ capture input or aux PWM out

  • MDIO has an IO pin pr1_mdio_data

Beagle AI ⇾ AM5729

  • 2 PRU, 200 MHz, 12 KB IRAM per Core, 8 KB DRAM per Core, 32 KB shared DRAM, 21/21 GP-Inp, 21 GP-Out, 3 Banks Scratch Pad

  • same peripherals as AM3358

Possible Compilers

  • ti c compiler, supports c99, asm and c++2003 (https://www.ti.com/tool/TI-CGT#PRU)

  • gcc pru port, in mainline now, (https://github.com/dinuxbg/gnupru/wiki)

Program - Optimizations

PRU Good Practice

  • passing of arguments: 16 registers to pass 32-bit each

  • auto-incrementing loops are without overhead [for (i = 0; i < X; ++i)]

  • O2 tries to rewrite div-const-int into reciprocal mult

  • mixing of asm, c, c++ can bring trouble when activating optimizations

  • a more efficient (single instruction) access to local memory in the lower 16-bits (__near), can be used

  • variables in shared memory always “volatile”

  • const helps, at least to save RAM (if defined at compile-time)

  • compiler switch can decide if char is signed or unsigned

  • don’t mix signed and unsigned variables ⇾ expensive typecast

  • don’t bring signed variable <32bit into code ⇾ expensive typecast and handling

CCS Compiler Switches

  • opt_level=[1-4]

  • opt_for_speed=[0-5]

  • fp_mode=[strict] ⇾ disable fp-usage

Current Program Flow PRU0

  • only drawing on paper atm

Current Program Flow PRU1

  • only drawing on paper atm