Memory Interface

Improvements for the Memory Interface between PRU and user space

shepherd consists of an embedded linux board (beaglebone black) that has an arm-core and special real time units (two coprocessors called PRU)
there are two basic functions for shepherd:
- harvesting / recording an energy source
- emulating that energy environment for a connected wireless node (target MCU)
focus is on the emulation part as this is most constrained
the PRUs are sampling an ADC, writing to a DAC and reading GPIO … and calculating some real-time math (virtual power source)
the linux side is controlled by a python-program that has a direct memory interface to the PRUs ⇾ that program supplies input data and collects the resulting measurement stream
(side-info) there is an optional second communication channel to a kernel module (python and PRU can each talk to that module) controlling most of the state-machine
Problem: the memory interface for exchanging that described measurement-data has some design-flaws described in “Current Situation” and “Known Constraints” below

overly complicated and “expensive” borrow & return system with a 64 segment ringbuffer (SampleBuffer)
SampleBuffer currently holds 100 ms of data (10 kSamples) and gpio-samples
nested gpio-struct (GPIOEdges) inside SampleBuffer holds only ~ 16 kSamples ⇾ artificial bottleneck
current trick / dirty hack as real time constraints got violated from time to time (when reading from RAM took to long): pru1 does the reading from RAM now and shares data via fast shared RAM (exclusive for PRUs) for pru0

Timings for reference (emulation, data from mid 2021):

10’000 ns for each loop (@ 100 kHz) available for getting data, process it and writing data
~600 ns for reading the ADC (current-value)
400 - 3’000 ns for reading data from RAM
~ 8’000 ns for the virtual source calculations (worst case)
~ 800 ns for buffer-swap (only every 100 ms)
- 400 ns prepare buffer
- 200 ns mutex-part / gpio-swap
- 200 ns send full buffer
720 ns for writing to DAC

our goal is to remove overhead, bottlenecks and boost the performance mainly for the gpio sampling to reliable frequencies in the range of 8 - 16 MHz
the gpio sampling is currently varying from 840 kHz to 5.7 MHz with a mean of 2.2 MHz
the main point of attack will be
- the design of a new memory interface (buffer-design)
- redesign of the state-machine coordinating the measurement (time-sync, buffer swap, controlling measurement-states)
- maybe: improved sampling routines for the ICs (bitbanged SPI in assembler)
another possibility for high throughput gpio-sampling
- offer two firmwares: virtual source emulation with slower GPIO-Sampling OR
- disable the virtual power source that is occupying > 90% of PRU0

roughly 1 MB/s in both directions over the mem-interface (for emulation / power traces)
event based gpio-sampling with high throughput might overburden beaglebone, example:
- 1 MBaud Serial might cause 1 * 10^6 events
- event consists of 2 byte gpio-register & 8 byte timestamp
- 10 byte @ 1 MHz are ~ 10 MB/s
Note: there is another Testbed called Flocklab, that is also using the Beaglebone. They sample serial via a serial-kernel-module and are limited at ~ .5 to 1 MBaud which produces high system load
the PRU is good at writing into (system) RAM with just 1 cycle, but slow at reading with 100-600 cycles per read (at 200 MHz PRU baseclock)
- by using memcopy one read can be larger than uint32, by only needing little more time
- currently the PRU reads the voltage- and current-value in two OPs (design-flaw)
PRUs have only 8 kB private RAM and 12 kB shared RAM (between the two PRUs)
there might be more …

BeagleBone, Power-Adapter
SD-Card & SD-Cardreader for flashing a Linux-Image
Network-Cable and external router or switch to connect via ssh
logic-analyzer to determine timings of subroutines
dev PC with shell (linux preferred, but WSL, Powershell or MacOS-Shell also work)

External BBone-Projects that may help: