Direct Memory Access (DMA)
This section covers the core requirements for building a memory safe API around DMA transfers.
The DMA peripheral is used to perform memory transfers in parallel to the work
of the processor (the execution of the main program). A DMA transfer is more or
less equivalent to spawning a thread (see thread::spawn
) to do a memcpy
.
We'll use the fork-join model to illustrate the requirements of a memory safe
API.
Consider the following DMA primitives:
#![allow(unused)] fn main() { /// A singleton that represents a single DMA channel (channel 1 in this case) /// /// This singleton has exclusive access to the registers of the DMA channel 1 pub struct Dma1Channel1 { // .. } impl Dma1Channel1 { /// Data will be written to this `address` /// /// `inc` indicates whether the address will be incremented after every byte transfer /// /// NOTE this performs a volatile write pub fn set_destination_address(&mut self, address: usize, inc: bool) { // .. } /// Data will be read from this `address` /// /// `inc` indicates whether the address will be incremented after every byte transfer /// /// NOTE this performs a volatile write pub fn set_source_address(&mut self, address: usize, inc: bool) { // .. } /// Number of bytes to transfer /// /// NOTE this performs a volatile write pub fn set_transfer_length(&mut self, len: usize) { // .. } /// Starts the DMA transfer /// /// NOTE this performs a volatile write pub fn start(&mut self) { // .. } /// Stops the DMA transfer /// /// NOTE this performs a volatile write pub fn stop(&mut self) { // .. } /// Returns `true` if there's a transfer in progress /// /// NOTE this performs a volatile read pub fn in_progress() -> bool { // .. } } }
Assume that the Dma1Channel1
is statically configured to work with serial port
(AKA UART or USART) #1, Serial1
, in one-shot mode (i.e. not circular mode).
Serial1
provides the following blocking API:
#![allow(unused)] fn main() { /// A singleton that represents serial port #1 pub struct Serial1 { // .. } impl Serial1 { /// Reads out a single byte /// /// NOTE: blocks if no byte is available to be read pub fn read(&mut self) -> Result<u8, Error> { // .. } /// Sends out a single byte /// /// NOTE: blocks if the output FIFO buffer is full pub fn write(&mut self, byte: u8) -> Result<(), Error> { // .. } } }
Let's say we want to extend Serial1
API to (a) asynchronously send out a
buffer and (b) asynchronously fill a buffer.
We'll start with a memory unsafe API and we'll iterate on it until it's completely memory safe. On each step we'll show you how the API can be broken to make you aware of the issues that need to be addressed when dealing with asynchronous memory operations.
A first stab
For starters, let's try to use the Write::write_all
API as a reference. To
keep things simple let's ignore all error handling.
#![allow(unused)] fn main() { /// A singleton that represents serial port #1 pub struct Serial1 { // NOTE: we extend this struct by adding the DMA channel singleton dma: Dma1Channel1, // .. } impl Serial1 { /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer pub fn write_all<'a>(mut self, buffer: &'a [u8]) -> Transfer<&'a [u8]> { self.dma.set_destination_address(USART1_TX, false); self.dma.set_source_address(buffer.as_ptr() as usize, true); self.dma.set_transfer_length(buffer.len()); self.dma.start(); Transfer { buffer } } } /// A DMA transfer pub struct Transfer<B> { buffer: B, } impl<B> Transfer<B> { /// Returns `true` if the DMA transfer has finished pub fn is_done(&self) -> bool { !Dma1Channel1::in_progress() } /// Blocks until the transfer is done and returns the buffer pub fn wait(self) -> B { // Busy wait until the transfer is done while !self.is_done() {} self.buffer } } }
NOTE:
Transfer
could expose a futures or generator based API instead of the API shown above. That's an API design question that has little bearing on the memory safety of the overall API so we won't delve into it in this text.
We can also implement an asynchronous version of Read::read_exact
.
#![allow(unused)] fn main() { impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact<'a>(&mut self, buffer: &'a mut [u8]) -> Transfer<&'a mut [u8]> { self.dma.set_source_address(USART1_RX, false); self.dma .set_destination_address(buffer.as_mut_ptr() as usize, true); self.dma.set_transfer_length(buffer.len()); self.dma.start(); Transfer { buffer } } } }
Here's how to use the write_all
API:
#![allow(unused)] fn main() { fn write(serial: Serial1) { // fire and forget serial.write_all(b"Hello, world!\n"); // do other stuff } }
And here's an example of using the read_exact
API:
#![allow(unused)] fn main() { fn read(mut serial: Serial1) { let mut buf = [0; 16]; let t = serial.read_exact(&mut buf); // do other stuff t.wait(); match buf.split(|b| *b == b'\n').next() { Some(b"some-command") => { /* do something */ } _ => { /* do something else */ } } } }
mem::forget
mem::forget
is a safe API. If our API is truly safe then we should be able
to use both together without running into undefined behavior. However, that's
not the case; consider the following example:
#![allow(unused)] fn main() { fn unsound(mut serial: Serial1) { start(&mut serial); bar(); } #[inline(never)] fn start(serial: &mut Serial1) { let mut buf = [0; 16]; // start a DMA transfer and forget the returned `Transfer` value mem::forget(serial.read_exact(&mut buf)); } #[inline(never)] fn bar() { // stack variables let mut x = 0; let mut y = 0; // use `x` and `y` } }
Here we start a DMA transfer, in start
, to fill an array allocated on the
stack and then mem::forget
the returned Transfer
value. Then we proceed to
return from start
and execute the function bar
.
This series of operations results in undefined behavior. The DMA transfer writes
to stack memory but that memory is released when start
returns and then reused
by bar
to allocate variables like x
and y
. At runtime this could result in
variables x
and y
changing their value at random times. The DMA transfer
could also overwrite the state (e.g. link register) pushed onto the stack by the
prologue of function bar
.
Note that if we had not use mem::forget
, but mem::drop
, it would have been
possible to make Transfer
's destructor stop the DMA transfer and then the
program would have been safe. But one can not rely on destructors running to
enforce memory safety because mem::forget
and memory leaks (see RC cycles) are
safe in Rust.
We can fix this particular problem by changing the lifetime of the buffer from
'a
to 'static
in both APIs.
#![allow(unused)] fn main() { impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact(&mut self, buffer: &'static mut [u8]) -> Transfer<&'static mut [u8]> { // .. same as before .. } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer pub fn write_all(mut self, buffer: &'static [u8]) -> Transfer<&'static [u8]> { // .. same as before .. } } }
If we try to replicate the previous problem we note that mem::forget
no longer
causes problems.
#![allow(unused)] fn main() { #[allow(dead_code)] fn sound(mut serial: Serial1, buf: &'static mut [u8; 16]) { // NOTE `buf` is moved into `foo` foo(&mut serial, buf); bar(); } #[inline(never)] fn foo(serial: &mut Serial1, buf: &'static mut [u8]) { // start a DMA transfer and forget the returned `Transfer` value mem::forget(serial.read_exact(buf)); } #[inline(never)] fn bar() { // stack variables let mut x = 0; let mut y = 0; // use `x` and `y` } }
As before, the DMA transfer continues after mem::forget
-ing the Transfer
value. This time that's not an issue because buf
is statically allocated
(e.g. static mut
variable) and not on the stack.
Overlapping use
Our API doesn't prevent the user from using the Serial
interface while the DMA
transfer is in progress. This could lead the transfer to fail or data to be
lost.
There are several ways to prevent overlapping use. One way is to have Transfer
take ownership of Serial1
and return it back when wait
is called.
#![allow(unused)] fn main() { /// A DMA transfer pub struct Transfer<B> { buffer: B, // NOTE: added serial: Serial1, } impl<B> Transfer<B> { /// Blocks until the transfer is done and returns the buffer // NOTE: the return value has changed pub fn wait(self) -> (B, Serial1) { // Busy wait until the transfer is done while !self.is_done() {} (self.buffer, self.serial) } // .. } impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer // NOTE we now take `self` by value pub fn read_exact(mut self, buffer: &'static mut [u8]) -> Transfer<&'static mut [u8]> { // .. same as before .. Transfer { buffer, // NOTE: added serial: self, } } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer // NOTE we now take `self` by value pub fn write_all(mut self, buffer: &'static [u8]) -> Transfer<&'static [u8]> { // .. same as before .. Transfer { buffer, // NOTE: added serial: self, } } } }
The move semantics statically prevent access to Serial1
while the transfer is
in progress.
#![allow(unused)] fn main() { fn read(serial: Serial1, buf: &'static mut [u8; 16]) { let t = serial.read_exact(buf); // let byte = serial.read(); //~ ERROR: `serial` has been moved // .. do stuff .. let (serial, buf) = t.wait(); // .. do more stuff .. } }
There are other ways to prevent overlapping use. For example, a (Cell
) flag
that indicates whether a DMA transfer is in progress could be added to
Serial1
. When the flag is set read
, write
, read_exact
and write_all
would all return an error (e.g. Error::InUse
) at runtime. The flag would be
set when write_all
/ read_exact
is used and cleared in Transfer.wait
.
Compiler (mis)optimizations
The compiler is free to re-order and merge non-volatile memory operations to better optimize a program. With our current API, this freedom can lead to undefined behavior. Consider the following example:
#![allow(unused)] fn main() { fn reorder(serial: Serial1, buf: &'static mut [u8]) { // zero the buffer (for no particular reason) buf.iter_mut().for_each(|byte| *byte = 0); let t = serial.read_exact(buf); // ... do other stuff .. let (buf, serial) = t.wait(); buf.reverse(); // .. do stuff with `buf` .. } }
Here the compiler is free to move buf.reverse()
before t.wait()
, which would
result in a data race: both the processor and the DMA would end up modifying
buf
at the same time. Similarly the compiler can move the zeroing operation to
after read_exact
, which would also result in a data race.
To prevent these problematic reorderings we can use a compiler_fence
#![allow(unused)] fn main() { impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact(mut self, buffer: &'static mut [u8]) -> Transfer<&'static mut [u8]> { self.dma.set_source_address(USART1_RX, false); self.dma .set_destination_address(buffer.as_mut_ptr() as usize, true); self.dma.set_transfer_length(buffer.len()); // NOTE: added atomic::compiler_fence(Ordering::Release); // NOTE: this is a volatile *write* self.dma.start(); Transfer { buffer, serial: self, } } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer pub fn write_all(mut self, buffer: &'static [u8]) -> Transfer<&'static [u8]> { self.dma.set_destination_address(USART1_TX, false); self.dma.set_source_address(buffer.as_ptr() as usize, true); self.dma.set_transfer_length(buffer.len()); // NOTE: added atomic::compiler_fence(Ordering::Release); // NOTE: this is a volatile *write* self.dma.start(); Transfer { buffer, serial: self, } } } impl<B> Transfer<B> { /// Blocks until the transfer is done and returns the buffer pub fn wait(self) -> (B, Serial1) { // NOTE: this is a volatile *read* while !self.is_done() {} // NOTE: added atomic::compiler_fence(Ordering::Acquire); (self.buffer, self.serial) } // .. } }
We use Ordering::Release
in read_exact
and write_all
to prevent all
preceding memory operations from being moved after self.dma.start()
, which
performs a volatile write.
Likewise, we use Ordering::Acquire
in Transfer.wait
to prevent all
subsequent memory operations from being moved before self.is_done()
, which
performs a volatile read.
To better visualize the effect of the fences here's a slightly tweaked version of the example from the previous section. We have added the fences and their orderings in the comments.
#![allow(unused)] fn main() { fn reorder(serial: Serial1, buf: &'static mut [u8], x: &mut u32) { // zero the buffer (for no particular reason) buf.iter_mut().for_each(|byte| *byte = 0); *x += 1; let t = serial.read_exact(buf); // compiler_fence(Ordering::Release) ▲ // NOTE: the processor can't access `buf` between the fences // ... do other stuff .. *x += 2; let (buf, serial) = t.wait(); // compiler_fence(Ordering::Acquire) ▼ *x += 3; buf.reverse(); // .. do stuff with `buf` .. } }
The zeroing operation can not be moved after read_exact
due to the
Release
fence. Similarly, the reverse
operation can not be moved before
wait
due to the Acquire
fence. The memory operations between both fences
can be freely reordered across the fences but none of those operations
involves buf
so such reorderings do not result in undefined behavior.
Note that compiler_fence
is a bit stronger than what's required. For example,
the fences will prevent the operations on x
from being merged even though we
know that buf
doesn't overlap with x
(due to Rust aliasing rules). However,
there exist no intrinsic that's more fine grained than compiler_fence
.
Don't we need a memory barrier?
That depends on the target architecture. In the case of Cortex M0 to M4F cores, AN321 says:
3.2 Typical usages
(..)
The use of DMB is rarely needed in Cortex-M processors because they do not reorder memory transactions. However, it is needed if the software is to be reused on other ARM processors, especially multi-master systems. For example:
- DMA controller configuration. A barrier is required between a CPU memory access and a DMA operation.
(..)
4.18 Multi-master systems
(..)
Omitting the DMB or DSB instruction in the examples in Figure 41 on page 47 and Figure 42 would not cause any error because the Cortex-M processors:
- do not re-order memory transfers
- do not permit two write transfers to be overlapped.
Where Figure 41 shows a DMB (memory barrier) instruction being used before starting a DMA transaction.
In the case of Cortex-M7 cores you'll need memory barriers (DMB/DSB) if you are using the data cache (DCache), unless you manually invalidate the buffer used by the DMA. Even with the data cache disabled, memory barriers might still be required to avoid reordering in the store buffer.
If your target is a multi-core system then it's very likely that you'll need memory barriers.
If you do need the memory barrier then you need to use atomic::fence
instead
of compiler_fence
. That should generate a DMB instruction on Cortex-M devices.
Generic buffer
Our API is more restrictive that it needs to be. For example, the following program won't be accepted even though it's valid.
#![allow(unused)] fn main() { fn reuse(serial: Serial1, msg: &'static mut [u8]) { // send a message let t1 = serial.write_all(msg); // .. let (msg, serial) = t1.wait(); // `msg` is now `&'static [u8]` msg.reverse(); // now send it in reverse let t2 = serial.write_all(msg); // .. let (buf, serial) = t2.wait(); // .. } }
To accept such program we can make the buffer argument generic.
#![allow(unused)] fn main() { // as-slice = "0.1.0" use as_slice::{AsMutSlice, AsSlice}; impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact<B>(mut self, mut buffer: B) -> Transfer<B> where B: AsMutSlice<Element = u8>, { // NOTE: added let slice = buffer.as_mut_slice(); let (ptr, len) = (slice.as_mut_ptr(), slice.len()); self.dma.set_source_address(USART1_RX, false); // NOTE: tweaked self.dma.set_destination_address(ptr as usize, true); self.dma.set_transfer_length(len); atomic::compiler_fence(Ordering::Release); self.dma.start(); Transfer { buffer, serial: self, } } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer fn write_all<B>(mut self, buffer: B) -> Transfer<B> where B: AsSlice<Element = u8>, { // NOTE: added let slice = buffer.as_slice(); let (ptr, len) = (slice.as_ptr(), slice.len()); self.dma.set_destination_address(USART1_TX, false); // NOTE: tweaked self.dma.set_source_address(ptr as usize, true); self.dma.set_transfer_length(len); atomic::compiler_fence(Ordering::Release); self.dma.start(); Transfer { buffer, serial: self, } } } }
NOTE:
AsRef<[u8]>
(AsMut<[u8]>
) could have been used instead ofAsSlice<Element = u8>
(AsMutSlice<Element = u8
).
Now the reuse
program will be accepted.
Immovable buffers
With this modification the API will also accept arrays by value (e.g. [u8; 16]
). However, using arrays can result in pointer invalidation. Consider the
following program.
#![allow(unused)] fn main() { fn invalidate(serial: Serial1) { let t = start(serial); bar(); let (buf, serial) = t.wait(); } #[inline(never)] fn start(serial: Serial1) -> Transfer<[u8; 16]> { // array allocated in this frame let buffer = [0; 16]; serial.read_exact(buffer) } #[inline(never)] fn bar() { // stack variables let mut x = 0; let mut y = 0; // use `x` and `y` } }
The read_exact
operation will use the address of the buffer
local to the
start
function. That local buffer
will be freed when start
returns and the
pointer used in read_exact
will become invalidated. You'll end up with a
situation similar to the unsound
example.
To avoid this problem we require that the buffer used with our API retains its
memory location even when it's moved. The Pin
newtype provides such
guarantee. We can update our API to required that all buffers are "pinned"
first.
NOTE: To compile all the programs below this point you'll need Rust
>=1.33.0
. As of time of writing (2019-01-04) that means using the nightly channel.
#![allow(unused)] fn main() { /// A DMA transfer pub struct Transfer<B> { // NOTE: changed buffer: Pin<B>, serial: Serial1, } impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact<B>(mut self, mut buffer: Pin<B>) -> Transfer<B> where // NOTE: bounds changed B: DerefMut, B::Target: AsMutSlice<Element = u8> + Unpin, { // .. same as before .. } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer pub fn write_all<B>(mut self, buffer: Pin<B>) -> Transfer<B> where // NOTE: bounds changed B: Deref, B::Target: AsSlice<Element = u8>, { // .. same as before .. } } }
NOTE: We could have used the
StableDeref
trait instead of thePin
newtype but opted forPin
since it's provided in the standard library.
With this new API we can use &'static mut
references, Box
-ed slices, Rc
-ed
slices, etc.
#![allow(unused)] fn main() { fn static_mut(serial: Serial1, buf: &'static mut [u8]) { let buf = Pin::new(buf); let t = serial.read_exact(buf); // .. let (buf, serial) = t.wait(); // .. } fn boxed(serial: Serial1, buf: Box<[u8]>) { let buf = Pin::new(buf); let t = serial.read_exact(buf); // .. let (buf, serial) = t.wait(); // .. } }
'static
bound
Does pinning let us safely use stack allocated arrays? The answer is no. Consider the following example.
#![allow(unused)] fn main() { fn unsound(serial: Serial1) { start(serial); bar(); } // pin-utils = "0.1.0-alpha.4" use pin_utils::pin_mut; #[inline(never)] fn start(serial: Serial1) { let buffer = [0; 16]; // pin the `buffer` to this stack frame // `buffer` now has type `Pin<&mut [u8; 16]>` pin_mut!(buffer); mem::forget(serial.read_exact(buffer)); } #[inline(never)] fn bar() { // stack variables let mut x = 0; let mut y = 0; // use `x` and `y` } }
As seen many times before, the above program runs into undefined behavior due to stack frame corruption.
The API is unsound for buffers of type Pin<&'a mut [u8]>
where 'a
is not
'static
. To prevent the problem we have to add a 'static
bound in some
places.
#![allow(unused)] fn main() { impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact<B>(mut self, mut buffer: Pin<B>) -> Transfer<B> where // NOTE: added 'static bound B: DerefMut + 'static, B::Target: AsMutSlice<Element = u8> + Unpin, { // .. same as before .. } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer pub fn write_all<B>(mut self, buffer: Pin<B>) -> Transfer<B> where // NOTE: added 'static bound B: Deref + 'static, B::Target: AsSlice<Element = u8>, { // .. same as before .. } } }
Now the problematic program will be rejected.
Destructors
Now that the API accepts Box
-es and other types that have destructors we need
to decide what to do when Transfer
is early-dropped.
Normally, Transfer
values are consumed using the wait
method but it's also
possible to, implicitly or explicitly, drop
the value before the transfer is
over. For example, dropping a Transfer<Box<[u8]>>
value will cause the buffer
to be deallocated. This can result in undefined behavior if the transfer is
still in progress as the DMA would end up writing to deallocated memory.
In such scenario one option is to make Transfer.drop
stop the DMA transfer.
The other option is to make Transfer.drop
wait for the transfer to finish.
We'll pick the former option as it's cheaper.
#![allow(unused)] fn main() { /// A DMA transfer pub struct Transfer<B> { // NOTE: always `Some` variant inner: Option<Inner<B>>, } // NOTE: previously named `Transfer<B>` struct Inner<B> { buffer: Pin<B>, serial: Serial1, } impl<B> Transfer<B> { /// Blocks until the transfer is done and returns the buffer pub fn wait(mut self) -> (Pin<B>, Serial1) { while !self.is_done() {} atomic::compiler_fence(Ordering::Acquire); let inner = self .inner .take() .unwrap_or_else(|| unsafe { hint::unreachable_unchecked() }); (inner.buffer, inner.serial) } } impl<B> Drop for Transfer<B> { fn drop(&mut self) { if let Some(inner) = self.inner.as_mut() { // NOTE: this is a volatile write inner.serial.dma.stop(); // we need a read here to make the Acquire fence effective // we do *not* need this if `dma.stop` does a RMW operation unsafe { ptr::read_volatile(&0); } // we need a fence here for the same reason we need one in `Transfer.wait` atomic::compiler_fence(Ordering::Acquire); } } } impl Serial1 { /// Receives data into the given `buffer` until it's filled /// /// Returns a value that represents the in-progress DMA transfer pub fn read_exact<B>(mut self, mut buffer: Pin<B>) -> Transfer<B> where B: DerefMut + 'static, B::Target: AsMutSlice<Element = u8> + Unpin, { // .. same as before .. Transfer { inner: Some(Inner { buffer, serial: self, }), } } /// Sends out the given `buffer` /// /// Returns a value that represents the in-progress DMA transfer pub fn write_all<B>(mut self, buffer: Pin<B>) -> Transfer<B> where B: Deref + 'static, B::Target: AsSlice<Element = u8>, { // .. same as before .. Transfer { inner: Some(Inner { buffer, serial: self, }), } } } }
Now the DMA transfer will be stopped before the buffer is deallocated.
#![allow(unused)] fn main() { fn reuse(serial: Serial1) { let buf = Pin::new(Box::new([0; 16])); let t = serial.read_exact(buf); // compiler_fence(Ordering::Release) ▲ // .. // this stops the DMA transfer and frees memory mem::drop(t); // compiler_fence(Ordering::Acquire) ▼ // this likely reuses the previous memory allocation let mut buf = Box::new([0; 16]); // .. do stuff with `buf` .. } }
Summary
To sum it up, we need to consider all the following points to achieve memory safe DMA transfers:
-
Use immovable buffers plus indirection:
Pin<B>
. Alternatively, you can use theStableDeref
trait. -
The ownership of the buffer must be passed to the DMA :
B: 'static
. -
Do not rely on destructors running for memory safety. Consider what happens if
mem::forget
is used with your API. -
Do add a custom destructor that stops the DMA transfer, or waits for it to finish. Consider what happens if
mem::drop
is used with your API.
This text leaves out up several details required to build a production grade
DMA abstraction, like configuring the DMA channels (e.g. streams, circular vs
one-shot mode, etc.), alignment of buffers, error handling, how to make the
abstraction device-agnostic, etc. All those aspects are left as an exercise for
the reader / community (:P
).