UBI - Unsorted Block Images
Table of contents
- Big red note
- Overview
- Source code
- Mailing list
- User-space tools
- UBI headers
- UBI volume table
- Minimum flash input/output unit
- NAND flash sub-pages
- UBI headers position
- Flash space overhead
- Saving erase counters
- Marking eraseblocks as bad
- Scalability issues
- Volume auto-resize
- UBI operations
- More documentation
- How to send a bugreport?
Big red note
People are often confused about what UBI is, which was the reason for creating this section. Please, realize that:
- UBI is not a Flash Translation Layer (FTL), and it has nothing to do with FTL;
- UBI works with bare flashes, and it does not work with
consumer flashes like
MMC
,RS-MMC
,eMMC
,SD
,mini-SD
,micro-SD
,CompactFlash
,MemoryStick
,USB flash drive
, etc; instead, UBI works with raw flash devices, which are mostly found in embedded devices like mobile phones, etc.
Please, do not be confused. Read here for more information about how raw flash devices are different to FTL devices.
Overview
UBI (Latin: "where?") stands for "Unsorted Block Images". It is a volume management system for raw flash devices which manages multiple logical volumes on a single physical flash device and spreads the I/O load (i.e, wear-leveling) across whole flash chip.
In a sense, UBI may be compared to the Logical Volume Manager (LVM). Whereas LVM maps logical sectors to physical sectors, UBI maps logical eraseblocks to physical eraseblocks. But besides the mapping, UBI implements global wear-leveling and transparent I/O errors handling.
An UBI volume is a set of consecutive logical eraseblocks (LEBs). Each logical eraseblock may be mapped to any physical eraseblock (PEB). This mapping is managed by UBI, it is hidden from users and it is the base mechanism to provide global wear-leveling (along with per-physical eraseblock erase counters and the ability to transparently move data from more worn-out physical eraseblocks to less worn-out ones).
UBI volume size is specified when the volume is created and may later be changed (volumes are dynamically re-sizable). There are user-space tools which may be used to manipulate UBI volumes.
There are 2 types of UBI volumes - dynamic volumes and static
volumes. Static volumes are read-only and their contents are protected by
CRC-32
checksums, while dynamic volumes are read-write and the
upper layers (e.g., a file-system) are responsible for ensuring data
integrity.
UBI is aware of bad eraseblocks (e.g., NAND flash may have them) and frees the upper layers from any bad block handling. UBI has a pool of reserved physical eraseblocks, and when a physical eraseblock becomes bad, it transparently substitutes it with a good physical eraseblock. UBI moves good data from the newly appeared bad physical eraseblocks to good ones. The result is that users of UBI volumes do not notice I/O errors as UBI takes care of them.
NAND flashes may have bit-flips which occur on read and write operations. Bit-flips are corrected by ECC checksums, but they may accumulate over time and cause data loss. UBI handles this by moving data from physical eraseblocks which have bit-flips to other physical eraseblocks. This process is called scrubbing. Scrubbing is done transparently in background and is hidden from upper layes.
Here is a short list of the main UBI features:
- UBI provides volumes which may be dynamically created, removed, or re-sized;
- UBI implements wear-leveling across whole flash device (i.e., you may continuously write/erase only one logical eraseblock of an UBI volume, but UBI will spread this to all physical eraseblocks of the flash chip);
- UBI transparently handles bad physical eraseblocks;
- UBI minimizes chances to lose data by means of scrubbing.
Here is a comparison of MTD partitions and UBI volumes. They are somewhat because:
- both consist of eraseblocks - logical eraseblocks in case of UBI volumes, and physical eraseblocks in case of MTD partitions;
- both support three basic operations - read, write, and erase.
But UBI volumes have the following advantages over MTD partitions:
- UBI volumes have no eraseblock wear-leveling constraints, so users do not have to care about this at all, which means the upper level software may be simpler;
- UBI volumes have no bad eraseblocks, which also leads to simpler upper level software;
- UBI volumes are dynamic in a sense that they may be created, removed or re-sized dynamically, while MTD partitions are static;
- UBI handles bit-flips which again makes the upper level software simpler;
- UBI provides a volume update operations which makes it easy to detect interrupted software updates and recover;
- UBI provides an atomic logical eraseblock change operation which allows to change the contents of a logical eraseblock without loosing the data if an unclean reboot happens during the operation; this is might be very useful for the upper-level software (e.g., for a file-system);
- UBI has an un-map operation, which just un-maps a logical eraseblock from the physical eraseblock, schedules the physical eraseblock for erasure and returns; this is very quick and frees upper level software from implementing their own mechanisms to defer erasures (e.g., JFFS2 has to implements such mechanisms).
There is an additional driver called gluebi
which emulates MTD
devices on top of UBI volumes. This looks a little strange, because UBI works
on top of an MTD device, then gluebi
emulates other MTD devices
on top, but this actually works and makes it possible for existing software
(e.g., JFFS2) to run on top of UBI volumes. However, new software may benefit
from the advanced UBI features and let UBI solve many issues which the flash
technology imposes.
Source code
UBI is in the main-line Linux kernel starting from version
2.6.22
. But it is recommended to use the latest UBI, because we
have fixed many bugs since that time, made many improvements and added new
features. The UBI git tree may be found at:
git://git.infradead.org/ubi-2.6.git
Here is the corresponding Git-web view.
The git tree has 2 branches - the master
branch and
linux-next
branches. The master
branch contains the
most recent stuff which is often incomplete, buggy, or has not been tested very
well. This branch is re-based from time to time. Please, do not use it unless
you are an UBI developer. The linux-next
branch contains stable
UBI changes which are going to be merged upstream soon. This branch is included
to the linux-next
git tree. The linux-next
branch is never re-based. Please, use
this branch.
Mailing list
You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list.
User-space tools
UBI user-space tools are available from the the following git repository:
git://git.infradead.org/mtd-utils.git
Please, clone it and compile. The easiest way to compile UBI tools is to go
straight to the ubi-utils
sub-directory and type
make
. This section
provides information about how to compile the whole mtd-utils
repository tree.
The repository contains the following UBI tools:
ubinfo
- provides information about UBI devices and volumes found in the system;ubiattach
- attaches MTD devices (which describe raw flash) to UBI and creates corresponding UBI devices;ubidetach
- detaches MTD devices from UBI devices (the opposite to whatubiattach
does);ubimkvol
- creates UBI volumes on UBI devices;ubirmvol
- removes UBI volumes from UBI devices;ubiupdatevol
- updates UBI volumes; this tool uses the UBI volume update feature which leaves the volume in "corrupted" state if the update was interrupted; additionally, this tool may be used to wipe out UBI volumes;ubicrc32
- calculatesCRC-32
checksum of a file with the same initial seed as UBI would use;ubinize
- generates UBI images;ubiformat
- formats empty flash, erases flash and preserves erase counters, flashes UBI images to MTD devices;mtdinfo
- reports information about MTD devices found in the system.
All UBI tools support "-h" option and print sufficient usage information.
Note, the ubiattach
and ubidetach
tools won't work
if the kernel version is less than 2.6.25
, because corresponding
UBI features did not exist in the older kernels.
UBI headers
UBI stores 2 small 64-byte headers at the beginning of each non-bad physical eraseblock:
- erase counter header (or EC header) which contains the erase counter of the physical eraseblock (PEB) plus some other not so important information;
- volume identifier header (or VID header) which stores volume ID and logical eraseblock (LEB) number this PEB belongs to (plus some other not so important information).
This is why logical eraseblocks are smaller than physical eraseblock - the headers take some flash space.
All UBI headers are protected by the CRC-32
checksum. Please,
refer the drivers/mtd/ubi/ubi-media.h
file in the linux kernel for
more information about the header's contents.
When UBI attaches an MTD device, it has to scan it, read all headers,
check the CRC-32
checksums, and store erase counters and the
logical-to-physical eraseblock mapping information in RAM. Please, refer
this section for information about
scalability issues related to this.
After UBI has erased a PEB, it writes the EC header with increased erase counter value. This means that PEBs always have the EC header, except for the short period of time after the erasure and before the EC header is written. Should an unclean reboot happen during this short period of time, the EC header is lost or becomes corrupted. In this case UBI writes new EC header with an average erase counter just after the MTD device scanning is done.
The VID header is written to the PEB when UBI associates it with an LEB. Let's consider what happens to the headers in case of some UBI operations.
- The LEB un-map operation just un-maps the LEB from the PEB and schedules the PEB for erasure. When the PEB is erased, the EC header is written straight away. The VID header is not written.
- The LEB map operation or a write operation to an un-mapped LEB makes UBI find an appropriate PEB and write the VID header to it (the EC header must already be there). Note, the write operation to an already mapped LEB just writes the data straight to PEB and does not change the UBI headers.
UBI maintains two per-PEB headers because it needs to write different information on flash at different moments of time:
- after a PEB is erased, the EC header is written straight away, which minimizes the probability of losing the erase counter due to unclean reboots;
- when UBI associates a PEB with an LEB, the VID header is written to the PEB.
When the EC header is written to a PEB, UBI does not yet know the volume ID and LEB number this PEB will be associated with. This is why UBI needs to do two separate write operations and to have two separate headers.
UBI volume table
Volume table is an on-flash data structure which contains information about each volume on this UBI device. The volume table is an array of volume table records. Each record contains the following information:
- volume size;
- volume name;
- volume type (dynamic or static);
- volume alignment;
- update marker (set for volumes which had interrupted updates;
- auto-resize flag;
CRC-32
checksum for this record.
Each record descripes one UBI volume and record index in the volume table array corresponds to the volume ID. I.e, UBI volume 0 is described by record 0 in the volume table, and so on. Count of records in the volume table is limited by the LEB size, but cannot be greater than 128. This means that UBI devices cannot have more than 128 volumes.
Every time an UBI volume is created, removed, re-sized, re-named or updated, the corresponding volume table record is changed. UBI maintains two copies of the volume for reliability and power-cut tolerance reasons.
Implementation details
Internally, the volume table resides in a special-purpose UBI volume which is called layout volume. This volume consists of 2 LEBs - one for each copy of the volume table. The layout volume is an "internal" UBI volume, and the users do not see it and cannot access it. When reading or writing the layout volume, UBI uses the same mechanisms which are used for normal user volumes.
UBI uses the following algorithm when updating a volume table record.
- Prepare in-memory buffer with the new volume table contents.
- Un-map LEB0 of the layout volume.
- Write the new volume table to LEB0.
- Un-map LEB1 of the layout volume.
- Write the new volume table to LEB1.
- Flush the UBI work queue to make sure the PEBs are corresponding to the un-mapped LEBs are erased.
When attaching the MTD device, UBI makes sure that the 2 volume table copies are equivalent. If they are not equevalent, which may be caused by an unclean reboot, UBI picks the one from LEB0 and copies it to LEB1 of the layout volume (because it is newer). If one of the volume table copies is corrupted, UBI restores it from the other volume table copy.
Minimum flash input/output unit
UBI uses an abstract model of flash. In short, from UBI's point of view the flash (or MTD device) consists of eraseblocks, which may be good or bad. Each good eraseblock may be read from, written to, or erased. Good eraseblocks may also be marked as bad.
Flash reads and writes may only be done in portions of minimum input/output unit size, which depends on flash type.
- NOR flashes usually have min. I/O unit size of 1 byte, because NOR flashes usually allow reading and writing single bytes (in fact, it is even be possible to change individual bits).
- Some NOR flashes may have other min. I/O unit sizes, e.g. 16 or 32 bytes in case of ECC'd NOR flashes.
- NAND flashes usually have 512, 2048 or 4096 byte min. I/O. unit size, which corresponds to NAND page size. NAND flashes store per-NAND page ECC codes in the OOB area, which means that whole NAND page has to be written at once to calculate the ECC code, and whole NAND page has to be read at once to check the ECC code.
The min. I/O unit size is a very important characteristic of the MTD device. It affects many things, e.g.:
- physical position of the VID header depends on the min. I/O unit size, which means that LEB size also depends on it; generally, the larger is the min. I/O unit size, the less is LEB size, and the greater is UBI flash space overhead;
- all writes to LEBs should be aligned to min. I/O unit size, and should be multiple of the min. I/O unit size; this does not apply to reads, but bear in mind that on the MTD level all reads are done in fractions of min. I/O unit size anyway; this is just hidden from users by buffering the read data and copying only the requested amount of bytes to the user buffer.
NAND flash sub-pages
As it is said here, all UBI I/O should be done in fractions of min. I/O unit size, which is equivalent to NAND page size in case of NAND flash. However, some SLC NAND flashes allow for smaller I/O units, which are called sub-pages in MTD terminology. Not all NANDs have sub-pages.
- MLC NANDs do not have sub-pages, at least to the date of writing of this piece of documentation (April 2009).
- SLC NANDs usually do have sub-pages. E.g., 512-byte NAND pages usually consist of 2x256-byte sub-pages, and 2048-byte NAND pages consist of 4x512-byte sub-pages.
- SLC OneNAND chips with 2048 bytes NAND page size have 4x512-byte sub-pages.
If the NAND flash supports sub-pages, then what can be done is ECC codes can be calculated on per-sub-page basis, instead of per-NAND page basis. In this case it becomes possible to read and write sub-pages independently.
But obviously, even though the NAND chip may support sub-pages, the NAND controller may disallow them. Indeed, if the flash is managed by a controller which calculates ECC codes on per-NAND page basis, then it is impossible to do I/O in sub-page fractions. E.g. this is the case for the OLPC XO-1 laptop) - its NAND chip supports sub-pages, but the NAND controller does not.
Note, sub-page is an MTD term, but this is also referred to as "NOP" which stands for "number of partial programs". NOP1 NAND flashes have no sub-pages - UBI treats them as NANDS with sub-page size equivalent to NAND page size. NOP2 NAND flashes have 2 sub-pages (half a NAND page each), NOP4 flashes have 4 sub-pages (quarter of a NAND page each).
UBI utilizes sub-pages to lessen flash space overhead. The overhead is less if NAND flash supports sub-pages (see here). Indeed, let's consider a NAND flash with 128KiB eraseblocks and 2048-byte pages. If it does not have sub-pages, UBI puts the the VID header at physical offset 2048, so LEB size becomes 124KiB (128KiB minus one NAND page which stores the EC header and minus another NAND page which stores the VID header. In opposite, if the NAND flash does have sub-pages, UBI puts the VID header at physical offset 512 (the second sub-page), so LEB size becomes 126KiB (128KiB minus one NAND page which is used for storing both UBI headers). See this section for more information about where the UBI headers are stored.
Sub-pages are used by UBI only internally, and only for storing the headers.
UBI API does not allow users doing I/O in sub-page units. One of the reasons for
this is that sub-page writes may be slow. To write a sub-page, the driver may
actually write whole NAND page, but put 0xFF
bytes to the sub-pages
which are not relevant to this operation. E.g., this means that writing 4
sub-pages may be 4 times slower than writing whole NAND page at once. Thus,
UBI does use sub-pages for the headers, but this notion does not exist in the
UBI API.
UBI headers position
The EC header always resides at offset 0 and takes 64 bytes, the VID header resides at the next available min. I/O unit or sub-page, and also takes 64 bytes. For example:
- in case of NOR flash which has 1 byte min. I/O unit, the VID header resides at offset 64;
- in case of NAND flash which does not have sub-pages, the VID header resides at the second NAND page;
- in case of NAND flash which has sub-pages, the VID header resides at the second sub-page.
Flash space overhead
UBI uses some amount of flash space for its own purposes, thus reducing the amount of flash space available for UBI users. Namely:
- 2 PEBs are used to store the volume table;
- 1 PEB is reserved for wear-leveling purposes;
- 1 PEB is reserved for the atomic LEB change operation;
- some amount of PEBs is reserved for bad PEB handling; this is applicable for NAND flash, but not for NOR flash; the percentage of reserved PEBs is configurable and is 1% by default;
- UBI stores the EC and VID headers at the beginning of each PEB; the amount of bytes used for these purposes depends on the flash type and is explained below.
Lets introduce symbols:
- P - total number of physical eraseblocks on the MTD device;
- SP - physical eraseblock size;
- SL - logical eraseblock size;
- B - number of PEBs reserved for bad PEB handling; it is 1% of P for NAND by default, and 0 for NOR and other flash types which do not have bad PEBs;
- O - the overhead related to storing EC and VID headers in bytes, i.e. O = SP - SL.
The UBI overhead is (B + 4) * SP + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:
- in case of NOR flash which has 1 byte minimum input/output unit, O is 128 bytes;
- in case of NAND flash which does not have sub-pages (e.g., MLC NAND), O is 2 NAND pages, i.e. 4KiB in case of 2KiB NAND page and 1KiB in case of 512 bytes NAND page;
- in case of NAND flash which has sub-pages, UBI optimizes its on-flash layout and puts the EC and VID headers at the same NAND page, but different sub-pages; in this case O is only one NAND page;
- for other flashes the overhead should be 2 min. I/O units if the min. I/O unit size is greater or equivalent to 64 bytes, and 2 times 64 bytes aligned to the min. I/O unit size if the min. I/O unit size is less than 64 bytes.
Saving erase counters
When working with UBI, it is important to realize that UBI stores erase
counters on the flash media. Namely, each physical eraseblock has so-called
erase counter header which stores the amount of times this physical eraseblock
has been erased (see here). And of course,
it is important not to lose the erase counters, which means that the tools
you use to erase the flash and to write UBI images have to be UBI-aware. The
mtd-utils repository contains the
ubiformat
utility which takes things right.
UBI flasher details
The following is a list of what the UBI flasher program has to do when erasing the flash or when flashing UBI images.
- First of all, scan the flash and collect the erase counters. Namely,
it read the EC header from each PEB, check the
CRC-32
checksum of the header, and save the erase counter in a RAM. It is not necessary to read VID headers. Bad PEBs should be skipped. - Calculate average erase counter. It should be used for PEBs with corrupted or missing EC headers. Such PEBs may be there because of unclean reboots, but there shouldn't be too many of them.
- If the intention is to just erase the flash, then each PEB has to be erased and proper EC header has to be written at the beginning of the PEB. The EC header should contain incremented erase counter. Bad PEBs should be just skipped. For NAND flashes, in case of I/O errors while erasing or writing, the PEB should be marked as bad (see here for more information how UBI marks PEBs as bad).
- If the intention is to flash an UBI image, then the flasher should
do the following for each non-bad PEB.
- Read the contents of this PEB from the UBI image (PEB size bytes) into a buffer.
- Stripe min. I/O units full of
0xFF
bytes from the end of the buffer (the details are given below in this section). - Erase the PEB.
- Change the EC header in the buffer - put the new erase
counter value there and re-calculate the
CRC-32
checksum. - Write the buffer to the physical eraseblock.
In practice the input UBI image is usually shorter than the flash, so the flasher has to flash the used PEBs properly, and erase the unused PEBs properly.
Note, when writing an UBI image, it does not matter where eraseblocks from the input UBI image will be written. For example, the first input eraseblock may be written to the first PEB, or to the second one, or to the last one.
Also note, if you implement a flasher which writes UBI images at the production line, i.e., only once, then the flasher does not have to change EC headers of the input UBI image, because this is new flash and each PEB has zero erase counter anyway. This means the production line flasher may be simpler.
If your UBI image contains UBIFS file system, and
your flash is NAND, you may have to drop 0xFF
bytes the end of
input PEB data. This is very important, although not required for all NAND
flashes. Sometimes a failure to do this may result in very unpleasant problems
which might be difficult to debug later. So we recommend to always do this.
The reason for this is that UBIFS treats NAND pages which contain only
0xFF
bytes (let's refer them to as empty NAND pages) as free.
For example, suppose the first NAND page of a PEB has some data, the second one
is empty, the third one also has some data, the fourth one and the rest of NAND
pages are empty as well. In this case UBIFS will treat all NAND pages starting
from the fourth one as free, and will write data there. However, if the flasher
program has already written 0xFF
's to these pages, so they will be
written to twice! However, many NAND flashes require NAND pages to be written
only once, even if the data contains only 0xFF
bytes.
To put it differently, writing 0xFF
bytes may have side-effects.
What the flasher has to do is to drop all empty NAND pages from the end of the
PEB buffer before writing it. It is not necessary to drop all empty NAND pages,
just the last ones. This means that the flasher does not have to scan whole
buffer for 0xFF
's. It is enough to scan the buffer from the end
and stop on the first non-0xFF
byte. This is much faster. Here
is the code from UBI which does the right thing.
/** * calc_data_len - calculate how much real data are stored in a buffer. * @ubi: UBI device description object * @buf: a buffer with the contents of the physical eraseblock * @length: the buffer length * * This function calculates how much "real data" is stored in @buf and returns * the length. Continuous 0xFF bytes at the end of the buffer are not * considered as "real data". */ int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf, int length) { int i; for (i = length - 1; i >= 0; i--) if (((const uint8_t *)buf)[i] != 0xFF) break; /* The resulting length must be aligned to the minimum flash I/O size */ length = ALIGN(i + 1, ubi->min_io_size); return length; }
This function is called before writing the buf
buffer to the
PEB. The purpose of this function is to drop 0xFF
's from the end
and prevent the situation described above. The ubi->min_io_size
is the minimal input/output unit size which is equivalent to NAND page size.
By the way, we experienced the similar problems with JFFS2. The JFFS2 images
generated by the mkfs.jffs2
program were padded to the physical
eraseblock size and were later flashed to our NAND. The flasher did not bother
skipping empty NAND pages. When JFFS2 was mounted, it wrote to those NAND pages,
and the writes did not fail. But later we observed weird ECC errors. It took a
while to find out the problem. In other words, this is also relevant to JFFS2
images.
Marking eraseblocks as bad
This section is relevant for NAND flashes and other flashes which admit of bad eraseblocks. UBI marks physical eraseblocks as bad on 2 occasions:
- eraseblock write operation failed, in which case UBI moves the data from this PEB to some other PEB (data recovery) and schedules this PEB for torturing;
- erase operation failed with
EIO
error, in which case the eraseblock s marked as bad straight away.
The torturing is done in background with the purpose of detecting whether the physical eraseblock is really bad. The write failure might have happened because of many reasons, including bugs in the driver or in the upper level stuff like the file system (e.g., the FS mistakenly writes many times to the same NAND page). During the torturing UBI does the following:
- erase the eraseblock;
- read it back and make sure it contains only 0xFF bytes;
- write test pattern bytes;
- read the eraseblock back and check the pattern;
- and so on for several patterns (
0xA5
,0x5A
,0x00
).
The eraseblock is not marked as bad if it survives the torture test. Note, a
bit-flip during the torture test is treated as a good reason to mark the
eraseblock bad as well. Please, refer the torture_peb()
function
for detailed information.
Scalability issues
Unfortunately, UBI scales linearly in terms of flash size. UBI initialization time linearly depends on the number of physical eraseblocks on the flash. This means that the larger is the flash, the more time it takes for UBI to initialize (i.e., to attach the MTD device). The initialization time depends on the flash I/O speed and (slightly) on the CPU speed, because:
- UBI scans the MTD device when attaching - it reads the erase EC and VID headers from every single PEB; the headers are small (64 bytes each), so this means reading 128 bytes from each PEB in case of NOR flash or one or two NAND pages in case of NAND flash (this depends on whether the NAND flash supports sub-pages or not); this is anyway much less than JFFS2 needs to read when it mounts MTD devices, so UBI attaches MTD devices many times faster than JFFS2 would mount a file system on the same MTD device;
- UBI calculates
CRC-32
checksum of each EC and VID header, which consumes CPU, although this is usually minor comparing to the flash I/O overhead.
Here are some figures:
- a 256MiB OneNAND flash found in Nokia N800 devices is attached for less than 1 sec; the flash does support sub-pages so UBI has to read the first 2KiB NAND page of each PEB while scanning;
- a 1GiB NAND flash found in OLPC XO-1 devices is attached for about 2 seconds; the flash is an SLC NAND and supports sub-pages, but the Cafe controller which is used in the laptop does not allow sub-page writes, so UBI has to read two 2KiB NAND pages from each PEB.
Unfortunately we do not have more data and the reader is welcome to send it to us via the MTD mailing list.
Implementation details
In general, UBI needs three tables to operate:
- volume table which contains per-volume information, like volume size, type, etc;
- eraseblock association (EBA) table which contains the logical-to-physical eraseblock mapping information; for example, when reading an LEB, UBI first looks up the table to find the corresponding PEB number, then reads from this PEB;
- erase counters (EC) table which contains the erase counter value for each physical eraseblock; UBI wear-leveling sub-system uses this table when it needs to find, for example, a highly worn-out LEB;
The volume table is maintained on flash. It changes only when UBI volumes are created, deleted and re-sized, which are rare and not time-critical operations, and UBI can afford a slow and simple method of the volume table management.
The EBA and EC tables are changed every time an LEB is mapped to a PEB or a PEB is erased, which happens quite often and means that the table management methods should be fast and efficient.
UBI could maintain on the EBA and EC tables on the flash media, but this would inevitably involve journaling, journal replay, journal commit, etc. In other words, this would introduce a lot of complexity. But UBI would be logarithmically scalable in this case.
One of the UBI requirements was simplicity of the on-flash format, because UBI authors had to read UBI volumes from the boot-loader and they had very tough constraints on the boot-loader code size. It was basically impossible to add complex journal scanning and replay code to the boot-loader.
So UBI does not maintain the EBA and EC tables on the flash media. Instead, it builds them in RAM each time it attaches the MTD device. This means that UBI has to scan whole flash and read the EC and VID headers from each PEB in order to build in-RAM EC and EBA tables.
The drawbacks of this design are poor scalability and relatively high overhead on NAND flashes (e.g., the overhead is 1.5%-3% of flash space in case of a NAND flash with 2KiB NAND page and 128KiB eraseblock). The advantages are simple binary format and robustness, as the result of symplicity.
Nonetheless, it is always possible to create UBI2 which would maintain the tables in separate flash areas. UBI2 would not be compatible with UBI because of completely different on-flash formats, but the user interfaces would stay the same, which would guarantee compatibility of all the software built on top of UBI.
Volume auto-resize
It is well-known that NAND chips have some amount of physical eraseblocks marked as bad by the manufacturer. The bad PEBs are distributed randomly and their number is different, although manufacturers usually guarantee that the first few physical eraseblocks are not bad and the total amount of bad PEBs does not exceed certain number. For example, a new 256MiB Samsung OneNAND chip is guaranteed to have not more than 40 128KiB PEBs (but of course, more physical eraseblock will become bad over time). This is about 2% of flash size.
When it is needed to create an UBI image which will be flashed to the end user devices in production line, you should define exact sizes of all volumes (the sizes are stored in the UBI volume table). But it is difficult to do because the total flash chip size may vary depending on the amount of the initially bad PEBs.
One obvious way to solve the problem is to assume the worst case, when all chips would have maximum amount of bad PEBs. But in practice, most of the chips will have only few bad PEBs which is far less than the maximum. In general, it is fine - this will increase reliability, because UBI anyway uses all PEBs of the device. On the other hand UBI anyway reserves some amount of physical eraseblocks for bad PEB handling which is 1% of PEBs by default. So in case of the above mentioned OneNAND chip the result would be that 1% of PEBs would be reserved by UBI, and 0-2% of PEBs would not be used (they would be seen as available LEBs to the UBI users).
But there is an alternative approach - one of the volume may have the auto-resize mark, which means that its size has to be enlarged when UBI is run for the first time. After the volume size is adjusted, UBI removes the auto-resize mark and the volume is not re-sized anymore. The auto-resize flag is stored in the volume table and only one volume may be marked as auto-resize. For example, if there is a volume which is intended to have the root file-system, it may be reasonable to mark it with the auto-resize flag.
In the example with OneNAND chip, if one of the UBI volumes is be marked as auto-re-sized, it will be enlarged by 0-2% on the first UBI boot, but 1% of PEBs will anyway be reserved for bad PEB handling.
Note, the auto-resize feature exists in the Linux kernel starting from
version 2.6.25
.
UBI operations
LEB un-map
The LEB un-map operation is implemented by the
ubi_leb_unmap()
UBI kernel API function. And starting from kernel
version 2.6.29
the un-map operation is available to the user-space
programs via the UBI_IOCEBUNMAP
ioctl command. The ioctl should be
called for UBI volume character devices.
The LEB un-map operation:
- first un-maps the LEB from the corresponding PEB;
- then schedules the PEB for erasure and returns; it does not wait for the erasure of the PEB to be finished; the PEB is instead erased in context of the UBI background thread;
UBI returns all 0xFF
bytes when an un-mapped LEB is read, so
the un-map operation may be considered as a very fast erase operation. But there
is one aspect UBI programmers have to be well aware of.
Suppose you un-map LEB L which is mapped to PEB P. Since P is not synchronously erased, but just scheduled for erasure, there might be "surprises" in case of unclean reboots: if the reboot happens before P has been physically erased, L will be mapped to P again when UBI attaches the MTD device after the unclean reboot. Indeed, UBI will scan the MTD device and find P which refers L, and it will add this mapping information to the EBA table.
But once you write any data to L, or map it using the LEB map operation, it gets mapped to a new PEB and the old contents goes forever, because even in case of an unclean reboot UBI would pick the newer mapping for L.
Implementation details
This section describes how UBI distinguishes between older and newer versions of an LEB in case of an unclean reboot. Suppose we un-map LEB L which is mapped to PEB P1, which means UBI schedules P1 for erasure. Then we write some data to L, which means that UBI finds another PEB P2, maps L to P2, and writes the data to P2. If an unclean reboot happens before P1 is physically erased, but after the write operation, we end up with 2 PEBs (P1 and P2) mapped to the same LEB L.
To handle situations like this, UBI maintains a global 64-bit sequence number variable. The sequence number variable is increased each time a PEB is mapped to a LEB and its value is stored in the VID header of the PEB. So each VID header has a unique sequence number, and the larger is the sequence number, the "younger" is the VID header. When UBI attaches MTD devices, it initializes the global sequence number variable to the highest value found in existing VID headers plus one.
In the above situation, UBI just selects a PEB with higher sequence number (P2) and drops the PEB with lower sequence number (P1).
Note, the situation is more difficult if an unclean reboot happens when UBI moves the contents of one PEB to another for a wear-leveling purposes, or when it happens during the atomic LEB change operation. In this case it is not enough to just pick the newer PEB, it is also necessary to make sure the data reached the the new PEB.
LEB map
The LEB map operation maps a previously un-mapped logical eraseblock to a physical eraseblock. For example, if the operation is run for LEB A, UBI will find appropriate PEB, write VID header to the PEB, and amend the in-memory EBA table. The VID header will refer LEB A. After this operation all I/O to LEB A will actually go to the mapped PEB.
The LEB map operation is available via the ubi_leb_map()
UBI kernel API function, or via the UBI_IOCEBMAP
volume character
device ioctl command. However, thie ioctl interface is available only starting
from kernel version 2.6.29
The LEB map operation accepts the dtype
parameter which suggests
UBI which type of data the LEB will contain. Namely, dtype
may be
one of:
UBI_SHORTTERM
- the LEB will store short-term data, which means that it will be erased soon; UBI will map this LEB to a PEB with low erase counter, so it will grow relative to other PEB erase counters;UBI_LONGTTERM
- the LEB will store long-term data and will not be erased soon; UBI will map this LEB to a PEB with high erase counter, so it will go down relative to other PEB erase counters;UBI_UNKNOWN
- should be used most of the time, when you are not sure whether the data are long-term or short term.
Bear in mind that dtype
is only a hint. Please, use
UBI_UNKNOWN
if unsure. And note, UBI authors never really tested
the effects of using UBI_SHORTTERM
and UBI_LONGTTERM
,
so there is not guarantee they improve anything.
One of the possible use-cases of the LEB map operation is making sure the old LEB contents goes away forever. As it was explained in this section, when an LEB is un-mapped, the corresponding PEB is not erased straight away. And if an unclean reboot happens, the LEB may becomes mapped to the same PEB again, after the UBI attaches the MTD device. So, if you map the LEB just after un-mapping it, you are guaranteed that the old LEB contents never comes back. In other words, the LEB is guaranteed to contain only 0xFF bytes after the map operation returns, even in case of an unclean reboot.
Please, use the LEB map operation carefully. Do not use this unless it is really needed, because mapped LEBs add more overhead on the UBI wear-leveling sub-system, comparing to un-mapped LEBs. Indeed, if an LEB is un-mapped, there is no PEB which contains LEB's data, and the wear-leveling sub-system does not have to move any data to maintain wear-leveling. Conversely, if the LEB is mapped to a PEB, there is one more PEB for the wear-leveling sub-system to care about, and one more LEB to re-map to another PEB if the erase counter of the current PEB becomes too low (then the LEB is re-mapped to a PEB with higher erase counter and the old PEB is used for other operations).
Volume update
The volume update operation is be useful for device software updates.
The operation changes the contents of whole UBI volume with new contents. But if
it gets interrupted in the middle of the update, the volume goes into the
"corrupted" state and further I/O on the volume ends up with an
EBADF
error. And the only way to get the volume back to the normal
state is to start a new volume update operation and finish it.
The volume update operation allows detecting interrupted updates and re-starting it with help of, for example, a "mirror" volume which would have the same contents or by showing a dialog window which would inform the user about the problem and request flashing. In contrast, it is difficult to detect interrupted updates in case of raw MTD partitions.
The volume update operation is available via the user-space UBI interface and
not available via the UBI kernel API. To update a volume, you first have to call
the UBI_IOCVOLUP
ioctl of the corresponding UBI volume character
device and pass it a pointer to a 64-bit value containing the length of the new
volume contents in bytes. Then this amount of bytes has to be written to the
volume character device. Once the last byte has been send to the character
device, the update operation is finished. Schematically, the sequence is:
fd = open("/dev/my_volume"); ioctl(fd, UBI_IOCVOLUP, &image_size); write(fd, buf, image_size); close(fd);
See include/mtd/ubi-user.h
for more details. Bear in mind, the
old contents of the volume is not preserved in case of an interrupted update.
Also, you do not have to write all new data at one go. It is OK to call
the write()
function arbitrary number of times and pass arbitrary
amount of data each time. The operation will be finished after all the data
have been written. If the last write operation contains more bytes than UBI
expects, the extra data are just ignored.
Special case of the volume update operation is what we call volume
truncation, which is done by the same ioctl command if the data length is
zero. In this case the volume is just wiped out and will contain all
0xFF
bytes (all LEBs will be un-mapped).
Note, the /sys/class/ubi/ubiX_X/corrupted
sysfs file reflects
the "corrupted" state of the volume: it contains ASCII "0\n" if the volume is OK
and "1\n" if it is corrupted (because volume update had started but was not
finished).
The volume update operation does not preserve the old volume contents if it is interrupted, so it is not atomic. However, UBI also provides atomic volume updates by means of the volume re-name operation.
The volume update is implemented with help of so-called update
marker. Once the user has issued the UBI_IOCVOLUP
ioctl, UBI
sets the update marker flag for the volume in the corresponding record of the
UBI volume table. Then the volume is wiped
out and UBI waits for the the user to pass the data. Once all the data have
arrived and have been written to the flash, the update marker is cleaned. But
in case of an interruption (e.g., unclean reboot, crash of the update
application, etc.), the update marker is not cleaned and the volume is treated
as "corrupted". Only a new successful update operation may clean the update
marker.
Atomic LEB change
The atomic LEB change operation changes the contents of an LEB atomically, so that the old contents is preserved if the operation is interrupted. In other words, the result of the operation is that the LEB either has the old contents or the new contents.
The operation is available via the ubi_leb_change()
kernel API
call. The user-space interface for this operation exists starting from kernel
version 2.6.25
.
The user-space atomic LEB change operation is run via the
UBI_IOCEBCH
ioctl command. You have to pass a pointer to a properly
filled request object of struct ubi_leb_change_req
type. The
object stores the LEB number to change and the length of the new contents. Then
you have to write the specified amount of bytes to the volume character device.
Notice some similarity to the user-space interface of the
volume update operation. Schematically,
the sequence is:
struct ubi_leb_change_req req; req.lnum = lnum_to_change; req.len = data_len; req.dtype = UBI_LONGTERM; /* data persistency (may also be UBI_SHORTTERM and UBI_UNKNOWN) */ fd = open("/dev/my_volume"); ioctl(fd, UBI_IOCEBCH, &req); write(fd, data_buf, data_len); close(fd);
If for some reason the user does not write the declared amount of bytes and closes the file, the operation is canceled and the old contents of the LEB is preserved.
Similarly tho the volume update operation it does not matter how many times
the write()
function is called and how much data it passes to the
UBI volume each time. The atomic LEB change operation finishes once the last
data byte has arrived.
The atomic LEB change operation might be very useful for file-systems, for example UBIFS uses this operation as the last resort when it commits the file-system index. This operation may also be exploited to create an FTL layer on top of UBI (see here for the description of the idea).
Keep in mind that the atomic LEB change operation calculates the
CRC-32
checksum of the new data, so it has some overhead comparing
to the LEB erase plus LEB write sequence. The volume update operation does not
calculate data CRC-32
, so it is faster to update the volume than to
atomically change all its eraseblocks. This additional overhead has to be
remembered about and the operation should not be used if the atomicity is not
really needed.
Implementation details
Suppose UBI has to change a logical eraseblock L which is mapped to a
physical eraseblock P1. First of all, UBI always has one free
PEB reserved for the atomic LEB change operation, let it be
P2. Before the operation, P1 stores the
contents of the LEB L and P2 is free (it contains only
the EC header and 0xFF
bytes). The new data are written to
P2, not to P1, so should anything go wrong,
the old contents of the LEB is always there.
When the operation finishes, UBI un-maps L from P1, maps in to P2, and schedules P1 for erasure. If the operation is interrupted, L stays being mapped to P1 and P2 is scheduled for erasure.
If an unclean reboot happens half way through the atomic LEB change operation, it is obvious that UBI has to preserve the L -> P1 mapping and erase P2 when it is attaches the MTD device next time. But if the unclean reboot happens just after the atomic LEB change operation finishes, but before P1 is physically erased, it is obvious that UBI has to preserve L -> P2 mapping and erase P1.
To resolve situations like that, UBI calculates CRC-32
checksum
of the new contents of the LEB before it is written to flash, and stores it in
the VID header (together with data length). When UBI finds 2 PEBs
P1 and P2 mapped to the same LEB L
during the initialization, it selects the one with higher sequence number
(P2) only if the data CRC-32
is correct (which
means that all data has been written to the flash media), otherwise it selects
the PEB with lower sequence number(P1). Of course, UBI has to
read the LEB contents in order to check the CRC-32
checksum.
More documentation
Unfortunately, there are no thorough and strict UBI documents. But there is an old UBI design document which has some out-of-date information, but is still useful: ubidesign.pdf.
There is also a PowerPoint UBI presentation available:
ubi.ppt. Note, this document has to be looked at
in Windows, because it contains a lot of animation and Open Office cannot
properly show it. Use slide show (F5
key) when you look, because
otherwise the animation is not shown.
Many useful information may be found at the FAQ section.
And of course just reading the UBI interface C header files which contains
quite a few commentaries may help: include/mtd/ubi-user.h
contains the user-space interface definition (namely, it defines UBI ioctl
commands and the involved data structures),
include/linux/mtd/ubi.h
defines the kernel API and the
drivers/mtd/ubi/kapi.c
file contains comments for each kernel API
function (just above the body of the function).
How to send an UBI bugreport?
Before sending a bug report:
- make sure you have compiled kernel symbols in
(
CONFIG_KALLSYMS_ALL=y
in.config
); - enable UBI debugging (
CONFIG_MTD_UBI_DEBUG=y
in.config
); please, mark only the "UBI debugging" check-box and do not mark other debugging sub-options like "UBI debugging messages" unless you know what you are doing.
Please, attach all the bug-related messages including the UBI messages from
the kernel ring buffer, which may be collected using the dmesg
utility or using minicom
with serial console capturing. Please,
describe how the problem can be reproduced (if it can be). The bugreport
should be sent to the MTD mailing list. Please,
do not send private e-mails to UBI authors, always CC the mailing
list!