Home Reliability study

Solid State Drive Bad Block Management Method

Renice

How Bad Blocks are generated? What kind of means does SSD use to discover and manage Bad Blocks? What are the problems with the Bad Block management strategy suggested by the manufacturer? What kind of management method will be better? Will formatting the hard drive cause the Bad Block Table to be lost? What are the security risks after the SSD is repaired? This article elaborates them one by one.


Overview


The Bad Block management design concept is related to SSD reliability and efficiency. The Bad Block management practices given by Nand Flash manufacturers may not be very reasonable. During product design, if some abnormal conditions are not considered carefully, it will often lead to some unexpected Bad Blocks.


For example, after testing several different controller SSDs, it is found that the problem of new Bad Blocks due to abnormal power failure is very common. Use a search engine to search for "abnormal power failure resulting in Bad Blocks" or similar keywords, you will find that this problem does not only exist in the testing process, but there are also many problems that occur to end users.

Who Manage Bad Block



For controllers without a dedicated flash file system, Bad Blocks can be managed by the SSD controller firmware.

For a dedicated flash file system, Bad Blocks are managed by a dedicated flash file system or a driver.

Solid State Drive Bad Block Management Method, Bad Block Manage


Three Types of Bad Blocks


1. Factory Bad Blocks (or initial Bad Blocks), that is, blocks that do not meet the manufacturer's standards or have been tested by the manufacturer and fail to meet the manufacturer's published standards, and have been identified as Bad Blocks by the manufacturer when they leave the factory. Some factory Bad Blocks can be erased, while others cannot;


2. New Bad Blocks or used Bad Blocks caused by wear and tear during use;

3. False Bad Blocks misjudged by the controller due to abnormal power failure and other reasons;

Not all new Bad Blocks are caused by wear and tear. If the SSD does not have an abnormal power failure protection function, the abnormal power failure may cause the controller to misjudge Bad Blocks or actually generate new Bad Blocks.
In the absence of abnormal power-off protection, if the lower page has been programmed successfully, and the power is suddenly lost during the programming of the upper page, it will inevitably lead to data transmission errors in the lower page.
If the number of data errors exceeds the error correction capability of the SSD ECC, an error will occur during reading, and the block will be judged as "Bad Block" by the controller and marked in the Bad Block Table.

Some of the new Bad Blocks can be erased. Moreover, after the new Bad Block is erased, the data read, write and erase operation may not be wrong again, because the error or not is also related to the pattern of the written data. It is possible to use a certain pattern to make an error, and it may not be wrong to replace another pattern.

The Ratio of Factory Bad Blocks in the Entire Device



Renice has consulted several original Nand Flash manufacturers, and their general statement is that the rate of Bad Blocks in the factory does not exceed 2%. And the manufacturer will leave a part of the margin to ensure that even when the maximum P/E Cycle limit is reached, there is still a Bad Block ratio of no more than 2%.

But it seems that it is not an easy thing to guarantee 2%. Renice got a new sample from the factory, and the test Bad Block ratio was 2.55%, which exceeded 2%.


Solid State Drive Bad Block Management Method, Bad Block Scan




How to Determine Bad Blocks in SSD


1.Judgment Method of Factory Bad Blocks


The scanning of Bad Blocks is basically to scan whether the byte corresponding to the address specified by the manufacturer has the FFh flag, and if there is no FFh, it is a Bad Block.


The location of the Bad Block identifier is roughly the same for each manufacturer. But for SLC and MLC, the location is different, take Micron as an example:

1.1 For SLC of small page (528Byte), whether the sixth Byte in the spare area of the first page of each block has the FFh flag, if not, it is a Bad Block;
1.2 For the SLC SSD of large pages (greater than or equal to 2112Byte), whether the first and sixth Bytes of the spare area of the first page of each block have the FFh flag, if not, it is a Bad Block;
1.3 For MLC SSD, the factory Bad Block is to scan whether the first or second Byte of the spare area of the first page and the last page of each block is the 0xFF flag, which is 0xFF, which is very fast, no 0xFF is the Bad Block.

Quoting a figure in the Hynix datasheet to illustrate:

Solid State Drive Bad Block Management Method, How to Determine Bad Blocks in SSD



What data is in the Bad Block? All 0 or all 1?

The test results are shown below. Of course, it may be true for factory Bad Blocks, but it may not be true for new Bad Blocks, otherwise it would be impossible to hide data through "Bad Blocks":


Solid State Drive Bad Block Management Method, What data is in the Bad Block



Can factory Bad Blocks be erased?
Some "can" be erased, while others are prohibited from being erased by the manufacturer. And "can" erase only means that the Bad Block identification can be changed by sending an erase command, rather than suggesting that the Bad Block can be used.

The manufacturer strongly recommends not to erase the Bad Block. Once the Bad Block flag is erased, it cannot be "recovered", and writing data on the Bad Block is risky.


Solid State Drive Bad Block Management Method, Can factory Bad Blocks be erased?



2.Judgment Method for New Bad Blocks During Use


The new Bad Block is determined by the feedback result of the status register to determine whether the operation of the Nand Flash is successful. During Program or Erase, if the feedback of the status register fails, the SSD controller will list this block as a Bad Block.


Specifically:

2.1 An error occurred while executing the erase command;
2.2 An error occurred while executing the write command;
2.3 When the read command is executed, an error occurs; when the read command is executed, if the number of bit errors exceeds the error correction capability of the ECC, the block will be judged as a Bad Block.

Bad Block Management Methods


Bad Blocks are managed by building and updating the Bad Block Table (BBT). There is no unified specification and practice for the Bad Block Table. Some engineers use one table to manage factory Bad Blocks and new Bad Blocks, some engineers manage the two tables separately, and some engineers regard the initial Bad Blocks as a separate table. table, the factory Bad Blocks are added to the new Bad Blocks as another table.


For the content of the Bad Block Table, the expressions are not consistent, and some will be expressed roughly, for example: use 0 to indicate good fast, use 1 to indicate Bad Blocks, or vice versa. Some engineers will use a more detailed description method, such as: 00 for factory Bad Blocks, 01 for Bad Blocks when Program fails, 10 for Bad Blocks when Read fails, and 11 for Erase.

The Bad Block Table is generally saved in a separate area (such as: Block0, page0 and Block1, page1). The BBT is directly read after each power-on, which is more efficient. Considering that the Nand Flash itself will also be damaged, it may lead to BBT. Therefore, BBT is usually backed up, and how many copies are backed up is different for each house. Some people back up 2 copies, and some people back up 8 copies. Usually, it can be calculated with the help of the voting system of probability theory. In any case, at least I want 2 or more.

Bad Block management strategies generally include: Bad Block skip strategy and Bad Block Replacement strategy;

Bad Block Skipping Strategy



1. For the initial Bad Block, the Bad Block skip will skip the corresponding Bad Block through BBT, and directly store the data in the next good block.
2. For the new Bad Block, update the Bad Block to the BBT, transfer the valid data in the Bad Block to the next good block, and skip this Bad Block every time you do the corresponding Read, Program or Erse in the future.



Bad Block Replacement Strategy (recommended by Nand Flash manufacturer)


Bad Block Replacement refers to using good blocks in the reserved area to replace Bad Blocks generated during use. Assuming that an error occurs on the nth page during the program, then under the Bad Block Replacement strategy, the data from page0 to page(n-1) will be copied to the same position of the free Block (such as Block D) in the reserved area. , and then write the data of the nth page in the data register to page n in Block D.


The manufacturer's suggestion is to divide the entire data area into two parts, one part is the user-visible area, which is used for normal data operations by the user, and the other part is the spare area specially prepared for replacing Bad Blocks, which is used to store data for replacing Bad Blocks and The Bad Block Table is kept, and the proportion of the spare area is 2% of the entire capacity.


Solid State Drive Bad Block Management Method, Bad Block Replacement Strategy



When a Bad Block is generated, FTL will remap the Bad Block address to the good block address in the reserved area instead of skipping the Bad Block to the next good block directly. Before each write operation to the logical address, it will first calculate which physical The address can be written, which address is a Bad Block, if it is a Bad Block, the data is written to the address of the corresponding reserved area.

However, don't see a suggestion as to whether the 2% reserved area needs to be included in the OP area or an additional area, nor do I see a description of whether the 2% reserved area is dynamic or static, and join is a separate area And it is a static area, then this approach will have the following disadvantages:

1. Directly reserve 2% of the area for replacement of Bad Blocks, which will reduce the available capacity and waste space. At the same time, due to the small number of available blocks, the average number of available Bad Blocks is accelerated; 

2. Assuming that the number of Bad Blocks in the available area exceeds At 2%, that is to say, all the reserved replacement areas have been replaced, and the resulting Bad Blocks will not be processed, and the SSD will face the end of its life.


Bad Block Replacement Strategy (the practice of some SSD manufacturers)


In fact, in the real product design, it is rare to see a separate 2% ratio as the Bad Block Replacement area. In general, the OP (Over Provision) area free block is used to replace the new block in the process of use. Take garbage collection as an example. When the garbage collection mechanism is running, the valid page data in the block that needs to be reclaimed is first moved to the free block, and then the Erase operation is performed on the block. It is assumed that the Erase status register is fed back at this time. Erase fails, the Bad Block management mechanism will update the block address to the new Bad Block list, and at the same time, write the valid data pages in the Bad Block to the Free Block in the OP area, update the Bad Block management table, and next time When writing data, skip the Bad Block directly to the next available block.


The OP size varies from manufacturer to manufacturer. Different application scenarios, different reliability requirements, and different OP sizes.
There is a trade-off relationship between OP and stability. The larger the OP, the larger the available space for garbage collection in the process of continuous writing, the more stable the performance, and the smoother the performance curve.
Conversely, the smaller the OP, the worse the performance stability, the larger the available space for users, and the lower the cost.

Generally speaking, OP can be set to 5%-50%. 7% OP is a common ratio. Unlike the 2% fixed block suggested by the manufacturer, 7% is not a fixed block for OP. Instead, it is dynamically distributed among all Blocks, which is more conducive to the wear leveling strategy.

Security Risk of SSD Repair


For most SSD manufacturers that do not have controller technology, if the product is returned for repair, the usual practice is to replace the faulty device and then restart mass production. At this time, the new Bad Block list will be lost, and the new Bad Block Table will be lost. This means that there are already Bad Blocks in the Nand Flash that has not been replaced, and the operating system or sensitive data may be written to the Bad Block area, which may cause the user's operating system to crash. Even for a manufacturer with a controller control, whether the existing Bad Block list will be saved for the user depends on the attitude of the user facing the manufacturer.


Will Bad Block Production Affect SSD's Read/Write Performance and Stability


The factory Bad Blocks will be isolated on the bitline, so it will not affect the erasing speed of other blocks. However, if there are enough new Bad Blocks in the entire SSD, the number of available blocks on the entire disk will decrease, which will lead to an increase in the number of garbage collections. , the reduction of the OP capacity will seriously affect the efficiency of garbage collection. Therefore, the increase of Bad Blocks to a certain level will affect the performance stability of the SSD, especially when the SSD is continuously written. Because the system performs garbage collection, it will lead to As performance decreases, the SSD performance curve will fluctuate greatly.

If you have a question or need a quote, please leave your message. We'll get back to you as soon as possible.

Get Quotes
Get Quotes

We use cookies to help us improve our webpage. Please read our Cookie Policy.

Ok Block Cookie