They write in small blocks (eg, 4K), but can only erase in very large blocks (eg, 2M). Writing to an empty device is easy, but eventually you have to overwrite something, and you can't just replace a 4K block with another. You have to take a contiguous block of 2M, wipe the entire thing, rewrite whatever part of it was useful, then do the write you actually wanted to.
This effect is known as "write amplification", and it means that in bad cases you need to do many times more work than the host system requested.
Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.
This is what TRIM (confusingly also known as 'discard' in some contexts) is for -- the SSD operates in blocks and has no way to know that some chunk you've written to before is now useless to you because you deleted the file it belonged to, until you give it the command to overwrite that block. TRIM tells the drive "these parts can be recycled", and allows it to create empty blocks in advance. So make sure to TRIM once in a while.
Sadly TRIM wasn't well specified initially in regards to what performance characteristics it should have -- some old drives can get stuck on it for a while. So while many filesystems support emitting TRIM automatically, it can cause severe performance issues on some drives and the recommendation is to do it as a maintenance task on a timer instead.
TL;DR: Run `fstrim /mountpoint`. Wait a while before testing if it changed anything. The drive isn't obligated to do the work immediately.
It gets a bit more complex than that due to layers. LVM and dmcrypt can filter out TRIM requests. You can use `lsblk -D` to check the support status.
> You have to take a contiguous block of 2M, wipe the entire thing, rewrite whatever part of it was useful,
This is called garbage collection. It may be happening at any time in the background but becomes more frequent and perhaps in the write path as the drive fills. 2M is an example size - the actual size will vary by drive model and it is rarely disclosed.
> Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.
There are multiple types of SLC cache as well. Client drives may have a small number of gigabytes of SLC that can absorb a small burst of writes. Client drives may also have pseudo SLC (pSLC) that is called pSLC, TurboWrite (Samsung), etc. With pSLC, when there's about 30% of the drive's NAND that is erased, the drive will use that as SLC. So, a 1 TB drive will use about 300 GB of space as a 100 GB SLC cache.
How performance degrades as the drive fills varies widely between drive models. Some drives start to have significant read and write performance degradation long before the drive is half written. Others will maintain fairly consistent read performance (maybe within 90% of that which is advertised) regardless of how full the drive is. For instance, the original version of the Samsung 980 Pro maintains close-to-spec read speeds regardless of how full it is but write performance drops from about 5200 MB/s to about 1300(?) MB/s the moment it hits 70% allocated.
Datacenter and enterprise class drives tend to have lower peak performance than client drives, but their performance is much more consistent regardless of how full they are.
If you are buying a client NVMe drive for speed, buy one that is larger than you need and set aside at least 30% of it in unpartitioned (or unused partition) space. This will prevent the OS from writing to 30% of the drive thus keeping plenty of space for pSLC and similar optimizations. This will also increase the life of the drive as garbage collection is likely to have to rewrite the same data less frequently, resulting in a lower write amplification factor.[1]
They write in small blocks (eg, 4K), but can only erase in very large blocks (eg, 2M). Writing to an empty device is easy, but eventually you have to overwrite something, and you can't just replace a 4K block with another. You have to take a contiguous block of 2M, wipe the entire thing, rewrite whatever part of it was useful, then do the write you actually wanted to.
This effect is known as "write amplification", and it means that in bad cases you need to do many times more work than the host system requested.
Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.
This is what TRIM (confusingly also known as 'discard' in some contexts) is for -- the SSD operates in blocks and has no way to know that some chunk you've written to before is now useless to you because you deleted the file it belonged to, until you give it the command to overwrite that block. TRIM tells the drive "these parts can be recycled", and allows it to create empty blocks in advance. So make sure to TRIM once in a while.
Sadly TRIM wasn't well specified initially in regards to what performance characteristics it should have -- some old drives can get stuck on it for a while. So while many filesystems support emitting TRIM automatically, it can cause severe performance issues on some drives and the recommendation is to do it as a maintenance task on a timer instead.
TL;DR: Run `fstrim /mountpoint`. Wait a while before testing if it changed anything. The drive isn't obligated to do the work immediately.
It gets a bit more complex than that due to layers. LVM and dmcrypt can filter out TRIM requests. You can use `lsblk -D` to check the support status.