I have a colleague who is working on hydrogen–deuterium exchange, which requires high protein coverage. They are doing some initial testing with BSA. They tried searching against databases of various size with the concatenated target–decoy strategy: just BSA vs. cRAP (which includes BSA) vs. the entire bovine database. As you might expect, the smaller the database, the more they identify and thus the higher the sequence coverage for BSA.
My first reaction is to be skeptical of the small-database searches, but the more I have thought about it, I can't come up with a reason why the target–decoy strategy would not be valid for a small database. The fundamental premise of the strategy is that an incorrect hit is equally likely to match to the target or decoy database (assuming they are divided equally). That should still be the case regardless of database size, right?
There is another issue at play here but I think it is a little more clear cut. A low number of hits could lead to an overestimation of IDs. Let's say you only have 10 peptides in your sample, and they are all confidently identified. Those 10 peptides score the highest, and then the next hit is a decoy, so you correctly claim to identify only those 10 at ≤1% FDR. But there is a 50% chance the next hit after the top 10 is a target before the first decoy, so you would claim 11 IDs, a 10% overestimate. There are smaller but still significant probabilities of identifying several more peptides just by random chance. Once you get into the hundreds or thousands of identifications this issue becomes much less significant.
But back to the small-database issue. Assuming you have enough hits for the small numbers to not be a problem, is there anything fundamentally wrong with searching against a small or even single-protein database? How would this violate the principles of target–decoy FDR? Or is it possible that this is a case of it being easier to identify something when you are only looking for it instead of many things?