+ Reply to Thread
Results 1 to 8 of 8

Thread: How to fairly search data from a low complexity (e.g. single protein) sample

  1. #1
    Administrator
    Join Date
    Jun 2011
    Location
    Sunnyvale, CA
    Posts
    209

    Question How to fairly search data from a low complexity (e.g. single protein) sample

    I have a colleague who is working on hydrogen–deuterium exchange, which requires high protein coverage. They are doing some initial testing with BSA. They tried searching against databases of various size with the concatenated target–decoy strategy: just BSA vs. cRAP (which includes BSA) vs. the entire bovine database. As you might expect, the smaller the database, the more they identify and thus the higher the sequence coverage for BSA.

    My first reaction is to be skeptical of the small-database searches, but the more I have thought about it, I can't come up with a reason why the target–decoy strategy would not be valid for a small database. The fundamental premise of the strategy is that an incorrect hit is equally likely to match to the target or decoy database (assuming they are divided equally). That should still be the case regardless of database size, right?

    There is another issue at play here but I think it is a little more clear cut. A low number of hits could lead to an overestimation of IDs. Let's say you only have 10 peptides in your sample, and they are all confidently identified. Those 10 peptides score the highest, and then the next hit is a decoy, so you correctly claim to identify only those 10 at ≤1% FDR. But there is a 50% chance the next hit after the top 10 is a target before the first decoy, so you would claim 11 IDs, a 10% overestimate. There are smaller but still significant probabilities of identifying several more peptides just by random chance. Once you get into the hundreds or thousands of identifications this issue becomes much less significant.

    But back to the small-database issue. Assuming you have enough hits for the small numbers to not be a problem, is there anything fundamentally wrong with searching against a small or even single-protein database? How would this violate the principles of target–decoy FDR? Or is it possible that this is a case of it being easier to identify something when you are only looking for it instead of many things?

  2. #2
    Angiotensin Member
    Join Date
    Jun 2011
    Location
    Seattle, WA
    Posts
    36
    this is not really an answer to you question, but in cases like this what I always do is throw my protein of interest (BSA) into a database of another organism, like yeast, and search against a target-decoy of that to generate more decoy hits.

  3. #3
    Albumin Member
    Join Date
    Sep 2011
    Location
    Delhi
    Posts
    82
    Target Decoy approach is a statistical procedure for FDR estimation rather than calculation (yes, they are different). So as with any statistical procedure, this one depends a lot on random sampling and law of large numbers(central limit theorem). Lets take the case of a die. The probability of getting a 6 on one throw is 1/6. But this occurs when we REPEAT the experiment many times. More times we do it, closer to 1/6 we are. Doing the same for MS/MS would be disastrous in terms of time. So, all premises holding true, target decoy needs moderately large databases. I do not have the numbers right now, but IMO somewhere around 1000 sequences should be fine. Larger the better. But as the DB size increases beyond what I call Algorithm's limitation, the results will start being poorer. So any algorithm has two limits - lower DB size limit, where everything passes (because decoys don't have enough statistical numbers), and Upper limit, when decoys overpower the targets. Of course, data quality and parameters also play a role.

    Daniswan has suggested a good trick , I do the same whenever in doubt.

  4. #4
    Administrator Doug's Avatar
    Join Date
    Jun 2011
    Location
    Redwood City, CA
    Posts
    306
    "lower DB size limit, where everything passes (because decoys don't have enough statistical numbers)"

    I am not sure I agree with this. Why would everything pass? I think an incorrect hit still has an equal chance of hitting a target or decoy peptide.

    It seems to me that the premise of FDR till holds true. I think when you only have one protein in a sample FDR estimation just doesn't work that well in general. At the peptide level, FDR should normally be performed on unique peptides and a single protein generally doesn't have enough unique peptides to accurately estimate FDR. And protein level FDR certainly won't work for a single protein. So I think if you already know what protein you have and you are just looking for a specific peptide then the single protein database could be fine. If you are trying to see if a certain protein is present in your sample then a larger decoy database probably makes sense. Though I don't think FDR estimates will be accurate in either case.

  5. #5
    Administrator
    Join Date
    Jun 2011
    Location
    Sunnyvale, CA
    Posts
    209
    daniswan – Good idea. I am comparing several different ways of doing the search and I will try your strategy as well.

    aky – I understand your point about dice but do you really need to repeat an MS/MS search several times (presumably with different decoys) to get reliable results, or would having hundreds to thousands of PSMs in a single search be sufficient? Do you have an explanation as to why you need a certain number of decoys in the context of the assumptions of the target–decoy strategy? I am still struggling to understand why having a small number of decoys (like 1) would lead to an underestimate of FDR / overestimate of IDs at a given FDR.

  6. #6
    Albumin Member
    Join Date
    Sep 2011
    Location
    Delhi
    Posts
    82
    Target Decoy (TD) strategy works because of large numbers. An equal chance of an incorrect hit matching a target or decoy is not flawed at all. But it only talks about ONE particular hit. When we estimate FDR, we do not use average decoy scores but the highest decoy scores, which is a property of large number of decoys.

    Doug's point is valid. But, in a small DB, low number of decoys will be there as would be targets. If by chance the best decoy scores too high or too low, the targets will either mostly fail or mostly pass, respectively. As u have mentioned, the FDR is still relevant, but we do not have enough statistical power(or confidence) in the estimated FDR.

    Craig, you don't need to perform a search many times with different decoys. I think I've read in a paper(cant remember title/authors) where this was done and not much difference was found in estimated FDRs. Interesting point - about large no. of PSMs compensating for statistical power when DB size is small. I too believe this should work, though cant pin-point how or why?

    I am currently trying to fathom some of these issues and would look forward to ideas, questions, help or critique.

  7. #7
    Glycine Member
    Join Date
    Feb 2012
    Posts
    5
    In my opinion, the size of database doesn't matter, what really matters is the number of events that happened, or the number of scans that search the database. Theoretically, as long as the number is big enough, the target decoy strategy holds true.
    Besides, I think putting the single ID into the database of another organism is a stringent but conserved approach. For the single protein analysis, manual inspection is simply the best.

  8. #8
    Administrator
    Join Date
    Jun 2011
    Location
    Sunnyvale, CA
    Posts
    209
    I like the idea of manual inspection as an extra layer of confidence for a small number of biologically important results. But even for a single protein I'm not sure this is practical. If you take a large protein like BSA (66 kDa) and allow up to 1 missed cleavage you end up with over 100 peptides. Once you consider multiple charge states, non-optimal dynamic exclusion, and contaminants, it is easy to get thousands of MS/MS spectra. Plus, one person might have an error rate of 10% while another might have an error rate of 0.1%, and there is no guarantee of consistency even within one dataset manually analyzed by one person.

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

SharedProteomics
an online proteomics community