Ideas for Validator #158

meier-rene · Feb 27, 2019

I want collect Ideas for automatic Validation in this Issue:
-~~check for duplicate entries in CH$NAME~~ implemented and applied, 242 records fixed
-~~check for InChIKey-style pattern in CH$NAME~~ implemented and applied, 131 records fixed
-perhaps flag for super-short names that contain letters and numbers (as these are e.g. database codes, like CID1233 or something, a lot of ZINC and CHEBI sneak through

Treutler · Feb 27, 2019

I stumbled over this spectrum and noticed the fragment m/z's compared to the parent-ion-mz (423.5989 and 782.2591 vs 147.0441). If the validator could reveal fragments which are too heavy, than we would at least be aware of that.

schymane · Feb 27, 2019

That spectrum looks like it is only noise (those are the only two peaks). Note that spectra processed with RMassBank can sometimes contain heavier peaks if they have certain adducts (up to +N2O allowed), so this should be considered in any validation. There are, however, many spectra with bogus heavy peaks that are clearly just noise (where it is clear from mass defect etc, like in this case) and maybe the (sub)formula assignment routine in RMB could be integrated into the validator to help separate the possible goodies from the baddies? What is going to be the procedure for spectra that the validator identifies as (likely) pure noise, like the example you just raised?

meier-rene · Feb 27, 2019

I have no idea for the proper procedure, especially in this case, because its experimental data, not meta data. I wouldn't touch it. One could flag it or raise an issue with the original contributor, but sometimes this will be complicated.

Treutler · Mar 4, 2019

Check CH$NAME for

duplicated names
InChIKeys, i.e. "^[A-Z]{14,14}-[A-Z]{10,10}-[A-Z]{1,1}"
typical database IDs, i.e. ChEBI, PubChem CID, ...

schymane · Mar 25, 2019

In light of issues found/raised by Herbert Oberacher recently, I see a couple of new ideas we should consider implementing in the validator:

do a spectral search of new entries and flag records if extremely high similarity is show to a record with a different compound annotation. To discuss: to do this globally or introduce different warnings? If the compound annotation is clearly completely different (radically different precursor mass etc), this should be an error or high level warning; but we will get many cases where they are isobars or substances that may validly have very similar spectra as they are closely related. Note: this will only catch errors for substances that are already on the public record, but it's a start. Contributor should be prompted to check records.
We should flag all Orbitrap spectra that have a base peak of absolute intensity below 10,000. These are highly likely noise only. Contributor should check and decide whether to deprecate?
Orbitrap records where one peak is just above 10,000 but all others below are also likely just noise
To discuss in more detail: how to catch and deprecate poor quality records where we now have better records? Similarity is challenging to use alone as we can have poor similarity because of different settings ...

schymane · Mar 25, 2019

also check for the presence of possible interfering precursor masses in the peak list, e.g. around the isolation width of 1, using MS$FOCUSED_ION: PRECURSOR_M/Z entry

Treutler · Apr 11, 2019

Please check whether all fragment-m/z in the PK$ANNOTATION section are present in the PK$PEAK section

schymane · Apr 11, 2019

Good idea, I suggest to build in a slight tolerance to avoid decimal place issues. I wouldn't check on the reverse, i.e. there may be fewer PK$ANNOTATION entries than PK$PEAK but there should not be more PK$ANNOTATION entries than PK$PEAK (unless anyone puts out multiple annotations for a given PK$PEAK, I am not aware of this case ... RMassBank only puts out one formula per peak and tags if more were possible ...

meowcat · Apr 11, 2019

unless anyone puts out multiple annotations for a given PK$PEAK, I am not aware of this case ...

I am not 100% certain that we don't have cases like this from RMassBank - the s4power branch is certainly able to produce such records if you tell it to.
I personally don't think a record should be invalid if multiple annotations are present for a peak. Note that the annotation field is loosely defined in what it is allowed to contain, so this is certainly legal and possibly also welcome in some cases...

schymane · Apr 11, 2019

Agree with @meowcat - in principle no problem with having multiple annotations for one peak

schymane changed the title ~~Ideas vor Validator~~ Ideas for Validator Mar 25, 2019

MassBank/MassBank-web

Ideas for Validator #158

Ideas for Validator #158

meier-rene commented Feb 27, 2019 •

edited

This comment has been minimized.

Treutler commented Feb 27, 2019

This comment has been minimized.

schymane commented Feb 27, 2019

This comment has been minimized.

meier-rene commented Feb 27, 2019

This comment has been minimized.

Treutler commented Mar 4, 2019

This comment has been minimized.

schymane commented Mar 25, 2019

schymane changed the title Ideas vor Validator Ideas for Validator Mar 25, 2019

This comment has been minimized.

schymane commented Mar 25, 2019

This comment has been minimized.

Treutler commented Apr 11, 2019

This comment has been minimized.

schymane commented Apr 11, 2019

This comment has been minimized.

meowcat commented Apr 11, 2019

This comment has been minimized.

schymane commented Apr 11, 2019

MassBank/MassBank-web

Join GitHub today

Ideas for Validator #158

Comments

meier-rene commented Feb 27, 2019 • edited

This comment has been minimized.

Treutler commented Feb 27, 2019

This comment has been minimized.

schymane commented Feb 27, 2019

This comment has been minimized.

meier-rene commented Feb 27, 2019

This comment has been minimized.

Treutler commented Mar 4, 2019

This comment has been minimized.

schymane commented Mar 25, 2019

schymane changed the title Ideas vor Validator Ideas for Validator Mar 25, 2019

This comment has been minimized.

schymane commented Mar 25, 2019

This comment has been minimized.

Treutler commented Apr 11, 2019

This comment has been minimized.

schymane commented Apr 11, 2019

This comment has been minimized.

meowcat commented Apr 11, 2019

This comment has been minimized.

schymane commented Apr 11, 2019

meier-rene commented Feb 27, 2019 •

edited