Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for Validator #158

Open
meier-rene opened this Issue Feb 27, 2019 · 10 comments

Comments

Projects
None yet
4 participants
@meier-rene
Copy link
Contributor

meier-rene commented Feb 27, 2019

I want collect Ideas for automatic Validation in this Issue:
-check for duplicate entries in CH$NAME implemented and applied, 242 records fixed
-check for InChIKey-style pattern in CH$NAME implemented and applied, 131 records fixed
-perhaps flag for super-short names that contain letters and numbers (as these are e.g. database codes, like CID1233 or something, a lot of ZINC and CHEBI sneak through

@Treutler

This comment has been minimized.

Copy link
Contributor

Treutler commented Feb 27, 2019

I stumbled over this spectrum and noticed the fragment m/z's compared to the parent-ion-mz (423.5989 and 782.2591 vs 147.0441). If the validator could reveal fragments which are too heavy, than we would at least be aware of that.

@schymane

This comment has been minimized.

Copy link
Member

schymane commented Feb 27, 2019

@meier-rene

This comment has been minimized.

Copy link
Contributor Author

meier-rene commented Feb 27, 2019

I have no idea for the proper procedure, especially in this case, because its experimental data, not meta data. I wouldn't touch it. One could flag it or raise an issue with the original contributor, but sometimes this will be complicated.

@Treutler

This comment has been minimized.

Copy link
Contributor

Treutler commented Mar 4, 2019

Check CH$NAME for

  • duplicated names
  • InChIKeys, i.e. "^[A-Z]{14,14}-[A-Z]{10,10}-[A-Z]{1,1}"
  • typical database IDs, i.e. ChEBI, PubChem CID, ...
@schymane

This comment has been minimized.

Copy link
Member

schymane commented Mar 25, 2019

In light of issues found/raised by Herbert Oberacher recently, I see a couple of new ideas we should consider implementing in the validator:

  • do a spectral search of new entries and flag records if extremely high similarity is show to a record with a different compound annotation. To discuss: to do this globally or introduce different warnings? If the compound annotation is clearly completely different (radically different precursor mass etc), this should be an error or high level warning; but we will get many cases where they are isobars or substances that may validly have very similar spectra as they are closely related. Note: this will only catch errors for substances that are already on the public record, but it's a start. Contributor should be prompted to check records.
  • We should flag all Orbitrap spectra that have a base peak of absolute intensity below 10,000. These are highly likely noise only. Contributor should check and decide whether to deprecate?
  • Orbitrap records where one peak is just above 10,000 but all others below are also likely just noise
  • To discuss in more detail: how to catch and deprecate poor quality records where we now have better records? Similarity is challenging to use alone as we can have poor similarity because of different settings ...

@schymane schymane changed the title Ideas vor Validator Ideas for Validator Mar 25, 2019

@schymane

This comment has been minimized.

Copy link
Member

schymane commented Mar 25, 2019

  • also check for the presence of possible interfering precursor masses in the peak list, e.g. around the isolation width of 1, using MS$FOCUSED_ION: PRECURSOR_M/Z entry
@Treutler

This comment has been minimized.

Copy link
Contributor

Treutler commented Apr 11, 2019

Please check whether all fragment-m/z in the PK$ANNOTATION section are present in the PK$PEAK section

@schymane

This comment has been minimized.

Copy link
Member

schymane commented Apr 11, 2019

Good idea, I suggest to build in a slight tolerance to avoid decimal place issues. I wouldn't check on the reverse, i.e. there may be fewer PK$ANNOTATION entries than PK$PEAK but there should not be more PK$ANNOTATION entries than PK$PEAK (unless anyone puts out multiple annotations for a given PK$PEAK, I am not aware of this case ... RMassBank only puts out one formula per peak and tags if more were possible ...

@meowcat

This comment has been minimized.

Copy link

meowcat commented Apr 11, 2019

unless anyone puts out multiple annotations for a given PK$PEAK, I am not aware of this case ...

  • I am not 100% certain that we don't have cases like this from RMassBank - the s4power branch is certainly able to produce such records if you tell it to.
  • I personally don't think a record should be invalid if multiple annotations are present for a peak. Note that the annotation field is loosely defined in what it is allowed to contain, so this is certainly legal and possibly also welcome in some cases...
@schymane

This comment has been minimized.

Copy link
Member

schymane commented Apr 11, 2019

Agree with @meowcat - in principle no problem with having multiple annotations for one peak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.