Fraggle – A new similarity searching algorithm

Jameed Hussain Gavin Harper

Introduction Brief history of the technique Why we created (yet) another similarity method How it works Performance

Brief history of Fraggle Was first written in 2008 using the Daylight toolkit – Currently 5 years old.. One of several similarity methods which is in regular use in GSK – Method of choice for “boosting” SAR Has provided leads for several drug discovery programs Re-implemented using RDKit this year

Chemical Similarity Methods There is no shortage of chemical similarity methods.. – Path based fps – Morgan fps – Topological Torsion / Atom Pairs – 2D pharmacophore methods RGs / ErGs.

– 3D fps Why does the world need another ? – ...

Chemical Similarity Methods Why did we create another similarity method ? Specifically built to fix a particular issue that affects path based fps – Small changes in the middle of a molecule Affects other similarity methods too

ChEMBL_11085_A_27 & ChEMBL_11085_A_78 RDK5: 0.42 ECFP4: 0.65 TT: 0.47

ChEMBL_28_A_27 & ChEMBL_28_A_45 RDK5: 0.45 ECFP4: 0.66 TT: 0.48

Riniker, S., & Landrum, G. A. (2013). Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of cheminformatics, 5(1), 26.

Substructure searching Similarity and Substructure searching are complementary Substructure searching has a requirement of knowing which part of molecule is important – Fixed as the substructure, rest of compound can be anything Similarity searching has no requirement of a fixed substructure – “Most” of the compound needs to be the same How can we capture some of the benefits of a substructure search – “Large changes in a small part of a molecule”

Fraggle – how does it work?

Fraggle works in three steps: Query Fragmentation

Tversky Search

Post-Processing

Query fragmentation “Make the method behave like a substructure search” If you don’t know which part of the molecule is important how do you know which substructure to search with ? – Use “all the interesting” substructures Algorithm used to fragment query molecule and select the “interesting” substructures – Employs simple rules – Tries to capture all the constituent rings in a query molecule

ChEMBL_11265_A_41

Fragmentation Algorithm – Acyclic cuts Enumerate all the single acyclic bond cuts – Discard fragmentations where you only chop a single atom off – Keep fragment if >60% of query molecule Enumerate all the double acyclic bond cuts – Discard fragmentations where you only chop a single atom off – Keep the two fragments with one attachment point Needs to be >60% of query molecule

ChEMBL_11265_A_41

Fragmentation Algorithm – Ring cuts For compounds with fused / spiro ring systems Enumerate all single “ring cuts” - cut at the 2 exocyclic bonds – Need to be >40% of query molecule Enumerate all single “ring cuts” with an acyclic bond cut – Needs to be >60% of query molecule

ChEMBL_11265_A_41

Tversky Search For each fragmentation carry out a Tversky search against the database – ChemAxon FP Alpha=0.95, Beta=0.05 (“substructure similarity”) Tversky similarity cut-off=0.9

Tversky search gives superior results compared to substructure searching (more “fuzziness”)

Post Processing Tversky search can retrieve results which are uninteresting with respect to the original query molecule

ChEMBL_zinc_D_3054 Tversky: 0.90

Query Fragmentation ChEMBL_11085_A_27

ChEMBL_11085_A_78 Tversky: 0.90

Post Processing

ChEMBL_11085_A_27

Query Fragmentation

RDK5 Similarity: 0.36

RDK5 Similarity: 0.42

Post Processing

ChEMBL_11085_A_27

Query Fragmentation

Post Processing

ChEMBL_11085_A_27

Query Fragmentation

False Positive RDK5 Similarity: 0.25

High Scoring Match RDK5 Similarity: 1.0

Post Processing

ChEMBL_11085_A_27

Query Fragmentation

False Positive Fraggle Similarity: 0.36

High Scoring Match RDK5 Similarity: 1.0

Post Processing – gory details... Post Matching algorithm: – For the query fragmentation and the db molecule pair Map the fragmentation on the molecule Modify the non-matching atoms of molecule – Aromatic atoms become * – Aliphatic atoms become Sc

– Carry out a RDK5 fp Tanimoto similarity using these “modified” query and db molecule Done for every “fragmentation” and the highest similarity is selected

– Compare the highest similarity with the RDK5 fp Tanimoto on the unmodified query and db molecule Pick the highest to give the Fraggle similarity

Fragment Mapping Matching of the fragments on retrieved and query molecules carried using partial fingerprints and Tversky similarity – A partial fingerprint (pFP) of an atom (in a compound) are the bits it sets in the compound fingerprint Compare the pFP of every atom of a molecule against the FP of the fragments – Tversky >0.8 is considered a match

Partial fingerprints with Tversky allows for very computationally cheap alignments – Crude but fast Perfectly adequate for this application – “Fuzziness” is good

What types of compounds does Fraggle find? Not as sensitive to changes in the middle of a molecule Fraggle similarity for the pairs of cmpds is below is 1:

ChEMBL_11085_A_27 & ChEMBL_11085_A_78 Fraggle: 1.0 RDK5: 0.42 ECFP4: 0.65 TT: 0.47

ChEMBL_28_A_27 & ChEMBL_28_A_45 Fraggle: 1.0 RDK5: 0.45 ECFP4: 0.66 TT: 0.48

What types of compounds does Fraggle find? “Large changes in a small part of a molecule”

ChEMBL_10579_A_78 & ChEMBL_10579_A_39 Fraggle: 0.89 RDK5: 0.62 ECFP4: 0.8 TT: 0.78

ChEMBL_11682_A_2 & ChEMBL_11682_A_52 Fraggle: 0.86 RDK5: 0.38 ECFP4: 0.64 TT: 0.57

ChEMBL_10579_A_16 & ChEMBL_10579_A_39 Fraggle: 0.89 RDK5: 0.52 ECFP4: 0.75 TT: 0.68

What types of compounds does Fraggle find? Performs very well with fused and spiro queries

ChEMBL_11265_A_64 & ChEMBL_11265_A_41 Fraggle: 0.81 RDK5: 0.49 ECFP4: 0.66 TT: 0.59

ChEMBL_11279_A_53 & ChEMBL_11279_A_35 Fraggle: 0.92 RDK5: 0.63 ECFP4: 0.7 TT: 0.61

ChEMBL_11085_A_97 & ChEMBL_11085_A_74 Fraggle: 0.81 RDK5: 0.64 ECFP4: 0.44 TT: 0.31

Performance - AUC Acknowledge Sereina Riniker and Greg Landrum work –

Riniker, S., & Landrum, G. A. (2013). Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of cheminformatics, 5(1), 26.

Compared Fraggle, RDK5, TT, ECFP4, MACCS, ECFP0 Results from post-hoc Friedman tests of the average rank:

TT RDK5 Fraggle ECFP4

RDK5

Fraggle

ECFP4

MACCS

ECFP0

X

O

-

-

-

X

O

-

-

X

-

-

O

-

MACCS

-

ECFP0 X: No statistical significant difference O: Difference around the confidence level - : Statistically significant difference

Performance – BEDROCK20 Results from post-hoc Friedman test of the average rank:

TT ECFP4 RDK5 Fraggle

ECFP4

RDK5

Fraggle MACCS

X

X

O

-

-

X

X

-

-

X

-

-

-

-

MACCS

ECFP0

-

ECFP0 X: No statistical significant difference O: Difference around the confidence level - : Statistically significant difference

Fraggle “in the mix” with the best performing methods – Benefits from RDK5 for AUC metric – Similar performance to ECFP4,RDK5 (and TT) for BEDROCK20

Correlation with other methods Take all actives from evaluation platform – For actives in each dataset generate similarity matrix How does the similarity ranking correlate (Spearman) between methods?

Fraggle worth running with other top performing methods ChEMBL:

MUV:

Possible Enhancements The method has a number of “tuneable” parameters – Size of fragments selected for Tversky searching – FP and parameters to use for Tversky searching against db Does RDK5 give better results than ChemAxon FP? What is the optimum alpha, beta and cut-off parameters to use – Tversky parameters for pFP comparison The parameters chosen are based on very limited datasets and our judgement – Balance speed vs retrieval performance What happens if I drop the Tversky db searching step? – “Post process” every cmpd in db Evaluation platform provides a more rigorous way to determine the “best general” parameters

Summary

Brief history of the technique Why we created (yet) another similarity method How it works Performance

Back-up Slides

Performance AUC Rankings:

Smaller is better

Performance BEDROCK20 Rankings:

Smaller is better

Correlation with other methods Take all actives from evaluation platform – For actives in each dataset generate similarity matrix How does the similarity ranking correlate (Spearman) between methods? DUD:

Tversky Metric When comparing molecule A and molecule B:

c a

b c

a is the count of bits on in mol A but not in mol B. b is the count of bits on in mol B but not in mol A. c is the count of the bits on in both mol A and mol B.

=1 =0: similarity of molecule B as a superstructure of molecule A

=0 =1: similarity of molecule B as a substructure of molecule A =0.5 =0.5: Tanimoto similarity

Fraggle – A new similarity searching algorithm - GitHub

Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of ... Perfectly adequate for this application. – “Fuzziness” is good ...

711KB Sizes 43 Downloads 93 Views

Recommend Documents

A Fast String Searching Algorithm
number of characters actually inspected (on the aver- age) decreases ...... buffer area in virtual memory. .... One telephone number contact for those in- terested ...

A Fast String Searching Algorithm
An algorithm is presented that searches for the location, "i," of the first occurrence of a character string, "'pat,'" in another string, "string." During the search operation, the characters of pat are matched starting with the last character of pat

Searching Co-Integrated Portfolios by a Genetic Algorithm
Apr 4, 2010 - Sadhana House, 1st Flr, 570. 400018 Mumbai – India [email protected]. Luigi Troiano. University of Sannio ..... weakly-cointegrated instruments using boosting-based optimization,” in. JCIS. Atlantis Press, 2006. [

The Algorithm Design Manual - GitHub
design form one of the core practical technologies of computer science. .... placed. Degree of difficulty ratings (from 1 to 10) have been assigned to all ... Updating a book dedication after ten years focuses attention on the effects of time. ......

The DES Algorithm Illustrated - GitHub
The DES (Data Encryption Standard) algorithm is the most widely used encryption algorithm in the world. For many years, and among many people, "secret code making" and DES have been synonymous. And despite the recent coup by the Electronic Frontier F

similarity line and predict trend - GitHub
Page 1. similarity line and predict trend. Page 2. prediction close index change percent. Page 3. Page 4.

VChunkJoin: An Efficient Algorithm for Edit Similarity ...
The current state-of-the-art Ed-Join algorithm im- proves the All-Pairs-Ed algorithm mainly in the follow- .... redundant by another rule v if v is a suffix of u (including the case where v = u). We define a minimal CBD is a .... The basic version of

An Online Algorithm for Large Scale Image Similarity Learning
machines, and is particularly useful for applications like searching for images ... Learning a pairwise similarity measure from data is a fundamental task in ..... ACM SIGKDD international conference on Knowledge discovery and data mining,.

SPEC Hashing: Similarity Preserving algorithm for Entropy-based ...
This paper presents a novel and fast algorithm for learning binary hash ..... the hypothesis space of decision stumps, which we'll call. H, is bounded. .... One way to optimize the search .... Conference on Computer Vision, 2003. [11] A. Torralba ...

A distributed algorithm for minimum weight spanning trees ... - GitHub
displayed will be uniform (all nodes run the exact same code) and will require up to .... fragment it belongs to and in state Found at all other times. The algorithm.

An Efficient Algorithm for Similarity Joins With Edit ...
ture typographical errors for text documents, and to capture similarities for Homologous proteins or genes. ..... We propose a more effi- cient Algorithm 3 that performs a binary search within the same range of [τ + 1,q ..... IMPLEMENTATION DETAILS.

Ripple Protocol Consensus Algorithm Review - GitHub
May 11, 2015 - 1. Reviewed white papers and development documentation at https://ripple. com. 2. .... denial of service due to the Ripple network being unable to process transactions, ..... https:// download.wpsoftware.net/bitcoin/pos.pdf. 15 ...

New PDF File - GitHub
Waibhav Yadavdev. [f] /Vaiyadav. /Dev Vaibhav Yadav. [in] /devvaibhavyadav. º & Dev Vaibhav. Professional Skills. Software Skills. IntelliJ IDEA. Android Studio.

New Output - GitHub
Sep 12, 2015 - USB-FIFO Interface (Up to 10 MB/s transfer speed). 64 Mbit .... net class. ClassName: FT_JTAG. FPGA_CLK. 22R. RA2D. 22R. RA2B. 22R.

A NEW ACTIVE LEARNING ALGORITHM FOR ...
We propose a new active learning algorithm to address the problem of selecting ..... from Microsoft Speech research group for their technical help. We also like to ...

A new Algorithm for Community Identification in Linked ...
community identification, community mining, web communities. 1 Introduction. Since late nineties, identification of web communities has received much attention from researchers. HITS is a seminal algorithm in the community identification (CI) algorit

A New Perspective on an Old Perceptron Algorithm - CS - Huji
2 Google Inc., 1600 Amphitheater Parkway, Mountain View CA 94043, USA. {shais,singer}@cs.huji.ac.il ..... r which let us express 1. 2α as r wt ..... The natural question that arises is whether the Ballseptron entertains any ad- vantage over the ...

New Output - GitHub
Aug 29, 2016 - SPI 0 FS0LS. SPI 0 FS1LS. SPI 0 FS2LS. SPI 0 CLKLS. SPI 0 MOSILS. SPI 0 MISOLS. I2C 1 SDALS. I2C 1 SCLLS. I2C 2 SDALS. I2C 2 SCLLS.

Toward Faster Nonnegative Matrix Factorization: A New Algorithm and ...
College of Computing, Georgia Institute of Technology. Atlanta, GA ..... Otherwise, a complementary ba- ...... In Advances in Neural Information Pro- cessing ...

A new algorithm for computing the minimum Hausdorff ...
Sort the spikes according to their left endpoints and add them from left to right. ... If we shift A* by t units, the left or the right endpoint will contribute at least |t| to ...