Full Descriptions Of Available Activities
From CDK-Taverna 2.0 Wiki
This page describes and gives a short introduction of the usage of all available activities in the CDK-Taverna 2.0 project.
The I/O folder provides basic activities for reading and writing different types of data from and to hard-disk. They all have an input port to specify the files to read from or the destination to write the data. File writer with the abitlity to handle iterative incoming data can be configurated to write one file per iteration or only a single file for all iterations. Every file writer provides a list of all written files.
ARFF File Reader
Reads ARFF files to harddisk. Sets the last attribute as the class attribute when its name is "Class".
ARFF File Writer
Reads ARFF files to harddisk.
CML Chem File Reader
Reads CML Chem Files from hard-disk.
CML Chem File Writer
Writes CML Chem Files to hard-disk.
CSV File reader
Reads CSV Files from hard-disk.
CSV File writer
Writes CSV Files to hard-disk. Specially designed to write CSV data coming from QSAR vector data.
MDL SDFile Reader
Reads MDL SDFiles from hard-disk.
MDL SDFile Writer
Writes MDL SDFiles to hard-disk.
MDL Mol File Reader
Reads MDL Mol Files from hard-disk.
MDL Mol File Writer
Writes MDL Mol Files to hard-disk.
MDL RXN File Reader
Reads MDL RXN Files from hard-disk.
Multi MDL RXN File Reader
Reads Multi MDL RXN Files from hard-disk. This format is not a native MDL format. It is made up of a list of MDL RXN strings seperated by a "$$$$" delimiter string.
MDL RXN File Writer
Writes MDL RXN Files to hard-disk.
SMILES File Reader
Reads SMILES Files from hard-disk.
SMILES File Writer
Writes SMILES Files to hard-disk.
Text File Writer
Writes incoming string data to hard-disk.
XRFF File Reader
Reads XRFF files to harddisk.
XRFF File Writer
Reads XRFF files to harddisk.
The iterative I/O folder provides the ability to handle huge file sizes by reading them iteratively. They also have to be configurated like the basic input activities. They all have an input port to specify the files to read the data from. Additionaly you can adjust the number of elements read per iteration through the second port.
This activity is needed for the iterative loop reader activities. You have to connect an activitiy to the "state" output port of the loop activities because otherwise the port will not be evaluated. For an exapmle have a look at the Loop SDFile Reader activity.
This two activities are only in combination with each other useable. The acceptor activity caches all the data coming from an iterative source. Afterwards the emitter activity reads the cached data at once and provides the whole data in a single invovation to the subsequent workflow.
Iterative RXN File Reader
Iterative file reader for MDL RXN files. The provided file chooser has multi file selection enabled.
Iterative Multi RXN File Reader
Iterative file reader for multi MDL RXN files. This format is not a native MDL format. It is made up of a list of MDL RXN strings seperated by a "$$$$" delimiter string.
Iterative SDFile Reader
Iterative file reader for MDL SDFiles.
Loop SDFile Reader
Iterative file reader for MDL SDFiles. The difference to the Iterative RXN/SD File readers is that the whole nested workflow is executed bevor the next iteration step starts.
Loop RXN File Reader
Iterative file reader for MDL RXN files. The difference to the Iterative RXN/SD File readers is that the whole nested workflow is executed bevor the next iteration step starts. The configuration process is the same like for the Loop SDFile Reader activity.
This activities are used to convert string data to the data format used within the CDK-Zaverna 2.0 project and backwards. The activities have not to be configurated.
CML String to Structures Converter
Converts given CML strings to CDK-Taverna 2.0 structure objects.
Structure to CML String Converter
Converts given CDK-Taverna 2.0 structure objects to CML strings.
MDL Mol String to Structures Converter
Converts given MDL Mol File strings to CDK-Taverna 2.0 structure objects.
Structure to MDL Mol String Converter
Converts given CDK-Taverna 2.0 structure objects to MDL Mol File strings.
MDL SDFile String to Structures Converter
Converts given MDL SDFile strings to CDK-Taverna 2.0 structure objects.
Structure to SDFile String Converter
Converts given CDK-Taverna 2.0 structure objects to MDL SDFile string.
MDL RXN String to Reaction Converter
Converts given MDL RXN strings to CDK-Taverna 2.0 reaction objects.
Reaction to MDL RXN String Converter
Converts given CDK-Taverna 2.0 reaction objects to MDL RXN strings.
SMILES String to Structure Converter
Converts given SMILES strings to CDK-Taverna 2.0 structure objects.
Structure to SMILES String Converter
Converts given CDK-Taverna 2.0 structure objects to SMILES strings.
This folder provides different filter activities.
Atom Type Filter
Filters out molecules which contain for the CDK unknown atom types.
Filters out duplcate structures.
Rule Of Five Filter
Filters out structures which fail the Rule Of Five.
This folder provides activities for basic isomorphism mapping between molecular structures.
Performs a test whether two structures are structurally identical.
Filters structeres whether they contain the query subgraph structure or not.
JChemPaint is an editor and viewer for 2D chemical structures. The activity can be used to edit chemical structures at runtime.
This folder contains several activities which could not be classified to a certain class.
Add Explicit Hydrogens
Adds explicit hydrogens to the structures.
Add Implicit Hydrogens
Adds implicit hydrogens to the structures.
Hueckel Aromaticity Detector
Detects the aromaticity based on the Hueckel 4n+2 pi-electrons rule applied to isolated ring systems.
Reaction Reactant Splitter
Splitts given reaction into its reactans.
Sorts given reactions according to their number of reactants.
Tag Molecules With UUID
Tags given molecules with an UUID.
Generates 2D-Coordinates for molecules.
This folder provides activities to curate compound libraries.
Curate Strange Elements
Sorts out molecules which contain other atom types then the following: C, H, N, O, P, S, Cl, F, As, Se, Br, I, B.
Molecule Connectivity Checker
Sorts out counter ions from given structures. The maximum size for ions is definable.
Remove Sugar Groups
The enumeration activity generates a virtual compound library based on a generic reaction and their associated reactant lists.
Example workflows can be found here.
The Reaction Enumerator activity.
- Variabel Region Checker: Gives the user the possibility to define variable regions. This makes the generic template reactions much more flexible.
- Multi match checker: Checks wether the reactant matches the reaction template more then one time. So every possible result reaction can be enumerated.
Reaction Enumerator Subgraph Filter
Specially designed subgraph filter for the Reaction Enumerator activity.
Write Molecule As JPG
Renders given structures into JPG image files.
Write Molecule As PNG
Renders given structures into PNG image files.
Write Molecule As PDF
Renders given structures into a single PDF file.
Write Reaction As PDF
Renders given reactions into a single PDF file.
This folder contains activities for the calculation and the processing of QSAR descriptor results.
Example workflows can be found here.
Note: It is strongly recommended to write one file per iteration during iterative workflows. When the workflow is finished merge the CSVs with the Merge CSVs To QSAR Vector activity.
Calculate QSAR Vector Statistics
Create Fingerprint Item List From QSAR Vector
CSV To QSAR Vector
Merge CSVs To QSAR Vector
Curate QSAR Vector
Curates the given QSAR Vector from not calculated descriptor values and removes columns which do not differ in their min/max values. You can choose between three curation methods:
- Dynamic curation between rows and columns: Tries to maximize the number of remaining descriptor values. This curation type is an intermediate type between curation type 2 and 3.
- Curate only columns: Discards the columns which contain not calculated descriptors.
- Curate only rows: Discards the rows (molecules) which contain not calculated descriptors.
Additionaly you can choose whether columns with not in min max values differing descriptors should be discarded.
Merge QSAR Vectors
This activity combines the power of all QSAR descriptors in one single activity. You can choose all available descriptors to be calculated at once.
QSAR Descriptor Threaded (Experimantal)
This activity bases on the QSAR Descriptor Activity but with the ability to use multi threading for the QSAR descriptor calculations. Vou can set the number of used threads in the configuration panel.
Note: It is tagged as experimental because the CDK is not explicitly thread safe.
QSAR Vector Generator
QSAR Vector To CSV
Atomic QSAR Descriptors
Atomic Proton QSAR Descriptors
Atompair QSAR Descriptors
Bond QSAR Descriptors
Molecular QSAR Descriptors
Protein QSAR Descriptors
This activity implements the ART-2a classification algorithm. There are six parameters to configure:
- Number of classifications: Determines the number of classifications within the intervall of the lower and upper vigilance parameter limit.
- The upper vigilance limit. The vigilance parameter determines the number of resulting classes. The higher the vigilance paramater the higher the number of resulting classes.
- The lower vigilance limit.
- The maximum classification time.
- Scale fingerprint items to values between 0.0 and 1.0.
- The output directory of the classification result files.
ART-2a result As PDF
ART-2a result As PDF File Reader
ART-2a Result Considering Different Origins As PDF
This activity visualizes the fraction of each origin in the resulting classes so that it is possible to determine the similarity between different compound sources. An equal fraction within the classes shows a high equality between the sources. The output directory is the same as for the ART-2a Clusterer activity results.
This folder provides activities for clustering and result visualisation. It uses the Weka Machine Learning library. All result activities use the same output diretory than the weka clustering worker.
Example workflows can be found here.
Setup instructions for the activity:
- Choose clustering algorithm from list.
- Click the configure button.
- Configure the chosen clusterer and press OK.
- Add job to job list.
- Chose output directory
Finally apply the settings and close the configuration panel. To add more than one job to the list repeat steps 1-4.
Create Weka Dataset From QSAR Vector
Converts a QSAR vector to a weka dataset.
Extract Clustering Result As CSV
Writes statistics and UUID cluster memberships into files.
Extract Clustering Result As PDF
Visualizes the clustering result as PDF.
Clustering Result Considering Different Origins As PDF
Visualizes the clustering result from different origins and shows the ratio of the sources in the different clusters. The file is saved as PDF file.The activity uses the same output diretory than the weka clustering worker.
Generate Silhouette Plot From CLustering Result As CSV
Generate Silhouette Plot From CLustering Result As PDF
Split Molecules Into Clusters
- Multiple Linear regression
- Three-layer perceptron-type neural networks
- Support Vector Machines
- M5P Regression Trees
Create Weka Regression Dataset
This activity creates a regression dataset from a basic weka dataset i.e. provided by the Create Weka Dataset From QSAR Vector activity. The first attribute of the dataset has to be an UUID followed by n numeric data attributes. The last field is the class field and has to be numeric and named as "Class". The ID Class CSV File consits of two columns. The first column contains the UUID and the second column the numeric class value.
Split Dataset Into Train-/Testset
This activity splits a given dataset into a trainset and a testset. There are three algorithms available:
Composes the sets randomly.
- Cluster Representatives:
Uses the simple KMeans clusterer to assemble the sets.
- Single Global Max:
Uses also the simple KMeans clusterer to assemble the sets. But afterwards the algorithm tries to optimize the sets through switching the worst datapoint from the testset into the trainset. This step is performed for a certain amount of iterations. Within every iteration a classification step is performed to evaluate the worst described datapoint in the testet. The blacklisting should usually be enbabled because the algorithm is very prone to oscillation and the blacklisting suppresses this behaviour.
GA Attribute Selection
The activity uses a genetic algorithm to find an optimized set of attributes.
Heuristic Attribute Selection
The activity tries to sort the attributes corresponding to their relevance for the underlying machine learning problem. The algorithms evaluates the performance of every attribute and leaves out the worst. This step is repeated until only one attribute remains.
Evaluate Regression Results as PDF
This activity produces a PDF containing different plots and statistics characterising the used machine learning model.
For further questions, feel free to contact us at the CDK-Taverna mailing list: